Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Program-guided framework for your interpreting and acquiring complex skills with learning robots
(USC Thesis Other)
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Program-Guided Framework for Interpreting and Acquiring Complex Skills with Learning Robots by Shao-Hua Sun A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2022 Copyright 2022 Shao-Hua Sun TableofContents ListofTables ix ListofFigures xi Abstract xv I Introduction 1 Chapter1: Introduction 2 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Program Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1.1 Learning to Synthesize Programs from Demonstrations . . . . . . . . . . 4 1.1.1.2 Learning to Synthesize Programs from Reward Functions . . . . . . . . . 5 1.1.2 Primitive Skill Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.2.1 Meta-Learning & Meta-Reinforcement Learning . . . . . . . . . . . . . . 6 1.1.2.2 Learning from Demonstrations . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.3.1 Learning to Execute Programs . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.3.2 Learning to Compose Skills . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Published Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 II ProgramInference 11 Chapter2: LearningtoSynthesizeProgramsfromDemonstrations 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1.1 Demonstration Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1.2 Summarizer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1.3 Program Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Multi-task Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ii 2.5.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Evaluation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.4 Karel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4.1 Environment and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.4.3 Effect of Summarizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.5 ViZDoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.5.1 Environment and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.5.4 Debugging the Synthesized Program . . . . . . . . . . . . . . . . . . . . 30 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.1 Detailed Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.1.1 Demonstration Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.1.2 Summarizer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.1.3 Program Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.3 One-shot Imitation Learning Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.4 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.4.1 Karel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.4.2 ViZDoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter3: LearningtoSynthesizeProgramsfromRewardFunctions 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Learning a Program Embedding Space . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1.1 Program Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1.2 Program Behavior Reconstruction . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1.3 Latent Behavior Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.2 Latent Program Search: Synthesizing a Task-Solving Program . . . . . . . . . . . . 43 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.1 Karel Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.2 Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.6 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.7 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.1 Program Embedding Space Visualizations . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.2 Cross Entropy Method Trajectory Visualization . . . . . . . . . . . . . . . . . . . . 57 3.7.3 Program Embedding Space Interpolations . . . . . . . . . . . . . . . . . . . . . . . 59 3.7.4 Program Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 iii 3.7.5 Interpretability: Human Debugging of LEAPS Programs . . . . . . . . . . . . . . . 61 3.7.6 Optimal and Synthesized Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7.6.1 Program Behavior Reconstruction . . . . . . . . . . . . . . . . . . . . . . 65 3.7.6.2 Karel Environment Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.7 Additional Generalization Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7.7.1 Generalization onFourCorner,TopOff, andHarvester . . . . . . . . 66 3.7.7.2 Generalization to Unseen Configurations . . . . . . . . . . . . . . . . . . 67 3.7.8 Additional Analysis on Experimental Results . . . . . . . . . . . . . . . . . . . . . 69 3.7.8.1 DRL vs. DRL-abs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7.8.2 VIPER Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7.9 Detailed Descriptions and Illustrations of Ablations and Baselines . . . . . . . . . . 70 3.7.9.1 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7.9.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7.10 Program Dataset Generation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.11 Karel Task Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7.11.1 StairClimber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7.11.2 FourCorner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.11.3 TopOff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.11.4 Maze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.11.5 CleanHouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.11.6 Harvester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7.12 Hyperparameters and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7.12.1 DRL and DRL-abs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7.12.2 DRL-abs-t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.7.12.3 HRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7.12.4 Naïve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.7.12.5 VIPER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7.12.6 Program Embedding Space VAE Model . . . . . . . . . . . . . . . . . . . 86 3.7.12.7 Cross-Entropy Method (CEM) . . . . . . . . . . . . . . . . . . . . . . . . 89 3.7.12.8 Random Search LEAPS Ablation . . . . . . . . . . . . . . . . . . . . . . . 91 3.7.13 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.7.14 Toward Robotics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 III PrimitiveSkillAcquisition 112 Chapter4: Meta-LearningonMultimodalTaskDistributions 113 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.4.1 Modulation Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.4.2 Task Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.5.1 Regression Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.5.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.5.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 iv 4.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.7.1 Details on Modulation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.7.2 Further Discussion on Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.7.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.7.4 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.7.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.7.4.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.7.4.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.7.5 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.7.5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.7.5.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.7.5.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Chapter5: Meta-LearningonLong-HorizonandSparse-RewardTasks 146 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.3 Problem Formulation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.1 Skill Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4.2 Skill-based Meta-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4.3 Target Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.5.1.1 Maze Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.5.1.2 Kitchen Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.5.4 Meta-Training Task Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . . 160 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.7.1 Meta-Reinforcement Learning Method Ablation . . . . . . . . . . . . . . . . . . . . 162 5.7.2 Learning Efficiency on Target Tasks with Few Episodes of Experience . . . . . . . 164 5.7.3 Investigating Offline Data vs. Target Domain Shift . . . . . . . . . . . . . . . . . . 165 5.7.4 Implementation Details on Our Method . . . . . . . . . . . . . . . . . . . . . . . . 168 5.7.4.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.7.4.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.7.5 Implementation Details on Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.7.5.1 SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.7.5.2 PEARL and PEARL-ft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.7.5.3 SPiRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.7.5.4 Multi-task RL (MTRL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.7.6 Meta-Training Tasks and Target Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.7.6.1 Maze Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.7.6.2 Kitchen Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 v Chapter6: LearningfromObservation 175 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3.2 Learning Goal Proximity Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.3.3 Training Policy with Proximity Reward . . . . . . . . . . . . . . . . . . . . . . . . 181 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.4.3 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.4.4 Maze2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.4.5 Ant Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.4.6 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.4.7 Dexterous Hand Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.4.8 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.6.1 Comparison with GAIL and Its Variants . . . . . . . . . . . . . . . . . . . . . . . . 194 6.6.2 Failure of GAIfO and SQIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.6.3 Analysis on Generalization of Our Method and Baselines . . . . . . . . . . . . . . . 195 6.6.4 Further Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.6.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.6.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.6.6.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.6.6.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.6.6.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 IV TaskExecution 208 Chapter7: LearningtoExecutePrograms 209 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 7.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.4.1 Program Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.4.2 Perception Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.4.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.4.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7.4.4.1 Perception Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.4.4.2 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.5.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.5.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.5.1.2 Task Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 7.5.3 End-to-end Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 vi 7.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.5.4.1 Task Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.5.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.5.5 Policy Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7.7.1 Program Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7.7.2 DSL Design Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.7.3 Extended Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 7.7.4 Discussions on Learned Modulation Mechanisms . . . . . . . . . . . . . . . . . . . 227 7.7.5 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.7.5.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.7.5.2 Ground Truth Perceptions for End-to-end Learning Baselines . . . . . . 230 7.7.5.3 Task Instructions Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.7.5.4 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.7.5.5 Raw RGB Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 7.7.5.6 Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.7.5.7 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.7.5.8 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Chapter8: LearningtoComposeSkills 242 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 8.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 8.3.2 Modular Framework with Transition Policies . . . . . . . . . . . . . . . . . . . . . 246 8.3.3 Training Transition Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 8.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 8.4.2 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 8.4.3 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 8.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 8.4.5 Training of Transition Policy and Proximity Predictor . . . . . . . . . . . . . . . . 256 8.4.6 Visualizing Transition Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 8.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.6.1 Acquiring Primitive Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.6.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.6.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.6.2.2 Replay Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 8.6.2.3 Proximity Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 8.6.2.4 Proximity Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 8.6.2.5 Transition Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 8.6.2.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 8.6.3 Environment Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 8.6.3.1 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 8.6.3.2 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 vii V Conclusion 271 Chapter9: Conclusion 272 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 9.2.1 Program Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 9.2.2 Primitive Skill Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 9.2.3 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Bibliography 275 viii ListofTables 2.1 Karel Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Ablation Study on Summarizer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 ViZDoom Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 ViZDoom If-else Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Program Embedding Space Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Generalization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 LEAPS Close Latent Program Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 LEAPS Far Latent Program Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Program Evolution Over CEM Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Human Debugging Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.9 Additional Generalization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.10 Unseen Configurations Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.11 Program Token Generation Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.12 LEAPS Length 100 Synthesized Karel Programs . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1 Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 ix 4.3 RL Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.4 Additional Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.5 Classification Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.6 Hyperparameters for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.7 2-mode Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.8 3-mode Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.9 5-mode Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.10 Hyperparameters for RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.2 Hyperparameters for Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.3 Hyperparameters for Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.2 End-to-end Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.1 Robotic Manipulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 8.2 Locomotion Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 8.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 x ListofFigures 1.1 Overview of Program-Guided Framework for Interpreting and Acquiring Complex Skills with Learning Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Illustration of Neural Program Synthesis from Demonstration Videos . . . . . . . . . . . . 13 2.2 Domain-Specific Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Summarizer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Karel Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 ViZDoom Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Analysis on Varying Numbers of Demonstration Videos . . . . . . . . . . . . . . . . . . . . 30 3.1 Domain Specific Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Karel Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Visualizations of Learned Program Embedding Space . . . . . . . . . . . . . . . . . . . . . 95 3.5 StairClimber CEM Trajectory Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.6 FourCorner CEM Trajectory Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.7 Human Debugging Experiment User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8 Human Debugging Experiment Example Programs (TopOff) . . . . . . . . . . . . . . . . . 99 3.9 Human Debugging Experiment Example Programs (FourCorner) . . . . . . . . . . . . . . 100 3.10 Human Debugging Experiment Example Programs (Harvester) . . . . . . . . . . . . . . . 101 xi 3.11 Ground-Truth Test Programs and Karel Programs . . . . . . . . . . . . . . . . . . . . . . . 102 3.12 Program Reconstruction Task Synthesized Programs (naïve, LEAPS-P, LEAPS-P+R) . . . . 103 3.13 Program Reconstruction Task Synthesized Programs (LEAPS-P+L, LEAPS) . . . . . . . . . 104 3.14 LEAPS Karel Tasks Synthesized Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.15 LEAPS Ablations Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.16 Baseline Methods Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.17 Program Length Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.18 Karel Task Start/End State Depictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.19 Karel Rollout Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3 Visualization of Embedded Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.4 RL Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5 Point Mass Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6 Reacher andAnt Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.7 Visualization of Embedded Regression Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.8 Visualization of Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.9 Classification Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.10 Visualization of Embedded Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.11 Training Curves on RL Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.12 Additional Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.13 Additional Point Mass Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.14 MAML and Multimodal MAML Training Curves on Classification Tasks . . . . . . . . . . . 144 4.15 Multi-MAML Training Curves on Classification Tasks . . . . . . . . . . . . . . . . . . . . . 145 xii 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.3 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.4 Target Task Learning Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.6 Meta-Training Task Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.7 Task Distributions for Task Length Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.8 Meta-Training Performance for Task Length Ablation . . . . . . . . . . . . . . . . . . . . . 165 5.9 Qualitative Result of Meta-RL Method Ablation . . . . . . . . . . . . . . . . . . . . . . . . 166 5.10 Performance with Few Episodes of Target Task Interaction . . . . . . . . . . . . . . . . . . 167 5.11 Image-Based Maze Navigation with Distribution Shift . . . . . . . . . . . . . . . . . . . . . 167 5.12 Maze Meta-Training and Target Task Distributions . . . . . . . . . . . . . . . . . . . . . . 172 5.13 Maze Meta-Training and Target Task Distributions for Meta-training Task Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.2 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.3 Goal Completion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.4 Generalization Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.5 Goal Proximity Function Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.7 Analysis on Generalizing to Unseen States . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.8 Additional Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.9 Analysis on Proximity Discounting Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.10 Effect of Spectral Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.11 Proximity Prediction Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 xiii 6.12 Navigation 25% heldout Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.2 Domain-Specific Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.3 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.4 Learning a Multitask Policy via Learned Modulation . . . . . . . . . . . . . . . . . . . . . . 217 7.5 Analysis on End-to-end Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.6 Environment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.7 First Failure Rate of Subtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.8 Average Time Cost of Subtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.9 Additional Analysis on Completion Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 7.10 Exemplar Data and Language Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.11 Training Program Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 7.12 Testing Program Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 7.13 Complex Testing Program Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.1 Transition Policy Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 8.2 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.3 Training Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 8.5 Average Transition Length and Proximity Reward . . . . . . . . . . . . . . . . . . . . . . . 257 8.6 Visualization of Transition Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 8.7 Ablation Study on Proximity Discounting Factor . . . . . . . . . . . . . . . . . . . . . . . . 262 xiv Abstract Recent development in artificial intelligence and machine learning has remarkably advanced machines’ ability to understand images and videos, comprehend natural languages and speech, and outperform human experts in complex games. However, building intelligent robots that can physically interact with their surroundings as well as learn to operate in unstructured environments, manipulate unknown objects, and acquire novel skills – to free humans from tedious or dangerous manual work – remains challenging. The focus of my research is to develop a robot learning framework that enables robots to acquire long-horizon and complex skills with hierarchical structures, such as furniture assembly and cooking. Specifically, I aim to devise a robot learning framework which is: (1) interpretable: by decoupling interpreting skill specifications (e.g. demonstrations, reward functions) and executing skills, (2) programmatic: by generalizing from simple instances to complex instances without additional learning, (3) hierarchical: by operating on a proper level of abstraction that enables human users to interpret high-level plans of robots allows for composing primitive skills to solve long-horizon tasks, and (4) modular: by being equipped with modules specialized in different functions (e.g. perception, action) which collaborate, allowing for better generalization. This dissertation discusses a series of research projects toward building such an interpretable, programmatic, hierarchical, and modular robot learning framework. xv PartI Introduction 1 Chapter1 Introduction 1.1 Overview Recent advancement in artificial intelligence and machine learning has remarkably advanced machines’ ability to understand images and videos (e.g. object detection, semantic segmentation, action recognition, image captioning), comprehend natural languages and speech (e.g. machine translation, document sum- marization, speech recognition), and even outperform human experts in complex games (e.g. Go, Dota 2, StarCraft II). With the ability to learn, those systems generalize reasonably well on a wide range of tasks, and some have even been deployed as widely used products. However, building reliable artificial Environment Demonstration pickUp attach place moveTo release Primitive Skills def run(): pickUp(top, left_arm, on, floor) moveTo(left_arm, (0,0,5), (0,0,0)) for(x, num(leg, on, floor)): pickUp(leg, right_arm, on, floor) moveTo(right_arm, (2,0,3), (0,0,0)) attach() release(right_arm) moveTo(left_arm, (0,0,5), (0,0,360.0/x)) if(isThere(back, on, floor)): pickUp(back, right_arm, on, floor) moveTo(right_arm, (1,0,7), (90,0,0)) attach() release(right_arm) Program Action Program Inference Primitive Skill Acquisition Task Execution Figure 1.1: An overview of the proposed program-guided framework for interpreting and acquiring complex skills with learning robots. Learning and inference modules (in blue) connect demonstration, environment, program, primitive skills, and executable action (in black). 2 intelligence agents (i.e. robots) that can physically interact with their surroundings while learning to operate in unstructured environments, manipulate unseen objects, and acquire novel skills – to free humans from tedious or dangerous manual work – remains challenging. Recently, the success of deep reinforcement learning (DRL) has led many researchers to develop learning frameworks to control robots. Compared to traditional robotics pipelines, DRL methods approximate policies using deep neural networks, which are optimized to automate the process of designing sensing, planning, and control algorithms by letting robots learn in an end-to-end fashion. Despite the recent progress in the field, such neural network policies suffer from several fundamental issues. First, such black-box policies are not interpretable to humans and therefore are difficult to debug when robots fail to perform a task. Second, acquiring complex skills through trial and error still remains challenging and these neural network policies often have difficulty generalizing to novel scenarios. Third, most existing works are limited to short-horizon skills such as pushing and picking up objects. Finally, most approaches are designed to acquire skills from scratch instead of building upon previously learned skills. The focus of my research is to develop a robot learning framework that allows robots to acquire long- horizon and complex skills with hierarchical structures such as furniture assembly and cooking. Specifically, I aim to devise a robot learning framework which is: (1) interpretable: by decoupling interpreting skill specifications ( e.g. demonstrations, reward functions) and executing skills, (2)programmatic: by generalizing from simple instances to complex instances without additional training, (3)hierarchical: by operating on a proper level of abstraction that enables human users to interpret high-level plans of robots allows for composing primitive skills to solve long-horizon tasks, and (4)modular: by being equipped with modules specialized in different functions ( e.g. perception, action) which collaborate together, allowing for better generalization. To this end, I present a robot learning framework which represents desired behaviors as a program as well as acquires and utilizes primitive skills for learning to execute desired skills, as shown in Figure 1.1. 3 Specifically, instead of learning in an end-to-end manner, I propose to design specialized learning modules that aim to (1) performprograminference to explicitly infer underlying programs that describe the skills of interest, (2)acquireprimitiveskills that can be used to compose more complex and longer-horizon skills, and (3) performtaskexecution by following the inferred program and utilizing acquired primitive skills to replicate the desired skills. In the following, I will discuss the details of each module. 1.1.1 ProgramInference I propose to utilize programs, structured in a formal domain-specific language, to describe behaviors (see an example program in Figure 1.1). Programs are not only interpretable to humans, but also more directly machine-executable since they are less ambiguous compared to natural languages. However, specifying a task or a desired behavior by writing a program requires substantial expertise and can be tedious. Therefore, I propose to learn to perform program inference, which aims to construct a program that describes a task-solving procedure from task specifications which are easier to provide ( e.g. demonstrations, reward functions). In the following, I will describe the projects carried out by my collaborators and me, which focus on learning to perform program inference from such task specifications. 1.1.1.1 LearningtoSynthesizeProgramsfromDemonstrations In [287], we aim to interpret the provided demonstrations and infer the intended skill by learning to explicitly synthesizing the underlying program which describes the skill in a structured language. A synthesized program explicitly describes diverse situations and the corresponding subtasks to execute. Moreover, a program describes tasks and their hierarchies at a set level of abstraction (e.g. actions such as moveTo, attach and perceptions such asisThere), providing scaffolding for learning long-horizon, hierarchically structured tasks. Specifically, in [287], we study the problem of mimicking behaviors presented in a set of demonstration videos, where demonstrator’s behaviors vary from video to video due to different 4 environmental conditions. To address it, we propose to synthesize a program from demonstration videos and then execute the program to replicate the demonstrator’s behavior. The results suggest that learning to explicitly synthesize programs instead of learning a black-box policy encourages the model to pay extra attention to the decision-making logic of the demonstrator, leading to its superior performance. 1.1.1.2 LearningtoSynthesizeProgramsfromRewardFunctions While [287] achieves promising results on learning to synthesize programs from demonstrations, obtaining demonstrations can sometimes be expensive or even impossible. Therefore, in [304], we aim to devise a framework that can learn to perform program inference directly from reward signals provided by a Markov decision process (MDP). To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. The results suggest that the proposed framework can produce interpretable and more generalizable policies which outperform DRL methods. Beside inferring programs from a variety of task specifications, our recent work explores synthesizing a program by hierarchically composing multiple programs. This allows for synthesizing programs that are long and deeply nested, which can induce more complex behaviors, and therefore increases the applicability of representing behaviors using programs. 1.1.2 PrimitiveSkillAcquisition The aim of this stage is to robustly and efficiently acquire a set of primitive skills ( e.g. moveTo, attach, place in Figure 1.1) that can be used to compose more complex skills to enable executing behaviors specified in programs obtained from the program inference stage. To this end, my research focuses on 5 meta-learning and learning from demonstrations. In the following, I will describe the projects carried out by my collaborators and me, which focus on these two directions. 1.1.2.1 Meta-Learning&Meta-ReinforcementLearning Meta-LearningonMultimodalTaskDistributions. To efficiently acquire a diverse set of skills, we propose to leverage the recent advancement in meta-learning and meta-reinforcement learning. Specifically, we aim to leverage model-agnostic meta-learning (MAML), which allows an agent to learn from a distribution of tasks and then quickly adapt to novel tasks with few gradient updates. Yet, MAML seeks a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they can learn from. To enable an agent to adapt to a diverse set of primitive skills, we propose a multimodal MAML framework [315, 316], which can modulate its meta-learned prior parameters according to the identified task families, allowing more efficient fast adaptation. The proposed multimodal MAML framework achieves superior performance on not only reinforcement learning but also few-shot regression and image classification. Meta-LearningonLong-HorizonandSparse-RewardTasks. In our recent works [205, 206, 207], we aim to address a common issue of most existing meta-reinforcement learning methods – they are limited to short-horizon tasks with dense rewards. To this end, we propose to first extract composable skills and a skill prior from agent play data in the form of offline datasets, which then enables meta-learning on long-horizon, sparse-reward tasks. 1.1.2.2 LearningfromDemonstrations Learning from Observation. Another research direction that aims to improve sample efficiency for acquiring skills is learning from demonstrations. However, demonstrators’ actions might not always be available and how a learning agent can generalize beyond situations seen in demonstrations remains 6 challenging. In [161], we study learning from observation (i.e. imitate demonstrators without accessing to their actions but only state sequences) and generalization to novel situations beyond demonstrations. Specifically, we proposed to learn a task progress estimator and use the task progress estimate as a dense reward for training a policy. We show that our proposed method can robustly generalize compared to prior methods on a set of tasks in navigation, locomotion, and robotic manipulation – even with demonstrations that cover only part of the states. 1.1.3 TaskExecution To replicate a skill by following an inferred program (Section 1.1.1) and utilizing a set of acquired primitive skills (Section 1.1.2), an agent needs to (1) deduce the correct order of primitive skills that need to be executed as well as (2) smoothly chain them together. 1.1.3.1 LearningtoExecutePrograms To address (1), we propose a framework that learns to interpret and follow a program by employing a perception module which learns to decide which path in a program should be taken and a policy which fulfills each desired primitive skill in [288]. 1.1.3.2 LearningtoComposeSkills For (2), when the primitive skills were independently obtained and the environment dynamic is complex (e.g. continuous control), simply sequentially executing the primitive skills can often fail. Therefore, we propose to learn transition policies which effectively navigate from an ending state of any primitive skill to suitable initial states of the following primitive skill in [160]. 7 1.2 PublishedWorks This dissertation presents a number of techniques for allowing learning robots to interpret and acquire complex skills, which were presented at top-tier computer science and machine learning venues. This section enumerates these published works, and briefly describes the content of each chapter in this dissertation. PartII:ProgramInference • Chapter 2 corresponds to a paper [287] published at International Conference on Machine Learning (ICML) 2018. It aims to infer decision making logic in demonstration videos, allowing for accurately imitating demonstrator’s behaviors. To this end, we propose a framework that is able to explicitly synthesize underlying programs, that describe demonstrator decision making logic, from behaviorally diverse and visually complicated demonstration videos. • Chapter 3 corresponds to a paper [304] published at Neural Information Processing Systems (NeurIPS) 2021. It presents a framework that learns to synthesize a program, detailing the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. PartIII:PrimitiveSkillAcquisition • Chapter 4 corresponds to a paper [315] published at Neural Information Processing Systems (NeurIPS) 2019 and a paper [316] presented in Meta-Learning Workshop at Neural Information Processing Systems (NeurIPS) 2018. Model-agnostic meta-learners aim to acquire meta-prior parameters from a distribution of tasks and adapt to novel tasks with few gradient updates. Yet, seeking a common initialization shared across the entire task distribution substantially limits the diversity of the task 8 distributions that they are able to learn from. We propose a multimodal MAML (MMAML) framework, which is able to modulate its meta-learned prior according to the identified mode, allowing more efficient fast adaptation. • Chapter 5 corresponds to a paper [207] published at International Conference on Learning Repre- sentations (ICLR) 2022, a paper [205] presented in Meta-Learning Workshop at Neural Information Processing Systems (NeurIPS) 2021, and a paper [206] presented in Deep RL Workshop at Neural Information Processing Systems (NeurIPS) 2021. We devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. • Chapter 6 corresponds to a paper [161] published at Neural Information Processing Systems (NeurIPS) 2021. Task progress is intuitive and readily available task information that can guide an agent closer to the desired goal. Furthermore, a progress estimator can generalize to new situations. From this intuition, we propose a simple yet effective imitation learning from observation method for a goal-directed task using a learned goal proximity function as a task progress estimator, for better generalization to unseen states and goals. We obtain this goal proximity function from expert demonstrations and online agent experience, and then use the learned goal proximity as a dense reward for policy training. PartIV:TaskExecution • Chapter 7 corresponds to a paper [288] published at International Conference on Learning Represen- tations (ICLR) 2020. We propose to utilize programs, structured in a formal language, as a precise 9 and expressive way to specify tasks, instead of natural languages which can often be ambiguous. We then devise a modular framework that learns to perform a task specified by a program – as different circumstances give rise to diverse ways to accomplish the task, our framework can perceive which circumstance it is currently under, and instruct a multitask policy accordingly to fulfill each subtask of the overall task. • Chapter 8 corresponds to a paper [160] published at International Conference on Learning Repre- sentations (ICLR) 2019. Humans acquire complex skills by exploiting previously learned skills and making transitions between them. To empower machines with this ability, we propose a method that can learn transition policies which effectively connect primitive skills to perform sequential tasks without handcrafted rewards. To efficiently train our transition policies, we introduce proximity predictors which induce rewards gauging proximity to suitable initial states for the next skill. 10 PartII ProgramInference 11 Chapter2 LearningtoSynthesizeProgramsfromDemonstrations 2.1 Introduction Imagine you are watching others driving cars. You will easily notice many common behaviors even if you know nothing about driving. For example, cars stop when a traffic light turns to red and move again when the light turns to green. Cars also slow down when drivers see a pedestrian jay-walking. Just like in this example, humans can abstract behaviors – especially extracting the structural relationship between action and perception (e.g. light, pedestrian). Can machines also reason decision making logic behind behaviors? While there has been tremendous effort and success in understanding videos, they have been mostly focused on recognizing actions, finding and naming objects, or predicting future outcomes. However, the problem of reasoning decision making logic behind behaviors is a crucial skill for machines to mimic and collaborate with humans. Hence, our goal is to step towards developing a method that can interpret perception-based decision making logic from visual behavior demonstrations. Our insight is to exploit declarative programs, structured in a formal language, as behavior interpreting representations. The formal language is composed of action blocks, perception blocks, and control flow ( e.g. if/else). Programs written in such a language can explicitly model the connection between an observation (e.g. traffic light, biker) and an action ( e.g. stop). An example is shown in Figure 2.1. Described in a formal 12 def run(): while frontIsClear(): move() turnRight() if thereIsPig(): attack() else: if not thereIsWolf(): spawnPig() else: giveBone() Synthesized Program Demonstrations demo 1 demo 2 demo 3 Figure 2.1: An illustration of neural program synthesis from demonstration videos. Given multiple demon- stration videos exhibiting diverse behaviors, our neural program synthesizer learn to produce interpretable and executable underlying programs. Divergence above occurs based on perception in the second frame. language, programs are logically interpretable and executable. Thus, the problem of interpreting decision making logic from visual demonstrations can be reduced to extracting an underlying program. In fact, there have been many neural network frameworks proposed recently for program induction or synthesis. First, a variety of frameworks [132, 247, 332, 61] are proposed to induce latent representations of underlying programs. While they can be efficient at mimicking desired behaviors, they do not explicitly yield interpretable programs, resulting in inexplicable failure cases. On the other hand, there is another line of work directly synthesizing programs [63, 35] giving full interpretability. These approaches have shown highly competitive results in simple input/output pairs but are limited in the expressibility of programs. Hence, in this paper, we further extend their models with our model to synthesize programs while handling complex visual sequential inputs. Our goal is to interpret logics behind various visual demonstrations. In other words, we want a model that can synthesize programs mimicking behaviors in demonstrations. Therefore, the model is required to handle diverse sequential visual data. To address this requirement, we propose a program synthesis architecture augmented with a summarizer module – a module that is capable of encoding the inter- relation between multiple demonstrations and summarizes them into a compact vector representation. The 13 summarizer module empowers the model to handle a varying number of demonstrations, resulting in extra flexibility. We extensively evaluate our model in two distinct environments: a fully observable, third-person environment (Karel) and a partially observable, egocentric game (ViZDoom). Our experiments in both environments show that directly modeling a behavior as a program has several benefits such as a better reasoning of the underlying conditions behind action compared against other methods. We also present an additional strength of our approach such as interpretability that enables human interaction for fixing and debugging. In summary, in this paper, we introduce a novel problem of program synthesis from demonstrations of sequential visual data and a method to address it. This ultimately enables machines to explicitly interpret decision making logic and further interact with humans through a debugging process. We also demonstrate that our algorithm can synthesize programs reliably on multiple environments. 2.2 RelatedWork ProgramInduction Learning to perform a specific task by inducing latent representations of underlying task-specific programs is known as program induction. Various approaches have been developed: designing end-to-end differentiable architectures [95, 96, 347, 132, 131, 97, 208], learning to call subprograms using step-by-step supervision [247, 38], and few-shot program induction [61]. Contrary to our work, those method do not return explicit programs. ProgramSynthesis The line of work in program synthesis focuses on explicitly producing programs that are restricted to certain languages. [24] train a model to predict program attributes and used external search algorithms for inductive program synthesis. [219, 63] directly synthesize simple string transformation programs. [35] employ reinforcement learning to directly optimize the execution of generated programs. 14 Programm:= def run():s Statements:= while(b):(s)|s 1 ;s 2 |a| repeat(r):(s) | if(b):(s)| ifelse(b):(s 1 ) else:(s 2 ) Repetitionr := Number of repetitions Conditionb:= percept| notb Perceptionp:= Domain dependent perception primitives Actiona:= Domain dependent action primitives Figure 2.2: Domain-specific language for the program representation. The program is composed of domain dependent perception and action primitives and control flows. However, those methods are limited to synthesizing programs from input-output pairs, which substan- tially restricts the expressibility of the programs that are considered; instead, we address the problem of synthesizing programs from full demonstrations videos. ImitationLearning The methods that are concerned with acquiring skills from expert demonstrations, dubbed imitation learning, can be split into behavioral cloning [236, 237, 251] which casts the problem as a supervised learning task and inverse reinforcement learning [210] that extracts estimated reward functions given demonstrations. Recently, [69, 82, 332] have studied the task of mimicking given few demonstrations. This line of work can be considered as program induction, as they imitate demonstrations without explicitly modeling underlying programs. While those methods are able to mimic given few demonstrations, it is not clear if they could deal with multiple demonstrations with diverse branching conditions. 2.3 ProblemOverview In this section, we define our formulation for program synthesis from diverse demonstration videos. We would like to comprehend and replicate demonstrated behaviors by revealing the formal structure between demonstrators’ perception and actions. We therefore formally define programs in a domain-specific language (DSL) with perception primitives, action primitives and control flows. Action primitives define the way that 15 Demo Encoder Program Decoder Perception Decoder Action Decoder … Demo 1 … def run(): move() if leftIsClear(): turnLeft() REPEAT R=5: turnRight() if MarkersPresent(): pickupMarker() else: move() Program Demo Encoder Demo Encoder move() def run() move() Demo 2 Demo k else … c c frontIsClear() rightIsClear() MarkersPresent() yes no yes … move() turnLeft() pickupMarker() move() turnRight() pickupMarker() … move() turnLeft() move() … … … … Summarizer Module v 1 demo v 2 demo v k demo v summary Figure 2.3: Model Architecture. The demonstration encoder encodes each of the k demonstrations separately and the summarizer network aggregates them to construct a summary vector. The summary vector is used by the program decoder to produce program tokens sequentially. The encoded demonstrations are used to decode the action sequence and perception conditions as additional supervision. agents can interact with an environment, while perception primitives describe how agents can percept it. On the other hand, control flows can include if/else statements, while loops, repeat statements, and simple logic operations. An example of control flows introduced in [226] is shown in Figure 2.2. Note that we focus on perceptions with boolean returns in this paper, while more generic perception type constraint is possible. A programη is a deterministic function that outputs an actiona∈A given a history of states at time stept,H t =(s 1 ,s 2 ,...,s t ), wheres∈S is a state of the environment. The generation of an action given the history of states is represented asa t =η (H t ). In this paper, we focus on a program type that can be represented in DSL by a codeC =(w 1 ,w 2 ,...,w N ), which is a sequence of tokensw. A demonstrationτ =((s 1 ,a 1 ),(s 2 ,a 2 ),...,(s T ,a T )) is a sequence of state and action tuples generated by an underlying programη ∗ given a initial states 1 . Given an initial states 1 and its corresponding state history H 1 , the program generates new action a 1 = η ∗ (H 1 ). The following state s 2 is generated by a state transition functionT : s 2 ∼ T(s 1 ,a 1 ). The newly sampled state is incorporated into the state history H 2 = H 1 ⌢ (s 2 ) and this process is iterated until the end of file action EOF ∈ A is returned by the program. A set of demonstrationsD ={τ 1 ,τ 2 ,...,τ K } can be generated by running a single programη ∗ 16 on different initial states s 1 1 ,s 2 1 ,...,s K 1 , where each initial state is sampled from a initial state distribution (i.e.s K 1 ∼ P 0 (s 1 )). While we are interested in inferring a programη ∗ from a set of demonstrationsD, it is preferable to predict a codeC ∗ instead, because it is a more accessible representation while immediately convertible to a program. Formally, we formulate the problem as a sequence prediction where the input is a set of demonstrationsD and the output is a codeC. Note that our objective is not about inferring a code perfectly but instead generating a code that can infer a program. This requires a carefully chosen measure for successful code synthesis, discussed in Section 2.5.1. 2.4 Approach Inferring a program behind a set of demonstrations requires (1) interpreting each demonstration video (2) spotting and summarizing the difference among demonstrations to infer the conditions behind the taken actions (3) describing the understanding of demonstrations in a written language, Based on this intuition, we design a neural architecture composed of three components: • DemonstrationEncoder receives a demonstration video as input and produces an embedding that captures an agent’s actions and perception. • SummarizerModule discovers and summarizes where actions diverge between demonstrations and upon which branching conditions subsequent actions are taken. • ProgramDecoder represents the summarized understanding of demonstrations as a code sequence. The details of the three main components are described in the Section 2.4.1, and the learning objective of the proposed model is described in Section 2.4.2. Section 2.4.3 introduces auxiliary tasks for encouraging the model to learn the knowledge that is essential to infer a program. 17 2.4.1 ModelArchitecture Figure 2.3 illustrates the overall architecture of the proposed model, which is composed of demonstration encoders, a summarizer module, and a program decoder. Details of each component are described in the following sections. 2.4.1.1 DemonstrationEncoder The demonstration encoder performs two types of understanding over a single demonstration. The first is understanding visible actions in each time steps and the second is summarizing the overall action sequence in a demonstration as a single idea. The demonstration encoder performs both types of understanding at the same time using a state encoder and an LSTM (Long Short Term Memory) [115]. The state encoder, a stack of convolutional layers, encodes a states t to its embedding as a state vector v t state = CNN enc (s t )∈R d , wheret∈[1,T] is the time-step. The LSTM encodes each state representation and summarized representation at the same time. c t enc ,h t enc = LSTM enc (v t state ,c t− 1 enc ,h t− 1 enc ), (2.1) where,t∈[1,T] is the time step,c t enc is the cell state,h t enc is the hidden state. Both the final state tuples (c T enc ,h T enc ) encode the overall idea of the demonstration and intermediate hidden states{h 1 enc ,h 2 enc ,...,h T enc } containing high level understanding of each state and are used as an input to the following modules. 2.4.1.2 SummarizerModule The summarizer module first reviews each demonstration again with the context of the whole demon- strations to infer underlying conditions behind visible actions. The inferred conditions are summarized within a demonstration and then summarized again over multiple demonstrations. An illustration of the summarizer is shown in Figure 2.4. 18 The first summarization is performed by a reviewer module, an LSTM initialized with the pooled final state tuples of the demonstration encoder. The pooled final state tuples of the demonstration encoder is formally written as follows c 0 review = 1 K K X k=1 c T,k enc , h 0 review = 1 K K X k=1 h T,k enc , (2.2) where (c T,k enc ,h T,k enc ) is the final state tuple of the kth demonstration encoder. Then the reviewer LSTM encodes the hidden states c t review ,h t review = LSTM review (h t enc ,c t− 1 review ,h t− 1 review ), (2.3) where the final hidden state becomes a demonstration vector v demo = h T review ∈R d , which includes the summarized information within a single demonstration. The final summarization, which is performed over multiple demonstrations, is performed by an aggre- gation module. The aggregation module getsK demonstration vectors and summarize them into a single compact vector representation. To effectively model complex relations between different demonstrations, we employ relational networks (RN) [257]. The aggregation process is formally written as follows. v summary = RN v 1 demo ,v 2 demo ,...,v K demo , (2.4) wherev summary ∈R d is the the summarized demonstration vector. We show that employing the summarizer module significantly alleviate the difficulty of handling multiple demonstrations and improve generalization over different number of generations in Section 2.5 . 19 v summary LSTM LSTM LSTM LSTM s1 1 sT 1 LSTM LSTM LSTM LSTM s1 k sT k … Demo 1 Demo k Average Pooling zero state zero state Relation Network … … z Demonstration Encoder Summarizer Module z z … … v k demo v 1 demo Figure 2.4: Summarizer Module. The demonstration encoder (inner layer) encodes each demonstration starting from a zero state. The summarizer module (outer layer) aggregates the outputs of the demonstration encoder with a relation network to provide context from other demonstrations. 2.4.1.3 ProgramDecoder The program decoder generates programs from a summarized vector representation of the demonstrations. We use LSTMs similar to [290, 312] as a program decoder. Initialized with the summary vectorv summary , the LSTM at each time step gets the previous token embedding as an input and outputs a probability of the following program tokens as in the Equation 2.5. During training, the previous ground truth token is fed as an input, and during inference, the predicted token in the previous steps is fed as an input. 2.4.2 Learning The proposed model learns a conditional distribution between a set of demonstrations D and a corre- sponding codeC ={w 1 ,w 2 ,...,w N }. By employing the LSTM program decoder, this problem becomes an autoregressive sequence prediction [290]. For a given demonstration and previous code tokenw i− 1 , our model is trained to predict the following ground truth tokenw ∗ i , where the cross entropy loss is optimized. L code =− 1 NM M X m=1 N X n=1 logp(w ∗ m,n |W m m,n− 1 ,D m ), (2.5) 20 whereM is the total number of training examples,w m,n is thenth token of themth training example and D m aremth training demonstrations. W m,n ={w m,1 ,...,w m,n } is the history of previous token inputs at time stepn. 2.4.3 Multi-taskObjective To reason an underlying program from a set of demonstrations, the primary and essential step is recognizing actions and perceptions happening in each step of the demonstration. However, it can be difficult to learn meaningful representations purely from the sequence loss of programs when environments increase in visual complexity. To alleviate this issue, we propose to predict additional action sequences and a perception vector from the demonstrations as auxiliary tasks. The overview of the auxiliary tasks are illustrated in Figure 2.3. The first auxiliary task is predicting action sequences. Given a demo vector encoded by the demonstra- tion encoder, an action decoder LSTM decodeskth demo vectorv k demo into a sequence of actions. During training, a sequential cross entropy loss similar to Equation 2.5 is optimized: L A =− 1 MKT M X m=1 K X k=1 T X t=1 logp(a k∗ m,t |A k m,t− 1 ,v k demo ), (2.6) where,a k m,t is thet-th action token ink− th demonstration ofm-th training example,A k m,t ={a k m,1 ,...,a k m,t} is the history of previous actions at time stept. The second auxiliary task is predicting perceptions for each frame of the demonstrations. We denote a perception vectorΦ ={ϕ 1 ,...,ϕ L }∈{0,1} L as aL dimension binary vector obtained by executingL 21 def run(): while frontIsClear(): move() putMarker() turnLeft() move() putMarker() move() move() (b) Underlying Program Synthesized Program (a) Program seen demo unseen demo def run(): turnRight() turnRight() while frontIsClear(): move() if markersPresent(): turnLeft() move() else: turnRight() def run(): turnRight() turnRight() while frontIsClear(): move() else: turnRight() Figure 2.5: Karel Results. Seen training examples are on top row (in blue) and unseen testing examples are on the bottom row (in green). (a) A successful case with a program sequence match (b) Due to a missing branch condition execution in training data (top images), the synthesized program doesn’t incorporate the condition, resulting in execution mismatch in lower right testing image. perception primitives on a given states. Specifically, we formulate the perception vector prediction as a sequential multi-label binary classification problem and optimizes the binary cross entropy. L Φ = − 1 MKTL M X m=1 K X k=1 T X t=1 L X l=1 logp(ϕ k∗ m,t,l |P k m,t− 1 ,v k demo ), (2.7) whereP k m,t ={f(Φ k m,1 ),...,f(Φ k m,t )} is the history of encoded previous perception vectors andf(·) is an encoding function. The aggregated multi-task objective is as follows:L=L C +α L A +β L Φ , whereα andβ are hyper- parameter controlling the importance of each loss. For all the experiments in this paper, we setα =β =1. 2.5 Experiments We perform experiments in different environments: Karel [226] and ViZDoom [137]. We first describe the experimental setup and then present the experimental results. 22 2.5.1 EvaluationMetric To verify whether a model is able to infer an underlying programη ∗ from a given set of demonstrationsD, we evaluate accuracy based on the synthesized codes and the underlying program (sequence accuracy and program accuracy) as well as the execution of the program (execution accuracy). Sequenceaccuracy Comparison in the code space is based on the instantiated codeC ∗ of a ground truth program and the synthesized code ˆ C from a program synthesizer. The sequence accuracy counts exact match of two code sequences, which is formally written as: Acc seq = 1 M M X m=1 1 seq (C ∗ m , ˆ C m ), whereM is the number of testing examples and1 seq (·,·) is the indicator function of exact sequence match. Program accuracy While the sequence accuracy is simple, it is a pessimistic estimation of program accuracy since it does not considerprogramaliasing – different codes with identical program semantics ( e.g. repeat(2):(move()) andmove() move()). Therefore, we measure theprogramaccuracy by enumerating variations of codes. Specifically, we exploit the syntax of DSL to identify variations: e.g. unfolding repeat statements, decomposing if-else statement into two if statements, etc. Formally, the program accuracy is Acc program = 1 M P M m=1 1 prog (C ∗ m , ˆ C m ), where1 prog (C ∗ m , ˆ C m ) is an indicator function that returns 1 if any variations of ˆ C m match any variations ofC ∗ m . Note that theprogramaccuracy is only computable when the DSL is relatively simple and some assumptions are made i.e. termination of loops. The details of computing program accuracy are presented in Appendix (Section 2.7). Executionaccuracy To evaluate how well a synthesized program can capture the behaviors of an un- derlying program, we compare the execution results of the synthesized program code ˆ C and the demon- strations D ∗ generated by a ground truth program η ∗ , where both are generated from the same set of sampled initial states I K = {s 1 1 ,...,s K 1 }. We formally define the execution accuracy as: Acc execution = 23 1 M P M m=1 1 execution (D ∗ m , ˆ D m ), where1 execution (D ∗ m , ˆ D m ) is the indicator function of exact sequence match. Note that when the number of sampled initial states becomes infinitely large, the execution accuracy converges to the program accuracy. 2.5.2 EvaluationSetting For training and evaluation, we collectM train training programs andM test test programs. Each program codeC ∗ m is randomly sampled from an environment specific DSL and compiled into an executable form η ∗ m . The corresponding demonstrations D ∗ m = {τ 1 ,...,τ K } are generated by running the program on K = K seen +K unseen different initial states. The seen demonstrations are used as an input to the pro- gram synthesizer, and the unseen demonstrations are used for computing execution accuracy. We train our model on the training setΩ train = {(C ∗ 1 ,D ∗ 1 ),...,(C ∗ M train ,D ∗ M train )} and test them on the testing set Ω test ={(C ∗ 1 ,D ∗ 1 ),...,(C ∗ M test ,D ∗ M test )}. Note thatΩ train andΩ test are disjoint. Both sequence and execution accuracies are used for the evaluation. The training details are described in Appendix (Section 2.7). 2.5.3 Baselines We compare our proposed model (ours) against baselines to evaluate the effectiveness of: (1) explicitly modeling the underlying programs (2) our proposed model with the summarizer module and multi-task objective. To address (1), we design a programinductionbaseline based on [69], which bypasses synthesizing programs and directly predicts action sequences. We modified the architecture to incorporate multiple demonstrations as well as pixel inputs. The details are presented in Appendix (Section 2.7). For a fair comparison with our model that gets supervision of perception primitives, we feed the perception primitive vector of every frame as an input to the induction baseline . To verify (2), we compose a program synthesis baseline simply consisting of a demonstration encoder and a program decoder without a summarizer module 24 and multi-task loss. To integrate all the demonstration encoder outputs across demos, an average pooling layer is applied. 2.5.4 Karel We first focus on a visually simple environment to verify the feasibility of program synthesis from demon- strations. We consider Karel [226] featuring an agent navigating through a gridworld with walls and interacting with markers based on the underlying program. 2.5.4.1 EnvironmentandDataset Karel has 5 action primitives for moving and interacting with markers and 5 perception primitives for detecting obstacles and markers. A gridworld of8× 8 size is used for our experiments. To evaluate the generalization ability of the program synthesizer to novel programs, we randomly generate 35,000 unique programs and split them into a training set with 25,000 program, a validation set with 5,000 program, and a testing set with 5,000 programs. The maximum length of the program codes is43. For each program,10 seen demonstrations and5 unseen demonstrations are generated. The maximum length of the demonstrations is 20. Methods Execution Program Se- quence Induction baseline 62.8% (69.1%) - - Synthesis baseline 64.1% 42.4% 35.7% + summarizer (ours) 68.6% 45.3% 38.3% + multi-task loss (ours-full) 72.1% 48.9% 41.0% Table 2.1: Performance evaluation on Karel environment. Synthesis baseline outperforms induction baseline . The summarizer module and the multi-task objective introduce significant improvement. 25 2.5.4.2 PerformanceEvaluation The evaluation results of our proposed model and baselines are shown in Table 2.1. Comparison of execution accuracy shows relative performance of the proposed model and the baselines. Synthesis baseline outperforms induction baseline based on the execution accuracy, which shows the advantage of explicit modeling the underlying programs. Induction baseline often matches some of theK unseen demonstration, but fails to match all of them from a single program. This observation is supported by the number in the parenthesis (69.1%), which counts the number of correct demonstrations while execution accuracy counts the number of program whose demonstrations match perfectly. This finding has also been reported in [63]. The proposed model shows consistent improvement oversynthesisbaseline for all the evaluation metrics. The sequence accuracy for our full model is41.0%, which is a reasonable generalization performance given that none of the test programs are seen during training. We observe that our model often synthesizes programs that do not exactly match with the ground truth program but are semantically identical. For example, given a ground truth programrepeat(4):( turnLeft; turnLeft; turnLeft ), our model predictsrepeat (12): ( turnLeft ). These cases are considered correct for program accuracy. Note that comparison based on the execution and sequence accuracy is consistent with the program accuracy, which justifies using them as a proxy for the program accuracy when it is not computable. The qualitative success and failure cases of the proposed model are described in Figure 2.5. The Figure 2.5(a) shows a correct case where a single program is used to generate diverse action sequences. Figure 2.5(b) show a failure case, where part of the ground truth program tokens are not generated due to missing seen demonstration hitting that condition. Methods k=3 k=5 k=10 Synthesis baseline 58.5% 60.1% 64.1% + summarizer (ours) 60.6% 63.1% 68.6% Improvement 2.1% 3.0% 4.5% Table 2.2: Effect of the summarizer module. Employing the proposed summarizer module brings more improvement as the number of seen demonstration increases over synthesis baseline . 26 Underlying Program inTarget HellKnight ! attack() inTarget HellKnight not inTarget Demon ! moveRight() Demo 1 inTarget HellKnight and inTarget Demon inTarget HellKnight ! attack() inTarget Demon ! attack() Demo 2 def run(): if inTarget HellKnight: attack() if inTarget Demon: attack() else: moveRight() def run(): if inTarget HellKnight: attack() if not inTarget Demon: moveRight() else: attack() Synthesized Program Figure 2.6: ViZDoom Results. Annotations below frames are the perception conditions and actions. Hel- lknight, Revenant, and Demon monsters are white, black, and pink respectively. The model is able to correctly percepts the condition and actions as well as synthesize a precise program. Note that the synthe- sized and the underlying program are semantically identical. 2.5.4.3 EffectofSummarizer To verify the effectiveness of our proposed summarizer module, we conduct experiments where models are trained on varying numbers of demonstrations and compare the execution accuracy in Table 2.2. As the number of demonstrations increases, both models enjoy a performance gain due to extra available information. However, the gap between our proposed model and synthesis baseline also grows, which demonstrates the effectiveness of our summarizer module. 2.5.5 ViZDoom Doom is a 3D first-person shooter game where a player can move in a continuous space and interact with monsters, items and weapons. We use ViZDoom [137], an open-source Doom-based AI platform, for our experiments. ViZDoom’s increased visual complexity and a richer DSL could test the boundary of models in state comprehension, demo summarization, and program synthesis. 27 2.5.5.1 EnvironmentandDataset The ViZDoom environment has 7 action primitives including diverse motions and attack as well as 6 perception primitives checking the existence of different monsters and whether they are targeted. Each state is represented by an image with120× 160× 3 pixels. For each demonstration, initial state is sampled by randomly spawning different types of monsters and ammos in different location and placing an agent randomly. To ensure that the program behavior results in the same execution, we control the environment to be deterministic. We generate 80,000 training programs and 8,000 testing programs. To encourage diverse behavior of generated program, we give a higher sampling rate to the perception primitives that has higher entropy over K different initial states. We use 25 seen demonstrations for program synthesis and 10 unseen demonstrations for execution accuracy measure. The maximum length of programs is32 and the maximum length of demonstrations is20. 2.5.5.2 PerformanceEvaluation Table 2.3 shows the result on ViZDoom environment. Synthesis baseline outperforms induction baseline in terms of the execution accuracy, which shows the strength of program synthesis for understanding diverse demonstrations. In addition, the proposed summarizer module and the multi-task objective bring improvement in terms of all evaluation metrics. Also we found that the syntax of the synthesized programs is about99.9% accurate. This tells that the program synthesizer correctly learn the syntax of the DSL. Figure 2.6 shows the qualitative result. It is shown that the generated program covers different condi- tional behavior in the demonstration successfully. In the example, the synthesized program does not match the underlying program in the code space, while matching the underlying program in the program space. 28 Methods Execution Program Sequence Induction baseline 35.1% (60.6%) - - Synthesis baseline 48.2% 39.9% 33.1% Ours-full 78.4% 62.5% 53.2% Table 2.3: Performance evaluation on ViZDoom environment. The proposed model outperforms induction baseline and synthesis baseline significantly as the environment is more visually complex. Methods Execution Program Sequence Induction baseline 26.5% (83.1%) - - Synthesis baseline 59.9% 44.4% 36.1% Ours-full 89.4% 69.1% 58.8% Table 2.4: If-else experiment on ViZDoom environment. Single if-else statement with two branching consequences is used to evaluate ability of inferring underlying conditions. 2.5.5.3 Analysis To verify the importance of inferring underlying conditions, we perform evaluation only with programs containing a single if-else statement with two branching consequences. This setting is sufficiently simple to isolate other diverse factors that might affect the evaluation result. For the experiment, we use 25 seen demonstrations to understand a behavior and10 unseen demonstrations for testing. The result is shown in Table 2.4. Induction baseline has difficulty inferring the underlying condition to match all unseen demonstrations most of the times. In addition, our proposed model outperforms synthesis baseline ,2 which demonstrates the effectiveness of the summarizer module and the multi-task objective. Figure 2.7 illustrates how models trained with a fixed number ( 25) of seen demonstration generalize to fewer or more seen demonstrations during testing time. This shows our model and synthesis baseline are able to leverage more seen demonstrations to synthesize more accurate programs as well as achieve reasonable performance when fewer demonstrations are given. On the contrary, Induction baseline could not exploit more than10 demonstrations well. 29 Figure 2.7: Generalization over different number of K seen . The baseline models and our model trained with 25 seen demonstration are evaluated with fewer or more seen demonstrations. 2.5.5.4 DebuggingtheSynthesizedProgram One of the intriguing properties of the program synthesis is that synthesized programs are interpretable and interactable by human. This makes it possible to debug a synthesized program and fix minor mistakes to correct the behaviors. To verify this idea, we use edit distance between synthesized program and ground truth program as a number of minimum token that is required to get a exactly correct program. With this setting, we found that fixing at most 2 program token provides 4.9% improvement in sequence accuracy and4.1% improvement in execution accuracy. 2.6 Conclusion We propose the task of synthesizing a program from diverse demonstration videos. To address this, we introduce a model augmented with a summarizer module to deal with branching conditions and a multi-task objective to induce meaningful latent representations. Our method is evaluated on a fully observable, third-person environment (Karel environment) and a partially observable, egocentric game (ViZDoom environment). The experiments demonstrate that the proposed model is able to reliably infer underlying programs and achieve satisfactory performances. 30 2.7 Appendix 2.7.1 DetailedNetworkArchitectures 2.7.1.1 DemonstrationEncoder The demonstration encoder consists of a stack of convolutional layers and an LSTM. The stack of convolu- tional layers consists of five layers, which can be represented as: C {3,2,16} →C {3,2,32} →C {3,2,48} →C {3,2,48} →C {3,2,48} , whereC k,s,n denotes a convolutional layer with a kernel sizek, strides, and a number of channeln. Then the encoded feature maps are flatten and passed to an LSTM. We experiment with RNN, GRU, and LSTM and found that LSTM works the best. 2.7.1.2 SummarizerModule For the relation network of the summarizer module, we use two fully-connected layers with LeakyReLU activation. We also experiment with RNN, GRU, and LSTM for summarizer module and found that LSTM works the best. 2.7.1.3 ProgramDecoder For the token embedding function used to produce embedding vectors of program tokens, we create an embedding lookup with a hidden size of 128. An LSTM with a hidden size of 512 is utilized to decode program tokens. 31 2.7.2 TrainingDetails We implement the proposed model and its submodules described in the main paper in TensorFlow [1] and trained it using batch size of 32 with Adam optimizer [140]. 2.7.3 One-shotImitationLearningBaseline To evaluate the effectiveness of our proposed model, we implement an One-shot imitation learning model proposed in [69]. Since the model proposed in the original paper is not able to 1. Incorporate multiple seen specification demonstration sequences 2. Handle a varying-length number of demonstrations 3. Deal with visual input we make modifications as follows: 1. Augment the demonstration encoder with a stack of convolutional layers to process visual input 2. Remove temporal dropout, temporal convolution, and neighborhood attention 3. Add an LSTM with an attention mechanism [178]. We also experimented with the monotonic attention mechanism [241] and empirically found [178] works better. 4. Replace the context network with an average pooling layer to handle a varying-length number of demonstrations 5. Change the manipulation network to an LSTM decoder, which optimizes the predictions of one-hot action vectors at each time step 32 2.7.4 DatasetDetails 2.7.4.1 Karel We use 5 action primitives and 5 perception primitives for Karel, which is formally defined as follows: action:= move| turnRight| turnLeft| pickMarker | putMarker perception:= frontIsClear| leftIsClear| rightIsClear | markersPresent| noMarkersPresent For Karel environment, we use 8× 8× 16 state representation, where each channel of the state representation has its own meaning. 0: agent facing north|1: agent facing south∥ 2: agent facing west|3: agent facing east| 4: wall or empty|5∼ 15:0∼ 10markers 33 2.7.4.2 ViZDoom ViZDoom model contains 7 action primitives and 6 perception primitives, which is formally defined as follows: action:= moveBackward| moveForward| moveLeft | moveRight| turnLeft| turnRight|attack perception:= isTherem| inTargetm monsterm := demon| hellKnight| revenant ViZDoom environment has120× 160× 3 image as a state representation. We resize them to80× 80× 3 to feed to our model as an input. To generate meaningful program and collecting diverse behavior we use heuristics to sample codes and demonstrations. Given each state we sequentially increase the program length by adding more statement. At the same time action is instantly taken to the environment and state transition is performed. Whenever statement with condition is sampled for the program, we give higher sampling probability to perception that makes current state more diverse. 34 Chapter3 LearningtoSynthesizeProgramsfromRewardFunctions 3.1 Introduction Recently, deep reinforcement learning (DRL) methods have demonstrated encouraging performance on a variety of domains such as outperforming humans in complex games [200, 276, 277, 310] or controlling robots [39, 98, 15, 105, 334, 348, 159]. Despite the recent progress in the field, acquiring complex skills through trial and error still remains challenging and these neural network policies often have difficulty generalizing to novel scenarios. Moreover, such policies are not interpretable to humans and therefore are difficult to debug when these challenges arise. To address these issues, a growing body of work aims to learn programmatic policies that are structured in more interpretable and generalizable representations such as decision trees [25], state-machines [123], and programs described by domain-specific programming languages [308, 307]. Yet, the programmatic representations employed in these works are often limited in expressiveness due to constraints on the policy spaces. For example, decision tree policies are incapable of naïvely generating repetitive behaviors, state machine policies used in [123] are computationally complex to scale to policies representing diverse behaviors, and the programs of [308, 307] are constrained to a set of predefined program templates. On 35 the other hand, program synthesis works that aim to represent desired behaviors using flexible domain- specific programs often require extra supervision such as input/output pairs [63, 35, 46, 273, 155] or expert demonstrations [287, 55], which can be difficult to obtain. In this paper, we present a framework to instead synthesize human-readable programs in an expressive representation, solely from rewards, to solve tasks described by Markov Decision Processes (MDPs). Specifically, we represent a policy using a program composed of control flows ( e.g. if/else and loops) and an agent’s perceptions and actions. Our programs can flexibly compose behaviors through perception- conditioned loops and nested conditional statements. However, composing individual program tokens (e.g. if,while,move()) in a trial-and-error fashion to synthesize programs that can solve given MDPs can be extremely difficult and inefficient. To address this problem, we propose to first learn a latent program embedding space where nearby latent programs correspond to similar behaviors and allows for smooth interpolation, together with a program decoder that can decode a latent program to a program consisting of a sequence of program tokens. Then, when a task is given, this embedding space allows us to iteratively search over candidate latent programs to find a program that induces desired behavior to maximize the reward. Specifically, this embedding space is learned through reconstruction of randomly generated programs and the behaviors they induce in the environment in an unsupervised manner. Once learned, the embedding space can be reused to solve different tasks without retraining. To evaluate the proposed framework, we consider the Karel domain [226], featuring an agent navigating through a gridworld and interacting with objects to solve tasks such as stacking and navigation. The experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms program synthesis and deep RL baselines. In addition, we justify the necessity of the proposed two-stage learning scheme as well as conduct an extensive analysis comparing various approaches for learning the latent program embedding spaces. Finally, we perform 36 experiments which highlight that the programs produced by our proposed framework can both generalize to larger state spaces and unseen state configurations as well as be interpreted and edited by humans to improve their task performance. 3.2 RelatedWork Neuralprograminductionandsynthesis. Program induction methods [155, 332, 61, 208, 95, 132, 89, 247, 38, 329, 37, 335, 168, 120] aim to implicitly induce the underlying programs to mimic the behaviors demonstrated in given task specifications such as input/output pairs or expert demonstrations. On the other hand, program synthesis methods [63, 35, 46, 273, 287, 31, 219, 273, 175, 171, 169, 75, 76, 24, 173, 3, 117, 278, 328, 44, 5, 48, 18, 117, 327, 47, 215] explicitly synthesize the underlying programs and execute the programs to perform the tasks from task specifications such input/output pairs, demonstrations, language instructions. In contrast, we aim to learn to synthesize programs solely from reward described by an MDP without other task specifications. Similarly to us, a two-stage synthesis method is proposed in [173]. Yet, the task is to match truth tables for given test programs rather than solve MDPs. Their first stage requires the entire ground-truth table for each program synthesized during training, which is infeasible to apply to our problem setup (i.e. synthesizing imperative programs for solving MDPs). Learningprogrammaticpolicies. Prior works have also addressed the problem of learning program- matic policies [52, 326, 154]. Bastani, Pu, and Solar-Lezama [25] learns a decision tree as a programmatic policy for pong and cartpole environments by imitating an oracle neural policy. However, decision trees are incapable of representing repeating behaviors on their own. Silver et al. [278] addresses this by including a loop-style token for their decision tree policy, though it is still not as expressive as synthesized loops. Inala et al. [123] learns programmatic policies as finite state machines by imitating a teacher policy, although finite state machine complexity can scale quadratically with the number of states, making them difficult to scale to more complex behaviors. 37 Programρ := DEF run m(s m) Repetitionn:= Number of repetitions Perceptionh:= Domain-dependent perceptions Conditionb:= perception h| not perception h Actiona:= Domain-dependent actions Statements:= while c(b c) w(s w)|s 1 ;s 2 |a| repeat R=n r(s r)| if c(b c) i(s i)| ifelse c(b c) i(s 1 i) else e(s 2 e) Figure 3.1: The domain specific language (DSL) for constructing programs. Another line of work instead synthesizes programs structured in Domain-Specific Languages (DSLs), allowing humans to design tokens (e.g. conditions and operations) and control flows ( e.g. while loops, if statements, reusable functions) to induce desired behaviors and can produce human interpretable programs. Verma et al. [308, 307] distill neural network policies into programmatic policies. Yet, the initial programs are constrained to a set of predefined program templates. This significantly limits the scope of synthesizable programs and requires designing such templates for each task. In contrast, our method can synthesize diverse programs, without templates, which can flexibly represent the complex behaviors required to solve various tasks. 3.3 ProblemFormulation We are interested in learning to synthesize a program structured in a given DSL that can be executed to solve a given task described by an MDP, purely from reward. In this section, we formally define our definition of a program and DSL, tasks described by MDPs, and the problem formulation. ProgramandDomain-SpecificLanguage. The programs, or programmatic policies, considered in this work are defined based on a DSL as shown in Figure 3.1. The DSL consists of control flows and an agent’s 38 perceptions and actions. A perception indicates circumstances in the environment (e.g.frontIsClear()) that can be perceived by an agent, while an action defines a certain behavior that can be performed by an agent (e.g.move(),turnLeft()). Control flow includes if/else statements, loops, and boolean/logical operators to compose more sophisticated conditions. A policy considered in this work is described by a programρ which is executed to produce a sequence of actions given perceptions from the environment. MDP. We consider finite-horizon discounted MDPs with initial state distribution µ (s o ) and discount factor γ . For a fixed sequence {(s 0 ,a 0 ),...,(s t ,a t )} of states and actions obtained from a rollout of a given policy, the performance of the policy is evaluated based on a discounted return P T t=0 γ t r t , whereT is the horizon of the episode andr t =R(s t ,a t ) the reward function. Objective. Our objective ismax ρ E a∼ EXEC(ρ ),s 0 ∼ µ [ P T t=0 γ t r t ], where EXEC returns the actions induced by executing a program policyρ in the environment. Note that one can view this objective as a special case of the standard RL objective, where the policy is represented as a program which follows the grammar of the DSL and the policy rollout is obtained by executing the program. 3.4 Approach Our goal is to develop a framework that can synthesize a program (i.e. a programmatic policy) structured in a given DSL that can be executed to solve a task of interest. This requires the ability to synthesize a program that is not only valid for execution (e.g. grammatically correct) but also describes desired behaviors for solving the task from only the reward. Yet, learning to synthesize such a program from scratch for every new task can be difficult and inefficient. To this end, we propose ourLearningEmbeddings for lAtentProgramSynthesis framework, dubbed LEAPS, as illustrated in Figure 3.2. LEAPS first learns a latent program embedding space that continuously parameterizes diverse behaviors and a program decoder that decodes a latent program to a program consisting of a sequence of program tokens. Then, when a task is given, we iteratively search over this 39 Latent Program z Program ⇢ (a) Learning Program Embedding Stage (b) Latent Program Search Stage def run(): if frontIsClear(): move() else: turnLeft() def run(): if frontIsClear(): move() else: turnLeft() Environment Execute Environment Execute Cross Entropy Method Sample Next Candidate Latent Programs Noise + Candidate Latent Program def run(): if frontIsClear(): move() else: turnLeft() Predicted Program Environment r s a L P L R Reconstructed Program ˆ ⇢ L L Figure 3.2: (a)Learningprogramembeddingstage: we propose to learn a program embedding space by training a program encoderq ϕ that encodes a program as a latent programz, a program decoderp θ that decodes the latent programz back to a reconstructed program ˆ ρ , and a policyπ that conditions on the latent programz and acts as a neural program executor to produce the execution trace of the latent programz. The model optimizes a combination of a program reconstruction lossL P , a program behavior reconstruction lossL R , and a latent behavior reconstruction lossL L . a 1 ,a 2 ,..,a t denotes actions produced by either the policyπ or program execution. (b)Latentprogramsearchstage: we use the Cross Entropy Method to iteratively search for the best candidate latent programs that can be decoded and executed to maximize the reward to solve given tasks. embedding space and decode each candidate latent program using the decoder to find a program that maximizes the reward. This two-stage learning scheme not only enables learning to synthesize programs to acquire desired behaviors described by MDPs solely from reward, but also allows reusing the learned embedding space to solve different tasks without retraining. In the rest of this section, we describe how we construct the model and our learning objectives for the latent program embedding space in Section 3.4.1. Then, we present how a program that describes desired behaviors for a given task can be found through a search algorithm in Section 3.4.2. 3.4.1 LearningaProgramEmbeddingSpace To learn a latent program embedding space, we propose to train a variational autoencoder (VAE) [141] that consists of a program encoderq ϕ which encodes a programρ to a latent programz and a program decoder 40 p θ which reconstructs the program from the latent. Specifically, the VAE is trained through reconstruction of randomly generated programs and the behaviors they induce in the environment in an unsupervised manner. Architectural details are listed in Section 3.7.12.6. Since we aim to iteratively search over the learned embedding space to achieve certain behaviors when a task is given, we want this embedding space to allow for smooth behavior interpolation (i.e. programs that exhibit similar behaviors are encoded closer in the embedding space). To this end, we propose to train the model by optimizing the following three objectives. 3.4.1.1 ProgramReconstruction To learn a program embedding space, we train a program encoderq ϕ and a program decoderp θ to reconstruct programs composed of sequences of program tokens. Given an input programρ consisting of a sequence of program tokens, the encoder processes the input program one token at a time and produces a latent program embeddingz. Then, the decoder outputs program tokens one by one from the latent program embeddingz to synthesize a reconstructed program ˆ ρ . Both the encoder and the decoder are recurrent neural networks and are trained to optimize theβ -VAE [113] loss: L P θ,ϕ (ρ )=− E z∼ q ϕ (z|ρ ) [logp θ (ρ |z)]+βD KL (q ϕ (z|ρ )∥p θ (z)). (3.1) 3.4.1.2 ProgramBehaviorReconstruction While the loss in Equation 3.1 enforces that the model encodes syntactically similar programs close to each other in the embedding space, we also want to encourage programs with the same semantics to have similar program embeddings. An example that demonstrates the importance of this is the program aliasing issue, where different programs have identical program semantics ( e.g.repeat(2): move() andmove() move()). Thus, we introduce an objective that compares the execution traces of the input program and the 41 reconstructed program. Since the program execution process is not differentiable, we optimize the model via REINFORCE [325]: L R θ,ϕ (ρ )=− E z∼ q ϕ (z|ρ ) [R mat (p θ (ρ |z),ρ )], (3.2) whereR mat (ˆ ρ,ρ ), the reward for matching the input program’s behavior, is defined as R mat (ˆ ρ,ρ )=E µ [ 1 N N X t=1 1{EXEC i (ˆ ρ )== EXEC i (ρ )∀i=1,2,...t} | {z } stays 0 after the first t where EXECt(ˆ ρ )!= EXECt(ρ ) ], (3.3) whereN is the maximum of the lengths of the execution traces of both programs, and EXEC i (ρ ) represents the action taken by programρ at timei. Thus this objective encourages the model to embed behaviorally similar yet possibly syntactically different programs to similar latent programs. 3.4.1.3 LatentBehaviorReconstruction To further encourage learning a program embedding space that allows for smooth behavior interpolation, we devise another source of supervision by learning a program embedding-conditioned policy. Denoted π (a|z,s t ), this recurrent policy takes the program embeddingz produced by the program encoder and learns to predict corresponding agent actions. One can view this policy as a neural program executor that allows gradient propagation through the policy and the program encoder by optimizing the cross entropy between the actions obtained by executing the input programρ and the actions predicted by the policy: L L π (ρ,π )=− E µ [ M X t=1 |A| X i=1 1{EXEC i (ˆ ρ )== EXEC i (ρ )}logπ (a i |z,s t )], (3.4) whereM denotes the length of the execution ofρ . Optimizing this objective directly encourages the program embeddings, through supervised learning instead of RL as inL R , to be useful for action reconstruction, thus further ensuring that similar behaviors are encoded together and allowing for smooth interpolation. 42 Note that this policy is only used for improving learning the program embedding space not for solving the tasks of interest in the later stage. In summary, we propose to optimize three sources of supervision to learn the program embedding space that allows for smooth interpolation and can be used to search for desired agent behaviors: (1)L P (Equation 3.1), theβ -VAE objective for program reconstruction, (2)L R (Equation 3.2), an RL environment- state matching loss for the reconstructed program, and (3)L L (Equation 3.4), a supervised learning loss to encourage predicting the ground-truth agent action sequences. Thus our combined objective is: min θ,ϕ,π λ 1 L P θ,ϕ (ρ )+λ 2 L R θ,ϕ (ρ )+λ 3 L L π (ρ,π ), (3.5) whereλ 1 ,λ 2 , andλ 3 are hyperparameters controlling the importance of each loss. Optimizing the com- bination of these losses encourages the program embedding to be both semantically and syntactically informative. More training details can be found in Section 3.7.12.6. 3.4.2 LatentProgramSearch: SynthesizingaTask-SolvingProgram Once the program embedding space is learned, our goal becomes searching for a latent program that maximizes the reward described by a given task MDP. To this end, we adapt the Cross Entropy Method (CEM) [253], a gradient-free continuous search algorithm, to iteratively search over the program embedding space. Specifically, we (1) sample a distribution of latent programs, (2) decode the sampled latent programs into programs using the learned program decoderp θ , (3) execute the programs in the task environment and obtain the corresponding rewards, and (4) update the CEM sampling distribution based on the rewards. This process is repeated until either convergence or the maximum number of sampling steps has been reached. 43 stairClimber (a) StairClimber fourCorners (b) FourCorner topOf f (c) TopOff maze (d) Maze cleanHouse (e) CleanHouse harvester (f) Harvester Figure 3.3: TheKarelproblemset: the domain features an agent navigating through a gridworld with walls and interacting with markers, allowing for designing tasks that demand certain behaviors. The tasks are further described in Section 3.7.11 with visualizations in Figure 3.18. 3.5 Experiments We first introduce the environment and the tasks in Section 3.5.1 and describe the experimental setup in Section 3.5.2. Then, we justify the design of LEAPS by conducting extensive ablation studies in Section 3.5.3. We describe the baselines used for comparison in Section 3.5.4, followed by the experimental results presented in Section 3.5.5. In Section 3.5.6, we conduct experiments to evaluate the ability of our method to generalize to a larger state space without further learning. Finally, we investigate how LEAPS’ interpretability can be leveraged by conducting experiments that allow humans to debug and improve the programs synthesized by LEAPS in Section 3.5.7 3.5.1 KarelDomain To evaluate the proposed framework, we consider the Karel domain [226], as featured in [35, 273, 287], which features an agent navigating through a gridworld with walls and interacting with markers. The agent has 5 actions for moving and interacting with marker and 5 perceptions for detecting obstacles and markers. The tasks of interest are shown in Figure 3.3. Note that most tasks have randomly sampled agent, wall, marker, and/or goal configurations. When either training or evaluating, we randomly sample initial configurations upon every episode reset. More details can be found in Section 3.7.11. 44 3.5.2 Programs To produce programs for learning the program embedding space, we randomly generated a dataset of 50,000 unique programs. Note that the programs are generated independently of any Karel tasks; each program is created only by sampling tokens from the DSL, similar to the procedures used in [63, 35, 46, 273, 287, 55]. This dataset is split into a training set with 35,000 programs a validation set with 7,500 programs, and a testing set with 7,500 programs. The validation set is used to select the learned program embedding space to use for the program synthesis stage. For each program, we sample random Karel states and execute the program on them from different starting states to obtain10 environment rollouts to compute the program behavior reconstruction lossL R and the latent behavior reconstruction lossL L when learning the program embedding space. We perform checks to ensure rollouts cover all execution branches in the program so that they are representative of all aspects of the program’s behavior. The maximum length of the programs is44 tokens and the average length is17.9. We plot a histogram of their lengths in Figure 3.17 (in Appendix). More dataset generation details can be found in Section 3.7.10. 3.5.3 AblationStudy We first ablate various components of our proposed framework in order to (1) justify the necessity of the proposed two-stage learning scheme and (2) identify the effects of the proposed objectives. We consider the following baselines and ablations of our method (illustrated Section 3.7.9). • Naïve: a program synthesis baseline that learns to directly synthesize a program from scratch by recurrently predicting a sequence of program tokens. This baseline investigates if an end-to-end learning method can solve the problem. More details can be found in Section 3.7.12.4. 45 • LEAPS-P: the simplest ablation of LEAPS, in which the program embedding space is learned by only optimizing the program reconstruction lossL P (Equation 3.1). • LEAPS-P+R: an ablation of LEAPS which optimizes both the program reconstruction lossL P (Equation 3.1) and the program behavior reconstruction lossL R (Equation 3.2). • LEAPS-P+L: an ablation of LEAPS which optimizes both the program reconstruction lossL P (Equation 3.1) and the latent behavior reconstruction lossL L (Equation 3.4). • LEAPS (LEAPS-P+R+L): LEAPS with all the losses, optimizing our full objective in Equation 3.5. • LEAPS-rand-{8/64}: similar to LEAPS, this ablation also optimizes the full objective (Equation 3.5) for learning the program embedding space. Yet, when searching latent programs, instead of CEM, it simply randomly samples 8/64 candidate latent programs and chooses the best performing one. These baselines justify the effectiveness of using CEM for searching latent programs. Table 3.1: Program behavior reconstruction rewards (standard deviations) across all methods. WHILE IFELSE+WHILE 2IF+IFELSE WHILE+2IF+IFELSE Avg Reward Naïve 0.65 (0.33) 0.83 (0.07) 0.61 (0.33) 0.16 (0.06) 0.56 LEAPS-P 0.95 (0.13) 0.82 (0.08) 0.58 (0.35) 0.33 (0.17) 0.67 LEAPS-P+R 0.98 (0.09) 0.77 (0.05) 0.63 (0.25) 0.52 (0.27) 0.72 LEAPS-P+L 1.06 (0.00) 0.84 (0.10) 0.77 (0.23) 0.33 (0.13) 0.75 LEAPS-rand-8 0.62 (0.24) 0.49 (0.09) 0.36 (0.18) 0.28 (0.14) 0.44 LEAPS-rand-64 0.78 (0.22) 0.63 (0.09) 0.55 (0.20) 0.37 (0.09) 0.58 LEAPS 1.06 (0.08) 0.87 (0.13) 0.85 (0.30) 0.57 (0.23) 0.84 Program Behavior Reconstruction. To determine the effectiveness of the proposed two-stage learning scheme and the learning objectives, we measure how effective each ablation is at reconstructing the behaviors of input programs. We use programs from the test set (shown in Figure 3.11 in Appendix), and utilize the environment state matching rewardR mat (ˆ ρ,ρ ) (Equation 3.3), with a0.1 bonus for synthesizing a syntactically correct program. Thus the return ranges between[0,1.1]. We report the mean cumulative return, over 5 random seeds, of the final programs after convergence. 46 The results are reported in Table 3.1. Each test is named after its control flows ( e.g.IFELSE+WHILE has an if-else statement and a while loop). The naïve program synthesis baseline fails on the complex WHILE+2IF+IFELSE program, as it rarely synthesizes conditional and loop statements, instead generating long sequences of action tokens that attempt to replicate the desired behavior of those statements (see synthesized programs in Figure 3.12). We believe that this is because it is incentivized to initially predict action tokens to gain more immediate reward, making it less likely to synthesize other tokens. LEAPS and its variations perform better and synthesize more complex programs, demonstrating the importance of the proposed two-stage learning scheme in biasing program search. We also note that LEAPS-P achieves the worst performance out of the CEM search LEAPS ablations, indicating that optimizing the program reconstruction lossL P (the VAE loss) alone does not yield satisfactory results. Jointly optimizingL P with either the program behavior reconstruction lossL R or the latent behavior reconstruction lossL L improves the performance, and optimizing our full objective with all three achieves the best performance across all tasks, indicating the effectiveness of the proposed losses. Finally, LEAPS outperforms LEAPS-rand-8/64, suggesting the necessity of adopting better search algorithms such as CEM. Table 3.2: Program embedding space smoothness. For each program, we execute the ten nearest programs in the learned embedding space of each model to calculate the mean state-matching rewardR mat against the original program. We report R mat averaged over all programs in each dataset. LEAPS-P LEAPS-P+R LEAPS-P+L LEAPS Training 0.22 0.22 0.31 0.31 Validation 0.22 0.21 0.27 0.27 Testing 0.22 0.22 0.28 0.27 Program Embedding Space Smoothness. We investigate if the program and latent behavior reconstruction losses encourage learning a behavio- rially smooth embedding space. To quantify behav- ioral smoothness, we measure how much a change in the embedding space corresponds to a change in behavior by comparing execution traces. For all programs we compute the pairwise Euclidean dis- tance between their embeddings in each model. We then calculate the environment state matching distance R mat between the decoded programs by executing them from the same initial state. 47 The results are reported in Table 3.2. LEAPS and LEAPS-P+L perform the best, suggesting that optimizing the latent behavior reconstruction objectiveL L , in Equation 3.4, is essential for improving the smoothness of the latent space in terms of execution behavior. We further analyze and visualize the learned program embedding space in Section 3.7.1 and Figure 3.4 (in Appendix). 3.5.4 Baselines We evaluate LEAPS against the following baselines (illustrated in Figure 3.16 in Appendix Section 3.7.9). • DRL: a neural network policy trained on each task and taking raw states (Karel grids) as input. • DRL-abs: a recurrent neural network policy directly trained on each Karel task but taking abstract states as input (i.e. it sees the same perceptions as LEAPS, e.g.frontIsClear()==true). • DRL-abs-t: a DRL transfer learning baseline in which for each task, we train DRL-abs policies on all other tasks, then fine-tune them on the current task. Thus it acquires a prior by learning to first solve other Karel tasks. Rewards are reported for the policies from the task that transferred with highest return. We only transfer DRL-abs policies as some tasks have different state spaces. • HRL: a hierarchical RL baseline in which a VAE is first trained on action sequences from program execution traces used by LEAPS. Once trained, the decoder is utilized as a low-level policy for learning a high-level policy to sample actions from. Similar to LEAPS, this baseline utilizes the dataset to produce a prior of the domain. It takes raw states (Karel grids) as input. • HRL-abs: the same method as HRL but taking abstract states (i.e. local perceptions) as input. • VIPER [25]: A decision-tree programmatic policy which imitates the behavior of a deep RL teacher policy via a modified DAgger algorithm [251]. This decision tree policy cannot synthesize loops, allowing us to highlight the performance advantages of more expressive program representation that LEAPS is able to take advantage of. 48 Table 3.3: Mean return (standard deviation) of all methods across Karel tasks, evaluated over 5 random seeds. DRL methods, program synthesis baselines, and LEAPS ablations are separately grouped. StairClimber FourCorner TopOff Maze CleanHouse Harvester DRL 1.00 (0.00) 0.29 (0.05) 0.32 (0.07) 1.00 (0.00) 0.00 (0.00) 0.90 (0.10) DRL-abs 0.13 (0.29) 0.36 (0.44) 0.63 (0.23) 1.00 (0.00) 0.01 (0.02) 0.32 (0.18) DRL-abs-t 0.00 (0.00) 0.05 (0.10) 0.17 (0.11) 1.00 (0.00) 0.01 (0.02) 0.16 (0.18) HRL -0.51 (0.17) 0.01 (0.00) 0.17 (0.11) 0.62 (0.05) 0.01 (0.00) 0.00 (0.00) HRL-abs -0.05 (0.07) 0.00 (0.00) 0.19 (0.12) 0.56 (0.03) 0.00 (0.00) -0.03 (0.02) Naïve 0.40 (0.49) 0.13 (0.15) 0.26 (0.27) 0.76 (0.43) 0.07 (0.09) 0.21 (0.25) VIPER 0.02 (0.02) 0.40 (0.42) 0.30 (0.06) 0.69 (0.05) 0.00 (0.00) 0.51 (0.07) LEAPS-rand-8 0.10 (0.17) 0.10 (0.14) 0.28 (0.05) 0.40 (0.50) 0.00 (0.00) 0.07 (0.06) LEAPS-rand-64 0.18 (0.40) 0.20 (0.11) 0.33 (0.07) 0.58 (0.41) 0.03 (0.06) 0.12 (0.05) LEAPS 1.00 (0.00) 0.45 (0.40) 0.81 (0.07) 1.00 (0.00) 0.18 (0.14) 0.45 (0.28) All the baselines are trained with PPO [265] or SAC [104], including the VIPER teacher policy. More training details can be found in Section 3.7.12. 3.5.5 Results We present the results of the baselines and our method evaluated on the Karel task set based on the environment rewards in Table 3.3. The reward functions are sparse for all tasks, and are normalized such that the final cumulative return is within [− 1,1] for tasks with penalties and[0,1] for tasks without; reward functions for each task are detailed in Section 3.7.11. OverallTaskPerformance. Across all but one task, LEAPS yields the best performance. The LEAPS- rand baselines perform significantly worse than LEAPS on all Karel tasks, demonstrating the need for using a search algorithm like CEM during synthesis. The performance of VIPER is bounded by its RL teacher policy, and therefore is outperformed by the DRL baselines on most of the tasks. Meanwhile, DRL-abs-t is generally unable to improve upon DRL-abs across the board, suggesting that transferring Karel behaviors with RL from one task to another is ineffective. Furthermore, both the HRL baselines achieve poor performance, likely because agent actions alone provide insufficient supervision for a VAE to encode useful action trajectories on unseen tasks—unlike programs. Finally, the poor performance of the naïve 49 program synthesis baseline highlights the difficulty and inefficiency of learning to synthesize programs from scratch using only rewards. In the appendix, we present programs synthesized by LEAPS in Figure 3.14, example optimal programs for each task in Section 3.7.6 (Figure 3.11), rollout visualizations in Figure 3.19, and additional results analysis in Section 3.7.8. Repetitive Behaviors. Solving StairClimber and FourCorner requires acquiring repetitive (or looping) behaviors. StairClimber, which can be solved by repeating a short, 4-step stair-climbing behavior until the goal marker is reached, is not solved by DRL-abs. LEAPS fully solves the task given the same perceptions, as this behavior can be simply represented with a while loop that repeats the stair-climbing skill. However VIPER performs poorly as its decision tree cannot represent such loops. Similarly, the baselines are unable to perform as well onFourCorner, a task in which the agent must pickup a marker located in each corner of the grid. This behavior takes at least 14 timesteps to complete, but can be represented by two nested loops. Similar toStairClimber, the bias introduced by the DSL and our generated dataset (which includes nested loops), results in LEAPS being able to perform much better. Exploration. TopOff rewards the agent for adding markers to locations with existing markers. However, there are no restrictions for the agent to wander elsewhere around the environment, thus making exploration a problem for the RL baselines, and thereby also constraining VIPER. LEAPS performs best on this task, as the ground-truth program can be represented by a simple loop that just moves forward and places markers when a marker is detected. Maze also involves exploration, however its small size (8× 8) results in many methods, including LEAPS, solving the task. Complexity. SolvingHarvester andCleanHouse requires acquiring complex behaviors, resulting in poor performance from all methods. CleanHouse requires an agent to navigate through a house and pick up all markers along the walls on the way. This requires repeated execution of a skill, of varied length, which navigates around the house, turns into rooms, and picks up markers. As such, all baselines perform very poorly. However, LEAPS is able to perform substantially better because these behaviors 50 can be represented by a program of medium complexity with a while loop and some nested conditional statements. On the other hand,Harvester involves simply navigating to and picking up a marker on every spot on the grid. However, this is a difficult program to synthesize given our random dataset generation process; the program we manually derive to solveHarvester is long and more syntactically complex than most training programs. As a result, DRL and VIPER outperform LEAPS on this task. LearnedProgramEmbeddingSpace. More analysis on our learned program embedding space can be found in the appendix. We present CEM search trajectory visualizations in Section 3.7.2, demonstrating how the search population’s rewards change over time. To qualitatively investigate the smoothness of the learned program embedding space, we linearly interpolate between pairs of latent programs and display their corresponding decoded programs in Section 3.7.3. In Section 3.7.4, we illustrate how predicted programs evolve over the course of CEM search. 3.5.6 Generalization Table 3.4: Rewards on100× 100 grids. StairClimber Maze DRL 0.00 (0.00) 0.00 (0.00) DRL-abs 0.00 (0.00) 0.04 (0.05) VIPER 0.00 (0.00) 0.10 (0.12) LEAPS 1.00 (0.00) 1.00 (0.00) We are also interested in learning whether the baselines and the programs synthesized by LEAPS can generalize to novel scenarios without further learning. Specifically, we investigate how well they can generalize to larger state spaces. We expand bothStairClimber andMaze to100× 100 grid sizes (from12× 12 and8× 8, respec- tively). We directly evaluate the policies or programs obtained from the original tasks with smaller state spaces for all methods except DRL (its observation space changes), which we retrain from scratch. The results are shown in Table 3.4. All baselines perform significantly worse than before on both tasks. On the contrary, the programs synthesized by LEAPS for the smaller task instances achieve zero-shot generalization to larger task instances without 51 losing any performance. Larger grid size experiments for the other Karel tasks and additional unseen configuration experiments can be found in Section 3.7.7. 3.5.7 Interpretability Interpretability in machine learning [172, 271] is particularly crucial when it comes to learning a policy that interacts with the environment [350, 112, 83, 106, 255, 29, 297, 17]. The proposed framework produces programmatic policies that are interpretable from the following aspects as outlined in [271]. • Trust: interpretable machine learning methods and models may more easily be trusted since humans tend to be reluctant to trust systems that they do not understand. Programs synthesized by LEAPS can naturally be better trusted since one can simply read and interpret them. • Contestability: the program execution traces produce a chain of reasoning for each action, providing insights on the induced behaviors and thus allowing for contesting improper decisions. • Safety: synthesizing readable programs allows for diagnosing issues earlier (i.e. before execution) and provides opportunities to intervene, which is especially critical for safety-critical tasks. In the rest of this section, we investigate how the proposed framework enjoys interpretability from the three aforementioned aspects. Specifically, synthesized programs are not only readable to human users but also interactive, allowing non-expert users with a basic understanding of programming to diagnose and make edits to improve their performance. To demonstrate this, we asked non-expert humans to read, interpret, and edit suboptimal LEAPS policies to improve their performance. Participants edited LEAPS programs on 3 Karel tasks with suboptimal reward: TopOff,FourCorner, andHarvester. With just 3 edits, participants obtained a mean reward improvement of 97.1%, and with 5 edits, participants improved it by 125%. This justifies how our synthesized policies can be manually diagnosed and improved, a property which DRL methods lack. More details and discussion can be found in Section 3.7.5. 52 3.6 Discussion We propose a framework for solving tasks described by MDPs by producing programmatic policies that are more interpretable and generalizable than neural network policies learned by deep reinforcement learning methods. Our proposed framework adopts a flexible program representation and requires only minimal supervision compared to prior programmatic reinforcement learning and program synthesis works. Our proposed two-stage learning scheme not only alleviates the difficulty of learning to synthesize programs from scratch but also enables reusing its learned program embedding space for various tasks. The experiments demonstrate that our proposed framework outperforms DRL and programmatic baselines on a set of Karel tasks by producing expressive and generalizable programs that can consistently solve the tasks. Ablation studies justify the necessity of the proposed two-stage learning scheme as well as the effectiveness of the proposed learning objectives. While the proposed framework achieves promising results, we would like to acknowledge two assump- tions that are implicitly made in this work. First, we assume the existence of a program executor that can produce execution traces of programs. This program executor needs to be able to return perceptions from the environment state as well as apply actions to the environment. While this assumption is widely made in program synthesis works, a program executor can still be difficult to obtain when it comes to real-world robotic tasks. Fortunately, in research fields such as computer vision or robotics, a great amount of effort has been put into satisfying this assumption such as designing modules that can return high-level abstraction of raw sensory input (e.g. with object detection networks, proximity/tactile sensors, etc.). Secondly, we assume that it is possible to generate a distribution of programs whose behaviors are at least remotely related to the desired behaviors for solving the tasks of interest. It can be difficult to synthesize programs which represent behaviors that are more complex than ones in the training program distribution, although one possible solution is to employ a better program generation process to generate programs that induce more complex behaviors. Also, the choice of DSL plays an important role in how 53 complex the programs can be. Ideally, employing a more complex DSL would allow our proposed framework to synthesize more advanced agent behaviors. In the future, we hope to extend the proposed framework to more challenging domains such real-world robotics. We believe this framework would allow for deploying robust, interpretable policies for safety- critical tasks such as robotic surgeries. One way to make LEAPS applicable to robotics domains would be to simultaneously learn perception modules and action controllers. Other possible solutions include incorporating program execution methods [11, 216, 288, 337, 160, 351] that are designed to allow program execution or designing DSLs that allow pre-training of perception modules and action controllers. Also, the proposed framework shares some characteristics with works in multi-task RL [216, 11, 295, 299, 281, 124, 232] and meta-learning [280, 315, 311, 77, 316, 45, 158, 212, 239, 49]. Specifically, it learns a program embedding space from a distribution of tasks/programs. Once the program embedding space is learned, it can be be reused to solve different tasks without retraining. Yet, extending LEAPS to such domains can potentially lead to some negative societal impacts. For example, our framework can still capture unintended bias during learning or suffer from adversarial attacks. Furthermore, policies deployed in the real world can create great economic impact by causing job losses in some sectors. Therefore, we would encourage further work to investigate the biases, safety issues, and potential economic impacts to ensure that the deployment in the field does not cause far-reaching, negative societal impacts. 3.7 Appendix 3.7.1 ProgramEmbeddingSpaceVisualizations In this section, we present and analyze visualizations providing insights on the program embedding spaces learned by LEAPS and its variations. To investigate the learned program embedding space, we perform 54 dimensionality reduction with PCA [130] to embed the following data to a 2D space for visualizations shown in Figure 3.4: • Latent programs from the training dataset encoded by a learned encoderq ϕ , visualized as blue scatters. There are 35k training programs. • Samples drawn from a normal distributionN(0,1), visualized as green scatters. This is to show how a distribution would look like if the embedding space is learned by using a highly weighted KL-divergence penalty (i.e. a largeβ value the VAE loss). We compared this against the latent program distribution learned by our method to justify the effectiveness of the proposed objectives: the program behavior reconstruction loss (L R ) and the latent behavior reconstruction loss (L L ). • Ground-truth (GT) test programs from the testing dataset, encoded by a learned decoderq ϕ , visualized as plus signs (+) with different colors. We selected 4 test programs. • Reconstructed programs which are predicted (Pred) by each method given visualized as crosses (× ) with different colors. Since there are 4 test programs selected, 4 reconstructed programs are visualized. Each pair of test program and predicted program is visualized with the same color. These predicted (i.e. synthesized) programs are also shown in Figure 3.12. EmbeddingSpaceCoverage. Even though the testing programs are not in the training program dataset, and therefore are unseen to models, their embedding vectors still lie in the distribution learned by all the models. This indicates that the learned embedding spaces cover a wide distribution of programs. LatentProgramDistributionvs.NormalDistribution. We now compare two distributions: the latent program distribution formed by encoding all the training programs to the program embedding space and a normal distributionN(0,1). One can view the normal distribution as the distribution obtained by heavily enforcing the weight of the KL-divergence term when training a VAE model. We discuss the shape of the latent program distribution in the learned program embedding space as follows: 55 • LEAPS-P: since LEAPS+P simply optimizes theβ -VAE loss (the program reconstruction lossL P ), which puts a lot of emphasis on the KL-divergence term, the shape of the latent program distribution is very similar to a normal distribution as shown in Figure 3.4 (a). • LEAPS-P+R: while LEAPS+P+R additionally optimizes the program behavior reconstruction loss L R , the shape of the latent program distribution is still similar to a normal distribution, as shown in Figure 3.4 (b). We hypothesize that it is because the program behavior reconstruction loss alone might not be strong or explicit enough to introduce a change. • LEAPS-P+L: the shape of the latent program distribution in the program embedding space learned by LEAPS+P+L is significantly different from a normal distribution, as shown in Figure 3.4 (c). This suggest that employing the latent behavior reconstruction lossL L dramatically contributes to the learning. We believe it is because the latent behavior reconstruction loss is optimized with direct gradients and therefore provides a stronger learning signal especially compared to the program behavior reconstruction lossL R , which is optimized using REINFORCE [325]. • LEAPS (LEAPS-P+R+L): LEAPS optimizes the full objective that includes all three proposed objectives and form a similar distribution shape as the one learned by LEAPS+P+L. However, the distance between each pair of the ground-truth testing program and the predicted program is much closer in the program embedding space learned by LEAPS compared to the space learned by LEAPS+P+L. This justifies the effectiveness of the proposed program behavior reconstruction loss L R , which can bring the programs with similar behaviors closer in the embedding space. Summary. The visualizations of the program embedding spaces learned by LEAPS and its ablations quali- tatively justify the effectiveness of the proposed learning objectives, as complementary to the quantitative results presented in the main paper. 56 3.7.2 CrossEntropyMethodTrajectoryVisualization As described in the main paper, once the program embedding space is learned by LEAPS, our goal becomes searching for a latent program that maximizes the reward described by a given task MDP. To this end, we adapt the Cross Entropy Method (CEM) [253], a gradient-free continuous search algorithm, to iteratively search over the program embedding space. Specifically, we iteratively perform the following steps: 1. Sample a distribution of candidate latent programs. 2. Decode the sampled latent programs into programs using the learned program decoderp θ . 3. Execute the programs in the task environment and obtain the corresponding rewards. 4. Update the CEM sampling distribution based on the rewards. This process is repeated until either convergence or the maximum number of sampling steps has been reached. We perform dimensionality reduction with PCA [130] to embed the following data to a 2D space; the visualizations of CEM trajectories are shown in Figure 3.5 and Figure 3.6: • Latent programs from the training dataset encoded by a learned encoderq ϕ , visualized as blue scatters. There are 35k training programs. This is to visualize the shape of the program distribution in the learned program embedding space. This is also visualized in Figure 3.4. • Ground-truth (GT) programs that exhibit optimal behaviors for solving the Karel tasks, visualized as red stars (⋆). Ideally, the CEM population should iteratively move toward where the GT programs are located. • CEM population is a batch of sampled candidate latent programs at each iteration, visualized as red scatters. Each candidate latent program can be decoded as a program that can be executed in the 57 task environment to obtain a reward. By averaging the reward obtained by every candidate latent program, we can calculate the average reward of this population and show it in the figures as Avg. Reward. • CEM Next Center, visualized as cross signs (× ), indicates the center vector around which the next batch of candidate latent programs will be sampled. This vector is calculated based on a set of candidate latent programs that achieve best reward (i.e. elite samples) at each iteration. In this case, it is a weighted average based on the reward each candidate gets from its execution. From Figure 3.5, we observe that both the average reward of the entire population and the reward of the next candidate program (CEM Next Center) consistently increase as the number of iterations increases, justifying the effectiveness of CEM. Moreover, we observe that the CEM population gradually moves toward where the ground-truth program is located, which aligns well with the fact that our proposed framework can reliably synthesize task-solving programs. Yet, the populations might not always exactly converge to where the ground-truth latent program is. We hypothesize this could be attributed to the following reasons: 1. CEM convergence: while the CEM search converges, it can still be suboptimal. Since the search terminates when the next candidate latent program obtains the maximum reward (1.1 as shown in the figure) for 10 iterations, it might not exactly converge to where a ground-truth program is. 2. Dimensionality reduction: we visualized the trajectories and programs by performing dimensionality reduction from 256 to 2 dimensions with PCA, which could cause visual distortions. 3. Suboptimal learned program embedding space: while we aim to learn a program embedding space where all the programs inducing the same behaviors are mapped to the same spot in the embedding space, it is still possible that programs that induce the desired behavior can distribute to more than one 58 Table 3.5: Decoded linear interpolations of programs close to each other in the latent space. Latent Program Decoded Program START DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) WHILE c( not c( frontIsClear c) c) w( move w) IF c( frontIsClear c) i( move i) m) 1 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) WHILE c( not c( frontIsClear c) c) w( move w) IF c( frontIsClear c) i( move i) m) 2 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( move i) m) 3 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( move i) m) 4 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( move i) m) 5 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( move i) m) 6 DEF run m( turnRight move WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( move i) m) 7 DEF run m( turnRight move turnLeft WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( putMarker i) m) 8 DEF run m( turnRight move turnLeft WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( putMarker i) m) END DEF run m( turnRight move turnLeft WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( putMarker i) m) location in a learned program embedding space. Therefore, CEM search can converge to somewhere that is different from the ground-truth latent program. On the other hand, the CEM trajectory shown in Figure 3.6 does not converge and terminates when reaching the maximum number of iterations. The ground-truth program lies far away from the initial sampled distribution, which might contribute to the difficulty of converging. This aligns with the relatively unsatisfactory performance achieved by LEAPS. Employing a more sophisticated searching algorithm or conducting a more thorough hyperparameter search could potentially improve the performance but it is not the main focus of this work. 3.7.3 ProgramEmbeddingSpaceInterpolations To learn a program embedding space that allows for smooth interpolation, we propose three sources of supervision. We aim to verify the effectiveness of it by investigating interpolations in the learned program 59 Table 3.6: Decoded linear interpolations of programs far from each other in the latent space. Latent Program Decoded Program START DEF run m( turnRight turnLeft turnLeft move turnRight putMarker move m) 1 DEF run m( turnRight turnLeft turnLeft move turnRight putMarker move m) 2 DEF run m( turnRight turnLeft turnLeft move WHILE c( frontIsClear c) w( putMarker w) turnRight move m) 3 DEF run m( turnRight turnLeft move turnLeft WHILE c( frontIsClear c) w( putMarker w) move m) 4 DEF run m( turnRight turnLeft move WHILE c( frontIsClear c) w( turnLeft w) IF c( not c( frontIsClear c) c) i( move i) m) 5 DEF run m( turnRight move turnLeft WHILE c( frontIsClear c) w( move w) IF c( not c( frontIsClear c) c) i( putMarker i) m) 6 DEF run m( move turnRight turnLeft move WHILE c( frontIsClear c) w( IF c( not c( rightIsClear c) c) i( putMarker i) w) m) 7 DEF run m( move turnRight turnLeft move WHILE c( frontIsClear c) w( IF c( not c( rightIsClear c) c) i( turnLeft i) w) m) 8 DEF run m( move turnRight move WHILE c( frontIsClear c) w( IF c( not c( rightIsClear c) c) i( turnLeft i) w) m) END DEF run m( move turnRight move WHILE c( frontIsClear c) w( IF c( not c( rightIsClear c) c) i( turnLeft i) w) m) embedding space. To this end, we follow the procedure described below to produce results shown in Table 3.5 and Table 3.6. 1. Sampling a pair of programs from the dataset (START program andEND program). 2. Encoding the two programs into the learned program embedding space. 3. Linearly interpolating between the two latent programs to obtain a number of (eight) interpolated latent programs. 4. Decoding the latent programs to obtain interpolated programs (program1 to program8). We show two pairs of programs and their interpolations in between below as examples. Specifically, the first pair of programs, shown in Table 3.5, are closer to each other in the latent space and the second pair of programs, shown in Table 3.6, are further from each other. We observe that the interpolations between the closer program pair exhibit smoother transitions and the interpolations between the further program pair display more dramatic change. 60 3.7.4 ProgramEvolution In this section, we aim to investigate how predicted programs evolve over the course of searching. We visualize converged CEM search trajectories and the reward each program gets on the StairClimber task in Appendix Figure 3.5. In Table 3.7, we present the predicted programs corresponding to the CEM search trajectory on the StairClimber task in Figure 3.5. We observe that the sampled programs consistently improve as the number of iterations increases, justifying the effectiveness of the learned program embedding and the CEM search. 3.7.5 Interpretability: HumanDebuggingofLEAPSPrograms Interpretability in Machine Learning is crucial for several reasons [172, 271]. First, trust – interpretable machine learning methods and models may more easily be trusted since humans tend to be reluctant to trust systems that they do not understand. Second, interpretability can improve the safety of machine learning systems. A machine learning system that is interpretable allows for diagnosing issues (e.g. the distribution shift from training data to testing data) earlier and provides more opportunities to intervene. This is especially important for safety-critical tasks such as medical diagnosis [22, 270, 92, 40, 279] and real-world robotics [39, 98, 15, 105, 334, 348, 159] tasks. Finally, interpretability can lead to contestability, by producing a chain of reasoning, providing insights on how a decision is made and therefore allowing humans to contest unfair or improper decisions. We believe interpretability is especially crucial when it comes to learning a policy that interacts with the environment. In this work, we propose a framework that offers an effective way to acquire an interpretable programmatic policy structured in a program. In the following, we discuss how the proposed framework enjoys interpretability from the three aforementioned aspects. Programs synthesized by the proposed framework can naturally be better trusted since one can simply read and understand them. Also, through the program execution trace produced by executing a program, each decision made by the policy (i.e. the 61 Table 3.7: How predicted programs evolve throughout the course of CEM search forStairClimber. See Figure 3.5 for the corresponding visualization of this CEM search. Search Iteration Best Predicted Program Iteration: 1 DEF run m( IF c( frontIsClear c) i( pickMarker i) WHILE c( leftIsClear c) w( move w) IFELSE c( frontIsClear c) i( turnRight move i) ELSE e( move e) m) Iteration: 2 DEF run m( WHILE c( markersPresent c) w( move w) IFELSE c( frontIsClear c) i( turnLeft i) ELSE e( move e) WHILE c( leftIsClear c) w( move w) m) Iteration: 3 DEF run m( WHILE c( not c( frontIsClear c) c) w( move turnRight w) WHILE c( leftIsClear c) w ( turnLeft move w) m) Iteration: 4 DEF run m( WHILE c( not c( frontIsClear c) c) w( pickMarker move w) WHILE c( leftIsClear c) w( turnLeft move w) m) Iteration: 5 DEF run m( WHILE c( not c( frontIsClear c) c) w( pickMarker turnRight w) WHILE c( leftIsClear c) w( move turnLeft w) m) Iteration: 6 DEF run m( WHILE c( not c( frontIsClear c) c) w( pickMarker turnRight w) WHILE c( leftIsClear c) w( move turnLeft w) m) Iteration: 7 DEF run m( WHILE c( not c( leftIsClear c) c) w( turnRight w) IFELSE c( frontIsClear c) i( move i) ELSE e( turnLeft e) WHILE c( rightIsClear c) w( move w) m) Iteration: 8 DEF run m( WHILE c( not c( leftIsClear c) c) w( turnRight move w) WHILE c( markersPresent c) w( turnLeft move w) m) Iteration: 9 DEF run m( WHILE c( not c( noMarkersPresent c) c) w( turnRight move w) WHILE c( not c( frontIsClear c) c) w( turnLeft move w) m) Iteration: 10 DEF run m( WHILE c( not c( noMarkersPresent c) c) w( turnRight move w) WHILE c( leftIsClear c) w( turnLeft move w) m) Iteration: 11 DEF run m( WHILE c( not c( leftIsClear c) c) w( turnRight move w) WHILE c( noMarkersPresent c) w( turnLeft move w) m) Iteration: 12 DEF run m( WHILE c( not c( leftIsClear c) c) w( turnRight move w) WHILE c( noMarkersPresent c) w( turnLeft move w) m) Iteration: 13 DEF run m( WHILE c( not c( leftIsClear c) c) w( turnRight move w) WHILE c( noMarkersPresent c) w( turnLeft move w) m) Iteration: 14 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 15 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 16 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 17 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 18 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 19 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 20 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 21 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Iteration: 22 DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( move turnLeft w) m) Converged DEF run m( WHILE c( not c( markersPresent c) c) w( turnRight move w) WHILE c( rightIsClear c ) w( turnLeft move w) m) 62 Table 3.8: Mean return (standard deviation) [% increase in performance] after debugging by non-expert humans of LEAPS synthesized programs for 3 statement edits and 5 statement edits. Chosen LEAPS programs are median-reward programs out of 5 LEAPS seeds for each task. Karel Task Original Program 3 Edits 5 Edits TopOff 0.86 0.95 (0.07) [10.5%] 1.0 (0.00) [16.3%] FourCorner 0.25 0.75 (0.35) [200%] 0.92 (0.12) (268%) Harvester 0.47 0.85 (0.05) [80.9%] 0.89 (0.00) [89.4%] Average % Increase - 97.1% 125% program) is traceable and therefore satisfies the contestability property. Finally, the programs produced by our framework satisfy the safety property of interpretability as humans can diagnose and correct for issues by reading and editing the programs. Our synthesized programs are not only readable to human users but also interactable, allowing non- expert users with a basic understanding of programming to diagnose and make edits to improve their performance. To test this hypothesis, we asked people with programming experience who are unfamiliar with our DSL or Karel tasks to edit suboptimal LEAPS programs to improve performance as much as possible on 3 Karel tasks: TopOff,FourCorner, andHarvester through a user interface displayed in Figure 3.7. Each person was given 1.5 hours (30 minutes per program), including time required to understand what the LEAPS programs were doing, understand the DSL tokens, and fully debug/test their edited programs. For each program, participants were required to modify up to 5 statements, then attempt the task again with up to only 3 modifications as calculated by the Levenshtein distance metric [162]. A single statement modification is defined as any modification/removal/addition of a IF, WHILE, IFELSE, REPEAT, or ELSE statement, or a removal/addition/change of an action statement (e.g. move, turnLeft, etc.). Participants were allowed to ask clarification questions, but we would not answer questions regarding how to specifically improve the performance of their program. We display example edited programs in Figure 3.8, and the aggregated results of editing in Table 3.8. We see a significant increase in performance in all three tasks, with an average 97.1% increase in performance with 3 edits and an average 125% increase in performance with 5. These numbers are averaged over 3 people, 63 with standard deviations reported in the table. Thus we see that even slight modifications to suboptimal LEAPS programs can enable much better Karel task performance when edited by non-expert humans. Our experiments in this section make an interesting connection to works in program/code repair (i.e. automatic bug fixing) [341, 211, 129, 331, 93, 266, 146, 72, 166, 43, 324, 103, 320, 192], where the aim is to develop algorithms and models that can find bugs or even repair programs without the intervention of a human programmer. While the goal of these works is to fix programs produced by humans, our goal in this section is to allow humans to improve programs synthesized by the proposed framework. Another important benefit of programmatic policies is verifiability - the ability to verify different properties of policies such as correctness, stability, smoothness, robustness, safety, etc. Since programmatic policies are highly structured, they are more amenable to formal verification methods developed for traditional software systems as compared to neural policies. Recent works [25, 308, 307, 352] show that various properties of programmatic policies (programs written using DSLs, decision trees) can be verified using existing verification algorithms, which can also be applied to programs synthesized by the proposed framework. 3.7.6 OptimalandSynthesizedPrograms In this section, we present the programs from the testing set which are selected for conducting ablation studies in the main paper in Figure 3.11. Also, we manually write programs that induce optimal behaviors to solve the Karel tasks and present them in Figure 3.11. Note that while we only show one optimal program for each task, there exist multiple programs that exhibit the desired behaviors for each task. Then, we analyze the program reconstructed by LEAPS, its ablations, and the naïve program synthesis baseline in Section 3.7.6.1, and discuss the programs synthesized by LEAPS for Karel tasks in Section 3.7.6.2. 64 3.7.6.1 ProgramBehaviorReconstruction This section serves as a complement to the ablation studies in the main paper, where we aim to justify the effectiveness of the proposed framework and the learning objectives. To this end, we select programs that are unseen to LEAPS and its ablations during the learning program embedding space from the testing set and reconstruct those programs using LEAPS, its ablations and the naïve program synthesis baseline. Those selected programs are shown in Figure 3.11 and the reconstructed programs are shown in Figure 3.12. The naïve program synthesis baseline fails on the complexWHILE+2IF+IFELSE program, as it rarely synthesizes conditional and loop statements, instead generating long sequences of action tokens that attempt to replicate the desired behavior of those statements. We believe that this is because it is incentivized to initially predict action tokens to gain more immediate reward, making it less likely to synthesize other tokens. LEAPS and its variations perform better and synthesize more complex programs, demonstrating the importance of the proposed two-stage learning scheme in biasing program search. Also, LEAPS synthesizes programs that are more concise and induce behaviors which are more similar to given testing programs, justifying the effectiveness of the proposed learning objectives. 3.7.6.2 KarelEnvironmentTasks This section is complementary to the main experiments in the main paper, where we compare LEAPS against the baselines on a set of Karel tasks, which is described in detail in Section 3.7.11. The programs synthesized by LEAPS are presented in Figure 3.14. The synthesized programs solve bothStairClimber andMaze. ForTopOff, since the average expected number of markers presented in the last row is3, LEAPS synthesizes a sub-optimal program that conducts the topoff behavior three times. For CleanHouse, while all the baselines fail on this task, the synthesized program achieves some performance by simply moving around and try to pick up markers. ForHarvester, 65 Table 3.9: Extended reward comparison on original tasks with8× 8 or12× 12 grids and zero-shot generalization to 100× 100 grids. LEAPS achieves the best generalization performance on all the tasks except forHarvester. StairClimber Maze FourCorner TopOff Harvester DRL Original 1.00 (0.00) 1.00 (0.00) 0.29 (0.05) 0.32 (0.07) 0.90 (0.10) 100x100 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.01) 0.00 (0.00) DRL-abs Original 0.13 (0.29) 1.00 (0.00) 0.36 (0.44) 0.63 (0.23) 0.32 (0.18) 100x100 0.00 (0.00) 0.04 (0.05) 0.37 (0.44) 0.15 (0.12) 0.02 (0.01) DRL-FCN Original 1.00 (0.00) 0.97 (0.03) 0.20 (0.34) 0.28 (0.12) 0.46 (0.16) 100x100 -0.20 (0.10) 0.01 (0.01) 0.00 (0.00) 0.01 (0.01) 0.02 (0.00) VIPER Original 0.02 (0.02) 0.69 (0.05) 0.40 (0.42) 0.30 (0.06) 0.51 (0.07) 100x100 0.00 (0.00) 0.10 (0.12) 0.40 (0.42) 0.03 (0.00) 0.04 (0.00) LEAPS Original 1.00 (0.00) 1.00 (0.00) 0.45 (0.40) 0.81 (0.07) 0.45 (0.28) 100x100 1.00 (0.00) 1.00 (0.00) 0.45 (0.37) 0.21 (0.03) 0.00 (0.00) LEAPS fails to acquire the desired behavior that required nested loops but produces a sub-optimal program that contains only action tokens. 3.7.7 AdditionalGeneralizationExperiments Here, we present additional generalization experiments to complement those presented in Section 3.5.6. In Section 3.7.7.1, we extend the 100x100 state size zero-shot generalization experiments to 3 additional tasks. In Section 3.7.7.2, we analyze how well baseline methods and LEAPS can generalize to unseen configurations of a given task. 3.7.7.1 GeneralizationonFourCorner,TopOff,andHarvester Evaluating zero-shot generalization performance assumes methods to work reasonably well on the original tasks. For this reason (and due to space limitations) we present onlyStairClimber andMaze for general- ization experiments in the main text in Section 3.5.6 because most methods achieve reasonable performance on these two tasks, with DRL and LEAPS both solving these tasks fully and DRL-abs solving Maze fully. However, here we also present full results for all tasks except CleanHouse (as no method except LEAPS has a reasonable level of performance on it). The results are summarized in Table 3.9. We see that LEAPS generalizes well onFourCorner and maintains the best performance onTopOff. It is outperformed 66 onHarvester, although none of the methods do well onHarvester as the highest obtained reward by any method is 0.04 (by VIPER). In summary, LEAPS performs the best on 4 out of these 5 tasks, further demonstrating its superior zero-shot generalization performance. Furthermore, we note that it is possible that a DRL policy employing a fully convolutional network (FCN) as proposed in Long, Shelhamer, and Darrell [177] can handle varying observation sizes. FCNs were also demonstrated in Silver et al. [278] to demonstrate better generalization performance than traditional convolutional neural network policies. However, we hypothesize that the generalization performance here will still be poor as there is a large increase in the number of features that the FCN architecture needs to aggregate when transferring from 8x8/12x12 state inputs to 100x100 inputs—a 10x input size increase that FCN is not specifically designed to deal with. We have included both FCN’s zero-shot generalization results and its results on the original grid sizes in Table 3.9. DRL-FCN, where we have replaced the policy and value function networks of PPO with an FCN, does manage to perform zero-shot transfer marginally better than DRL performs when training from scratch (as it DRL’s architecture cannot handle varied input sizes) onMaze andHarvester. However, it obtains a negative reward onStairClimber as it attempts to navigate away from the stairs when transferring to the100× 100 grid size. Its performance is still far worse than LEAPS and VIPER on most tasks, demonstrating that the programmatic structure of the policy is important for these tasks. 3.7.7.2 GeneralizationtoUnseenConfigurations We present a generalization experiment in the main paper to study how well the baselines and the programs synthesized by the proposed framework can generalize to larger state spaces that are unseen during training without further learning on theStairClimber andMaze tasks. In this section, we investigate the ability of generalizing to different configurations, which are defined based on the marker placement related to solve a task, on both theTopOff task andHarvester task. 67 Table 3.10: Mean return (standard deviation) [% change in performance] on generalizing to unseen configu- rations onTopOff andHarvester task. TopOff Training configuration % 75% 50% 25% 10% 5% DRL 0.17 (0.05) [-46.8%] 0.12 (0.09) [-62.5%] 0.12 (0.06) [-62.5%] 0.17 (0.13) [-46.8%] 0.13 (0.04) [-59.4%] DRL-abs 0.23 (0.29) [-63.5%] 0.29 (0.36) [-54.0%] 0.45 (0.45) [-28.6%] 0.24 (0.38) [-61.9%] 0.26 (0.37) [-18.8%] VIPER 0.27 (0.03) [-10.0%] 0.28 (0.04) [-6.67%] 0.27 (0.06) [-10.0%] 0.27 (0.02) [-10.0%] 0.28 (0.03) [-6.67%] LEAPS 0.68 (0.18) [-15.0%] 0.65 (0.13) [-18.8%] 0.61 (0.24) [-23.8%] 0.68 (0.21) [-15.0%] 0.67 (0.18) [-16.3%] Harvester Training configuration % 75% 50% 25% 10% 5% DRL 0.64 (0.24) [-28.9%] 0.71 (0.29) [-21.1%] 0.21 (0.06) [-76.7%] 0.14 (0.09) [-84.4%] 0.04 (0.01) [-95.6%] DRL-abs 0.14 (0.21) [-56.3%] 0.24 (0.25) [-25.0%] 0.05 (0.06) [-84.4%] 0.13 (0.21) [-59.4%] 0.31 (0.31) [-3.13%] VIPER 0.54 (0.01) [+5.88%] 0.54 (0.02) [+5.88%] 0.55 (0.01) [+7.84%] 0.54 (0.01) [+5.88%] 0.44 (0.22) [-13.7%] LEAPS 0.40 (0.30) [-13.0%] 0.42 (0.27) [-8.69%] 0.50 (0.35) [+08.69%] 0.12 (0.19) [-73.9%] 0.01 (0.03) [-97.6%] Since solvingTopOff requires an agent to put markers on top of all markers on the last row, the initial configurations are determined by the marker presence on the last row. The grid has a size of 10× 10 inside the surrounding wall. We do not spawn a marker at the bottom right corner in the last row, leaving 9 possible locations with marker, allowing2 9 possible initial configurations. On the other hand, Harvester requires an agent to pick up all the markers placed in the grid. The grid has a size of 6× 6 inside the surrounding wall, leaving 36 possible locations in grid with a marker, resulting in 2 36 possible initial configurations. We aim to test if methods can learn from only a small portion of configurations during training and still generalize to all the possible configurations without further learning. To this end, we experiment using 75%,50%,25%,10%,5% of the configurations for training DRL, DRL-abs, and VIPER and for the program search stage of LEAPS. Then, we test zero-shot generalization of the learned models and programs on all the possible configurations. We report the performance in Table 3.10. We compare the performance each method achieves to its own performance learning from all the configurations (reported in the main paper) to investigate how limiting training configurations affects the performance. Note that the results of training and testing on100% configurations are reported in the main paper, where no generalization is required. 68 TopOff. LEAPS outperforms all the baselines on the mean return on all the experiments. VIPER and LEAPS enjoy the lowest and the second lowest performance decrease when learning from only a portion of configurations, which demonstrates the strength of programmatic policies. DRL-abs slightly outperforms DRL, with better absolute performance and lower performance decrease. We believe that this is because DRL takes entire Karel grids as input, and therefore held out configurations are completely unseen to it. In contrast, DRL-abs takes abstract states (i.e. local perceptions) as input, which can alleviate this issue. Harvester. VIPER outperforms almost all other methods on absolute performance and performance decrease, while LEAPS achieves second best results, which again justifies the generalization of program- matic policies. Both DRL and DRL-abs are unable to generalize well when learning from a limited set of configurations, except in the case of DRL-abs learning from 5% of configurations, which can be attributed to the high-variance of DRL-abs results. 3.7.8 AdditionalAnalysisonExperimentalResults Due to the limited space in the main paper, we include additional analysis of the experimental results in this section. 3.7.8.1 DRLvs. DRL-abs We hypothesize that DRL-abs does not always outperform DRL due to imperfect perception (i.e. state abstrac- tion) design. DRL-abs takes abstract states as input (i.e.frontIsClear(),leftIsClear(),rightIsClear(), markerPresent() in our design), which only describe local perception while omitting the information of the entire map. Therefore, for tasks such asStairClimber,Harvester, andCleanHouse, which would be easier to solve with access to the entire Karel grid, DRL might outperform DRL-abs. In this work, DRL-abs’ abstract states are the perceptions from the DSL we synthesize programs with to make the comparisons fair against our method as well as analyzing the effects of abstract states in the DRL domain. However, a 69 more sophisticated design for perception/state abstraction could potentially improve the performance of DRL-abs. 3.7.8.2 VIPERGeneralization VIPER operates on the abstract state space which is invariant to grid size. However, for the reasons below, it is still unable to transfer the behavior to the larger grid despite its abstract state representation. We hypothesize that VIPER’s performance suffers on zero-shot generalization for two main reasons. 1. It is constrained to imitate the DRL teacher policy during training, which is trained on the smaller grid sizes. Thus its learned policy also experiences difficulty in zero-shot generalization to larger grid sizes. 2. Its decision tree policies cannot represent certain looping behaviors as they simply perform a one-to- one mapping from abstract state to action, thus making it difficult to learn optimal behaviors that require a one-to-many mapping between an abstract state and a set of desired actions. Empirically, we observed that training losses for VIPER decision trees were much higher for tasks such as StairClimber which require such behaviors. 3.7.9 DetailedDescriptionsandIllustrationsofAblationsandBaselines This section provides details on the variations of LEAPS used for ablations studies and the baselines which we compare against. The descriptions of the ablations of LEAPS are presented in Section 3.7.9.1 and the illustrations are shown in Figure 3.15. The naïve program synthesis baseline is illustrated in Figure 3.16 (c) for better visualization. Then, the descriptions of the baselines are presented in Section 3.7.9.2 and the illustrations are shown in Figure 3.16. 70 3.7.9.1 Ablations We first ablate various components of our proposed framework in order to (1) justify the necessity of the proposed two-stage learning scheme and (2) identify the effects of the proposed objectives. We consider the following baselines and ablations of our method. • Naïve: the naïve program synthesis baseline is a policy that learns to directly synthesize a program from scratch by recurrently predicting a sequence of program tokens. The architecture of this baseline is a recurrent neural network which takes an initial starting token as the input at the first time step, and then sequentially outputs a program token at each time step to compose a program until an end token is produced. Note that the observation of this baseline is its own previously outputted program token instead of the state of the task environment (e.g. Karel grids). Also, at each time step, this baseline produces a distribution over all the possible program tokens in the given DSL instead of a distribution over agent’s action in the task environment (e.g.move()). This baseline investigates if an end-to-end learning method can solve the problem. This baseline is illustrated in Figure 3.16 (c). • LEAPS-P: the simplest ablation of LEAPS, in which the program embedding space is learned by only optimizing the program reconstruction lossL P . This baseline is illustrated in Figure 3.15 (a). • LEAPS-P+R: an ablation of LEAPS which optimizes both the program reconstruction lossL P and the program behavior reconstruction lossL R . This baseline is illustrated in Figure 3.15 (b). • LEAPS-P+L: an ablation of LEAPS which optimizes both the program reconstruction lossL P and the latent behavior reconstruction lossL L . This baseline is illustrated in Figure 3.15 (c). • LEAPS (LEAPS-P+R+L): LEAPS with all the losses, optimizing our full objective. • LEAPS-rand-{8/64}: like LEAPS, this ablation also optimizes the full objective for learning the program embedding space. But when searching latent programs, instead of CEM, it simply randomly samples 71 8/64 candidate latent programs and chooses the best performing one. These baselines justify the effectiveness of using CEM for searching latent programs. 3.7.9.2 Baselines We evaluate LEAPS against the following baselines (illustrated in Figure 3.16). • DRL: a neural network policy trained on each task and taking raw states (Karel grids) as input. A Karel grid is represented as a binary tensor with dimensionW× H× 16 (there are 16 possible states for each grid square) instead of an image. This baseline is illustrated in Figure 3.16 (a). • DRL-abs: a recurrent neural network policy directly trained on each Karel task but instead of taking raw states (Karel grids) as input it takes abstract states as input (i.e. it sees the same percep- tions as LEAPS). Specifically, all returned values of perceptions such as frontIsClear()==true, leftIsClear()==false, rightIsClear()==true, markersPresent()==false, and noMarkersPresent()==true are concatenated as a binary vector, which is then fed to the DRL-abs policy as its input. This baseline allows for a fair comparison to LEAPS since the program execution process also utilizes abstract state information. This baseline is illustrated in Figure 3.16 (b). • DRL-abs-t: a DRL transfer learning baseline in which for each task, we train DRL-abs policies on all other tasks, then fine-tune them on the current task. Thus it acquires a prior by learning to first solve other Karel tasks. Rewards are reported for the policies from the task that transferred with highest return. We only transfer DRL-abs policies as some tasks have different state spaces so that transferring a DRL policy trained on a task to another task with a different state space is not possible. This baseline is designed to investigate if acquiring task related priors allows DRL policies to perform better on our Karel tasks. Unlike LEAPS, which acquires priors from a dataset consisting of randomly 72 generated programs and the behaviors those program induce in the environment, DRL-abs-t allows for acquiring priors from goal-oriented behaviors (i.e. other Karel tasks). • HRL: a hierarchical RL baseline in which a VAE is first trained on action sequences from program execution traces used by LEAPS. Once trained, the decoder is utilized as a low-level policy for learning a high-level policy to sample actions from. Similar to LEAPS, this baseline utilizes the dataset to produce a prior of the domain. It takes raw states (Karel grids) as input. This baseline is also designed to investigate if acquiring priors allow DRL policies to perform better. Similar to LEAPS, which acquires priors from a dataset consisting of randomly generated programs and the behaviors those program induce in the environment, HRL is trained to acquire priors by learning to reconstruct the behaviors induced by the programs. One can also view this baseline as a version of the framework proposed in [108] with some simplifications, which also learns an embedding space using a VAE and then trains a high-level policy to utilize this embedding space together with the low-level policy whose parameters are frozen. This baseline is illustrated in Figure 3.16 (d). • HRL-abs: the same method as HRL but taking abstract states (i.e. local perceptions) as input. This baseline is illustrated in Figure 3.16 (d). • VIPER [25]: A decision-tree programmatic policy which imitates the behavior of a deep RL teacher policy via a modified DAgger algorithm [251]. This decision tree policy cannot synthesize loops, allowing us to highlight the performance advantages of more expressive program representation that LEAPS is able to take advantage of. All the baselines are trained with PPO [265] or SAC [104], including the VIPER teacher policy. More training details can be found in Section 3.7.12. 73 3.7.10 ProgramDatasetGenerationDetails To learn a program embedding space for the proposed framework and its ablations, we randomly generate 50k programs to form a dataset with 35k training programs and 7.5k programs for validation and testing. Simply generating programs by uniformly sampling all the tokens from the DSL would yield programs that mainly only contain action tokens since the chance to synthesize conditional statements with correct grammar is low. Therefore, to produce programs that are longer and deeply nested with conditional statements to induce more complex behaviors, we propose to sample programs using a probabilistic sampler. To generate each program, we sample program tokens according to the probabilities listed in Table 3.11 at every step until we sample an ending token or when a maximum program length is reached. When generating programs, we ensure that no program is identical to any other. Each token is generated sequentially, and length is effectively governed by the STMT_STMT token detailed in Table 3.11’s caption. There is a maximum depth limit of 4 nested conditional/loop statements, and a maximum statement depth limit of 6 (can’t have more than 6 nestedSTMT_STMT tokens). Note that this sampling procedure does not guarantee that the programs generated will terminate, hence when executing them to obtain ground-truth interactions for training the Program Behavior and Latent Behavior Reconstruction losses we limit the max program execution length to 100 environment timesteps. This sampling procedure results in the distribution of program lengths seen in Figure 3.17. Intuitively, shorter lengths can bias synthesized programs to compress the same behaviors into fewer tokens through the use of loops, making program search easier. Therefore, in our experiments, we have limited the maximum output program length of LEAPS to 45 tokens (as the maximum in the dataset is 44). As shown in the example programs generated by LEAPS in Figure 3.14, LEAPS successfully generates loops for our Karel tasks, which can be probably attributed to this bias of program length. We further verify this intuition by rerunning LEAPS with the max program length set to 100 tokens on the Karel tasks. We 74 display generated programs in Table 3.12, where we see that some of the generated programs are indeed much longer and lack loop statements and structures. Table 3.11: The probability of sampling program tokens when generating the program dataset. Tokens are generated sequentially, andSTMT_STMT refers to breaking up the current token into two tokens, each of which is selected according to the same probability distribution again. Thus it effectively controls how long programs will be. WHILE REPEAT STMT_STMT ACTION IF IFELSE Standard Dataset 0.15 0.03 0.5 0.2 0.08 0.04 3.7.11 KarelTaskDetails MDPTasks We utilize environment state based reward functions for the RL tasksStairClimber,Four- Corner, TopOff, Maze, Harvester, and CleanHouse. For each task, we average performance of the policies on 10 random environment start configurations. For all tasks with marker placing objectives, the final reward will be 0—regardless of the any other agent actions—if a marker is placed in the wrong location. This is done in order to discourage “spamming” marker placement on every grid location to exploit the reward functions. All rewards described below are then normalized so that the return is between [0, 1.0] for tasks without penalties, and [-1.0, 1.0] for tasks with negative penalties, for easier learning for the DRL methods. We visualize all tasks as well as their start and ideal end states in Figure 3.18 on a10× 10 grid for consistency in the visualizations (exceptCleanHouse). 3.7.11.1 StairClimber The goal is to climb the stairs to reach where the marker is located. The reward is defined as a sparse reward: 1 if the agent reaches the goal in the environment rollout, -1 if the agent moves to a position off of the stairs during the rollout, and 0 otherwise. This is on a12× 12 grid, and the marker location and agent’s initial location are randomized between rollouts. 75 Table 3.12: LEAPSLength100SynthesizedKarelPrograms. Line breaks are not shown here as the programs are very long. The examples picked are ones that represent the programs generated by most seeds for each task. Without the 45 token restriction on program lengths, programs forTopOff,fourCorner, andHarvester are very long and have repetitive movements that can easily be put intoREPEAT orWHILE loops. TheCleanHouse program also contains repeated, somewhat redundantWHILE loops. Maze and StairClimber programs are mostly unaffected by the change in maximum program length. These programs demonstrate that the bias induced by program length restriction is important for producing more complex programs in the program synthesis phase of LEAPS. Karel Task Program StairClimber DEF run m( turnLeft turnRight turnLeft turnLeft turnRight WHILE c( noMarkersPresent c)} w( turnLeft move w) m) TopOff DEF run m( WHILE c( noMarkersPresent c) w( move w) turnRight turnRight turnRight turnRight turnRight} turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight putMarker turnRight turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move turnRight move m) CleanHouse DEF run m( turnRight pickMarker turnLeft turnRight turnLeft pickMarker move turnLeft WHILE c( leftIsClear c) w( pickMarker move w) turnRight turnLeft pickMarker move turnLeft WHILE c( leftIsClear c) w( pickMarker move w) turnLeft pickMarker} WHILE c( leftIsClear c) w( pickMarker move turnLeft pickMarker w)} WHILE c( noMarkersPresent c) w( turnLeft move pickMarker w) turnLeft pickMarker turnLeft m) fourCorner DEF run m( turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight turnRight WHILE c( frontIsClear c) w( move w) turnRight WHILE c( frontIsClear c) w( move w) turnRight WHILE c ( frontIsClear c) w( move w) turnRight putMarker WHILE c( frontIsClear c) w( move w) turnRight putMarker WHILE c( frontIsClear c) w( move w)} turnRight putMarker WHILE c( frontIsClear c) w( move w) turnRight putMarker m) Maze DEF run m( WHILE c( noMarkersPresent c) w( REPEAT R=1 r( turnRight r) move w) turnLeft turnRight m) Harvester DEF run m( turnLeft turnRight pickMarker move pickMarker move turnRight move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move pickMarker move pickMarker move turnRight move pickMarker move pickMarker move pickMarker move pickMarker move pickMarker move turnRight move m) 76 3.7.11.2 FourCorner The goal is to place a marker at each corner of the Karel environment grid. The reward is defined as sum of corners having a marker divided by four. If the Karel state has a marker placed in wrong location, the reward will be 0. This is on a12× 12 grid. 3.7.11.3 TopOff The goal is to place a marker wherever there is already a marker in the last row of the environment, and end up in the rightmost square on the bottom row at the end of the rollout. The reward is defined as the number of consecutive places until the agent either forgets to place a marker where the marker is already present or places a marker at an empty location in last row, with a bonus for ending up on the last square. This is on a12× 12 grid, and the marker locations in the last row are randomized between rollouts. 3.7.11.4 Maze The goal is to find a marker in randomly generated maze. The reward is defined as a sparse reward: 1 if the agent finds the marker in the environment rollout, 0 otherwise. This is on a 8× 8 grid, and the marker location, agent’s initial location, and the maze configuration itself are randomized between rollouts. 3.7.11.5 CleanHouse We design a complex14× 22 Karel environment grid that resembles an apartment. The goal is to pick up the garbage (markers) placed at 10 different locations and reach the location where there is a dustbin (2 markers in 1 location). To make the task simpler, we place the markers adjacent to any wall in the environment. The reward is defined as total locations cleaned (markers picked) out of the total number of markers placed in initial Karel environment state (10). The agent’s initial location is fixed but the marker locations are randomized between rollouts. 77 3.7.11.6 Harvester The goal is to pickup a marker from each location in the Karel environment. The final reward is defined as the number of markers picked up divided the total markers present in the initial Karel environment state. This is on a8× 8 grid. We run bothMaze andHarvester on smaller Karel environment grids to save time and compute resources because these are long horizon tasks. 3.7.12 HyperparametersandTrainingDetails 3.7.12.1 DRLandDRL-abs RL training directly on the Karel environment is performed with the PPO algorithm [265] for 2M timesteps using the ALF codebase ∗ . We tried a discretized SAC [104] implementation (by replacing Gaussian distribu- tions with Categorical distributions), but it was outperformed by PPO on the Karel tasks on all environments. We also tried tabular Q-learning from raw Karel grids (it wouldn’t work well on abstract states as the state is partially observed), however it was also consistently outperformed by PPO. For DRL, the policies and value networks are the same with a shared convolutional encoder that first processes the state (as the Karel state size is(H× W × 16) for16 possible agent direction or marker placement values that each state in the grid can take on at a time. The convolutional encoder consists of two layers: the first with 32 filters, kernel size2, and stride1, the second with32 filters, kernel size 4, and stride1. For DRL-abs, the policy and value networks are both comprised of an LSTM layer and a 2-layer fully connected network, all with hidden sizes of100. For each task, we perform a comprehensive hyperparameter grid search over the following parameters, and report results from the run with the best averaged final reward over 5 seeds. The hyperparameter grid is listed below, shared parameters are also listed: • Importance Ratio Clipping: {0.05, 0.1, 0.2} ∗ https://github.com/HorizonRobotics/alf/ 78 • Advantage Normalization: {True, False} • Entropy Regularization: {0.1, 0.01, 0.001} • Number of updates per training iteration (This controls the ratio of gradient steps to environment steps): {1, 4, 8, 16} • Number of environment steps per set of training iterations: 32 • Number of parallel actors: 10 • Optimizer: Adam • Learning Rate: 0.001 • Batch Size: 128 Hyperparameters that performed best for each task are listed below. DRL Import Ratio Clip Adv Norm Entropy Reg Updates per Train Iter CleanHouse 0.1 True 0.01 4 FourCorner 0.2 True 0.01 16 Harvester 0.05 True 0.01 8 Maze: 0.05 True 0.001 8 StairClimber 0.1 True 0.1 4 TopOff 0.05 True 0.001 4 79 DRL-abs Import Ratio Clip Adv Norm Entropy Reg Updates per Train Iter CleanHouse 0.2 True 0.01 8 FourCorner 0.05 True 0.01 4 Harvester 0.2 True 0.01 4 Maze: 0.2 True 0.001 4 StairClimber 0.05 True 0.1 16 TopOff 0.2 True 0.001 8 3.7.12.2 DRL-abs-t DRL-abs-t is limited to DRL-abs policies as the state spaces are different for some of the Karel tasks. For DRL- abs-t, we use the best hyperparameter configuration for each Karel task to train a policy to 1M timesteps. Then, we attempt direct policy transfer to each other task by training for another 1M timesteps on the new task with the same hyperparameters (excluding transferring to the same task). Numbers reported are from the task transfer that achieved the highest reward. The tasks that we transfer from for each task are listed below: 80 DRL-abs-t Transferred from CleanHouse Harvester FourCorner TopOff Harvester Maze Maze StairClimber StairClimber Harvester TopOff Harvester 3.7.12.3 HRL Pretrainingstage: We first train a VAE to reconstruct action trajectories generated from our program dataset. For each program, we generate 10 rollouts in randomly configured Karel environments to produce the HRL dataset, giving this baseline the same data as LEAPS. These variable-length action sequences are encoded via an LSTM encoder into a 10-dimensional, continuous latent space and decoded by an LSTM decoder into the original action trajectories. We chose 10-dimensional so as to not make downstream RL too difficult. We tune the KL divergence weight ( β ) of this network such that it’s as high as possible while being able to reconstruct the trajectories well. Network/training details below: • β : 1.0 • Optimizer: Adam (All optimizers) • Learning Rates: 0.0003 • Hidden layer size: 128 • # LSTM layers (both encoder/decoder): 2 81 • Latent embedding size: 10 • Nonlinearity: ReLU • Batch Size: 128 Downstream(Hierarchical)RL On our Karel tasks, we use the VAE’s decoder to decode latent vectors (actions for the RL agent) into varied-length action sequences for all Karel tasks. The decoder parameters are frozen and used for all environments. The RL agent is retrained from scratch for each task, in the same manner as the standard RL baselines DRL-abs and DRL. We use Soft-Actor Critic (SAC, Haarnoja et al. [104]) as the RL algorithm as it is state of the art in many continuous action space environments. SAC grid search parameters for all environments follow below: • Number of updates per training iteration: {1, 8} • Number of environment steps per set of training iterations: 8 (multiplied by the number of steps taken by the decoder in the environment) • Polyak Averaging Coefficient: {0.95, 0.9} • Number of parallel actors: 1 • Batch size: 128 • Replay buffer size: 1M The best hyperparameters follow: 82 HRL-abs Updates per Train Iter Polyak Coefficient CleanHouse 1 0.95 FourCorner 8 0.9 Harvester 8 0.95 Maze 1 0.95 StairClimber 1 0.9 TopOff 1 0.9 HRL Updates per Train Iter Polyak Coefficient CleanHouse 1 0.9 FourCorner 1 0.95 Harvester 1 0.95 Maze 8 0.9 StairClimber 8 0.95 TopOff 8 0.95 3.7.12.4 Naïve The naïve program synthesis baseline takes an initial token as input and outputs an entire program at each timestep to learn a recurrent policy guided by the rewards of these programs. We execute these generated programs on 10 random environment start configurations in Karel to get the reward. We run PPO for 2M Karel environment timesteps. The policy network is comprised of one shared GRU layer, followed by two fully connected layers, for both the policy and value networks. For evaluation, we generate 64 programs 83 from the learned policy, and choose the program with the maximum reward on 10 demonstrations. For each task, we perform a hyperparameter grid search over the following parameters, and report results from the run with the best averaged final reward over 5 seeds. We exponentially decay the entropy loss coefficient in PPO from the initial to final entropy coefficient to avoid local minima during the initial training steps. • Learning Rate: 0.0005 • Batch Size (B): {64, 128, 256} • initial entropy coefficient ( E i ): {1.0, 0.1} • final entropy coefficient: {0.01} • Hidden Layer Size: 64 Hyperparameters that performed best for each task are listed below. Naïve B E i WHILE 128 0.1 IFELSE+WHILE 256 1.0 2IF+IFELSE 256 0.1 WHILE+2IF+IFELSE 128 0.1 84 Naïve B E i CleanHouse 128 0.1 FourCorner 128 1.0 Harvester 128 1.0 Maze 256 1.0 StairClimber 128 1.0 TopOff 128 1.0 3.7.12.5 VIPER VIPER [25] builds a decision tree programmatic policy by imitating a given teacher policy. We use the best DRL policies as teachers instead of the DQN [200] teacher policy used in Bastani, Pu, and Solar-Lezama [25]. We did this in order to give the teacher the best performance possible for maximum fairness in comparison against VIPER, as we empirically found the PPO policy to perform much better on our tasks than a DQN policy. We perform a grid search over VIPER hyperparameters, listed below: • Max depth of decision tree: {6, 12, 15} • Max number of samples for tree policy: {100k, 200k, 400k} • Sample reweighting: {True, False} The best hyperparameters found for each task are listed below: 85 VIPER Max Depth Max Num Samples Sample Reweighting CleanHouse 6 100k False FourCorner 12 100k False Harvester 12 400k True Maze 12 100k True StairClimber 12 400k True TopOff 15 100k False 3.7.12.6 ProgramEmbeddingSpaceVAEModel Encoder-DecoderArchitecture. The encoder and decoder are both recurrent networks. The encoder structure consists of a PyTorch token embedding layer, then a recurrent GRU cell, and two linear layers that produceµ andlogσ vectors to sample the program embedding. The decoder consists of a recurrent GRU cell which takes in the embedding of the previous token generated and then a linear token output layer which models the log probabilities of all discrete tokens. Since we have access to DSL grammar during program synthesis, we utilize a syntax checker based on the Karel DSL grammar from Bunel et al. [35] at the output of the decoder to limit predictions to syntactically valid tokens. We restrict our decoder from predicting syntactically invalid programs by masking out tokens that make a program syntactically invalid at each timestep. This syntax checker is designed as a state machine that keeps track of a set of valid next tokens based on the current token, open code blocks (e.g. while, if, ifelse) in the given partial program, and the grammar rules of our DSL. Since we 86 generate a program as a sequence of tokens, the syntax checker outputs at each timestep a maskM, where M ∈{−∞ ,0} number of DSL tokens , and M j = −∞ if the j-th token is not valid in the current context 0 otherwise This mask is added to the output of the last layer of the decoder, just before the Softmax operation that normalizes the output to a probability over the tokens. π Architecture. The program-embedding conditioned policyπ consists of a GRU layer that operates on the inputs and three MLP layers that output the log probability of environment actions. Specifically, it takes a latent program vector, current environment state, and previous action as input and outputs the predicted environment action for each timestep. To evaluate how close the predicted neural execution traces are to the execution traces of the ground- truth programs, we consider the following metrics: • Action token accuracy: the percentage of matching actions in the predicted execution traces and the ground-truth execution traces. • Action sequence accuracy: the percentage of matching action sequences in the predicted execution traces and the ground-truth execution traces. It requires that a predicted execution trace entirely matches the ground-truth execution trace. After convergence, our model achieves an action token accuracy of 96.5% and an action sequence accuracy of 91.3%. Training. The reinforcement learning algorithm used for the program behavior reconstructionL R is REINFORCE [325]. 87 When training LEAPS with all losses, we first train with the Program Reconstruction ( L P ) and Latent Behavior Reconstruction (L L ) losses, essentially setting λ 1 = λ 3 = 1 and λ 2 = 0 of our full objective, reproduced below: min θ,ϕ,π λ 1 L P θ,ϕ (ρ )+λ 2 L R θ,ϕ (ρ )+λ 3 L L π (ρ,π ), (3.6) Once this model is trained for one epoch, we then train exclusively with the Program Behavior Recon- struction loss (L R ), settingλ 2 =1 andλ 1 =λ 3 =0, with equal number of updates. These two update steps are repeated alternatively till convergence is achieved. This is done to avoid potential issues of updating with supervised and reinforcement learning gradients at the same time. We did not attempt to train these 3 losses jointly. All other shared hyperparameters and training details are listed below: • β : 0.1 • Optimizer: Adam (All optimizers) • Supervised Learning Rate: 0.001 • RL Learning Rate: 0.0005 • Batch Size: 256 • Hidden Layer Size: 256 • Latent Embedding Size: 256 • Nonlinearity: Tanh() 88 3.7.12.7 Cross-EntropyMethod(CEM) CEM search works as follows: we sample an initial latent program vector from the initial distributionD I , and generate population of latent program vectors from aN(0,σI d ) distribution, whereI d is the identity matrix of dimensiond. The samples are added to the initial latent program vector to obtain the population of latent program vectors which are decoded into programs to obtain their rewards. The population is then sorted based on rewards obtained, and a set of ‘elites’ with the highest reward are reduced using weighted mean to one latent program vector for the next iteration of sampling. This process repeats for all CEM iterations. We include the following sets of hyperparameters when searching over the program embedding space to maximizeR mat to reproduce ground-truth program behavior or to maximizeR mat in the Karel task MDP. • Population Size (S): {8, 16, 32, 64} • µ : {0.0} • σ : {0.1,0.25,0.5} • % of population elites (this refers to the percent of the population considered ‘elites’): {0.05,0.1,0.2} • Exponentialσ decay † : {True, False} • Initial distributionD I :{N(1,0),N(0,I d ),N(0,0.1I d )} Since a comprehensive grid search over the hyperparameter space would be too computationally expensive, we choose parameters heuristically. We report results from the run with the best averaged reward over 5 seeds. Hyperparameters that performed best for each task are listed below. Ground-TruthProgramReconstruction We include the following sets of hyperparameters when searching over the program embedding space to maximizeR mat to reproduce ground-truth program behavior. † Over the first 500 epochs, we exponentially decay σ to0.1, and then we keep it at0.1 for the rest of the epochs if True. 89 We allow the search to run for1000 CEM iterations, counting the search as a success when it achieves 10 consecutive CEM iterations with matching the ground-truth program behaviors exactly in the environment across 10 random environment start configurations. We use same hyperparameter set to compare LEAPS-P, LEAPS-P+R, LEAPS-P+L, and LEAPS. CEM S σ # Elites Exp Decay D I WHILE 32 0.25 0.1 False N(0,0.1I d ) IFELSE+WHILE 32 0.25 0.1 True N(0,0.1I d ) 2IF+IFELSE 16 0.25 0.2 True N(0,0.1I d ) WHILE+2IF+IFELSE 32 0.25 0.2 False N(0,0.1I d ) MDP Task Performance We include the following sets of hyperparameters when searching over the LEAPS program embedding space to maximize rewards in the MDP. We allow the search to run for1000 CEM iterations, counting the search as a success when it achieves 10 consecutive CEM iterations of maximizing environment reward (solving the task) across 10 random environment start configurations. CEM S σ # Elites Exp Decay D I CleanHouse 32 0.25 0.05 True N(1,0) FourCorner 64 0.5 0.2 False N(0,0.1I d ) Harvester 32 0.5 0.1 True N(0,I d ) Maze 16 0.1 0.1 False N(1,0) StairClimber 32 0.25 0.05 True N(0,0.1I d ) TopOff 64 0.25 0.05 False N(0,0.1I d ) 90 3.7.12.8 RandomSearchLEAPSAblation The random search LEAPS ablations (LEAPS-rand-8 and LEAPS-rand-64) replace the CEM search method for latent program synthesis with a simple random search method. Both use the full LEAPS model trained with all learning objectives. We sample an initial vector from an initial distributionD I and add it to either 8 or 64 latent vector samples from aN(0,σI d ) distribution. We then decode those vectors into programs and evaluate their rewards, and then report the rewards of the best-performing latent program from that population. As such, the only parameters that we require are the initial sampling distribution andσ . We perform a grid search over the following for both LEAPS-rand-8 and LEAPS-rand-64. • σ : {0.1, 0.25, 0.5} • Initial distributionD I :{N(0,I d ),N(0,0.1I d )} Ground-TruthProgramReconstruction We report hyperparameters below for both random search methods on program reconstruction tasks. LEAPS-rand-8 σ D I WHILE 0.1 N(0,0.1I d ) IFELSE+WHILE 0.5 N(0,0.1I d ) 2IF+IFELSE 0.5 N(0,0.1I d ) WHILE+2IF+IFELSE 0.5 N(0,0.1I d ) 91 LEAPS-rand-64 σ D I WHILE 0.5 N(0,0.1I d ) IFELSE+WHILE 0.5 N(0,0.1I d ) 2IF+IFELSE 0.5 N(0,0.1I d ) WHILE+2IF+IFELSE 0.5 N(0,0.1I d ) MDPTaskPerformance We report hyperparameters below for both random search methods on Karel tasks. LEAPS-rand-8 σ D I CleanHouse 0.5 N(0,0.1I d ) FourCorner 0.5 N(0,0.1I d ) Harvester 0.5 N(0,0.1I d ) Maze 0.25 N(0,0.1I d ) StairClimber 0.5 N(0,I d ) TopOff 0.25 N(0,0.1I d ) 92 LEAPS-rand-64 σ D I CleanHouse 0.5 N(0,0.1I d ) FourCorner 0.25 N(0,0.1I d ) Harvester 0.5 N(0,0.1I d ) Maze 0.1 N(0,0.1I d ) StairClimber 0.25 N(0,0.1I d ) TopOff 0.5 N(0,0.1I d ) 3.7.13 ComputationalResources For our experiments, we used both internal and cloud provider machines. Our internal machines are: • M1: 40-vCPU Intel Xeon with 4 GTX Titan Xp GPUs • M2: 72-vCPU Intel Xeon with 4 RTX 2080 Ti GPUs The cloud instances that we used are either 128-thread AMD Epyc or 96-thread Intel Xeon based cloud instances with 4-8 NVIDIA Tesla T4 GPUs. Experiments were run in parallel across many CPUs whenever possible, thus requiring the high vCPU count machines. The experiment costs (GPU memory/time) are as follows: Learning Program Embedding Stage: • LEAPS-P: 4.2GB/13hrs on either M1 or M2 • LEAPS-P+R: 4.2GB/44-54hrs on M2 • LEAPS-P+L: 8.7GB/26hrs on either M1 or M2 • LEAPS: 8.8GB/104hrs on M1, 8.8GB/58hrs on M2 93 Policy Learning Stage: • CEM search: 0.8GB/4-10min (depends on the CEM population size and the number of iterations until convergence) • DRL/DRL-abs/DRL-abs-t: 0.7-2GB/1hr per run with parallelization across 10 processes • HRL/HRL-abs: 1-2GB/2.5hrs per run • VIPER: 0.7GB/20-30 minutes (excluding the time for learning its teacher policy) 3.7.14 TowardRoboticsApplications One way to make the LEAPS framework applicable to robotics domains would be simultaneously learning perception modules and action controllers. Other possible solutions include incorporating program execu- tion methods [11, 216, 288, 337, 160, 124] that are designed to allow program execution or designing DSLs that allow pre-training of perception modules and action controllers. Also, the proposed framework share similarity with works in multi-task RL [216, 11, 295, 299, 281] and meta-learning [280, 202, 319, 315, 311, 244, 77, 139, 316, 45, 158, 212, 252, 193, 239, 49]. Specifically, the proposed framework learns a program embedding space from a distribution of tasks/programs. Once the program embedding space is learned, it can be be reused to solve different tasks without retraining. 94 (a) LEAPS-P (b) LEAPS-P+R (c) LEAPS-P+L (d) LEAPS Figure 3.4: Visualizationsoflearnedprogramembeddingspace. We perform dimensionality reduction with PCA to embed encoded programs from the training dataset, samples drawn from a normal distribution, programs from the testing dataset, and programs reconstructed by models to a 2D space. The shape of the latent training programs in the program embedding spaces learned by LEAPS-P and LEAPS-P+R are similar to a normal distribution, while in the program embedding spaces learned by LEAPS and LEAPS-P+L, the shape is more twisted, suggesting the effectiveness of the proposed latent behavior reconstruction objective. Moreover, the distances between pairs of ground-truth programs and their reconstructions are smaller in the program embedding space learned by LEAPS, highlighting the advantage of employing both of the two proposed behavior reconstruction objectives. 95 (a) Iteration 1 (b) Iteration 4 (c) Iteration 9 (d) Iteration 14 (e) Iteration 19 (f) Iteration 23 Figure 3.5: StairClimberCEMTrajectoryVisualization. Latent training programs from the training dataset, a ground-truth program for StairClimber task, CEM populations, and CEM next candidate programs are embedded to a 2D space using PCA. Both the average reward of the entire population and the reward of the next candidate program (CEM Next Center) consistently increase as the number of iterations increase. Also, the CEM population gradually moves toward where the ground-truth program is located. 96 (a) Iteration 1 (b) Iteration 211 (c) Iteration 422 (d) Iteration 633 (e) Iteration 843 (f) Iteration 1000 Figure 3.6: FourCornerCEMTrajectoryVisualization. Latent training programs from the training dataset, a ground-truth program for the FourCorner task, CEM populations, and CEM next candidate programs are embedded to a 2D space using PCA. The CEM trajectory does not converge. The ground-truth program lies far away from the initial sampled distribution, which might contribute to the difficulty of converging. 97 Figure 3.7: UserInterfacefortheHumanDebuggingInterpretabilityExperiments. The top contains moving rollout visualizations of the current program in the “Input Program” box, which users are allowed to edit. “Input Program” will first contain the program synthesized by LEAPS. Syntax errors or other issues with code (such as the edit distance being too high) are displayed in the “Issue with Code?” box, the reward of the current inputted program is in the “New Reward” box, and the reward of the original program synthesized by LEAPS is in the “Orig Reward” box. The user’s best reward across all inputted programs is kept track of in the “Best Reward” box. 98 Figure 3.8: Human Debugging Experiment Example Programs (TopOff). Example original and human-edited programs for each Karel task for edit distances 3 and 5. 99 Figure 3.9: HumanDebuggingExperimentExamplePrograms(FourCorner). Example original and human-edited programs for each Karel task for edit distances 3 and 5. 100 Figure 3.10: HumanDebuggingExperimentExamplePrograms(Harvester). Example original and human-edited programs for each Karel task for edit distances 3 and 5. 101 Figure 3.11: Ground-TruthTestandKarelPrograms. Here we display ground-truth test set programs used for reconstruction experiments and example ground-truth programs that we write which can solve the Karel tasks (there are an infinite number of programs that can solve each task). Conditionals are enclosed inc( c), while loops are enclosed inw( w), if statements are enclosed ini( i), and the main program is enclosed inDEF run m( m). 102 Figure 3.12: Exampleprogramreconstructiontaskprogramsgeneratedbynaïve,LEAPS-P,and LEAPS-P+R. The programs that achieve the highest reward while being representative of programs generated by most seeds are shown. The naïve program synthesis baseline usually generates the simplest programs, with fewer conditional statements and loops than the LEAPS ablations. Notably, it fails to generate IFELSE statements on these examples, while LEAPS has no problem doing so. 103 Figure 3.13: ExampleprogramreconstructiontaskprogramsgeneratedbyLEAPS-P+LandLEAPS. The programs that achieve the highest reward while being representative of programs generated by most seeds are shown. The naïve program synthesis baseline usually generates the simplest programs, with fewer conditional statements and loops than the LEAPS ablations. Notably, it fails to generate IFELSE statements on these examples, while LEAPS has no problem doing so. 104 Figure 3.14: ExampleKarelprogramsgeneratedbyLEAPS. The programs that achieved the best reward out of all seeds are shown. 105 (a) LEAPS-P (b) LEAPS-P+R Program ⇢ Latent Program z def run(): if frontIsClear(): move() else: turnLeft() def run(): if frontIsClear(): move() else: turnLeft() L P Reconstructed Program ˆ ⇢ Program ⇢ Latent Program z def run(): if frontIsClear(): move() else: turnLeft() def run(): if frontIsClear(): move() else: turnLeft() Environment Execute L P L R Reconstructed Program ˆ ⇢ Environment Execute Latent Program z Program ⇢ def run(): if frontIsClear(): move() else: turnLeft() def run(): if frontIsClear(): move() else: turnLeft() Environment Execute Environment Execute L P L R Reconstructed Program ˆ ⇢ L L (d) LEAPS (LEAPS-P+R+L) Program ⇢ def run(): if frontIsClear(): move() else: turnLeft() def run(): if frontIsClear(): move() else: turnLeft() Environment Execute L P L L Latent Program z Reconstructed Program ˆ ⇢ (c) LEAPS-P+L Learnable mapping Training Objective Latent Program Figure 3.15: LEAPSVariationsIllustrations. Blue trapezoids represent the modules whose parameters are being learned in the learning program embedding stage. Red diamonds represent the learning objectives. Gray rounded rectangle represent latent programs (i.e. program embeddings), which are vectors. (a) LEAPS- P: the simplest ablation of LEAPS, in which the program embedding space is learned by only optimizing the program reconstruction lossL P . (b) LEAPS-P+R: an ablation of LEAPS which optimizes both the program reconstruction lossL P and the program behavior reconstruction lossL R . (c) LEAPS-P+L: an ablation of LEAPS which optimizes both the program reconstruction lossL P and the latent behavior reconstruction lossL L . (d) LEAPS (LEAPS-P+R+L): our proposed framework that optimizes all the proposed objectives. 106 (d) HRL / HRL-abs Learning High-level Policy (b) DRL-abs Abstract State frontIsClear() leftIsClear() rightIsClear() markerIsPresent() Yes No Yes No a (a) DRL Raw State st airClimber a (c) Naive startToken Program Synthesized So Far def run(): if frontIsClear(): move() else: turnLeft() turnLeft() Program Token Generated at t L P Program ⇢ def run(): if frontIsClear(): move() else: turnLeft() Environment Execute Action Sequence Reconstructed Action Sequence ˆ a 1 , ˆ a 2 ,..., ˆ a t Action Embedding Learning Action Sequence Embedding a Action Embedding State Raw State st airClimber Abstract State OR dec enc dec Figure 3.16: BaselineMethodsIllustrations. (a) DRL: a DRL policy that takes raw state input (i.e. a Karel grid represented as aW× H× 12 binary tensor as there are 12 possible states for each grid square). (b) DRL- abs: a DRL policy that takes abstract state input, containing a vector of returned values of perceptions, e.g. frontIsClear()==true andmarkersPresent()==false. (c) Naive: a naïve program synthesis baseline that learns to directly synthesize a program from scratch by recurrently predicting a sequence of program tokens. (d) HRL/HRL-abs: a hierarchical RL baseline in which a VAE, consisting of a encoderenc and a decoderdec, is first trained to reconstruct action sequences from program execution traces used by LEAPS. Once the action embedding space is learned, it employs a high-level policyπ that learns from scratch to solve task by predicting a distribution in the learned action embedding space. Note that the parameters of the decoderdec are frozen (represented in gray) when the high-level policy is learning. The HRL policy takes raw state input (same as the DRL baseline) and the HRL-abs policy takes abstract state input (same as the DRL-abs baseline). 107 5 10 15 20 25 30 35 40 45 0.00 0.02 0.04 0.06 0.08 0.10 Density Karel Train Dataset Program Lengths 5 10 15 20 25 30 35 40 0.00 0.02 0.04 0.06 0.08 0.10 Density Karel Validation Dataset Program Lengths Figure 3.17: Histograms of the program length (i.e. number of program tokens) in the training and validation datasets. 108 Figure 3.18: Example of initial configurations and their ideal end states of the Karel tasks. Note that we show only one example of initial configuration and its ideal end sate pair for each task. However, markers, walls and agent’s position are randomized in initial configurations depending upon task. Please see section 3.7.11 for more details. 109 (a) StairClimber: LEAPS and DRL are able to climb the stairs, DRL-abs is unable to do so. (b) fourCorner: In this example, LEAPS generates a program which is able to completely solve the task. Both DRL methods learn to only place one single marker in the left bottom corner. (c) TopOff: Here, LEAPS generates a program that solves the task by “topping off” each marker. Both DRL methods only learn to top off the initial marker. 110 (d) Maze: All three methods are able to solve the task. (e) CleanHouse: While both DRL methods learn no meaningful behaviors (generally just spinning around in place), LEAPS generates a program that is able to navigate to and clean the leftmost room. (f) Harvester: All three methods make partial progress onHarvester. Figure 3.19: KarelRolloutVisualizations. Example rollouts for LEAPS, DRL-abs, and DRL for each task. 111 PartIII PrimitiveSkillAcquisition 112 Chapter4 Meta-LearningonMultimodalTaskDistributions 4.1 Introduction Humans make effective use of prior knowledge to acquire new skills rapidly. When the skill of interest is related to a wide range of skills that one have mastered before, we can recall relevant knowledge of prior skills and exploit them to accelerate the new skill acquisition procedure. For example, imagine that we are learning a novel snowboarding trick with knowledge of basic skills about snowboarding, skiing, and skateboarding. We accomplish this feat quickly by exploiting our basic snowboarding knowledge together with inspiration from our skiing and skateboarding experience. Can machines likewise quickly master a novel skill based on a variety of related skills they have already acquired? Recent advances in meta-learning [311, 79, 70] have attempted to tackle this problem. They offer machines a way to rapidly adapt to a new task using few samples by first learning an internal representation that matches similar tasks. Such representations can be learned by considering a distribution over similar tasks as the training data distribution. Model-based (i.e. RNN-based) meta-learning approaches [70, 319, 202, 193] propose to recognize the task identity from a few sample data, use the task identity to adjust a model’s state (e.g. RNN’s internal state or an external memory) and make the appropriate predictions with the adjusted model. Those methods demonstrate good performance at the expense of having to hand-design architectures, yet the optimal strategy of designing a meta-learner for arbitrary tasks may not always 113 be obvious to humans. On the other hand, model-agnostic meta-learning frameworks [77, 81, 139, 158, 94, 212, 254, 252] seek an initialization of model parameters that a small number of gradient updates will lead to superior performance on a new task. With the flexibility in the model choices, these frameworks demonstrate appealing performance on a variety of domains, including regression, image classification, and reinforcement learning. While most of the existing model-agnostic meta-learners rely on a single initialization, different tasks sampled from a complex task distributions can require substantially different parameters, making it difficult to find a single initialization that is close to all target parameters. If the task distribution is multimodal with disjoint and far apart modes (e.g. snowboarding, skiing), one can imagine that a set of separate meta-learners with each covering one mode could better master the full distribution. However, associating each task with one of the meta-learners not only requires additional task identity information, which is often not available or could be ambiguous when the modes are not clearly disjoint, but also disables transferring knowledge across different modes of the task distribution. To overcome this issue, we aim to develop a meta-learner that is able to acquire mode-specific prior parameters and adapt quickly given tasks sampled from a multimodal task distribution. To this end, we leverage the strengths of the two main lines of existing meta-learning techniques: model-based and model-agnostic meta-learning. Specifically, we propose to augment MAML [77] with the capability of generalizing across a multimodal task distribution. Instead of learning a single initialization point in the parameter space, we propose to first compute the task identity of a sampled task by examining task related data samples. Given the estimated task identity, our model then performs modulation to condition the meta-learned initialization on the inferred task mode. Then, with these modulated parameters as the initialization, a few steps of gradient-based adaptation are performed towards the target task to progressively improve its performance. An illustration of our proposed framework is shown in Figure 4.1. 114 To investigate whether our method can acquire meta-learned prior parameters by learning tasks sampled from multimodal task distributions, we design and conduct experiments on a variety of domains, including regression, image classification, and reinforcement learning. The results demonstrate the effectiveness of our approach against other systems. A further analysis has also shown that our method learns to identify task modes without extra supervision. The main contributions of this paper are three-fold as follows: • We identify and empirically demonstrate the limitation of having to rely on a single initialization in a family of widely used model-agnostic meta-learners. • We propose a framework together with an algorithm to address this limitation. Specifically, it generates a set of meta-learned prior parameters and adapts quickly given tasks from a multimodal task distribution leveraging both model-based and model-agnostic meta-learning. • We design a set of multimodal meta-learning problems and demonstrate that our model offers a better generalization ability in a variety of domains, including regression, image classification, and reinforcement learning. 4.2 RelatedWork The idea of empowering the machines with the capability oflearningtolearn [298] has been widely explored by the machine learning community. To improve the efficiency of handcrafted optimizers, a flurry of recent works has focused on learning to optimize a learner model. Pioneered by [261, 28], optimization algorithms with learned parameters have been proposed, enabling the automatic exploitation of the structure of learning problems. From a reinforcement learning perspective, [165] represents an optimization algorithm as a learning policy. [13] trains LSTM optimizers to learn update rules from the gradient history, and [244] 115 trains a meta-learner LSTM to update a learner’s parameters. Similar approach for continual learning is explored in [314]. Recently, investigating how we can replicate the ability of humans to learn new concepts from one or a few instances, known as few-shot learning, has drawn people’s attention due to its broad applicability to different fields. To classify images with few examples, metric-based meta-learning frameworks have been proposed [143, 311, 280, 274, 289, 217, 45], which strive to learn a metric or distance function that can be used to compare two different samples effectively. Recent works along this line [217, 342, 158] share a conceptually similar idea with us and seek to perform task-specific adaptation with different type transformations. Due to the limited space, we defer the detailed discussion to Appendix (Section 4.7). While impressive results have been shown, it is nontrivial to adopt them for complex tasks such as acquiring robotic skills using reinforcement learning [118, 170, 133, 242, 98, 104, 160]. On the other hand, instead of learning a metric, model-based (i.e. RNN-based) meta-learning models learn to adjust model states (e.g. a state of an RNN [193, 70, 318] or external memory [256, 202]) using a training dataset and output the parameters of a learned model or the predictions given test inputs. While these methods have the capacity to learn any mapping from datasets and test samples to their labels, they could suffer from overfitting and show limited generalization ability [79]. Model-agnostic meta-learners [77, 81, 139, 158, 94, 212, 254, 252] are agnostic to concrete model configurations. Specifically, they aim to learn a parameter initialization under a certain task distribution, that aims to provide a favorable inductive bias for fast gradient-based adaptation. With its model agnostic nature, appealing results have been shown on a variety of learning problems. However, assuming tasks are sampled from a concentrated distribution and pursuing a common initialization to all tasks can substantially limit the performance of such methods on multimodal task distributions where the center in the task space becomes ambiguous. 116 In this paper, we aim to develop a more powerful model-agnostic meta-learning framework which is able to deal with complex multimodal task distributions. To this end, we propose a framework, which first identifies the mode of sampled tasks, similar to model-based meta-learning approaches, and then it modulates the meta-learned prior parameters to make the model better fit to the identified mode. Finally, the model is fine-tuned on the target task rapidly through gradient steps. 4.3 Preliminaries The goal of meta-learning is to quickly learn task-specific functions that map between input data and the desired output(x k ,y k ) Kt k=1 for different tasks t, where the number of dataK t is small. A task is defined by the underlying data generating distributionP(X) and a conditional probabilityP t (Y|X). For instance, we consider five-way image classification tasks with x k to be images andy k to be the corresponding labels, sampled from a task distribution. The data generating distribution is unimodal if it contains classification tasks that belong to a single input and label domain (e.g. classifying different combination of digits). A multimodal counterpart therefore contains classification tasks from multiple different input and label domains (e.g. classifying digits vs. classifying birds). We denote the later distribution of tasks to be the multimodal task distribution. In this paper, we aim to rapidly adapt to a novel task sampled from a multimodal task distribution. We consider a target datasetD consisting of tasks sampled from a multimodal distribution. The dataset is split into meta-training and meta-testing sets, which are further divided into task-specific training D train T and validationD val T sets. A meta-learner learns about the underlying structure of the task distribution through training on the meta-training set and is evaluated on meta-testing set. Our work builds upon Model-Agnostic Meta-Learning (MAML) algorithm [77]. MAML seeks an initialization of parametersθ for a meta-learner such that it can be optimized towards a new task with a small number of gradient steps minimizing the task-specific objectives on the training data D train T , with 117 Modulation Network Task Network x y ( ( K ⇥ Samples Task Encoder Task Embedding Modulation Network Modulation Network MLPs x y ✓ 2 ⌧ 2 ✓ 1 ⌧ 1 ⌧ n ✓ n … ˆ y Figure 4.1: Modeloverview. The modulation net- work produces a task embeddingυ , which is used to generate parameters{τ i } that modulates the task network. The task network adapts modulated param- eters to fit to the target task. Algorithm1MMAMLMeta-TrainingProcedure. 1: Input: Task distributionP(T), Hyper-parametersα andβ 2: Randomly initializeθ andω. 3: while not DONEdo 4: Sample batches of tasksT j ∼ P(T) 5: forall jdo 6: Inferυ =h({x,y} K ;ω h ) with K samples fromD train Tj . 7: Generate parametersτ ={g i (υ ;ω g )|i=1,··· ,N} to modulate each block of the task networkf. 8: Evaluate∇ θ L Tj (f(x;θ,τ );D train Tj ) w.r.t the K samples 9: Compute adapted parameter with gradient descent: 10: θ ′ Tj =θ − α ∇ θ L Tj f(x;θ,τ );D train Tj 11: endfor 12: Updateθ withβ ∇ θ P Tj∼ P(T) L Tj f(x;θ ′ ,τ );D val Tj 13: Updateω g withβ ∇ ωg P Tj∼ P(T) L Tj f(x;θ ′ ,τ );D val Tj 14: Updateω h withβ ∇ ω h P Tj∼ P(T) L Tj f(x;θ ′ ,τ );D val Tj 15: endwhile the adapted parameters generalize well to the validation dataD val T . The initialization of the parameters is trained by sampling mini-batches of tasks fromD, computing the adapted parameters for allD train T in the batch, evaluating adapted parameters to compute the validation losses on theD val T and finally update the initial parametersθ using the gradients from the validation losses. 4.4 Method Our goal is to develop a framework to quickly master a novel task from a multimodal task distribution. We call the proposed framework Multimodal Model-Agnostic Meta-Learning (MMAML). The main idea of MMAML is to leverage two complementary neural networks to quickly adapt to a novel task. First, a network called the modulation network predicts the identity of the mode of a task. Then the predicted mode identity is used as an input by a second network called the task network, which is further adapted to the task using gradient-based optimization. Specifically, the modulation network accesses data points from the target task and produces a set of task-specific parameters to modulate the meta-learned prior parameters of the task network. Finally, the modulated task network (but not the task-specific parameters from modulation 118 network) is further adapted to target task through gradient-based optimization. A conceptual illustration can be found in Figure 4.1. In the rest of this section, we introduce our modulation network and a variety of modulation operators in Section 4.4.1. Then we describe our task network and the training details for MMAML in Section 4.4.2. 4.4.1 ModulationNetwork As mentioned above, modulation network is responsible for identifying the mode of a sampled task, and generate a set of parameters specific to the task. To achieve this, it first takes the given K data points and their labels{x k ,y k } k=1,...,K as input to the task encoderf and produces an embedding vectorυ that encodes the characteristics of a task: υ =h {(x k ,y k )|k =1,··· ,K}; ω h (4.1) Then the task-specific parameters τ are computed based on the encoded task embedding vectorυ , which is further used to modulate the meta-learned prior parameters of the task network. The task network (introduced later in Section 4.4.2) can be an arbitrarily parameterized function, with multiple building blocks (or layers) such as deep convolutional networks [110], or multi-layer recurrent networks [234]. To modulate the parameters of each block in the task network as good initialization for solving the target task, we apply block-wise transformations to scale and shift the output activation of each hidden unit in the network (i.e. the output of a channel of a convolutional layer or a neuron of a fully-connected layer). Specifically, the modulation network produces the modulation vectors for each block i, denoted as τ i =g i (υ ;ω g ), where i=1,··· ,N, (4.2) 119 whereN is the number of blocks in the task network. We formalize the procedure of applying modulation as: ϕ i =θ i ⊙ τ i , whereϕ i is the modulated prior parameters for the task network, and⊙ represents a general modulation operator. We investigate some representative modulation operations including attention-based (softmax) modulation [199, 306] and feature-wise linear modulation (FiLM) [231, 220, 121]. We empirically observe that FiLM performs better and more stable than attention-based modulation (see Section 4.5 for details), and therefore use FiLM as default operator for modulation. The details of these modulation operators can be found in Appendix (Section 4.7). 4.4.2 TaskNetwork The parameters of each block of the task network are modulated using the task-specific parameters τ = {τ i | i = 1,··· ,N} generated by the modulation network, which can generate a mode-aware initialization in the parameter spacef(x;θ,τ ). After the modulation step, few steps of gradient descent are performed on the meta-learned prior parameters of the task network to further optimize the objective function for a target taskT i . Note that the task-specific parameters τ i are kept fixed and only the meta- learned prior parameters of the task network are updated. We describe the concrete procedure in the form of the pseudo-code as shown in Algorithm 1. The same procedure of modulation and gradient-based optimization is used both during meta-training and meta-testing time. Detailed network architectures and training hyper-parameters are different by the domain of applica- tions, we defer the complete details to Appendix. 4.5 Experiments We evaluate our method (MMAML) and baselines in a variety of domains including regression, image classification, and reinforcement learning, under the multimodal task distributions. We consider the following model-agnostic meta-learning baselines: 120 Table 4.1: Mean square error (MSE) on the multimodal 5-shot regression with 2, 3, and 5 modes. A Gaussian noise withµ =0 andσ =0.3 is applied. Multi-MAML uses ground-truth task modes to select the corresponding MAML model. Our method (with FiLM modulation) outperforms other methods by a margin. Method 2Modes 3Modes 5Modes Post Modulation Post Adaptation Post Modulation Post Adaptation Post Modulation Post Adaptation MAML [77] - 1.085 - 1.231 - 1.668 Multi-MAML - 0.433 - 0.713 - 1.082 LSTM Learner 0.362 - 0.548 - 0.898 - Ours: MMAML (Softmax) 1.548 0.361 2.213 0.444 2.421 0.939 Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868 • MAML[77] represents the family of model-agnostic meta-learners. The architecture of MAML on each individual domain is designed to be the same as task network in MMAML. • Multi-MAML consists ofM (the number of modes) MAML models and each of them is specifically trained on the tasks sampled from a single mode. The performance of this baseline is evaluated by choosing models based on ground-truth task-mode labels. This baseline can be viewed as the upper- bound of performance for MAML. If it outperforms MAML, it indicates that MAML’s performance is degenerated due to the multimodality of task distributions. Note that directly comparing the other algorithms to Multi-MAML is not fair as it uses additional information which is not available in real world scenarios. Note that we aim to develop a general model-agnostic meta-learning framework and therefore the comparison to methods that achieved great performance on only an individual domain are omitted. A more detailed discussion can be found in Appendix (Section 4.7). 4.5.1 RegressionExperiments Setups. We experiment with our models in multimodal few-shot regression. In our setup, five pairs of input/output data{x k ,y k } k=1,...,K are sampled from a one dimensional function and provided to a learning 121 Data Points Ground Truth MAML MultiMAML MMAML Sinusoidal Linear Quadratic Transformedℓ 1 Norm Tanh (a) MMAMLpostmodulation vs. other prior models (b) MMAMLpostadaptation vs. other posterior models Figure 4.2: Qualitative Visualization of Regression on Five-modes Simple Functions Dataset. (a): We compare the predicted function shapes of modulated MMAML against the prior models of MAML and Multi-MAML, before gradient updates. Our model can fit the target function with limited observations and no gradient updates. (b): The predicted function shapes after five steps of gradient updates, MMAML is qualitatively better. More visualizations in Appendix (Section 4.7). model. The model is asked to predictL output valuesy q 1 ,...,y q L for input queriesx q 1 ,...,x q L . To construct the multimodal task distribution, we set up five different functions: sinusoidal, linear, quadratic, transformed ℓ 1 norm, and hyperbolic tangent functions, and treat them as discrete task modes. We then evaluate three different task combinations with two functions, three functions and five functions in them. For each task, five pairs of data are sampled and Gaussian noise is added to the output value y, which further increases the difficulty of identifying which function generated the data. Please refer to Appendix (Section 4.7) for details and parameters for regression experiments. BaselinesandOurApproach. As mentioned before, we have MAML and Multi-MAML as two baseline methods, both with MLP task networks. Our method (MMAML) augments the task network with a modulation network. We choose to use an LSTM to serve as the modulation network due to its nature as good at handling sequential inputs and generate predictive outputs. Data points (sorted byx value) are first input to this network to generate task-specific parameters that modulate the task network. The 122 modulated task network is then further adapted using gradient-based optimization. Two variants of modulation operators – softmax and FiLM are explored to be used in our approach. Additionally, to study the effectiveness of the LSTM model, we evaluate another baseline (referred to as the LSTM Learner) that uses the LSTM as the modulation network (with FiLM) but does not perform gradient-based updates. Please refer to Appendix Section 4.7 for concrete specification of each model. Results. The quantitative results are shown in Table 4.1. We observe that MAML has the highest error in all settings and that incorporating task identity (Multi-MAML) can improve over MAML significantly. This suggests that MAML degenerates under multimodal task distributions. The LSTM learner outperforms both MAML and Multi-MAML, showing that the sequence model can effectively tackle this regression task. MMAML improves over the LSTM learner significantly, which indicates that with a better initialization (produced by the modulation network), gradient-based optimization can lead to superior performance. Finally, since FiLM outperforms Softmax consistently in the regression experiments, we use it for as the modulation method in the rest of experiments. We visualize the predicted function shapes of MAML, Multi-MAML and MMAML (with FiLM) in Figure 4.2. We observe that modulation can significantly modify the prediction of the initial network to be close to the target function (see Figure 4.2 (a)). The prediction is then further improved by gradient-based optimization (see Figure 4.2 (b)). tSNE [180] visualization of the task embedding (Figure 4.3) shows that our embedding learns to separate the input data of different tasks, which can be seen as a evidence of the mode identification capability of MMAML. 4.5.2 ImageClassification Setup. The task of few-shot image classification considers the problem of classifying images into N classes with a small number (K) of labeled samples available (i.e.N-wayK-shot). To create a multimodal few-shot image classification task, we combine multiple widely used datasets ( Omniglot [151],Mini-ImageNet [244], 123 (a) Regression (b) Image classification (c) RL Reacher (d) RL Point Mass Figure 4.3: tSNE plots of the task embeddings produced by our model from randomly sampled tasks; marker color indicates different modes of a task distribution. The plots (b) and (d) reveal a clear clustering according to different task modes, which demonstrates that MMAML is able to identify the task from a few samples and produce a meaningful embeddingυ . (a) Regression: the distance between modes aligns with the intuition of the similarity of functions (e.g. a quadratic function can sometimes be similar to a sinusoidal or a linear function while a sinusoidal function is usually different from a linear function) (b) Few-shot image classification: each dataset ( i.e. mode) forms its own cluster. (c-d) Reinforcement learning: The numbered clusters represent different modes of the task distribution. The tasks from different modes are clearly clustered together in the embedding space. Table 4.2: Classification testing accuracies on the multimodalfew-shotimageclassification with 2, 3, and 5 modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios. Method&Setup 2Modes 3Modes 5Modes Way 5-way 20-way 5-way 20-way 5-way 20-way Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot MAML [77] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85% Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78% MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97% FC100 [217],CUB [317], andAircraft [181]) to form a meta-dataset following the train/test splits used in the prior work, similar to [303, 284]. The details of all the datasets can be found in Appendix (Section 4.7). We train models on the meta-datasets with two modes (Omniglot andMini-ImageNet), three modes (Omniglot,Mini-ImageNet, andFC100), and five modes (all the five datasets). We use a 4-layer convolu- tional network for both MAML and our task network. Results. The results are shown in Table 4.2, demonstrating that our method achieves better results against MAML and performs comparably to Multi-MAML. The performance gap between ours and MAML becomes larger when the number of modes increases, suggesting our method can handle multimodal task 124 distributions better than MAML. Also, compared to Multi-MAML, our method achieves slightly better results partially because our method learns from all the datasets yet each Multi-MAML is likely to overfit to a single dataset with a smaller number of classes (e.g.Mini-ImageNet andFC100). This finding aligns with the current trend of meta-learning from multiple datasets [303]. The detailed performance on each dataset can be found in Appendix (Section 4.7). To gain insights to the task embeddingsυ produced by our model, we randomly sample 2000 5-mode 5- way 1-shot tasks and employ tSNE to visualizeυ in Figure 4.3 (b), showing that our task embedding network captures the relationship among modes, where each dataset forms an individual cluster. This structure shows that our task encoder learns a reasonable task embedding space, which allows the modulation network to modulate the parameters of the task network accordingly. 4.5.3 ReinforcementLearning (a) Point Mass (b) Reacher (c) Ant (d) Ant Goal Distribution Figure 4.4: RL environments. Three environments are used to explore the capability of MMAML to adapt in multimodal task distributions in RL. In all of the environments the agent is tasked to reach a goal marked by a star of a sphere in the figures. The goals are sampled from a multimodal distribution in two or three dimensions depending on the environment. InPointMass (a) the agent navigates a simple point mass agent in 2-dimensions. InReacher (b) the agent controls a 3-link robot arm in 2-dimensions. InAnt (c) the agent controls four-legged ant robot and has to navigate to the goal. The goals are sampled from a 2-dimensional distribution presented in figure (d), while the agent itself is 3-dimensional. Setup. Along with few-shot classification and regression, reinforcement learning (RL) has been a central problem where meta-learning has been studied [263, 262, 319, 77, 193, 252]. Similarly to the other domains, the objective in meta-reinforcement learning (meta-RL) is to adapt to a novel task based on limited experience with the task. For RL problems, the inner loop updates of gradient-based meta-learning take the form of 125 Figure 4.5: Visualizations of MMAML and ProMP trajectories in the 4-mode Point Mass 2D environment. Each trajectory originates in the green star. The contours present the multimodal goal distribution. Multiple trajectories are shown per each update step. For each column: the leftmost figure depicts the initial exploratory trajectories without modulation or gradient adaptation applied. Themiddlefigure presents ProMP after one gradient adaptation step and MMAML after a gradient adaptation step and the modulation step, which are computed based on the same initial trajectories. Thefigureontheright presents the methods after two gradient adaptation steps in addition to the MMAML modulation step. ProMP MMAML Reacher Ant Figure 4.6: Visualizations of MMAML and ProMP trajectories in the Ant and Reacher environments. The figures represent randomly sampled trajectories after the modulation step and two gradient steps for Reacher and three forAnt. Each frame sequence represents a complete trajectory, with the beginning, middle and end of the trajectories captured by the left, middle and right frames respectively. Videos of the trained agents can be found at https://vuoristo.github.io/MMAML/. 126 policy gradient updates. For a more detailed description of the meta-RL problem setting, we refer the reader to [252]. We seek to verify the ability of MMAML to learn to adapt to tasks sampled from multimodal task distributions based on limited interaction with the environment. We do so by evaluating MMAML and the baselines on four continuous control environments using the MuJoCo physics simulator [300]. In all of the environments, the agent is rewarded on every time step for minimizing the distance to the goal. The goals are sampled from multimodal goal distributions with environment specific parameters. The agent does not observe the location of the goal directly but has to learn to find it based on the reward instead. To provide intuition on the environments, illustrations of the robots are presented in Figure 4.4. Examples of trajectories are presented in Figure 4.5 forPointMass and in Figure 4.6 forAnt andReacher. Complete details of the environments and goal distributions can be found in Appendix (Section 4.7). BaselinesandOurApproach. To identify the mode of a task distribution with MMAML, we run the initial policy to interact with the environment and collect a batch of trajectories. These initial trajectories are used for two purposes: computing the adapted parameters using a gradient-based update and modulating the updated parameters based on the task embeddingυ computed by the modulation network. The modulation vectorsτ are kept fixed for the subsequent gradient updates. Descriptions of the network architectures and training hyperparameters are deferred to Appendix (Section 4.7). Due to credit-assignment problems present in the MAML for RL algorithm [77] as identified in [252], we optimize our policies and modulation networks with ProMP [252] algorithm, which resolves these issues. We use ProMP both as the training algorithm for MMAML and as a baseline. Multi-ProMP is an artificial baseline to show the performance of training one policy for each mode using ProMP. In practice we train an agent for only one of the modes since the task distributions are symmetric and the agent is initialized to a random pose. 127 Table 4.3: The mean and standard deviation of cumulative reward per episode for multimodal reinforcement learning problems with 2, 4 and 6 modes reported across 3 random seeds. Multi-ProMP is ProMP trained on an easier task distribution which consists of a single mode of the multimodal distribution to provide an approximate upper limit on the performance on each task. Method PointMass2D Reacher Ant 2Modes 4Modes 6Modes 2Modes 4Modes 6Modes 2Modes 4Modes ProMP [252] -397± 20 -523± 51 -330± 10 -12± 2.0 -13.8± 2.5 -14.9± 2.9 -761± 48 -953± 46 Multi-ProMP -109± 6 -109± 6 -92± 4 -4.3± 0.1 -4.3± 0.1 -4.3± 0.1 -624± 38 -611± 31 Ours -136± 8 -209± 32 -169± 48 -10.0± 1.0 -11.0± 0.8 -10.9± 1.1 -711± 25 -904± 37 Results. The results for the meta-RL experiments presented in Table 4.3 show that MMAML consistently outperforms the unmodulated ProMP. The good performance of Multi-ProMP, which only considers a single mode suggests that the difficulty of adaptation in our environments results mainly from the multiple modes. We find that the difficulty of the RL tasks does not scale directly with the number of modes, i.e. the performance gap between MMAML and ProMP forPointMass with 6 modes is smaller than the gap between them for 4 modes. We hypothesize the more distinct the different modes of the task distribution are, the more difficult it is for a single policy initialization to master. Therefore, adding intermediate modes (going from 4 to 6 modes) can in some cases make the task distribution easier to learn. The tSNE visualizations of embeddings of random tasks in thePointMass andReacher domains are presented in Figure 4.3. The clearly clustered embedding space shows that the task encoder is capable of identifying the task mode and the good results MMAML achieves suggest that the modulation network effectively utilizes the task embeddings to tackle the multimodal task distribution. 4.6 Conclusion We present a novel approach that is able to leverage the strengths of both model-based and model-agnostic meta-learners to discover and exploit the structure of multimodal task distributions. Given a few samples from a target task, our modulation network first identifies the mode of the task distribution and then 128 modulates the meta-learned prior in a parameter space. Next, the gradient-based meta-learner efficiently adapts to the target task through gradient updates. We empirically observe that our modulation network is capable of effectively recognizing the task modes and producing embeddings that captures the structure of a multimodal task distribution. We evaluated our proposed model in multimodal few-shot regression, image classification and reinforcement learning, and achieved superior generalization performance on tasks sampled from multimodal task distributions. 4.7 Appendix 4.7.1 DetailsonModulationOperators Attention based modulation has been widely used in modern deep learning models and has proved its effectiveness across various tasks [340, 199, 349, 333]. Inspired by the previous works, we employed attention to modulate the prior model. In concrete terms, attention over the outputs of all neurons (Softmax) or a binary gating value (Sigmoid) on each neuron’s output is computed by the modulation network. These modulation vectorsτ are then used to scale the pre-activation of each neural network layerF θ , such that F ϕ =F θ ⊗ τ . Note that here⊗ represents a channel-wise multiplication. Feature-wiselinearmodulation(FiLM) has been proposed to modulate neural networks for achieving the conditioning effects of data from different modalities. We adopt FiLM as an option for modulating our task network parameters. Specifically, the modulation vectors τ are divided into two components τ γ andτ β such that for a certain layer of the neural network with its pre-activationF θ , we would have F ϕ = F θ ⊗ τ γ +τ β . It can be viewed as a more generic form of attention mechanism. Please refer to [231] for the complete details. In a recent few-shot image classification paper [217], FiLM modulation is used in a metric learning model and achieves high performance. Similarly, employing FiLM modulation 129 has been shown effective on a variety of tasks such as image synthesis [136, 220, 121, 6], visual question answering [231, 230], style transfer [71], recognition [119, 330], reading comprehension [65], etc. 4.7.2 FurtherDiscussiononRelatedWorks DiscussionsonTask-SpecificAdaptation/Modulation. As mentioned in the related work of the main text, some recent works [217, 342, 158] leverage the task-specific adaptation or modulation to achieve few-shot image classification. Now we discuss about them in details. [217] propose to learn a task-specific network that adapts the weight of the visual embedding networks via feature-wise linear modulation (FiLM) [231]. Similarly, [342] learns to perform similar task-specific adaptation for few-shot image classification via Transformer [306]. [158] learns a visual embedding network with a task-specific metric and task-agnostic parameters, where the task-specific metric can be update via a fixed steps of gradient updates similar to [79]. In contrast, we aim to leverage the power of task-specific modulation to develop a more powerful model-agnostic meta-learning framework, which is able to effectively adapt to tasks sampled from a multimodal task distribution. Note that our proposed framework is capable of solving few-shot regression, classification, and reinforcement learning tasks. 4.7.3 Baselines Since we aim to develop a general model-agnostic meta-learning framework, the comparison to methods that achieved great performance on only an individual domain are omitted. ImageClassification. While Prototypical networks [280], Proto-MAML [303], and TADAM [217] learn a metric space for comparing samples and therefore are not directly applicable to regression and reinforcement learning domains, we believe it would be informative to evaluate those methods on our multimodal image classification setting. For this purpose, we refer the readers to a recent work [303] which presents extensive 130 (a) 2-Mode Regression (b) 3-Mode Regression (c) 5-Mode Regression Figure 4.7: tSNE plots of the task embeddings produced by our model from randomly sampled tasks for regression. We choose to visualize the corresponding task embeddings of two modes, three modes and five modes. experiments on a similar multimodal setting with a wide range of methods, including model-based (RNN- based) methods, model-agnostic meta-learners, and metric-based methods. ReinforcementLearning. We believe comparing MMAML to ProMP [252] on reinforcement learning tasks highlights the advantage of using a separate modulation network in addition to the task network, given that in the reinforcement learning setting MMAML uses ProMP as the optimization algorithm. Besides ProMP, Bayesian MAML [139] presents an appealing baseline for multimodal task distributions. We tried to run Bayesian MAML on our multimodal task distributions but had technical difficulties with it. The source code for Bayesian MAML in classification and regression is not publicly available. 4.7.4 AdditionalExperimentalDetails 4.7.4.1 Regression Setups To form multimodal task distributions for regression, we consider a family of functions including sinusoidal functions (in forms ofA· sinw· x+b+ϵ , withA∈ [0.1,5.0],w∈ [0.5,2.0] andb∈ [0,2π ]), linear functions (in forms ofA· x+b, withA∈ [− 3,3] andb∈ [− 3,3]), quadratic functions (in forms 131 of A· (x− c) 2 + b, with A ∈ [− 0.15,− 0.02]∪ [0.02,0.15], c ∈ [− 3.0,3.0] and b ∈ [− 3.0,3.0] ), ℓ 1 norm function (in forms ofA·|x− c|+b, withA ∈ [− 0.15,− 0.02]∪[0.02,0.15], c ∈ [− 3.0,3.0] and b∈[− 3.0,3.0]), and hyperbolic tangent function (in forms ofA· tanh(x− c)+b, withA∈[− 3.0,3.0], c∈[− 3.0,3.0] andb∈[− 3.0,3.0]). Gaussian observation noise withµ =0 andϵ =0.3 is added to each data point sampled from the target task. In all the experiments,K is set to5 andL is set to10. We report the mean squared error (MSE) as the evaluation criterion. Due to the multimodality and uncertainty, this setting is more challenging comparing to [77]. ModelsandOptimization In the regression task, we trained a 4-layer fully connected neural network with the hidden dimensions of100 and ReLU non-linearity for each layer, as the base model for both MAML and MMAML. In MMAML, an additional model with a Bidirectional LSTM of hidden size40 is trained to generateτ and to modulate each layer of the base model. We used the same hyper-parameter settings as the regression experiments presented in [77] and used Adam [140] as the meta-optimizer. For all our models, we train on 5 meta-train examples and evaluate on 10 meta-val examples to compute the loss. EvaluationProtocol In the evaluation of regression experiments, we samples 25,000 tasks for each task mode and evaluate all models with 5 gradient steps during the adaptation (if applicable), with the adaptation learning rate set to be the one models learned with. Therefore, the results for 2 mode experiments is computed over 50,000 tasks, corresponding 3 mode experiment is computed over 75,000 tasks and 5 mode has 125,000 tasks in total. We evaluate all methods over the function range between -5 and 5, and report the accumulated mean squared error as performance measures. EffectofModulationandAdaptation We analyze the effect of modulation and adaptation steps on the regression experiments. Specifically, we show both the qualitative and quantitative results on the 5-mode regression task, and plot the induced 132 function curves as well as measure the Mean Squared Error (MSE) after applying modulation step or both modulation and adaptation step. Note that MMAML starts from a learned prior parameters (denoted as prior params), and then sequentially performs modulation and adaptation steps. The results are shown in the Figure 4.8 and Table 4.4. We see that while inference with prior parameters itself induces high error, adding modulation as well as further adaptation can significantly reduce such error. We can see that the modulation step is trying to seek a rough solution that captures the shape of the target curve, and the gradient based adaptation step refines the induced curve. Figure 4.8: 5-mode Regression: Visualization with Linear & Quadratic Function. Linear Quadratic Table 4.4: 5-mode Regression: Performance mea- sured in mean squared error (MSE). MMAML MSE Prior Params 17.299 +Modulation 2.166 +Adaptation 0.868 4.7.4.2 ImageClassification Meta-dataset To create a meta-dataset by merging multiple datasets, we utilize five popular datasets: Omniglot, Mini-ImageNet,FC100,CUB, andAircraft. The detailed information of all the datasets are summarized in Table 4.5. To fit the images from all the datasets to a model, we resize all the images to 84× 84. The images randomly sampled from all the datasets are shown in Figure 4.9, demonstrating a diverse set of modes. Hyperparameters 133 (a) Omniglot (e) Aircraft (c) FC100 (b) Mini-ImageNet (d) CUB Figure 4.9: Examples of images from all the datasets. Table 4.5: Details of few-shot image classification datasets. Dataset Trainclasses Validationclasses Testclasses Imagesize Imagechannel Imagecontent Omniglot 4112 688 1692 28× 28 1 handwritten characters Mini-ImageNet 64 16 20 84× 84 3 objects FC100 64 16 20 32× 32 3 objects CUB 140 30 30 ∼ 500× 500 3 birds Aircraft 70 15 15 ∼ 1-2 Mpixels 3 aircrafts We present the hyperparameters for all the experiments in Table 4.6. We use the same set of hyperpa- rameters to train our model and MAML for all experiments, except that we use a smaller meta batch-size for 20-way tasks and train the jobs for more iterations due to the limited memory of GPUs that we have access to. We use 15 examples per class for evaluating the post-update meta-gradient for all the experiments, following [77, 244]. All the trainings use the Adam optimizer [140] with default hyperparameters. For Multi-MAML, since we train a MAML model for each dataset, it gives us the freedom to use different sets of hyperparameters for different datasets We tried our best to find the best hyperparameters for each dataset. NetworkArchitectures 134 Table 4.6: Hyperparameters for multimodal few-shot image classification experiments. We experiment different hyperparameters for each dataset for Multi-MAML. The dataset group Grayscale includesOm- niglot andRGB includesMini-ImageNet andFC100,CUB, andAircraft. Method Setup Datasetgroup Slowlr Fastlr Metabach-size Numberofupdates Trainingiterations MAML 5-way 1-shot - 0.001 0.05 10 5 60000 5-way 5-shot MMAML (ours) 5-way 1-shot 5-way 5-shot MAML 20-way 1-shot - 0.001 0.05 5 5 80000 20-way 3-shot MMAML (ours) 20-way 1-shot 20-way 3-shot Multi-MAML 5-way 1-shot Grayscale 0.001 0.4 10 1 60000 RGB 0.01 4 5 5-way 5-shot Grayscale 0.4 10 1 RGB 0.01 4 5 20-way 1-shot Grayscale 0.1 4 5 80000 RGB 0.01 2 5 20-way 3-shot Grayscale 0.1 4 5 RGB 0.01 2 5 TaskNetwork. For the task network, we use the exactly same architecture as the MAML convolutional network proposed in [77]. It consists of four convolutional layers with the channel size32,64,128, and256, respectively. All the convolutional layers have a kernel size of3 and stride of2. A batch normalization layer follows each convolutional layer, followed by ReLU. With the input tensor size of(n· k)× 84× 84× 3 for an- wayk-shot task, the output feature maps after the final convolutional layer have a size of (n· k)× 6× 6× 256. The feature maps are then average pooled along spatial dimensions, resulting feature vectors with a size of (n· k)× 256. A linear fully-connected layer takes the feature vector as input, and produce a classification prediction with a size ofn for n-way classification tasks. TaskEncoder. For the task encoder, we use the exactly same architecture as the task network. It consists of four convolutional layers with the channel size32,64,128, and256, respectively. All the convolutional layers have a kernel size of3, stride of2, and use valid padding. A batch normalization layer follows each convolutional layer, followed by ReLU. With the input tensor size of(n· k)× 84× 84× 3 for an-way 135 Table 4.7: The performance (classification accuracy) on the multimodalfew-shotimageclassification with2modes on each dataset. Setup Method Datasets Omniglot Mini-ImageNet Overall 5-way 1-shot MAML 89.24% 44.36% 66.80% Multi-MAML 97.78% 35.91% 66.85% MMAML (ours) 94.90% 44.95% 69.93% 5-way 5-shot MAML 96.24% 59.35% 77.79% Multi-MAML 98.48% 47.67% 73.07% MMAML (ours) 98.47% 59.00% 78.73% 20-way 1-shot MAML 55.36% 15.67% 35.52% Multi-MAML 91.59% 14.71% 53.15% MMAML (ours) 83.14% 12.47% 47.80% Table 4.8: The performance (classification accuracy) on the multimodalfew-shotimageclassification with3modes on each dataset. Setup Method Datasets Omniglot Mini-ImageNet FC100 Overall 5-way 1-shot MAML 86.76% 43.27% 33.29% 54.55% Multi-MAML 97.78% 35.91% 34.00% 55.90% MMAML (ours) 93.67% 41.07% 33.67% 57.47% 5-way 5-shot MAML 95.11% 61.48% 47.33% 67.97% Multi-MAML 98.48% 47.67% 40.44% 62.20% MMAML (ours) 99.56% 60.67% 50.22% 70.15% 20-way 1-shot MAML 57.87% 15.06% 11.74% 28.22% Multi-MAML 91.59% 14.71% 13.00% 39.77% MMAML (ours) 85.00% 13.00% 10.81% 36.27% k-shot task, the output feature maps after the final convolutional layer have a size of (n· k)× 6× 6× 256. The feature maps are then average pooled along spatial dimensions, resulting feature vectors with a size of(n· k)× 256. To produce an aggregated embedding vector from all the feature vectors representing all samples, we perform an average pooling, resulting a feature vector with a size of256. Finally, a fully- connected layer followed by ReLU takes the feature vector as input, and produce a task embedding vector υ with a size of128. 136 Table 4.9: The performance (classification accuracy) on the multimodalfew-shotimageclassification with5modes on each dataset. Setup Method Datasets Omniglot Mini-ImageNet FC100 CUB Aircraft Overall 5-way 1-shot MAML 83.63% 37.78% 33.70% 86.96% 36.74% 35.48% Multi-MAML 97.78% 35.91% 34.00% 93.44% 32.03% 27.59% MMAML (ours) 91.48% 42.89% 32.59% 93.56% 38.30% 36.82% 5-way 5-shot MAML 89.41% 51.26% 43.41% 82.30% 45.80% 43.92% Multi-MAML 98.48% 47.67% 40.44% 98.56% 45.70% 47.29% MMAML (ours) 97.96% 51.29% 44.08% 97.88% 53.80% 51.53% 20-way 1-shot MAML 59.10% 15.49% 11.75% 59.45% 16.31% 31.57% Multi-MAML 91.59% 14.71% 13.00% 85.46% 18.87% 30.72% MMAML (ours) 86.28% 14.35% 11.59% 91.86% 24.05% 30.89% (a) 2-mode classification (b) 3-mode classification (c) 5-mode classification Figure 4.10: tSNE plots of task embeddings produced in multimodal few-shot image classification domain. (a) 2-mode 5-way 1-shot (b) 3-mode 5-way 1-shot (c) 5-mode 5-way 5-shot. ModulationMLPs . Since the task network consists of four convolutional layers with the channel size 32,64,128, and256 and modulating each of them requires producing bothτ γ andτ β , we employ four linear fully-connected layers to convert the task embedding vectorυ to{τ γ 1 ,τ β 1 } (with a dimension of32),{τ γ 2 , τ β 2 } (with a dimension of64),{τ γ 3 ,τ β 3 } (with a dimension of128), and{τ γ 4 ,τ β 4 } (with a dimension of 256). Note the modulation for each layer is performed byθ i ⊙ γ i +β i , where⊙ denotes the Hadamard product. 137 (a)PointMass 2 Modes (b)PointMass 4 Modes (c)PointMass 6 Modes (a)Reacher 2 Modes (b)Reacher 4 Modes (c)Reacher 6 Modes (a)Ant 2 Modes (a)Ant 4 Modes Figure 4.11: Training curves for MMAML and ProMP in reinforcement learning environments. The curves indicate the average return per episode after gradient-based updates and modulation. The shaded region indicates standard deviation across three random seeds. The curves have been smoothed by averaging the values within a window of 10 steps. 4.7.4.3 ReinforcementLearning Environments The training curves for all environments are presented in Figure 4.11. PointMass . We consider three variants of thePointMass environment with 2, 4, and 6 modes. The agent controls a point mass by outputting changes to the velocity. At every time step the agent receives the negative euclidean distance to the goal as the reward. The goals are sampled from a multimodal goal distribution by first selecting the mode center and then adding Gaussian noise to the goal location. In the 4 mode variant the modes are the points(− 5,− 5),(− 5,5),(5,− 5),(5,5). In the 2 mode variant the modes are the points(− 5,− 5),(5,5). In the 6 mode variant the modes are the vertices of a regular hexagon with 138 at distance5 from the origin. All variants have noise scale of2.0. Visualizations of agent trajectories can be found in Figure 4.13. Reacher . We consider three variants of theReacher environment with 2, 4, and 6 modes. The agent controls a 2-dimensional robot arm with three links simulated in the MuJoCo [300] simulator. The goal distribution is similar to the goal distributions inPointMass but different parameters are used to match the scale of the environment. The reward for the environment is R(s,a)=− 1∗ (x point − x goal ) 2 −∥ a∥ 2 wherex point is the location of the point of the arm,x goal if the location of the goal anda is the action chosen by the agent. The modes of the goal distribution in the 4 mode variant are located at(− 0.225,− 0.225), (0.225,− 0.225),(− 0.225,0.225),(0.225,0.225) and the goal noise has scale of0.1. In the 2 mode variant the modes are located at(− 0.225,− 0.225),(0.225,0.225) and the noise scale is0.1. In the 6 mode variant the mode centers are the vertices of a regular hexagon with distance to the origin of0.318 and the noise scale is0.1. Ant . We consider two variants of theAnt environment with two and four modes. The agent controls an ant robot with four limbs simulated in the MuJoCo [300] simulator. The reward for the environment is R(s,a)=− 1∗ (x torso − x goal ) 2 − λ control ∗∥ a∥ 2 wherex torso is the location of the torso of the robot,x goal if the location of the goal,λ control =0.1 is the weighting for the control cost anda is the action chosen by the agent. The modes of the goal distribution in the 4 mode variant are located at(− 4,0),(− 2,3.46),(2,3.46),(4.0,0) and the goal noise has scale of 0.8. In the 2 mode variant the modes are located at(− 4.0,0),(4.0,0) and the noise scale is0.8. 139 Table 4.10: Hyperparameter settings for reinforcement learning. Environment Algorithm TrainingIterations TrajectoryLength Slowlr Fastlr InnerGradientSteps Clipeps PointMass MMAML 400 100 0.0005 0.01 2 0.1 ProMP Multi-ProMP Reacher MMAML 800 50 0.001 0.1 2 0.1 ProMP Multi-ProMP Ant MMAML 800 250 0.001 0.1 3 0.1 ProMP Multi-ProMP NetworkArchitecturesandHyperparameters For all RL experiments we use a policy network with two 64-unit hidden layers. The modulation network in RL tasks consists of a GRU-cell and post processing layers. The inputs to the GRU are the concatenated observations, actions and reward for each trajectory. The trajectories are processed separately. An MLP is used to process the last hidden states of each trajectory. The outputs of the MLPs are averaged and used by another MLP to compute the modulation vectorsτ . All MLPs have a single hidden layer of size 64. We sample 40 tasks for each update step. For each gradient step for each task we sample 20 trajectories. The hyperparameters, which differ from setting to setting are presented in Table 4.10. 4.7.5 AdditionalExperimentalResults 4.7.5.1 Regression We show visualization of embeddings for regression experiments with a varying number of task modes as Figure 4.7. We observe a linear separation in the two task modes and three task modes scenarios, which indicates that our method is capable of identifying data from different task modes. On the visualization of five task mode, we observe that data from linear, transformed ℓ 1 norm and hyperbolic tangent functions 140 cluttered. This is due to the fact that those functions are very similar to each other, especially with the Gaussian noise we added in the output space. Data Points Ground Truth MAML MultiMAML MMAML Sinusoidal Linear Quadratic Transformedℓ 1 Norm Tanh Figure 4.12: Additional qualitative results of the regression tasks. MMAMLafteradaptation vs. other posterior models. 141 4.7.5.2 ImageClassification We provide the detailed performance of our method and the baselines on each individual dataset for all 2, 3, and 5 mode experiments, shown in Table 4.7, Table 4.8, and Table 4.9, respectively. Note that the main paper presents the overall performance (the last columns of each table) on each of 2, 3, and 5 mode experiments. We found the results onOmniglot andMini-ImageNet demonstrate similar tendency shown in [303]. Note that the performance of Omniglot andFC100 might be slightly different from the results reported in the related papers because (1) all the images are resized and tiled along the spatial dimensions, (2) different hyperparamters are used, and (3) different numbers of training iterations. Additional tSNE plots for predicted task embeddings of 2-mode 5-way 1-shot classification, 3-mode 5-way 1-shot classification, and 5-mode 20-way 1-shot classification are shown in Figure 4.10. 4.7.5.3 ReinforcementLearning Additional trajectories sampled from the 2D navigation environment are presented in Figure 4.13. 142 Figure 4.13: Additional trajectories sampled from the point mass environment with MMAML and ProMP for six tasks. The contour plots represents the multimodal task distribution. The stars mark the start and goal locations. The curves depict five trajectories sampled using each method after zero, one and two update steps. In the figure, the modulation step takes place between the initial policy and the step after one update. 143 (a) 2 Modes 5-way trainings (b) 2 Modes 20-way trainings (c) 3 Modes 5-way trainings (d) 3 Modes 20-way trainings (e) 5 Modes 5-way trainings (f) 5 Modes 20-way trainings Figure 4.14: Training curves of MAML and our method for few-shot image classification. We show the losses and classification accuracies for training and validation tasks after adaptation. MAML trainings are less stable, while ours curves are smoother. 144 (a) 5-way 1-shot trainings (b) 5-way 5-shot trainings (c) 20-way 1-shot trainings Figure 4.15: Training curves of Multi-MAML for few-shot image classification. We show the losses and classification accuracies for training and validation tasks after adaptation. 145 Chapter5 Meta-LearningonLong-HorizonandSparse-RewardTasks 5.1 Introduction In recent years, deep reinforcement learning methods have achieved impressive results in robot learning [98, 15, 134]. Yet, existing approaches are sample inefficient, thus rendering the learning of complex behaviors through trial and error learning infeasible, especially on real robot systems. In contrast, humans are capable of effectively learning a variety of complex skills in only a few trials. This can be greatly attributed to our ability to learn how to learn new tasks quickly by efficiently utilizing previously acquired skills. Can machines likewise learn to how to learn by efficiently utilizing learned skills like humans? Meta- reinforcement learning (meta-RL) holds the promise of allowing RL agents to acquire novel tasks with improved efficiency by learning to learn from a distribution of tasks [77, 243]. In spite of recent advances in Figure 5.1: Overview. We propose a method that jointly leverages (1) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (2) a set of meta-training tasks to learn how to quickly solve unseen long-horizon tasks. Our method extracts reusable skills from the offline dataset and meta-learn a policy to quickly use them for solving new tasks. 146 the field, most existing meta-RL algorithms are restricted to short-horizon, dense-reward tasks. To facilitate efficient learning on long-horizon, sparse-reward tasks, recent works aim to leverage experience from prior tasks in the form of offline datasets without additional reward and task annotations [179, 232, 42]. While these methods can solve complex tasks with substantially improved sample efficiency over methods learning from scratch, millions of interactions with environments are still required to acquire long-horizon skills. In this work, we aim to take a step towards combining the capabilities of both learning how to quickly learn new tasks while also leveraging prior experience in the form of unannotated offline data (see Figure 5.1). Specifically, we aim to devise a method that enables meta-learning on complex, long-horizon tasks and can solve unseen target tasks with orders of magnitude fewer environment interactions than prior works. We propose to leverage the offline experience by extracting reusable skills – short-term behaviors that can be composed to solve unseen long-horizon tasks. We employ a hierarchical meta-learning scheme in which we meta-train a high-level policy to learn how to quickly reuse the extracted skills. To efficiently explore the learned skill space during meta-training, the high-level policy is guided by a skill prior which is also acquired from the offline experience data. We evaluate our method and prior approaches in deep RL, skill-based RL, meta-RL, and multi-task RL on two challenging continuous control environments: maze navigation and kitchen manipulation, which require long-horizon control and only provides sparse rewards. Experimental results show that our method can efficiently solve unseen tasks by exploiting meta-learning tasks and offline datasets, while prior approaches require substantially more samples or fail to solve the tasks. In summary, the main contributions of this paper are threefold: • To the best of our knowledge, this is the first work to combine meta-reinforcement learning algorithms with task-agnostic offline datasets that do not contain reward or task annotations. 147 • We propose a method that combines meta-learning with offline data by extracting learned skills and a skill prior as well as meta-learning a hierarchical skill policy regularized by the skill prior. • We empirically show that our method is significantly more efficient at learning long-horizon sparse- reward tasks compared to prior methods in deep RL, skill-based RL, meta-RL, and multi-task RL. 5.2 RelatedWork Meta-ReinforcementLearning. Meta-RL approaches [70, 319, 77, 345, 252, 102, 316, 204, 53, 54, 243, 315, 338, 354, 122, 355, 174] hold the promise of allowing learning agents to quickly adapt to novel tasks by learning to learn from a distribution of tasks. Despite the recent advances in the field, most existing meta-RL algorithms are limited to short-horizon and dense-reward tasks. In contrast, we aim to develop a method that can meta-learn to solve long-horizon tasks with sparse rewards by leveraging offline datasets. Offlinedatasets. Recently, many works have investigated the usage of offline datasets for agent training. In particular, the field of offline reinforcement learning [163, 275, 150, 346] aims to devise methods that can perform RL fully offline from pre-collected data, without the need for environment interactions. However, these methods require target task reward annotations on the offline data for every new tasks that should be learned. These reward annotations can be challenging to obtain, especially if the offline data is collected from a diverse set of prior tasks. In contrast, our method is able to leverage offline datasets without any reward annotations. OfflineMeta-RL. Another recent line of research aims to meta-learn from static, pre-collected datasets including reward annotations [196, 238, 68]. After meta-training with the offline datasets, these works aim to quickly adapt to a new task with only a small amount of data from that new task. In contrast to the aforementioned offline RL methods these works aim to adapt to unseen tasks and assume access to only limited data from the new tasks. However, in addition to reward annotations, these approaches often 148 require that the offline training data is split into separate datasets for each training tasks, further limiting the scalability. Skill-basedLearning. An alternative approach for leveraging offline data that does not require reward or task annotations is through the extraction of skills – reusable short-horizon behaviors. Methods for skill-based learning recombine these skills for learning unseen target tasks and converge substantially faster than methods that learn from scratch [160, 108, 269]. When trained from diverse datasets these approaches can extract a wide repertoire of skills and learn complex, long-horizon tasks [191, 179, 232, 4, 42, 233]. Yet, although they are more efficient than training from scratch, they still require a large number of environment interactions to learn a new task. Our method instead combines skills extracted from offline data with meta-learning, leading to significantly improved sample efficiency. 5.3 ProblemFormulationandPreliminaries Our approach builds on prior work for meta-learning and learning from offline datasets and aims to combine the best of both worlds. In the following we will formalize our problem setup and briefly summarize relevant prior work. ProblemFormulation. Following prior work on learning from large offline datasets [179, 232, 233], we assume access to a dataset of state-action trajectoriesD ={s t ,a t ,...} which is collected either across a wide variety of tasks or as “play data” with no particular task in mind. We thus refer to this dataset as task-agnostic. With a large number of data collection tasks, the dataset covers a wide variety of behaviors and can be used to accelerate learning on diverse tasks. Such data can be collected at scale, e.g. through autonomous exploration [108, 269, 57], human teleoperation [259, 101, 183, 179], or from previously trained agents [87, 100]. We additionally assume access to a set of meta-training tasksT={T 1 ,...,T N }, where each task is represented as a Markov decision process (MDP) defined by a tuple {S,A,P,r,ρ,γ } of states, actions, transition probability, reward, initial state distribution, and discount factor. 149 Our goal is to leverage both, the offline dataset D and the meta-training tasksT, to accelerate the training of a policyπ (a|s) on a target taskT ∗ which is also represented as an MDP. Crucially, we do not assume thatT ∗ is a part of the set of training tasksT, nor thatD contains demonstrations for solvingT ∗ . Thus, we aim to design an algorithm that can leverage offline data and meta-training tasks for learning how to quickly compose known skills for solving an unseen target task. Next, we will describe existing approaches that either leverage offline data or meta-training tasks to accelerate target task learning. Then, we describe how our approach takes advantage of the best of both worlds. Skill-basedRL. One successful approach for leveraging task-agnostic datasets for accelerating the learning of unseen tasks is though the transfer of reusable skills, i.e. short-horizon behaviors that can be composed to solve long-horizon tasks. Prior work in skill-based RL called Skill-Prior RL (SPiRL, Pertsch, Lee, and Lim [232]) proposes an effective way to implement this idea. Specifically, SPiRL uses a task-agnostic dataset to learns two models: (1) a skill policyπ (a|s,z) that decodes a latent skill representationz into a sequence of executable actions and (2) a prior over latent skill variables p(z|s) which can be leveraged to guide exploration in skill space. SPiRL uses these skills for learning new tasks efficiently by training a high-level skill policyπ (z|s) that acts over the space of learned skills instead of primitive actions. The target task RL objective extends Soft Actor Critic (SAC, Haarnoja et al. [104]), a popular off-policy RL algorithm, by guiding the high-level policy with the learned skill prior: max π X t E (st,zt)∼ ρ π r(s t ,z t )− αD KL π (z|s t ),p(z|s t ) . (5.1) HereD KL denotes the Kullback-Leibler divergence between the policy and skill prior, andα is a weighting coefficient. Off-PolicyMeta-RL. Rakelly et al. [243] introduced an off-policy meta-RL algorithm called probabilistic embeddings for actor-critic RL (PEARL) that leverages a set of training tasksT to enable quick learning of 150 Figure 5.2: Method Overview. Our proposed skill-based meta-RL method has three phases. (1) Skill Extraction: learns reusable skills from snippets of task-agnostic offline data through a skill extractor (yellow) and low-level skill policy (blue). Also trains a prior distribution over skill embeddings (green). (2) Skill-based Meta-training: Meta-trains a high-level skill policy (red) and task encoder (purple) while using the pre-trained low-level policy. The pre-trained skill prior is used to regularize the high-level policy during meta-training and guide exploration. (3)TargetTaskLearning: Leverages the meta-trained hierarchical policy for quick learning of an unseen target task. After conditioning the policy by encoding a few transitionsc ∗ from the target taskT ∗ , we continue fine-tuning the high-level skill policy on the target task while regularizing it with the pre-trained skill prior. new tasks. Specifically, PEARL leverages the meta-training tasks for learning a task encoder q(e|c). This encoder takes in a small set of state-action-reward transitionsc and produces a task embeddinge. This embedding is used to condition the actorπ (a|s,z) and criticQ(s,a,e). In PEARL, actor, critic and task encoder are trained by jointly maximizing the obtained reward and the policy’s entropyH [104]: max π E T∼ p T ,e∼ q(·|c T ) X t E (st,at)∼ ρ π |e r T (s t ,a t )+α H π (a|s t ,e) . (5.2) Additionally, the task embedding output of the task encoder is regularized towards a constant prior distributionp(e). 5.4 Approach We proposeSkill-basedMeta-PolicyLearning (SiMPL), an algorithm for jointly leveraging offline data as well as a set of meta-training tasks to accelerate the learning of unseen target tasks. Our method has three phases: (1)skillextraction: we extract reusable skills and a skill prior from the offline data (Section 5.4.1), 151 (2)skill-basedmeta-training: we utilize the meta-training tasks to learn how to leverage the extracted skills and skill prior to efficiently solve new tasks (Section 5.4.2), (3)targettasklearning: we fine-tune the meta-trained policy to rapidly adapt to solve an unseen target task (Section 5.4.3). An illustration of the proposed method is shown in Figure 5.2. 5.4.1 SkillExtraction To acquire a set of reusable skills from the offline dataset D, we leverage the skill extraction approach proposed in Pertsch, Lee, and Lim [232]. Concretely, we jointly train (1) a skill encoderq(z|s 0:K ,a 0:K− 1 ) that embeds anK-steps trajectory randomly cropped from the sequences inD into a low-dimensional skill embeddingz, and (2) a low-level skill policyπ (a t |s t ,z) that is trained with behavioral cloning to reproduce the action sequencea 0:K− 1 given the skill embedding. To learn a smooth skill representation, we regularize the output of the skill encoder with a unit Gaussian prior distribution, and weight this regularization by a coefficient β [113]: max q,π E z∼ q K− 1 Y t=0 logπ (a t |s t ,z) | {z } behavioral cloning − βD KL q(z|s 0:K ,a 0:K− 1 ),N(0,I) | {z } embedding regularization . (5.3) Additionally, we follow Pertsch, Lee, and Lim [232] and learn a skill priorp(z|s) that captures the distribution of skills likely to be executed in a given state under the training data distribution. The prior is trained to match the output of the skill encoder: min p D KL ⌊q(z|s 0:K ,a 0:K− 1 )⌋,p(z|s 0 ) . Here⌊·⌋ indicates that gradient flow is stopped into the skill encoder for training the skill prior. 5.4.2 Skill-basedMeta-Training We aim to learn a policy that can quickly learn to leverage the extracted skills to solve new tasks. We leverage off-policy meta-RL (see Section 5.3) to learn such a policy using our set of meta-training tasks T. 152 Similar to PEARL [243], we train a task-encoder that takes in a set of sampled transitions and produces a task embeddinge. Crucially, we leverage our learned skills by training a task-embedding-conditioned policy over skills instead of primitive actions: π (z|s,e), thus equipping the policy with a set of useful pre-trained behaviors and reducing the meta-training task to learning how to combine these behaviors instead of learning them from scratch. We find that this usage of offline data through learned skills is crucial for enabling meta-training on complex, long-horizon tasks (see Section 5.5). Prior work has shown that the efficiency of RL on learned skill spaces can be substantially improved by guiding the policy with a learned skill prior [232, 4]. Thus, instead of regularizing with a maximum entropy objective as done in prior work on off-policy meta-RL [243], we propose to regularize the meta-training policy with our pre-trained skill prior, leading to the following meta-training objective: max π E T∼ p T ,e∼ q(·|c T ) X t E (st,zt)∼ ρ π |e r T (s t ,z t )− αD KL π (z|s t ,e),p(z|s t ) . (5.4) whereα determines the strength of the prior regularization. We automatically tuneα via dual gradient descent by choosing a target divergenceδ between policy and prior [232]. To compute the task embeddinge, we used multiple different sizes of c. We found that we can increase training stability by adjusting the strength of the prior regularization to the size of the conditioning set. Intuitively, when the high-level policy is conditioned on only a few transitions, i.e. when the setc is small, it has only little information about the task at hand and should thus be regularized stronger towards the task-agnostic skill prior. Conversely, whenc is large, the policy likely has more information about the target task and thus should be allowed to deviate from the skill prior more to solve the task, i.e. have a weaker regularization strength. To implement this intuition, we employ a simple approach: we define two target divergencesδ 1 and δ 2 and associated auto-tuned coefficients α 1 and α 2 with δ 1 < δ 2 . We regularize the policy using the 153 larger coefficient α 1 with small conditioning transition set and otherwise we regularize using the smaller coefficient α 2 . We found this technique simple yet sufficient in our experiments and leave the investigation of more sophisticated regularization approaches for future work. 5.4.3 TargetTaskLearning When a target task is given, we aim to leverage the meta-trained policy for quickly learning how to solve it. Intuitively, the policy should first explore different skill options to learn about the task at hand and then rapidly narrow its output distribution to those skills that solve the task. We implement this intuition by first collecting a small set of conditioning transitionsc ∗ from the target task by exploring with the meta-trained policy. Since we have no information about the target task at this stage, we explore the environment by conditioning our pre-trained policy with task embeddings sampled from the task priorp(e). Then, we encode this set of transitions into a target task embeddinge ∗ ∼ q(e|c ∗ ). By conditioning our meta-trained high-level policy on this encoding, we can rapidly narrow its skill distribution to skills that solve the given target task: π (z|s,e ∗ ). Empirically, we find that this policy is often already able to achieve high success rates on the target task. Note that only very few interactions with the environment for collectingc ∗ are required for learning a complex, long-horizon and unseen target task with sparse reward. This is substantially more efficient than prior approaches such as SPiRL that require orders of magnitude more target task interactions for achieving comparable performance. To further improve the performance on the target task, we fine-tune the conditioned policy with target task rewards while guiding its exploration with the pre-trained skill prior ∗ : ∗ Other regularization distributions are possible during fine-tuning, e.g. the high-level policy conditioned on task prior samples p(z|s,e∼ p(e)) or the target task embedding conditioned policyp(z|s,e ∗ ) before finetuning . Yet, we found the regularization with the pre-trained task-agnostic skill prior to work best in our experiments. 154 max π E e ∗ ∼ q(·|c ∗ ) X t E (st,zt)∼ ρ π |e ∗ r T ∗ (s t ,z t )− αD KL π (z|s t ,e ∗ ),p(z|s t ) . (5.5) In practice, we propose several techniques for stabilizing meta-training and fine-tuning: (1) adaptively regularizing the policy based on the size of the conditioning trajectory set as described in Section 5.4.2, (2) parameterizing the policy as a residual policy that outputs differences to the pre-trained skill prior instead of the approach from Pertsch, Lee, and Lim [232] that directly fine-tunes the skill prior, and (3) initializing the Q-function andα parameter during fine-tuning with meta-trained parameters instead of randomly initialized networks. We discuss these techniques in detail in Section 5.7.4. 5.5 Experiments Our experiments aim to answer the following questions: (1) Can our proposed method learn to efficiently solve long-horizon, sparse reward tasks? (2) Is it crucial to utilize offline datasets to achieve this? (3) How can we best leverage the training tasks for efficient learning of target tasks? (4) How does the training task distribution affect the target task learning? 5.5.1 ExperimentalSetup We evaluate our approach in two challenging continuous control environments: maze navigation and kitchen manipulation environment, as illustrated in Figure 5.3. While meta-RL algorithms are typically evaluated on tasks that span only a few dozen time steps and provide dense rewards [77, 252, 243, 355], our tasks require to learn long-horizon behaviors over hundreds of time steps from sparse reward feedback and thus pose a new challenge to meta-learning algorithms. 155 Meta-training Tasks Target Tasks Meta-training Tasks Target Tasks Agent Meta-training Tasks Target Tasks top burner light switch slide cabinet hinge cabinet slide cabinet bottom burner bottom burner kettle bottom burner light switch top burner microwave kettle slide cabinet hinge cabinet light switch 1 2 3 4 (a) Maze Navigation (b) Kitchen Manipulation Figure 5.3: Environments. We evaluate our proposed framework in two domains that require the learning of complex, long-horizon behaviors from sparse rewards. These environments are substantially more complex than those typically used to evaluate meta-RL algorithms. (a)MazeNavigation: The agent needs to navigate for hundreds of steps to reach unseen target goals and only receives a binary reward upon task success. (b)KitchenManipulation: The 7DoF agent needs to execute an unseen sequence of four subtasks, spanning hundreds of time steps, and only receives a sparse reward upon completion of each subtask. 5.5.1.1 MazeNavigation Environment. This 2D maze navigation domain based on the maze navigation problem in Fu et al. [87] requires long-horizon control with hundreds of steps for a successful episode and only provides sparse reward feedback upon reaching the goal. The observation space of the agent consists of its 2D position and velocity and it acts via planar, continuous velocity commands. OfflineDataset&Meta-training/TargetTasks. Following Fu et al. [87] we collect a task-agnostic offline dataset by randomly sampling start-goal locations in the maze and using a planner to generate a trajectory that reaches from start to goal. Note that the trajectories are not annotated with any reward or task labels (i.e. which start-goal location is used for producing each trajectory). To generate a set of meta-training and target tasks, we fix the agent’s initial position in the center of the maze and sample 40 random goal locations for meta-training and another set of 10 goals for target tasks. All meta-training and target tasks use the same sparse reward formulation. More details can be found in Section 5.7.6.1. 156 5.5.1.2 KitchenManipulation Environment. The FrankaKitchen environment of Gupta et al. [101] requires the agent to control a 7-DoF robot arm via continuous joint velocity commands and complete a sequence of manipulation tasks like opening the microwave or turning on the stove. Successful episodes span 300-500 steps and the agent is only provided a sparse reward signal upon successful completion of a subtask. Offline Dataset & Meta-training / Target Tasks. We leverage a dataset of 600 human-teleoperated manipulation sequences of Gupta et al. [101] for offline pre-training. In each trajectory, the robot executes a sequence of four subtasks. We then define a set of 23 meta-training tasks and 10 target tasks that in turn require the consecutive execution of four subtasks (see Figure 5.3 for examples). Note that each task consists of a unique combination of subtasks. More details can be found in Section 5.7.6.2. 5.5.2 Baselines We compare SiMPL to prior approaches in RL, skill-based RL, meta-RL, and multi-task RL. • SAC [104] is a state of the art deep RL algorithm. It learns to solve a target task from scratch without leveraging the offline dataset nor the meta-training tasks. • SPiRL [232] is a method designed to leverage offline data through the transfer of learned skills. It acquires skills and a skill prior from the offline dataset but does not utilize the meta-training tasks. This investigates the benefits our method can obtain from leveraging the meta-training tasks. • PEARL [243] is a state of the art off-policy meta-RL algorithm that learns a policy which can quickly adapt to unseen test tasks. It learns from the meta-training tasks but does not use the offline dataset. This examines the benefits of using learned skills in meta-RL. • PEARL-ft demonstrates the performance of a PEARL [243] model further fine-tuned on a target task using SAC [104]. 157 • Multi-taskRL(MTRL) is a multi-task RL baseline which learns from the meta-training tasks by distilling individual policies specialized in each task into a shared policy, similar to Distral [295]. Each individual policy is trained using SPiRL by leveraging skills extracted from the offline dataset. Therefore, it utilizes both the meta-training tasks and offline dataset similar to our method. This provides a direct comparison of multi-task learning (MTRL) from the training tasks vs. meta-learning using them (ours). More implementation details on the baselines can be found in Section 5.7.5. 5.5.3 Results We present the quantitative results in Figure 5.4 and the qualitative results on the maze navigation domain in Figure 5.5. In Figure 5.4, SiMPL demonstrates much better sample efficiency for learning the unseen target tasks compared to all the baselines. Without leveraging the offline dataset and meta-training tasks, SAC is not able to make learning progress on most of the target tasks. While PEARL is first trained on the meta-training tasks, it still achieves poor performance on the target tasks and fine-tuning it (PEARL-ft) does not yield significant improvement. We believe this is because both environments provide only sparse rewards yet require the model to exhibit long-horizon and complex behaviors, which is known to be difficult for meta-RL methods [196]. On the other hand, by first extracting skills and acquiring a skill prior from the offline dataset, SPiRL’s performance consistently improves with more samples from the target tasks. Yet, it requires significantly more environment interactions than our method to solve the target tasks since the policy is optimized using vanilla RL, which is not designed to learn to quickly learn new tasks. While the multi-task RL (MTRL) baseline first learns a multi-task policy from the meta-training tasks, its sample efficiency is similar to SPiRL on target task learning, which highlights the strength of our proposed method – meta-learning from the meta-training tasks for fast target task learning. 158 SiMPL (Ours) SPiRL MTRL PEARL-ft SAC PEARL Figure 5.4: TargetTaskLearningEfficiency. SiMPL demonstrates better sample efficiency compared to all the baselines, verifying the efficacy of meta-learning on long-horizon tasks by leveraging skills and skill prior extracted from an offline dataset. For both the two environments, we train each model on each target task with 3 different random seeds. SiMPL and PEARL-ft first collect 20 episodes of environment interactions (vertical dotted line) for conditioning the meta-trained policy before fine-tuning it on target tasks. Compared to the baselines, our method learns the target tasks much quicker. Within only a few episodes the policy converges to solve more than80% of the target tasks in the maze environment and two out of four subtasks in the kitchen manipulation environment. The prior-regularized fine-tuning then continues to improve performance. The rapidly increasing performance and the overall faster convergence show the benefits of leveraging meta-training tasks in addition to learning from offline data: by first learning to learn how to quickly solve tasks using the extracted skills and the skill prior, our policy can efficiently solve the target tasks. The qualitative results presented in Figure 5.5 show that all the methods that leverage the offline dataset (i.e. SiMPL, SPiRL, and MTRL) effectively explore the maze in the first episode. Then, SiMPL converges with much fewer episodes compared to SPiRL and MTRL, underlining the effectiveness of meta-training. In contrast, PEARL-ft is not able to make learning progress, justifying the necessity of employing offline datasets for acquiring long-horizon, complex behaviors. 159 Meta-training Tasks Episode 0 Episodes 20 Episodes 80 SPiRL Ours Meta-Training Task Target Task Target Task Agent Trajectory Episode 0 Episode 20 Episode 100 SiMPL (Ours) SPiRL Episode 0 Episode 20 Episode 100 PEARL-ft MTRL Figure 5.5: QualitativeResults. All the methods that leverage the offline dataset ( i.e. SiMPL, SPiRL, and MTRL) effectively explore the maze in the first episode. Then, SiMPL converges with much fewer episodes compared to SPiRL and MTRL. In contrast, PEARL-ft is not able to make learning progress. 5.5.4 Meta-TrainingTaskDistributionAnalysis In this section, we aim to investigate the effect of the meta-training task distribution on our skill-based meta-training and target task learning phases. Specifically, we examine the effect of (1) the number of tasks in the meta-training task distribution and (2) the alignment between a meta-training task distribution and target task distribution. We conduct experiments and analyses in the maze navigation domain. More details on task distributions can be found in Section 5.7.6.1. Number of meta-training tasks. To investigate how the number of meta-training tasks affects the performance of our method, we train our method with fewer numbers meta-training tasks (i.e. 10 and 20) and evaluate it with the same set of target tasks. The quantitative results presented in Figure 5.6(a) suggest that even with sparser meta-training task distributions (i.e. fewer numbers of meta-training tasks), SiMPL is still more sample efficient compared to the best-performing baseline ( i.e. SPiRL). Meta-train/testtaskalignment. We aim to examine if a model trained on a meta-training task distribution that aligns better/worse with the target tasks would yield improved/deteriorated performance. To this end, we create biased meta-training / test task distributions: we create a meta-train set by sampling goal locations from only the top 25% portion of the maze (T Train-Top ). To rule out the effect of the density of the task 160 (a) Sparser Task Distribution (b) TTrain-Top→TTarget-Top (c) TTrain-Top→TTarget-Bottom Figure 5.6: Meta-trainingTaskDistributionAnalysis. (a) With sparser meta-training task distributions (i.e. fewer numbers of meta-training tasks), SiMPL still achieves better sample efficiency compared to SPiRL, highlighting the benefit of leveraging meta-training tasks. (b) When trained on a meta-training task distribution that aligns better with the target task distribution, SiMPL achieves improved performance. (c) When trained on a meta-training task distribution that is mis-aligned with the target tasks, SiMPL yields worse performance. For all the analyses, we train each model on each target task with 3 different random seeds. distribution, we sample 10 (i.e. 40× 25%) meta-training tasks. Then, we create two target task distributions that have good and bad alignment with this meta-training distribution respectively by sampling 10 target tasks from the top 25% portion of the maze (T Target-Top ) and 10 target tasks from the bottom 25% portion of the maze (T Target-Bottom ). Figure 5.6(b) and Figure 5.6(c) present the target task learning efficiency for models trained with good task alignment (meta-train onT Train-Top , learn target tasks fromT Target-Top ) and bad task alignment (meta- train onT Train-Top , learn target tasks fromT Target-Bottom ), respectively. The results demonstrate that SiMPL can achieve improved performance when trained on a better aligned meta-training task distribution. On the other hand, not surprisingly, SiMPL and MTRL perform slightly worse compared to SPiRL when trained with misaligned meta-training tasks (see Figure 5.6(c)). This is expected given that SPiRL does not learn from the misaligned meta-training tasks. In summary, from Figure 5.6, we can conclude that meta-learning from either a diverse task distribution or a better informed task distribution can yield improved performance for our method. 161 5.6 Conclusion We propose a skill-based meta-RL method, dubbed SiMPL, that can meta-learn on long-horizon tasks by leveraging prior experience in the form of large offline datasets without additional reward and task annotations. Specifically, our method first learns to extracts reusable skills and a skill prior from the offline data. Then, we propose to meta-trains a high-level policy that leverages these skills for efficient learning of unseen target tasks. To effectively utilize learned skills, the high-level policy is regularized by the acquired prior. The experimental results on challenging continuous control long-horizon navigation and manipulation tasks with sparse rewards demonstrate that our method outperforms the prior approaches in deep RL, skill-based RL, meta-RL, and multi-task RL. In the future, we aim to demonstrate the scalability of our method to high-DoF continuous control problems on real robotic systems to show the benefits of our improved sample efficiency. 5.7 Appendix 5.7.1 Meta-ReinforcementLearningMethodAblation In this section, we compare the learning efficiency of different meta-RL algorithms with respect to the length of the training tasks. Specifically, we hypothesize that our approach SiMPL, which extracts temporally extended skills from offline experience, is better suited for learning long-horizon tasks than prior meta-RL algorithms. To cleanly investigate the importance of the temporally extended skills vs. the importance of using prior experience we include two additional comparisons to methods that leverage prior experience for meta-RL but via flat behavioral cloning instead of through temporally extended skills: • BC+PEARL first learns a behavior cloning (BC) policy through supervised learning from the offline dataset. Then, analogous to our approach SiMPL, during the meta-training phase, a task encoder 162 and a meta-learned policy are meta-trained with the BC policy constrained SAC objective. For fair comparison, we use the same residual policy parameterization as described in Section 5.7.4.1. • BC+MAML follows the same learning procedure described above, but uses MAML [77] for meta- training instead of PEARL. We follow the original learning objective in Finn, Abbeel, and Levine [77] (i.e. using REINFORCE [325] for task adaptation, and using TRPO [264] for meta-policy optimization). We compare these methods as well as the standard meta-RL approach PEARL [243] on three meta- training tasks distributions of increasing complexity in the maze navigation environment (see Figure 5.7): (1) short-range goals with small varianceT Train-Easy , (2) short-range goals with larger varianceT Train-Medium , and (3) long-range goals with large varianceT Train-Hard , which we used in our original maze experiments. By increasing variance and length of the tasks in each task distribution, we can investigate the learning capability of the meta-RL algorithms. We present the quantitative results in Figure 5.8 and the corresponding qualitative analysis in Figure 5.9. On the simplest task distribution we find that all approaches can learn to solve the tasks efficiently, except for BC+MAML. While the latter also learns to solve the task eventually (see performance upon convergence as dashed orange line in Figure 5.8(a)) it uses on-policy meta-RL and thus requires substantially more environment interactions during meta-training. We thus only consider the more sample efficient BC+PEARL off-policy meta-RL method in the remaining comparisons. On the more complex task distributionsT Train-Medium andT Train-Hard , we find that using prior data for meta-learning is generally beneficial: both BC+PEARL and SiMPL learn more efficiently on the task distribution of medium difficulty T Train-Medium , as shown in Figure 5.8(b), since the policy pre-trained from offline data allows for more efficient exploration during meta-training. Importantly, on the hardest task distributionT Train-Hard , as shown in Figure 5.8(c), which consists exclusively of long-horizon tasks, we find that only SiMPL is able to effectively learn, highlighting the importance of leveraging the offline data via 163 (a) TTrain-Easy (b) TTrain-Medium (c) TTrain-Hard Figure 5.7: TaskDistributionsforTaskLengthAblation. We propose three meta-training task distribu- tions of increasing difficulty to compare different meta-RL algorithms: T Train-Easy uses short-horizon tasks with adjacent goal locations, making exploration easier during meta-training,T Train-Medium uses similar task horizon but increases the goal position variance,T Train-Hard contains long-horizon tasks with high variance in goal position and thus is the hardest of the tested task distributions. temporally extended skills instead of flat behavioral cloning. This supports our intuition that the abstraction provided by skills is particularly beneficial for meta-learning on long-horizon tasks. 5.7.2 LearningEfficiencyonTargetTaskswithFewEpisodesofExperience In this section, we examine the data efficiency of the compared methods on the target tasks, specifically when provided with only a few (<20) episodes of online interaction with an unseen target task. Being able to learn new tasks this quickly is a major strength of meta-RL approaches [77, 243]. We hypothesize that our skill-based meta-RL algorithm SiMPL can learn similarly fast, even on long-horizon, sparse-reward tasks. In our original evaluations in Section 5.5, we used 20 episodes of initial exploration to condition our meta-trained policy. In Figure 5.10, we instead compare performance of different approaches when only provided with very few episodes of online interactions. We find that SiMPL learns to solve the unseen tasks substantially faster than all alternative approaches. On the kitchen manipulation tasks our approach learns to almost solve two out of four subtasks within a time span equivalent to only a few minutes of real-robot execution time. In contrast, prior meta-RL methods struggle at making progress at all on such long-horizon tasks, showing the benefit of combining meta-RL with prior offline experience. 164 (a) TTrain-Easy (b) TTrain-Medium (c) TTrain-Hard Figure 5.8: Meta-TrainingPerformanceforTaskLengthAblation. We find that most meta-learning approaches can solve the simplest task distribution, but using prior experience in BC+PEARL and SiMPL helps for the more challenging distributions (b) and (c). We find that only our approach, which uses the prior data by extracting temporally extended skills, is able to learn the challenging long-horizon tasks efficiently. 5.7.3 InvestigatingOfflineDatavs. TargetDomainShift To provide more insights on comparing SiMPL and SPiRL [232], we evaluate SiMPL in the maze navigation task setup proposed in Pertsch, Lee, and Lim [232]. This tests whether our approach can scale to image-based observations: Pertsch, Lee, and Lim [232] use32× 32px observations centered around the agent. Even more importantly, it allows us to investigate the robustness of the approach to the domain shifts between the offline pre-training data and the target task: we use the maze navigation offline dataset from Pertsch, Lee, and Lim [232] which was collected on randomly sampled 20× 20 maze layouts and test on tasks in the unseen, randomly sampled40× 40 test maze layout from Pertsch, Lee, and Lim [232]. We visualize the meta-training task distribution in Figure 5.11(a) and the target task distribution in Figure 5.11(b). We compare the performance of our method to the best-performing baseline, SPiRL [232], in Figure 5.11(c). Similar to the result presented in Figure 5.4, SiMPL can learn the target task faster by combining skills learned from the offline dataset with efficient meta-training. This shows that our approach can scale to image-based inputs and is robust to substantial domain shifts between the offline pre-training data and the target tasks. 165 (a) PEARL onTTrain-Easy (b) BC+PEARL onTTrain-Easy (c) SiMPL onTTrain-Easy (d) PEARL onTTrain-Medium (e) BC+PEARL onTTrain-Medium (f) SiMPL onTTrain-Medium (g) PEARL onTTrain-Hard (h) BC+PEARL onTTrain-Hard (i) SiMPL onTTrain-Hard Figure 5.9: QualitativeResultofMeta-RLMethodAblation. Top. All the methods can learn to solve short-horizon tasksT Train-Easy . Middle. On medium-horizon tasksT Train-Medium , PEARL struggles at exploring further, while BC+PEARL exhibits more consistent exploration yet still fails to solve some of the tasks. SiMPL can explore well and solve all the tasks. Bottom. On long-horizon tasksT Train-Hard , PEARL falls into a local minimum, focusing only on one single task on the left. BC+PEARL explores slightly better and can solve a few more tasks. SiMPL can effectively learn all the tasks. 166 SiMPL (Ours) SPiRL MTRL PEARL SAC Figure 5.10: Performance with Few Episodes of Target Task Interaction. We find that our skill- based meta-RL approach SiMPL is able to learn complex, long-horizon tasks within few episodes of online interaction with a new task while prior meta-RL approaches and non-meta-learning baselines require many more interactions or fail to learn the task altogether. (a) TTrain-Image-based (b) TTarget-Image-based (c) Target Task Learning Efficiency Figure 5.11: Image-BasedMazeNavigationwithDistributionShift. (a-b): Meta-training and target task distributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent, which is equivalent to the one used in Pertsch, Lee, and Lim [232]. (c): Performance on the target task. Our approach SiMPL can leverage skills learned from offline data for efficient meta-RL on the maze navigation task and is robust to the domain shift between offline data environments and the target environment. 167 5.7.4 ImplementationDetailsonOurMethod In this section, we describe the additional implementation details on our proposed method. The details on model architecture is presented in Section 5.7.4.1, followed by the training detailed described in Section 5.7.4.2. 5.7.4.1 ModelArchitecture We describe the details on our model architecture in this section. SkillPrior We followed architecture and learning procedure of Pertsch, Lee, and Lim [232] for learning a low-level skill policy and a skill prior. Please refer to Pertsch, Lee, and Lim [232] for more details on the architectures for learning skills and skill priors from offline datasets. TaskEncoder Following Rakelly et al. [243], our task encoder is a permutation invarient neural network. Specifically, we adopt Set Transformer [157] that consists of layers [2× ISAB 32 →PMA 1 →3× MLP] for expressive and efficient set encoding. All the hidden layers are 128-dimensional and all attention layers have4 attention heads. The encoder takes a set of high-level transitions as input, where each transition is a vector concatenation of high-level transition tuple. The output of the encoder is(µ e ,σ e ) which are the parameters of Gaussian task posteriorp(e|c)=N(e;µ e ,σ e ). We varied task vector dimensiondim(e) depends on task distribution complexity. dim(e) = 10 for Kitchen Manipulation,dim(e) = 6 for Maze Navigation with 40 meta-training tasks, anddim(e)=5, otherwise. Policy We parameterize our policy with neural network. We employed4-layer MLPs with256 hidden units for Maze Navigation, and6-layer MLPs with128 hidden unit for Kitchen Manipulation experiment. Instead of direct parameterization of policy, the network output is added to skill-prior to make learning more stable. Specifically, the policy network takes concatenation of (s,e) as input, and then outputs residual 168 parameters (µ z ,logσ z ) to skill-prior distribution p(z|s) = N(z|µ p ,σ p ). Resulting distribution by this residual parameterization isπ (z|s)=N(z|µ p +µ z ,exp(logσ p +logσ z )) Critic The critic network takes concatenation ofs,e, and skillz as input and outputs an estimation of task- conditioned Q-valueQ(s,z,e). We employ double Q networks [305] to mitigate Q-value overestimation. The architecture of critic follows the policy. 5.7.4.2 TrainingDetails For all the network updates, we used Adam optimizer [140] with a learning rate of3e− 4,β 1 =0.9, and β 2 =0.999. We describe the training details of the skill-based meta-training phase in Section 5.7.4.2 and the target task learning phase Section 5.7.4.2. Skill-basedMeta-training Our meta-training procedure is similar to the procedure adopted in [243]. Encoder and critics networks are updated to minimize MSE between Q-value prediction and target Q value. Policy network is updated to optimize Equation 5.4 without updating the encoder network. All network are updated with the average gradients of 20 randomly sampled tasks. Each batch of gradients is computed from1024 and256 transitions for Maze Navigation and Kitchen Manipulation experiment, respectively. We train our models for10000,18000, and16000 episodes for the Maze Navigation experiments with10, 20,40 meta-training tasks, respectively, and3450 episodes for Kitchen Manipulation. As stated in Section 5.4.2, we apply different regularization coefficients depending on the size of the conditioning transitions. In Maze Navigation experiment, we set target KL divergence to 0.1 for batch that is conditioned on size4 transitions and0.4 for batch conditioned on size8192 transitions. In Kitchen Manipulation experiment, we set target KL divergence to0.4 for bath conditioned with a size1024 transitions while KL coefficient for batch conditioned on size 2 transitions is fixed to 0.3. 169 TargetTaskLearning We initialize the Q function and the auto-tuning valueα with the values that learned in the skill-based meta-training phase. The policy is initialized after observing and encoding20 episodes obtained from the task unconditioned policy rollouts. For the target task learning phase, the target KLδ is1 for Maze Navigation, and2 for Kitchen Manipulation experiments. To compute a gradient step, 256 high-level transitions are sampled from a replay buffer with size 20000. Note that we used same setup for baselines that uses SPiRL fine-tuning (SPiRL and MTRL). 5.7.5 ImplementationDetailsonBaselines In this section, we describe the additional implementation details on producing the results of the baselines. 5.7.5.1 SAC The SAC [104] baseline learns to solve a target task from scratch without leveraging the offline dataset nor the meta-training tasks. We initializeα to0.1 and set the target entropy toH =− dim(A). To compute a gradient step,4096 and 1024 environment transitions are sampled from a replay buffer for Maze Navigation and Kitchen Navigation experiments, respectively. 5.7.5.2 PEARLandPEARL-ft PEARL [243] learns from the meta-training tasks but does not use the offline dataset. Therefore, we directly train PEARL models on the meta-training tasks without the phase of learning from offline datasets. We use gradients averaged from 20 randomly sampled tasks where each task gradient is computed by batch sampled from a per-task buffer. The target entropy is set to H =− dim(A) andα is initialized to0.1. While the method proposed in Rakelly et al. [243] does not fine-tune on target/meta-testing tasks, we extend PEARL to be fine-tuned on target tasks for a fair comparison, called PEARL-ft. Since PEARL does not use learned skills or a skill prior, the target task learning of PEARL is simply running SAC with 170 task-encoded initialization. Similar to the target task learning of our method, we initialize the Q function and entropy coefficient α to the value learned during the meta-training phase. Also, we initialize the policy to the task conditioned policy after observing20 episodes of experience from the task unconditioned policy rollouts. The hyperparameters used for fine-tuning are the same as SAC. 5.7.5.3 SPiRL Similar to our method, we initialize the high-level policy to skill-prior while fixing low-level policy for target task learning for SPiRL.α is initialized to0.01 and we use the same hyperparameters for the SPiRL models as our method. 5.7.5.4 Multi-taskRL(MTRL) Inspired by Distral [295], our multi-task RL baseline is designed to first learns a set of individual policies, where each of them is specialized in one task; then, a shared/multi-task policy is learned by distilling the individual polices. Since it is inefficient to learn an individual policy from scratch, we learn each individual policy using SPiRL with learned skills and a skill prior. Then, we distill the individual policies using the following objective : max π 0 E T∼ p T X t E (st,zt)∼ ρ π 0 r T (s t ,z t )− αD KL π 0 (z|s t ,e),p(z|s t ) . (5.6) We use the same setup forα as our method, whereα is auto-tuned to satisfy a target KL,δ =0.1 for Maze Navigation andδ =0.2 for Kitchen Manipulation. While the target task learning phase for MTRL is similar to ours, except that MTRL is not initialized with a meta-trained Q function and learnedα . 171 (a) Meta-training 40 Tasks (b) Meta-training 20 Tasks (c) Meta-training 10 Tasks (d) Target Tasks Figure 5.12: MazeMeta-trainingandTargetTaskDistributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. (a) TTrain-Top (b) TTarget-Top (c) TTarget-Bottom Figure 5.13: MazeMeta-trainingandTargetTaskDistributionsforMeta-trainingTaskDistribution Analysis. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. 5.7.6 Meta-TrainingTasksandTargetTasks. In this section, we present the meta-training tasks and target tasks used in the maze navigation domain and the kitchen manipulation domain. 5.7.6.1 MazeNavigation The meta-training tasks and target tasks are visualized in Figure 5.12 and Figure 5.13. 5.7.6.2 KitchenManipulation The meta-training tasks are: 172 • microwave→kettle→bottom burner→slide cabinet • microwave→bottom burner→top burner→slide cabinet • microwave→top burner→light switch→hinge cabinet • kettle→bottom burner→light switch→hinge cabinet • microwave→bottom burner→hinge cabinet→top burner • kettle→top burner→light switch→slide cabinet • microwave→kettle→slide cabinet→bottom burner • kettle→light switch→slide cabinet→bottom burner • microwave→kettle→bottom burner→top burner • microwave→kettle→slide cabinet→hinge cabinet • microwave→bottom burner→slide cabinet→top burner • kettle→bottom burner→light switch→top burner • microwave→kettle→top burner→light switch • microwave→kettle→light switch→hinge cabinet • microwave→bottom burner→light switch→slide cabinet • kettle→bottom burner→top burner→light switch • microwave→light switch→slide cabinet→hinge cabinet • microwave→bottom burner→top burner→hinge cabinet • kettle→bottom burner→slide cabinet→hinge cabinet 173 • bottom burner→top burner→slide cabinet→light switch • microwave→kettle→light switch→slide cabinet • kettle→bottom burner→top burner→hinge cabinet • bottom burner→top burner→light switch→slide cabinet The target tasks are: • microwave→bottom burner→light switch→top burner • microwave→bottom burner→top burner→light switch • kettle→bottom burner→light switch→slide cabinet • microwave→kettle→top burner→hinge cabinet • kettle→bottom burner→slide cabinet→top burner • kettle→light switch→slide cabinet→hinge cabinet • kettle→bottom burner→top burner→slide cabinet • microwave→bottom burner→slide cabinet→hinge cabinet • bottom burner→top burner→slide cabinet→hinge cabinet • microwave→kettle→bottom burner→hinge cabinet 174 Chapter6 LearningfromObservation 6.1 Introduction Humans are effective at learning a task from demonstrations and applying the learned behaviors to other situations. We achieve this by extracting the underlying structure of the task when observing others fulfilling the task, instead of simply memorizing the demonstrator’s low-level actions [26, 116]. This high-level task structure generalizes to new situations and thus helps us to quickly learn the task in new situations. One intuitive and readily available instance of such high-level task structure is task progress, measuring how much of the task the agent completed. Inspired by this insight, we propose a novel imitation learning method that utilizes task progress for better generalization to unseen states and goals. Typical learning from demonstration (LfD) approaches [236, 80] greedily imitate the expert policy and thus suffer from accumulated errors causing a drift away from states seen in the demonstrations [251]. To make the imitation policy more robust to states not in demonstrations, adversarial imitation learning methods [114, 88] encourage the agent to stay near the expert trajectories using a learned reward that distinguishes expert and agent behaviors. However, such learned reward functions often overfit to the expert demonstrations by learning spurious correlations between task-irrelevant features and expert/agent labels [356], and thus suffer from generalization to slightly different initial and goal configurations from the ones seen in the demonstrations (e.g. holdout goal regions or larger perturbation in goal sampling). 175 f <latexit sha1_base64="U2SR+MFZAIT/qsNTNATAdafeRwk=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZ37rCbXhSj7YcYphQgeSx5xR66Rm3OumQ94rV/yqPwdZJUFOKpCj3it/dfuKZQlKywQ1phP4qQ0nVFvOBE5L3cxgStmIDrDjqKQJmnAyv3ZKzpzSJ7HSrqQlc/X3xIQmxoyTyHUm1A7NsjcT//M6mY2vwwmXaWZRssWiOBPEKjJ7nfS5RmbF2BHKNHe3EjakmjLrAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8ecp78d69j0VrwctnjuEPvM8fi3uPGA==</latexit> <latexit sha1_base64="U2SR+MFZAIT/qsNTNATAdafeRwk=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZ37rCbXhSj7YcYphQgeSx5xR66Rm3OumQ94rV/yqPwdZJUFOKpCj3it/dfuKZQlKywQ1phP4qQ0nVFvOBE5L3cxgStmIDrDjqKQJmnAyv3ZKzpzSJ7HSrqQlc/X3xIQmxoyTyHUm1A7NsjcT//M6mY2vwwmXaWZRssWiOBPEKjJ7nfS5RmbF2BHKNHe3EjakmjLrAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8ecp78d69j0VrwctnjuEPvM8fi3uPGA==</latexit> <latexit sha1_base64="U2SR+MFZAIT/qsNTNATAdafeRwk=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZ37rCbXhSj7YcYphQgeSx5xR66Rm3OumQ94rV/yqPwdZJUFOKpCj3it/dfuKZQlKywQ1phP4qQ0nVFvOBE5L3cxgStmIDrDjqKQJmnAyv3ZKzpzSJ7HSrqQlc/X3xIQmxoyTyHUm1A7NsjcT//M6mY2vwwmXaWZRssWiOBPEKjJ7nfS5RmbF2BHKNHe3EjakmjLrAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8ecp78d69j0VrwctnjuEPvM8fi3uPGA==</latexit> <latexit sha1_base64="U2SR+MFZAIT/qsNTNATAdafeRwk=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZ37rCbXhSj7YcYphQgeSx5xR66Rm3OumQ94rV/yqPwdZJUFOKpCj3it/dfuKZQlKywQ1phP4qQ0nVFvOBE5L3cxgStmIDrDjqKQJmnAyv3ZKzpzSJ7HSrqQlc/X3xIQmxoyTyHUm1A7NsjcT//M6mY2vwwmXaWZRssWiOBPEKjJ7nfS5RmbF2BHKNHe3EjakmjLrAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8ecp78d69j0VrwctnjuEPvM8fi3uPGA==</latexit> = Goal Proximity Learning Proximity Function 1.0 (Goal) 0.9 0.8 0.7 0.6 1.0 (Goal) 0.9 0.8 1.0 (Goal) 0.9 0.8 0.7 Demo 1 Demo 2 Demo N Observations Expert Demonstrations Goal Proximity Learning Policy = ⇡ ✓ <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> a <latexit sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit> <latexit sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit> <latexit sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit> <latexit sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit> Joint Training Proximity Reward: f (s t+1 ) f (s t ) <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> 0.5 0.8 1.0 (Goal) 0.2 0.3 0.3 Agent Experience under Policy ⇡ ✓ <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> <latexit sha1_base64="/GoZxCLXaZuQtcDIVt/1r515G7Q=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0GPRi8cKthaaUDbbSbt0sxt2J0IJ/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTKSw6LrfTmVtfWNzq7pd29nd2z+oHx51rU4Nhw7XUpteyCxIoaCDAiX0EgMsDiU8hpPbwn98AmOFVg84TSCI2UiJSHCGudT3EzHwcQzIaoN6w226c9BV4pWkQUq0B/Uvf6h5GoNCLpm1fc9NMMiYQcElzGp+aiFhfMJG0M+pYjHYIJufPKNnuTKkkTZ5KaRz9fdExmJrp3GYd8YMx3bZK8T/vH6K0XWQCZWkCIovFkWppKhp8T8dCgMc5TQnjBuR30r5mBnGMU+pCMFbfnmVdC+antv07i8brZsyjio5IafknHjkirTIHWmTDuFEk2fySt4cdF6cd+dj0Vpxyplj8gfO5w/gMZD4</latexit> 0.2 0.6 0.4 1.0 (Goal) Rollout 1 Rollout M Rollout 2 Predicted Goal Proximity 0.1 (Fail) f (st+1) f (st) <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> <latexit sha1_base64="xmjycHbRzCQLUKYjm8PsbGBJlEA=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahRSyJCLosunFZwT6gDWEynbRDJw9mboQSsnTjr7hxoYhbP8Gdf+O0zaK2HrhwOOde7r3HiwVXYFk/RmFldW19o7hZ2tre2d0z9w9aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj24nffmRS8Sh8gHHMnIAMQu5zSkBLrnnsu714yCvKTeHMzqr4HM8pWdU1y1bNmgIvEzsnZZSj4ZrfvX5Ek4CFQAVRqmtbMTgpkcCpYFmplygWEzoiA9bVNCQBU046fSTDp1rpYz+SukLAU3V+IiWBUuPA050BgaFa9Cbif143Af/aSXkYJ8BCOlvkJwJDhCep4D6XjIIYa0Ko5PpWTIdEEgo6u5IOwV58eZm0Lmq2VbPvL8v1mzyOIjpCJ6iCbHSF6ugONVATUfSEXtAbejeejVfjw/ictRaMfOYQ/YHx9QvMzJiM</latexit> Figure 6.1: In goal-directed tasks, states on an expert trajectory have increasing proximity toward the goal as the expert makes progress towards fulfilling a task. Inspired by this intuition, we propose to learn a proximity functionf ϕ from expert demonstrations and agent experience, which predicts goal proximity (i.e. an estimate of temporal distance to the goal). Then, using this proximity function, we train a policyπ θ to progressively move to states with higher predicted goal proximity (italicized numbers) and eventually reach the goal. We alternate these two learning phases to improve both the proximity function and policy, leading to not only better generalization but also superior performance. To learn a more generalizable and informative reward from demonstrations, we propose an imitation learning from observation (LfO) method, which learns a task progress estimator and uses the task progress estimate as a dense reward for training a policy as illustrated in Figure 6.1. Unlike discriminating expert and agent behaviors by predicting binary labels in prior adversarial imitation learning methods, which is prone to overfitting to task-irrelevant features, the task progress estimator is required to learn more task-relevant information to precisely predict the task progress on a continuous scale. Hence, it can generalize better to unseen states and provide more informative rewards. As a measure of progress in goal-directed tasks, we define goal proximity, which is an estimate of temporal distance to the goal (i.e. the number of actions required to reach the goal) and entails all semantic information about how to reach the goal. We then train a proximity function to predict the goal proximity from expert demonstrations and agent experience. This proximity function acts as a dense reward to guide a reinforcement learning agent to reach states with high proximity, leading to the goal. In this paper, we focus on learning the proximity function and policy in a state space shared by the expert and learner, and leave generalizing to different embodiments as future work. 176 However, the predicted goal proximity can still be inaccurate on states not in the demonstrations, resulting in unstable policy learning. To improve the accuracy of the proximity function, we continually update it with trajectories from both the expert and learning agent. In addition, we penalize trajectories with the uncertainty of the proximity prediction to prevent the policy from exploiting inaccurate high proximity predictions. By leveraging the agent experience and predicting proximity function uncertainty, the proposed method achieves more efficient and stable policy learning. The main contribution of this paper is an LfO algorithm for goal-directed tasks with better generalization to new goals or states not in demonstrations using goal proximity that informs an agent of the task progress. Together with a difference-based reward and uncertainty penalty of goal proximity estimation, our method provides more informative and robust rewards. Our extensive experiments show that the policy learned with the goal proximity function generalizes better than the state-of-the-art LfO algorithms on various goal-directed tasks, including navigation, locomotion, and robotic manipulation. Moreover, our method shows comparable results with LfD methods which learn from expert actions and a goal-conditioned imitation learning method which uses a sparse task reward. 6.2 RelatedWork Imitation learning [258] aims to leverage expert demonstrations to acquire skills. While behavioral cloning [236] is simple but effective with a large number of demonstrations, it suffers from compounding errors caused by covariate shift [251]. On the other hand, inverse reinforcement learning (IRL) [210, 2, 353] estimates the underlying reward from demonstrations and trains a policy through reinforcement learning (RL) with this reward, which can better handle the compounding errors. Specifically, generative adversarial imitation learning (GAIL) [114] shows improved demonstration efficiency by training a discriminator to distinguish expert and agent transitions and using the discriminator output as a reward for policy training. 177 GoalGAIL [67] further improves sample efficiency for goal-directed tasks by relabeling transitions [14] and using true environment rewards. While these imitation learning algorithms require expert actions, imitation learning from observation (LfO) approaches learn from state-only demonstrations, such as videos and kinesthetic demonstrations. To imitate demonstrations without expert actions, inverse dynamics models [213, 301, 225], reachability functions [159], or learned reward functions [74, 268, 267, 176] can be learned and used for policy training, but training such models requires a large amount of quality data or additional test-time demonstrations. On the other hand, state-only adversarial imitation learning [302] can imitate from a few demonstrations. However, in such adversarial imitation learning approaches, the discriminator tends to find spurious associations between task-irrelevant features and expert/agent labels [356]. This becomes problematic when the agent encounters unseen states and the discriminator erroneously assigns agent behaviors low scores based on these task-irrelevant features, providing a poor reward for the agent. To overcome finding spurious associations, in addition to discriminating expert and agent trajectories, we propose to also estimate the proximity to the goal, which requires more task-relevant information and thus generalizes better to new states. Temporal progress estimation has shown its effectiveness as an auxiliary reward for RL [188, 73, 160] and decision making criteria [51, 16, 36]. However, these methods learn the progress estimator only from the given demonstrations. This hinders policy learning when the progress estimator fails to generalize to agent experience, allowing the agent to exploit inaccurate progress predictions for higher reward. Moreover, greedily choosing an action with the highest predicted temporal progress [51, 16, 36] could lead to low long-term returns. By incorporating online updates, uncertainty estimates, and a difference-based proximity reward, our method robustly learns from demonstrations to solve goal-directed tasks without access to expert actions or the true environment reward. 178 6.3 Method In this paper, we address the problem of LfO for goal-directed tasks with a focus on generalization to states or goals not covered in the demonstrations. Adversarial LfO methods [302, 336] suggest learning a reward function that penalizes agent state transitions deviating from the expert trajectories. However, these learned reward functions often focus on task-irrelevant features [356] and do not generalize to states not in the demonstrations, leading to unsuccessful policy training. To learn a generalizable reward, we propose to leverage task progress information freely available in demonstrations, in terms of goal proximity, which estimates temporal distance to the goal (i.e. number of actions required to reach the goal). Predicting precise goal proximity on a continuous scale, rather than simply distinguishing expert and agent states, requires the model to capture task-relevant information, allowing the proximity prediction to generalize to states not in the demonstrations (Section 6.3.2). Then, a policy learns to reach states with higher proximity prediction, leading to the goal (Section 6.3.3). Moreover, we propose to use the uncertainty of the proximity prediction to prevent the policy from exploiting over-optimistic proximity predictions and yielding undesired behaviors. 6.3.1 Preliminaries We formulate our problem as a Markov decision process [292] defined through a tuple (S,A,R,P,ρ 0 ,γ ) of the state spaceS, action spaceA, reward functionR(s,a,s ′ ), transition distributionP(s ′ |s,a), initial state distributionρ 0 , and discounting factorγ . We define a policy π (a|s) that maps from a states to an actiona and correspondingly moves an agent to a new states ′ according to the transition probabilityP(s ′ |s,a). The policy is trained to maximize the sum of discounted rewards,E (s 0 ,a 0 ,...,s T i )∼ π h P T i − 1 t=0 γ t R(s t ,a t ,s t+1 ) i , whereT i is the variable episode length. In imitation learning, the learner receives a set ofN expert demonstrations,D e ={τ e 1 ,...,τ e N }. In this paper, we specifically consider the LfO setup where each demonstration τ e i is a sequence of states. Moreover, 179 we assume that goal information is explicitly or implicitly included in the states, and all demonstrations are successful; therefore, the final state of each trajectory achieves the task goal. 6.3.2 LearningGoalProximityFunction To effectively leverage expert demonstrations and generalize to new states or new goals, learning a generalizable reward function is essential. In goal-directed tasks, an estimate of how close an agent is to the goal can be utilized as a dense and direct learning signal. Moreover, predicting thecontinuous goal proximity requires understanding the task structure and thus encourages finding more task-relevant features, resulting in better generalization. Therefore, instead of learning to simply discriminate agent and expert trajectories, we propose to learn a goal proximity function, f : S → [0,1], which predicts goal proximity of a state s, which is a discounted value based on the temporal distance to the goal (i.e. inversely proportional to the number of actions required to reach the goal). In this paper, we define goal proximity as the exponentially discounted proximityf(s t )=δ (T i − t) , whereδ ∈(0,1) is a discounting factor andT i is the episode length. Note that the goal proximity function measures the temporal distance, not the spatial distance, between the current and goal states. Therefore, a single proximity value can entail all information about the task, goal, and any roadblocks. There are alternative ways to define goal proximity, such as linearly discounted proximity [160] and ranking-based proximity [34, 36]. But, in this paper, we use the exponentially discounted proximity as it performs better across most tasks (see appendix, Figure 6.8). We train a goal proximity functionf ϕ parameterized byϕ to minimize the following objective: L ϕ =E τ e i ∼D e ,st∼ τ e i f ϕ (s t )− δ (T i − t) 2 . (6.1) 180 Since the goal proximity function trained only on expert demonstrations can overfit to the data, we further train the goal proximity function with online agent experience by setting the target proximity of states in agent trajectories to0, similar to adversarial imitation learning methods [114]: L ϕ =E τ e i ∼D e ,st∼ τ e i f ϕ (s t )− δ (T i − t) 2 +E τ ∼ π θ ,st∼ τ f ϕ (s t ) 2 . (6.2) By learning to predict the goal proximity,f ϕ not only learns to discriminate agent and expert trajectories (i.e. predict 0 proximity for an agent trajectory and positive proximity for an expert trajectory) but also acquires the task information about temporal progress entailed in the trajectories. From this freely available additional supervision, the proximity function is required to learn task-relevant features. Hence, the resulting proximity function generalizes better to unseen states and provides more informative learning signals to the policy as empirically shown in Section 6.4. Due to the lack of environment reward, successful agent experience is also used as negative examples for proximity function training, and thus the proximity function learns to predict low goal proximity even for successful trajectories. However, early stopping and learning rate decay can ease this problem [356], and the optimal proximity function still outputs the average of expert and agent labels, which is δ (T i − t) /2 for ours and 0.5 for GAIL [88]. 6.3.3 TrainingPolicywithProximityReward In a goal-directed task, a policyπ θ aims to get close to and eventually reach the goal. We can formalize this objective as maximizing the difference-based proximity reward R ϕ , the increase in goal proximity, at every timestep, which corresponds to making consistent progress towards the goal: R ϕ (s t ,s t+1 )=f ϕ (s t+1 )− f ϕ (s t ). (6.3) 181 Given the proximity rewardR ϕ , the policy is trained to maximize the expected discounted return: E (s 0 ,...,s T i )∼ π θ T i − 1 X t=0 γ t R ϕ (s t ,s t+1 ) . (6.4) However, a policy trained with the proximity reward can sometimes acquire undesired behaviors by exploiting over-optimistic proximity predictions on states not seen in the expert demonstrations. This becomes critical when the expert demonstrations are limited and cannot sufficiently cover the state space. To avoid inaccurate predictions leading an agent to undesired states, we propose to (1) fine-tune the proximity function with online agent experience to reduce optimistic proximity predictions; and (2) penalize agent trajectories with high uncertainty in goal proximity prediction. To alleviate the effect of inaccurate proximity estimation in policy training, we discourage the policy from visiting states with uncertain proximity estimates. Specifically, we model the uncertainty U ϕ (s t ) as the disagreement of an ensemble of proximity functions by computing the standard deviation of their outputs [218, 152]. Then, we use this estimated uncertainty to penalize exploration of states with high uncertainty. The proximity estimatef ϕ (s t ) is the average prediction of the ensemble. With the uncertainty penalty, the modified proximity reward can be written as: R ϕ (s t ,s t+1 )=f ϕ (s t+1 )− f ϕ (s t )− λ · U ϕ (s t+1 ), (6.5) whereλ is a tunable hyperparameter to balance the proximity reward and uncertainty penalty. A largerλ results in more conservative exploration outside the states covered by the expert demonstrations. In summary, we propose to learn a goal proximity function to robustly provide a reward signal on states or goals not covered by demonstrations. We train the goal proximity function to estimate how close the current state is to the goal, and train a policy to maximize the goal proximity while avoiding states 182 (a) Navigation (b) Maze2D (c) AntReach (d) FetchPick (e) FetchPush (f) HandRotate Figure 6.2: Six goal-directed tasks are used for our experiments. (a) The agent must navigate across rooms to reach the goal. (b) The agent needs to navigate the maze to reach the goal. (c) The ant agent must walk towards the flag. (d, e) The robotic arm is required to pick up or push the block towards the goal (red). (f) The dexterous robot hand needs to rotate the block in-hand to the desired rotation. with uncertain proximity predictions. We jointly train the proximity function and policy as described in appendix, Algorithm 2. 6.4 Experiments In this paper, we propose a generalizable LfO algorithm that leverages task progress information (i.e. goal proximity) freely acquired from demonstrations. Hence, in our experiments, we aim to answer the following questions: (1) Does our method lead to policies that generalize better to states and goals not in the demonstrations? (2) How does our method’s efficiency and performance compare against prior work in LfO and LfD? (3) What factors contribute to the performance of our method? To answer these questions we consider diverse goal-directed tasks: navigation, locomotion, and robotic manipulation. 6.4.1 ExperimentalSetup To demonstrate the improved generalization capabilities of policies trained with the goal proximity, we benchmark our method under two different setups: expert demonstrations are collected from (1) only a fraction of the possible initial and goal states (e.g. 25%, 50% coverage) and (2) initial states with smaller amounts of noise. These generalization experimental setups serve to mimic the reality that expert demon- strations may be collected in a different setting from agent learning. For instance, due to the cost of 183 demonstration collection, the demonstrations may poorly cover the state space, which corresponds to the setup (1). Likewise, in the setup (2), demonstrations may be collected in controlled circumstances with little noise. Then, an agent in an actual environment would encounter more noise than presented in the demonstrations, leading to a wider initial state distribution. In our experiments, we use the discounting factorδ =0.95 for the goal proximity. We use an ensemble of 5 proximity functions to model uncertainty across all tasks. For policy optimization, we use PPO [265], which is widely used in LfO and LfD methods, and its hyperparameters are tuned for each method and task (see appendix, Table 6.2). Each baseline implementation is verified against the results reported in its original paper. We train each task with 5 random seeds and report mean and standard deviation. See Section 6.6.6 for further implementation details. 6.4.2 Baselines We compare our method to the state-of-the-art methods in LfO (BCO, GAIfO, GAIfO-s) as well as LfO with reward (GoalGAIL) and LfD (BC, GAIL, SQIL) approaches, which require additional supervisions, such as task reward and expert actions: • BCO [301] learns an inverse dynamics model from environment interaction to provide action labels in demonstrations for behavioral cloning. • GAIfO [302] trains a discriminator with state transitions(s,s ′ ), instead of(s,a) as in GAIL. • GAIfO-s [336] learns a discriminator based off a single state, not a state transition as with GAIfO. • GoalGAIL [67] uses goal reaching reward and relabeling to improve sample efficiency of GAIL. • BC [236] fits a policy to the demonstration state-action pairs (s,a) with supervised learning. • GAIL [114] is an adversarial imitation learning with a discriminator trained on state-action pairs(s,a) from both expert and agent. 184 LfO LfD LfO+reward 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) Navigation 25% 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (c) Maze2D 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (d) AntReach 0.05 noise 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (e) FetchPick 1.75x noise 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (f) FetchPush 1.75x noise 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (g) HandRotate 0.35 noise Figure 6.3: Goal completion rates of our method and baselines. The agent must generalize to a wider state and goal distributions than seen in the demonstrations. Demonstrations cover only a part of states (a, b) or are generated with less noise (c, d, e, f). Our method learns more stably, faster, and achieves higher goal completion rates than prior LfO methods. Moreover, our method outperforms the LfD baselines in Navigation, Fetch tasks, and Maze2D, and achieves comparable results in Ant Reach. GoalGAIL performs well inMaze2D since it can easily acquire environment rewards. • SQIL [246] is a sample-efficient imitation learning method which adds expert transitions (s,a) with reward 1 to the replay buffer of off-policy RL and assigns 0 reward to all agent experience. 6.4.3 Navigation We first examine the Navigation task between four rooms shown in Figure 6.2(a) to demonstrate general- ization capability of our method, and visualize the learned goal proximity function. The agent observes the19× 19× 4 2D map of the maze and moves in one of four directions. In this task, the agent starting and goal positions are randomly sampled (see an example in appendix, Figure 6.12). We provide 250 expert demonstrations obtained using a shortest path algorithm. During demonstration collection, we hold out 185 0%, 25%, 50%, and 75% of the possible agent starting and goal positions uniformly at random. In contrast, during agent learning and evaluation, start and goal positions are sampled from all possible positions. Figure 6.3(b) shows that our method achieves near 100% success rate in 2M environment steps even with demonstrations only covering 25% of starting and goal states, while other LfO methods fail to learn the task. Although BC, GAIL, and BCO achieve success rates of about 60%, 30%, and 35%, respectively, they show limited generalization to unseen configurations. This result shows that the learned goal proximity function generalizes well to unseen configurations. Figure 6.4(e) visualizes the proximity function trained with 50% coverage demonstrations and 250k steps of agent training. Our proximity function predicts high proximity near the goal and lower proximity when the agent is farther away from the goal. This demonstrates that our proximity function can learn the semantic, non-euclidean relationship between high-dimensional observations and goals. Since the proximity function is conditioned on the state, similar states are likely to have similar predicted proximity, and thus the proximity function learns a spatially consistent measure of proximity from temporal supervision. Moreover, as the task progress is a relative position within a trajectory, both slow and fast demonstrations result in the same task progress. More visualizations can be found in appendix, Section 6.6.5. Finally, we investigate our hypothesis that the goal proximity function allows for greater generalization, which results in better performance with smaller demonstration coverages. We compare the cases where extreme (25% coverage), moderate (50% and 75% coverage), and no generalization (100% coverage) are required. Figure 6.3(b) and Figure 6.4 show that our method consistently achieves almost 100% success rates in 2M steps across all coverages and is not as affected by the increasingly difficult generalization settings as baselines. In contrast, all LfO baselines struggle to learn the task when the demonstrations do not cover all configurations. LfD methods also shows limited generalization in 25% coverage since the discriminator can easily learn spurious associations between the actions and labels, which hurts generalization to new 186 LfD LfO 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) 100% coverage 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (c) 75% coverage 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (d) 50% coverage (e) Proximity heatmap Figure 6.4: Analyzing the effect of improved generalization as the cause for performance increase in our method. (a) Performance with no generalization required. (b, c) Performance with increasing difference between start and goal distributions of demonstrations and agent learning. (d) Visualization of the learned proximity function for a fixed goal (green) in the 50% coverage case. The proximity function was evaluated for every state on the grid; lighter cells correspond to states with higher estimated proximity to the goal. actions. This supports our hypothesis that the goal proximity function is able to capture the task structure and therefore, generalize better to unseen configurations. 6.4.4 Maze2D We further evaluate our method inMaze2D [87] with the medium maze, a continuous version of Navigation. The agent observes its position, velocity, and goal position, and then outputs an x- and y-velocity to navigate the maze. The agent starting and goal positions are randomly sampled. We collect 100 demonstrations (18k transitions) using a planner from Fu et al. [87]. Our method outperforms LfO baselines over all demonstration coverages (see appendix, Figure 6.7). More importantly, in the low coverage case, our method outperforms BC, which has access to expert actions, as shown in Figure 6.3(c). This could be because our proximity function generalizes well whereas BC is not robust to unseen states under small demonstration coverages. On the other hand, GoalGAIL shows the best performance regardless of coverages as the task can be easily solved with the sparse reward and goal relabeling, which is not available for our method and other baselines. 187 6.4.5 AntLocomotion InAntReach [91], the quadruped ant is tasked to reach a randomly generated goal, which is along the perimeter of a half circle of radius 5 m centered around the ant (see Figure 6.2(c)). The 132D state consists of joint angle, velocity, contact force, and the goal position relative to the agent. We collect 1k demonstrations (25k transitions) using the pre-trained policy (trained for 40M steps). When demonstrations are collected, no noise is added to the initial pose of the ant whereas random noise is added to the initial pose during policy learning, which requires the reward functions to generalize to unseen states. In Figure 6.3(d), with 0.05 added noise, our method achieves 35% success rate while BCO, GAIfO, and GAIfO-s achieve 1%, 2%, and 7%, respectively. This result illustrates the importance of learning proximity as opposed to discriminating expert and agent states for generalization to unseen states. The performance of GAIfO and GAIfO-s drops drastically with larger joint angle randomness as shown in appendix, Figure 6.7. AsAntReach is not as sensitive to noise in actions compared to other tasks, BC and GAIL show superior results but our method still achieves comparable performance. 6.4.6 RoboticManipulation We evaluate our method in two robotic manipulation tasks with the 7-DoF Fetch robotics arm: FetchPick andFetchPush [235]. The robot must grasp and move a block to a target position forFetchPick, and push a block to a target position forFetchPush. The 16D state consists of the gripper pose, object pose, the gripper pose relative to the object, and goal position. Both the initial and target positions of the block are randomly initialized. We generate 1k demonstrations using a hard-coded policy, consisting of 33k and 28k transitions forFetchPick andFetchPush, respectively. The policy is trained in an environment with larger noise applied to the starting and target block positions. InFetchPick, our method achieves about 80% success rate outperforming all baselines, despite LfD methods learning with expert actions (see Figure 6.3(e)). The best performing baseline BC only obtains 188 Prox+Diff+Uncert (Ours) Prox+Diff Prox+Abs+Uncert Prox+Abs GAIfO-s GAIfO-s+Uncert GAIfO-s+Ensemble 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) Navigation 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (c) FetchPick 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (d) FetchPush 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (e) AntReach 0.03 Figure 6.5: Analysis of the contribution of goal proximity function, uncertainty penalty, and reward formulation to the performance. “Prox” uses the goal proximity function while “GAIfO-s” does not. “+Diff” uses R(s t ,s t+1 ) = f(s t+1 )− f(s t ) and “+Abs” uses R(s t ) = f(s t ) as a reward. “+Uncert” adds the uncertainty penalty to the reward. “+Ensemble” uses an ensemble for the discriminator. around 40% success rate. The high variance in performance between seeds comes from the difficulty of learning the grasping behavior with large noise. In Fetch Push, our method outperforms baselines in generalization to unseen states by achieving more than 90% success rate (see Figure 6.3(f)). This shows that our proximity function is able to accelerate policy learning in continuous control environments with superior generalization capability. 6.4.7 DexterousHandManipulation We evaluate our method in a challenging in-hand object manipulation task [235],HandRotate as shown in Figure 6.2(f). In Hand Rotate, a 24-DoF Shadow Dexterous Hand must in-hand rotate a block to a targetz-axis rotation. The state consists of the agent’s joint angles and velocities, object pose, and the target rotation. Due to the high dimension of the state (68D) and action space (20D), Hand Rotate is extremely challenging for both RL and IL without dense reward. We therefore ease the task by constraining the possible initial and target z rotations to [− π 32 , π 32 ] and [ π 3 , π 2 ]. We collect 10k demonstrations (98k transitions) using a pre-trained policy (trained for 8M steps). In Figure 6.3(g), GAIfO-s performs well because its reward function is biased to provide large negative rewards encouraging the agent to end the episode early which is only possible by succeeding. In contrast, 189 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) = 0.001 = 0.0001 = 0.01 = 0.1 = 0 (a) Uncertainty penaltyλ 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours Ours (No Offline) Ours (Linear) Ours (w/ Action) Ours (No Uncert) Ours (Rank) Ours (No Online) (b) Proximity function design 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours GAIfO-s GAIfO-s+Spectral Norm (c) Baseline regularization Figure 6.6: Ablation of our method and comparison to a regularized baseline on Fetch Pick 1.75x to investigate (a) effects of uncertainty penalty coefficient λ ; (b) effects of proximity function design and online/offline training; and (c) generalization capability of a regularization technique and our proximity function. our difference-based reward is designed to provide positive rewards, which does not exploit this task property, and performs poorly even with an additional constant penalty -0.005 every step. To test the generalization capability of our proximity function, we additionally examine a variant of our method (Ours-GAIL), which uses the same reward formulation as GAIfO-s,logf ϕ (s t )− log(1− f ϕ (s t )). With this biased reward function, our method outperforms both GAIfO-s and GAIL, which verifies the benefit of our proximity function in generalization to noisy environments. While BC achieves the high success rate with 10x more demonstrations compared to other tasks, SQIL shows poor performance due to the lack of the negative reward bias. 6.4.8 AblationStudy Dissectingproximityreward We analyze the contribution of the proximity function, reward formu- lation, and uncertainty penalty to our method’s performance across four tasks in Figure 6.5. Adding uncertainty to GAIfO-s (GAIfO-s+Uncert) produced a 18.4% boost in average success rate compared to regular GAIfO-s. Proximity supervision, without the uncertainty penalty, resulted in a 66.7% increase in average performance over GAIfO-s with the difference-based reward f(s t+1 )− f(s t ) (Prox+Diff) and 25.8% 190 with the absolute rewardf(s t ) (Prox+Abs). This higher performance meansmodelingproximityismore importantthantheuncertaintypenalty for our method. Although we choose difference-based reward with exponentially discounted goal proximity, the goal proximity can be either linear or exponential discounting, and both can be used for either a difference-based or absolute reward, which perform differently across tasks. For example, the difference-based proximity reward is better for policy learning than the absolute proximity reward except onAntReach andHand Rotate, where the bias of the absolute reward [145] helps the agent survive longer and reach the goal. This is a fundamental problem in IRL, where inductive bias in reward functions is crucial and varies by tasks [145]. Nonetheless, our extensive experiments (Figure 6.5, 6.6(b), 6.8) show that our goal proximity reward provides a more stable and generalizable learning signal than baselines under the same reward bias. Moreover, we found thattheuncertaintypenaltyandproximityfunctionhaveasynergistic interaction. Combining both the proximity and uncertainty gives a 68.7% increase with the difference- based reward (Prox+Diff+Uncert) and 26.4% increase with the absolute reward (Prox+Abs+Uncert). The uncertainty penalty is especially important for the proximity function as it models fine-grain temporal information where inaccuracies can be easily exploited, as opposed to the binary classification of other adversarial imitation learning methods. Ensemblenetworks Next, we study if the robustness of our method comes from the use of ensemble networks or task progress. We verify this by applying ensemble of discriminators to the best performing baseline, GAIfO-s. Figure 6.5 shows that GAIfO-s with ensemble networks (GAIfO-s+Ensemble) only achieves 19.6% higher success rates, but this is still 39.7% lower than our method on average. Therefore, theuseoftaskprogressiskey to learn a generalizable reward, not the use of ensemble networks. Regularizationofdiscriminators In our experiments above, we show that our goal proximity func- tion is generalizable to unseen states and goals, which leads to successful imitation learning. We verify 191 whether standard regularization techniques, such as spectral normalization [197], can also provide the same generalization benefit. In the Fetch Pick 1.75x noise setting (Figure 6.6(c)), GAIfO-s without reg- ularization struggles to learn, achieving only a 1.43% success rate. Not surprisingly, applying spectral normalization [197] to the discriminator of GAIfO-s improves the success rate to 14.56%, which suggests that generalization of the reward function is key to imitation learning with insufficient demonstration coverage. Despite this improvement, our method performs much better at 75.45% success. In summary, predictinggoalproximityenablessignificantlybettergeneralizationthanregularization on the baselines. Figure 6.10 in appendix show similar results across most other tasks. Uncertaintypenaltycoefficient λ In Figure 6.6(a), we investigate how the uncertainty penalty coef- ficient λ affects the performance, showing that our method performs the best with λ =0.001. A higher or lowerλ yields worse performance since a higherλ prevents exploring unseen states while a lowerλ encourages exploiting uncertain predictions. Proximityfunctiontraining In Figure 6.6(b), we test the importance of online and offline training of the proximity function. Note that we update the policy with online interactions in both scenarios. The result shows thatonlineproximityfunctionupdateiscrucial for our method as the agent fails without online update. Meanwhile, no pre-training, Ours (No Offline), slows down training. Similar results can be observed across all tasks (see appendix, Figure 6.8). Our ablation experiments show that (1) goal proximity generalizes better and is more informative for policy learning; (2) the difference-based proximity reward generally performs better than the absolute one; and (3) the uncertainty penalty boosts the performance of our method. In conclusion, all three components of proximity, difference-based reward, and uncertainty are crucial for our method. 192 6.5 Conclusion We propose a generalizable learning from observation (LfO) method inspired by how humans acquire generalizable task information and learn skills in new situations by watching others performing goal- directed tasks. We specifically propose to use task progress, which is intuitive and readily available task information that can guide an agent closer to the goal. Inspired by this insight, we learn a goal proximity function and utilize it as a dense reward for policy learning. We hypothesize that predicting the task progress requires more task-relevant information than estimating an occupancy measure [114], and thus generalizes to states not seen in the demonstrations. Our extensive experiments on navigation, locomotion, and robotic manipulation show that our goal proximity function improves generalization in imitation learning, which results in better performance compared to LfO methods and comparable performance with LfD methods which learn from expert actions. In imitation learning, the generalization ability can include generalization to (1) unseen states and goals, (2) new visual environments (e.g. background), (3) unseen objects, and (4) different embodiments (e.g. humans to robots or different dynamics). In this paper, we focus on generalization to (1) unseen states and goals. This is especially important in imitation learning when the number of demonstrations is not sufficient to cover all possible states and goals. This is very common in imitation learning due to costly demonstration collection. Our approach suggests an effective way of using the demonstrations with limited coverage by learning the generalizable goal proximity reward. Generalization to a different environment and embodiment is another important research direction and this is indeed our immediate future work. Recent advances in generalizable representation learning [267, 282, 322], robust policy learning [125, 149], and cross-domain correspondence [348] enable us to train a policy that generalizes to new environments and embodiments. Yet, these approaches are orthogonal and complementary to our method as our goal proximity function can be trained on top of the learned representations [267, 282, 322, 125, 149, 348]. We believe that our method can be combined with these 193 approaches and improve their performance with better demonstration efficiency and additional supervision about task progress. Societal Impact Our method aims to increase the ability of autonomous agents, such as robots and self-driving cars, to imitate experts (e.g. humans) from observation alone. This enables autonomous agents to utilize data even without expert actions, such as kinesthetic demonstrations and video demonstrations. Ultimately, it could allow autonomous agents to acquire skills even from watching Youtube videos. Since our method learns from experts, it inherits any biases of the demonstrator, such as sub-optimal or unsafe behaviors. Additionally, demonstrations are an easy and intuitive way to specify behaviors, its potential for automation is a threat to job security. However, we overall see enormous benefit with this technology increasing human quality of life and automating difficult jobs. 6.6 Appendix 6.6.1 ComparisonwithGAILandItsVariants Our method shares a similar adversarial training process with GAIL [114]. First of all, similar to the discriminator in GAIfO-s [336], our proximity function takes only the current state as input. However, rather than training the discriminator to classify expert from agent, we train the proximity function to regress to the proximity labels which are 0 for agent and the time discounted value between 0 and 1 for expert. Our reward formulation also differs from GAIL approaches which gives a log probability reward based on the discriminator output. We instead incorporate a proximity estimation uncertainty penalty and a difference-based proximity reward as shown in Equation 6.3. 194 6.6.2 FailureofGAIfOandSQIL We found that SQIL training is unstable and often collapses after some amount of training steps (see Ant experiments in Figure 6.7). Similar trends can be observed in the original paper [246] and other recent papers [293, 240]. We hypothesize that GAIfO easily overfits to demonstrations compared to other baselines (e.g. GAIfO-s) since GAIfO conditions its discriminator on both the current and next observations. We evaluated these methods with the demonstrations from the same initial and goal state distributions in the first column of Figure 6.7. Even though they are trained for the same goal distributions as the demonstrations, they still overfit to the demonstration states and thus cannot generalize to unseen states encountered during online rollouts for most tasks. 6.6.3 AnalysisonGeneralizationofOurMethodandBaselines By learning to predict the goal proximity, the proximity function not only learns to discriminate expert and agent states but also models task progress, which encourages acquiring task-relevant information. With this additional supervision on learning goal proximity, we expect the proximity function to provide a more informative learning signal to the policy and generalize better to unseen states than baselines which easily overfit the reward function to expert demonstrations. To analyze how well our method and the baselines can generalize to unseen states, we vary the difference between the states encountered in expert demonstrations and agent training as described in Section 6.4. One way we vary the difference between expert demonstrations and agent learning is restricting the expert demonstrations to only cover parts of the state space. ForNavigation andMaze2D, we show results for expert demonstrations that cover 100%, 75%, 50%, and 25% of the state space. For the discrete state space inNavigation, we restrict expert demonstrations to the fraction of possible agent start and goal configurations. For Maze2D, we break the maze into6× 6 cells and sample a part of the cells for starting states and another part for goal states. 195 Likewise, we also measure generalization by adding more noise to the initial state during agent learning. On Fetch Pick, Fetch Push, Ant Reach, and Hand Rotate we show results for four different noise settings. For the two Fetch tasks, the 2D sampling region of the object and goal is scaled by the noise factor. ForAntReach, uniform noise scaled by the noise factor is added to the initial joint angles, whereas the demonstrations have no noise. ForHandRotate, uniform noise scaled by the noise factor is added to the possible initial and target object pose. If our method allows for greater generalization from the expert demonstrations, our method should perform well even under states different than those in the expert demonstrations. The results of our method and baselines across varying degrees of generalization are shown in Figure 6.7. Note that the results in the main paper are for 1.75x noise inFetchPick andFetchPush, 0.05 noise inAntReach, 0.35 noise inHandReach, and 25% coverage in Maze2D. Across both harder and easier generalization, our method demonstrates more consistent performance compared to baseline methods. While GAIfO-s performs well on high coverage or low noise, which require little generalization in agent learning, its performance deteriorates as the expert demonstration coverage decreases. 196 LfO LfD LfO+reward 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) Navigation 100% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (c) Navigation 75% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (d) Navigation 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (e) Navigation 25% 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (f) Maze2D 100% 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (g) Maze2D 75% 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (h) Maze2D 50% 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (i) Maze2D 25% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (j) AntReach 0.00 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (k) AntReach 0.01 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (l) AntReach 0.03 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (m) AntReach 0.05 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (n) FetchPick 1x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (o) FetchPick 1.25x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (p) FetchPick 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (q) FetchPick 2x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (r) FetchPush 1x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (s) FetchPush 1.25x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (t) FetchPush 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (u) FetchPush 2x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (v) HandRotate 0.0 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (w) HandRotate 0.25 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (x) HandRotate 0.35 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (y) HandRotate 0.5 Figure 6.7: Analyzing generalization to unseen states from expert demonstrations. Navigation andMaze2D tasks are tested with different coverages of state spaces in demonstrations, while Fetch,AntReach, andHandRotate tasks are tested in more noisy environments. The number indicates the amount of additional noise in agent learning compared to that in the expert demonstrations, with more noise requiring harder generalization. The noise level increases from left to right. 197 6.6.4 FurtherAblations We include additional ablations to further highlight the advantages of our main proposed method over its variants. We evaluate against the same ablations proposed in the main paper (Figure 6.6(b)), but across all environments. We present all these results in Figure 6.8. Our method shows the best performance in the majority of environments. In all tasks, incorporating online updates is crucial since the proximity function can overfit to expert trajectories and poorly generalize to agent trajectories. Updating the proximity function with online agent experience lowers the proximity prediction outside of the expert trajectories and thus leads an agent to follow the expert. Our method with the uncertainty penalty shows superior performance inFetchPick andHandRotate, while it performs similarly with our method without the uncertainty penalty in other environments. Our method using the linear proximity function achieves similar to or slightly lower performance than the exponential proximity function used in the main paper. Offline pre-training of the proximity function is also helpful in most environments. We also compare to an ablation which learns the proximity function through a ranking-based loss similar to Brown et al. [34] and Burke et al. [36]. However, we empirically found it to be ineffective and difficult to train. This ranking-based loss uses the criterion that for two states from an expert trajectory s t 1 ,s t 2 , the proximities should obeyf(s t 1 )<f(s t 2 ) ift 1 <t 2 . We therefore train the proximity function with the cross entropy loss− P t i <t j log expf ϕ (st j ) expf ϕ (st i )+expf ϕ (st j ) . We incorporate agent experience by adding an additional loss which ranks expert states above agent states for randomly sampled pairs of expert and agent states(s e ,s a ) through the cross-entropy loss: − X sa∼ D e ,se∼ π θ log expf ϕ (s e ) expf ϕ (s a )+expf ϕ (s e ) . (6.6) 198 Unlike the discounting factor in the discounting-based proximity function, the ranking-based training requires no hyperparameters. However, as shown in Figure 6.8, the lack of supervision on ground truth proximity scores results in less meaningful predicted proximity and a worse learning signal for the agent, which could explain its poor performance. We also show results for applying spectral normalization [197] to GAIfO-s [302] in Figure 6.10 across all tasks. While regularizing the GAIfO-s discriminator can consistently improve its performance, it still cannot generalize as well as our method for the majority of tasks. As mentioned in Section 6.4.7, GAIfO-s has a bias to provide negative rewards encouraging the agent to end the episode early, which is a desirable property for theHandRotate task. Vanilla GAIfO-s therefore performs better than our method in this environment, and spectral normalization for the discriminator further improves GAIfO-s performance. Ours Ours (Linear) Ours (Rank) Ours (w/ Action) Ours (No Offline) Ours (No Uncert) Ours (No Online) 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) Navigation 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (c) AntReach0.03 0 1M 2M 3M 4M 5M Step 0 10 20 30 40 50 60 70 Goal Completion (%) (d) Maze2D 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (e) FetchPick 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (f) FetchPush 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) Ours (GAIL) (g) HandRotate 0.35 Figure 6.8: Ablation on proximity function design and online/offline proximity function training. We compare our method to the proximity function with actions as input or with a ranking-based objective (Equation 6.6). Our method shows consistently superior or comparable performance over all ablations. 199 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) = 0.95 = 0.99 = 0.9 = 0.7 = 0.5 (a) FetchPick 1.75xδ ablation 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) = 0.95 = 0.99 = 0.9 = 0.7 = 0.5 (b) FetchPush 1.75xδ ablation Figure 6.9: Analyzing different choices of the proximity discounting factor δ for training the proximity function. The model learns similarly well over a range ofδ values around 0.95, but struggles for too large or too smallδ . 6.6.5 QualitativeResults It is important for agent learning that the proximity function gives higher values for states that are temporally closer to the goal. To verify this intuition, we visualize the proximity values predicted by the proximity function in a successful episode from agent learning in Figure 6.11. In Figure 6.11, we can observe that the predicted proximity increases as the agent moves closer to the goal (exceptHandRotate). This provides an example of the proximity function generalizing to agent experience and providing a meaningful reward signal for agent learning. We notice that while the predictions increase as the agent nears the goal, the proximity prediction values are often low (<0.1) as shown in Figure 6.11(c). These low values are mostly predicted for the states not covered in the demonstrations due to the adversarial online training of the proximity function. During online proximity function training, we label agent experience with 0 proximity and therefore proximity predictions get lower, especially for states not in the demonstrations. ForHandRotate, the proximity function fails to predict increasing proximity for states near the goal as an agent cannot learn to imitate the exact expert trajectories. Instead, due to the negatively biased reward, the agent finds a new way to solve the task as discussed in Section 6.4.7 and therefore achieves low proximity predictions even for successful trajectories as shown in Figure 6.11(e). However, our method still 200 Ours GAIfO-s GAIfO-s+Spectral Norm 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (b) Navigation 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (c) Ant 0.03 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (d) Maze 50% 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (e) FetchPick 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (f) FetchPush 1.75x 0 1M 2M 3M 4M 5M Step 0 20 40 60 80 100 Goal Completion (%) (g) HandRotate 0.35 Figure 6.10: Effect of applying spectral normalization to the GAIfO-s baseline compared to the performance of our method. While regularization helps GAIfO-s, it still is outperformed by our method in the majority of tasks. provides relatively higher proximity values near goal states compared to baseline methods, which leads our agent to achieve higher performance in noisy environments. 6.6.6 ImplementationDetails We use PyTorch [223] for our implementation and all experiments are conducted on a workstation with an Intel Xeon E5-2640 v4 CPU and a NVIDIA Titan Xp GPU. Most adversarial imitation learning methods and our method are trained for around 3 hours with 32 parallel workers. GoalGAIL and SQIL training takes around 48 hours since they use off-policy optimization with a single worker. 201 6.6.6.1 EnvironmentDetails In this section, we summarize details of the six goal-directed tasks discussed in this paper. For all en- vironments, the starting and goal states are randomly initialized. All units in this section are in meters and radians unless otherwise specified. The summary of observation spaces, action spaces, and episode lengths are described in Table 6.1. To evaluate the generalization capability of our method and baselines, we constrain the coverage of expert demonstrations or add additional starting state noise during agent learning as discussed in Section 6.6.3. Navigation [50] In Navigation, the state consists of a one-hot vector for each grid cell encoding wall, empty space, agent, or goal. Navigation has four discrete actions for moving in four directions. We collect 250 expert demonstrations using the shortest path algorithm, BFS search. The 25% holdout region is visualized in Figure 6.12. Maze2D[87] InMaze2D, the state consists of the agent’s 2D position, velocity, and the goal position. The point mass agent moves around the maze by controlling the continuous value of its(x,y) velocity. The only modification in this environment from maze2d-medium-v1 [87] is the episode length reduced from 600 to 400. We collect 100 expert demonstrations using a planner provided by Fu et al. [87]. AntReach[91] InAntReach the state consists of joint angle, velocity, force and the relative goal position, and the agent is controlled using joint torque control. We collect 1,000 demonstrations using an expert policy trained using PPO [265] based on the reward functionR(s,a)=1− 0.2·||p ant − p goal || 2 − 0.005·||a|| 2 2 , wherep ant andp goal are(x,y)-positions of the ant and goal, respectively, anda is an action. Please refer to the code for more details. FetchPickandFetchPush[235] The actions in theFetch experiments use 3-D end-effector position control and 1-D continuous control for the gripper (fixed for FetchPush). A 16-dimensional state inFetch 202 tasks consists of the relative position of the goal from the object, relative position of the end-effector to the object, and robot joint state. We found that not including the velocity information was beneficial for all learning from observation approaches inFetch tasks. InFetchPick, we generate 1,000 demonstrations by hard coding the Sawyer Robot to first reach above the object, then reach down and grasp, and finally move to the target position. Similarly, inFetchPush, we collect 664 demonstrations by hard coding the Sawyer to reach behind the object and then execute a planar push towards to the goal. HandRotate[235] The original taskHandManipulateBlockRotateZ-v0 proposed in Plappert et al. [235] is challenging to solve without reward due to its large and combinatorial state space and large action space. Hence, we reduce the initial and goalz rotations of the block to[− π 32 , π 32 ] and[ π 3 , π 2 ]. The 68-D state space consists of the agent’s joint angles and velocities, and object pose. The 20-D action space is for joint torque control of 24-DoF Shadow Dexterous Hand. We collect 10,000 demonstrations using an expert policy trained with DDPG+HER [14] using a sparse reward. 6.6.6.2 NetworkArchitectures Actorandcriticnetworks: We use the same architecture for actor and critic networks except for the output layer where the actor network outputs an action distribution while the critic network outputs a critic value. For Navigation, the actor and critic network consists of CONV(3,2,16)− ReLU − MaxPool(2,2)− CONV(3,2,32)− ReLU− CONV(3,2,64) followed by two fully-connected layers with hidden layer size 64, whereCONV(k,s,c) represents ac-channel convolutional layer with kernel sizek and strides. For other tasks, we model the actor and critic networks as two separate 3-layer MLPs with hidden layer size256. For the continuous control tasks, the last layer of the actor MLP has two heads to output the mean and standard deviation of a Gaussian distribution which an action is sampled from. We use the ReLU activation forNavigation andtanh for other tasks. 203 Goalproximityfunctionanddiscriminator: The goal proximity function and discriminator use a CNN encoder (the same CNN architecture as the actor and critic networks) followed by a hidden layer of size 64 forNavigation and a 3-layer MLP with a hidden layer of size 64 for other tasks. When measuring the uncertainty of predictions, we use an ensemble of 5 networks. 6.6.6.3 TrainingDetails For our method and all baselines except BC [236] and BCO [301], we train policies using PPO [265]. The hyperparameters for policy training are shown in Table 6.2, while the hyperparameters for the proximity and discriminator function are shown in Table 6.3. For our method, we found it helpful to normalize the reward based on the moving average and standard deviation of returns. We also did so for baselines when it helped. For hyperparameter tuning, we searched over entropy coefficients {0.0001,0.001,0.01}, state normal- ization{True, False}, uncertainty coefficient {0.0001,0.001,0.01,0.1}, learning rates{0.0001,0.0003,0.001}, and reward normalization{True, False}. In BC, the demonstrations were split into 80% training data and 20% validation data. The policy was trained on the training data until the validation loss stopped decreasing. The policy is then evaluated for 1,000 episodes to get an average success rate. In GAIfO-s and GAIL, we use the reward form oflogD(s)− log(1− D(s)) andlogD(s,a)− log(1− D(s,a)), respectively, from Finn et al. [78] and Fu, Luo, and Levine [88]. For GoalGAIL [67], we use the default hyperparameters used in the original implementation. For the policy network, we use a deterministic policy for DDPG [170] and use thetanh activation to normalize the policy output between [− 1,1]. We update the policy and critic every 2 environment steps and the discriminator every 10 environment steps to prevent overfitting. 204 Algorithm2 Imitation learning with learned goal proximity Require: Expert demonstrationsD e ={τ e 1 ,...,τ e N } 1: Initialize goal proximity functionf ϕ and policyπ θ 2: fori=0,1,...,M do 3: Sample expert demonstrationτ e ∼D e 4: Updatef ϕ withτ e to minimize Equation 6.1 ▷ Offline proximity function training 5: endfor 6: fori=0,1,...,Ldo 7: Rollout trajectoriesτ i =(s 0 ,...,s T i ) withπ θ 8: Compute proximity rewardR ϕ (s t ,s t+1 ) for(s t ,s t+1 )∼ τ i using Equation 6.5 9: Updateπ θ using any RL algorithm ▷ Policy training 10: Updatef ϕ withτ i andτ e ∼D e to minimize Equation 6.2 ▷ Online proximity function training 11: endfor Table 6.1: Environment details. InNavigation andMaze2D, the goal and agent are randomly initialized anywhere on the grid. In Ant Reach, the angle of the goal and the velocity of the agent are randomly initialized. The goal and object noise inFetch describes the amount of uniform noise applied to the(x,y) coordinates of the object and goal. InHandRotate, the state and goal noises are applied to the initial and goal object rotations. State Action Goal noise State noise Episode len. # demos Navigation (19, 19, 4) 4 - - 50 250 Maze2D 6 2 - - 600 100 AntReach 132 8 θ ∈[0,π ] v∈[± .005] 50 1,000 FetchPick 16 4 (x,y)∈[± .02,± .05] 50 1,000 FetchPush 16 3 (x,y)∈[± .02,± .05] 60 664 HandRotate 68 20 θ ∈[ π 2 , π 3 ] θ ∈[± π 32 ] 50 10,000 Table 6.2: PPO hyperparameters used for baselines and our method. Hyperparameter Value Learning Rate 3e-4 Learning Rate Decay Linear decay # Mini-batches 4 (Navigation), 32 (others) # Epochs per Update 4 (Navigation), 10 (others) Discount Factorγ 0.99 Rollout Size 16,000 (AntReach), 4,096 (others) Entropy Coefficient 0.01 ( Navigation), 0.001 (others) State Normalization False (Navigation), True (others) 205 Proximity: 0.162 Proximity: 0.195 Proximity: 0.374 Proximity: 0.805 (a) AntReach Proximity: 0.000 Proximity: 0.038 Proximity: 0.157 Proximity: 0.517 0.094 0.308 (b) Maze2D Proximity: 0.015 Proximity: 0.028 Proximity: 0.050 Proximity: 0.072 (c) FetchPick Proximity: 0.000 Proximity: 0.151 Proximity: 0.322 Proximity: 0.347 (d) FetchPush Proximity: 0.464 Proximity: 0.013 Proximity: 0.004 Proximity: 0.006 (e) HandRotate Figure 6.11: Visualizing the proximity predictions for a successful trajectory from agent learning. Four informative frames are selected from the overall trajectory and the predicted proximity value is displayed below. The proximity prediction visualization forNavigation can be found in Figure 6.4(e). 206 Figure 6.12: The goals of the expert demonstrations in red for theNavigation 25% holdout setting. Table 6.3: Hyperparameters for goal proximity functions (ours) and discriminators (baselines). Hyperparameter Value # Networks for Ensemble 5 # Epochs for Pre-training 5 Discount Factorδ 0.95 (exponential),1/H (linear) Uncertainty Coefficient λ 0.001 (Fetch), 0.01 (others) Learning Rate (ours) 1e-3 (Navigation,Fetch,Maze2D), 1e-4 (AntReach,HandRotate) Learning Rate (baselines) 1e-4 Batch Size 32 (Navigation), 128 (others) # Updates per Agent Update 1 Experience Buffer Size 16,000 ( AntReach), 4,096 (others) Reward Norm. (ours) True Reward Norm. (baselines) True (Fetch), False (others) 207 PartIV TaskExecution 208 Chapter7 LearningtoExecutePrograms 7.1 Introduction Humans are capable of leveraging instructions to accomplish complex tasks. A comprehensive instruction usually comprises a set of descriptions detailing a variety of situations and the corresponding subtasks that are required to be fulfilled. To accomplish a task, we can leverage instructions to estimate the progress, recognize the current state, and perform corresponding actions. For example, to make a gourmet dish, we can follow recipes and procedurally create the desired dish by recognizing what ingredients and tools are missing, what alternatives are available, and what corresponding preparations are required. With sufficient practice, we can improve our ability to perceive (e.g. knowing when food is well-cooked) as well as master cooking skills (e.g. cutting food into same-sized pieces), and eventually accomplish difficult recipes. Can machines likewise learn to follow and exploit comprehensive instructions like humans? Utilizing expert demonstrations to instruct agents has been widely studied in [82, 345, 332, 225, 283, 69, 323]. However, demonstrations could be expensive to obtain and are less flexible ( e.g. altering subtask orders in demonstra- tions is nontrivial). On the other hand, natural language instructions are flexible and expressive [182, 128, 138, 194, 86, 135, 21]. Yet, language has the caveat of being ambiguous even to humans, due to its lacking of structure as well as unclear coreferences and entities. [11, 216] investigate a hierarchical approach, where the instructions consist of a set of symbolically represented subtasks. Nonetheless, those instructions are 209 not a function of states (i.e. describe a variety of circumstances and the corresponding desired subtasks), which substantially limits their expressiveness. We propose to utilize programs, written in a formal language, as a structured, expressive, and unam- biguous representation to specify tasks. Specifically, we consider programs, which are composed of control flows ( e.g. if/else and loops), environmental conditions, as well as corresponding subtasks, as shown in Figure 7.1. Not only do programs have expressiveness by describing diverse situations (e.g. a river exists) and the corresponding subtasks which are required to be executed (e.g. mining wood), but they are also un- ambiguous due to their explicit scoping. To study the effectiveness of using programs as task specifications, we introduce a new problem, where we aim to develop a framework which learns to comprehend a task specified by a program as well as perceive and interact with the environment to accomplish the task. To address this problem, we propose a modular framework, program guided agent, which exploits the structural nature of programs to decompose and execute them as well as learn to ground program tokens with the environment. Specifically, our framework consists of three modules: (1) a program interpreter that leverages a grammar provided by the programming language to parse and execute a program, (2) a perception module that learns to respond to conditional queries (e.g.is_there[River]) produced by the interpreter, and (3) a policy that learns to fulfill a variety of subtasks ( e.g. mine(Wood)) extracted from the program by the interpreter. To effectively instruct the policy with symbolically represented subtasks, we introduce a learned modulation mechanism that leverages a subtask to modulate the encoded state features instead of concatenating them. Our framework (shown in Figure 7.3) utilizes a rule-based program interpreter to deal with programs as well as learning perception module and policy when it is necessary to perceive or interact with the environment. With this modularity, our framework can generalize to more complex program-specified tasks without additional learning. To evaluate the proposed framework, we consider a Minecraft-inspired 2D gridworld environment, where an agent can navigate itself across different terrains and interact with objects, similar to [11, 281]. A 210 Program def run(): if is_there[River]: mine(Wood) build_bridge() if agent[Iron]<3: mine(Iron) place(Iron, 1, 1) else: goto(4, 2) while env[Gold]>0: mine(Gold) Figure 7.1: An illustration of the proposed problem. We are interested in learning to fulfill tasks specified by written programs. A program consists of control flows ( e.g. if, while), branching conditions (e.g. is_- there[River]), and subtasks (e.g.mine(Wood)). Programp:= def run():s Statements:= while(c):(s)|b| loop(i):(s) | if(c):(s)| elseif(c):(s)| else:(s) Itemt:= Gold| Wood| Iron Terrainu:= Bridge| River| Merchant| Wall| Flat Operatorso:=>≥ ==<≤ Numbersi:= A positive integer or zero Perceptionh:= agent[t]| env[t]| is_there[t]| is_there[u] Behaviorb:= mine(t)| goto(i, i) | place(t, i, i)| build_bridge()| sell(t) Conditionsc:=h[t]oi|h[u]oi Figure 7.2: The domain-specific language (DSL) for con- structing programs. Each program is composed of domain dependent perception, subtasks, and control flows. corresponding domain-specific language (DSL) defines the rules of constructing programs for instructing an agent to accomplish certain tasks. Our proposed framework demonstrates superior generalization ability – learning from simpler tasks while generalizing to complex tasks. We also conduct extensive analysis on various end-to-end learning models which learns from not only program instructions but also natural language descriptions. Furthermore, our proposed learned policy modulation mechanism yields a better learning efficiency compared to other commonly used methods that simply concatenate a state and goal. 7.2 RelatedWork Learningfromlanguageinstructions. Prior works have investigated leveraging natural languages to specify tasks on a wide range of applications, including navigation [194, 296, 86, 313, 272, 32, 33, 195, 296], spatial reasoning for goal reaching [127], game playing [135, 85, 249], and grounding visual concepts [135, 21, 10]. However, natural language descriptions can often be ambiguous even to humans. Moreover, it is not clear how end-to-end learning agents trained with simpler instructions can generalize well to much more complex ones. In contrast, we propose to utilize a precise and structured representation, programs, to specify tasks. 211 Learningfromdemonstrations. When a task cannot be easily described in languages (e.g. object texture or geometry), expert demonstrations offer an alternative way to provide instructions. Prior works have explored learning from video demonstrations [82, 345, 332, 225, 283, 19] or expert trajectories [69, 323]. However, demonstrations can be expensive to obtain and are less expressive about diverging behaviors of a complex task, which are better captured by control flow in programs. Moreover, editing demonstrations such as rearranging the order of subtasks is often difficult. Programinductionandsynthesis. To learn acquire programmatic skills such as digit addition and string transformations and achieve better generalization, program induction methods [332, 61, 208, 95, 132, 247, 38, 329] aim to implicitly induce the underlying programs to mimic the behaviors demonstrated in task specifications ( e.g. input/output pairs, demonstrations). On the other hand, program synthesis methods [31, 219, 63, 46, 273, 35, 175, 287, 171, 169] explicitly synthesize the underlying programs and execute the programs to perform the tasks. Instead of trying to infer programs from task specifications, we are interested in explicitly executing programs. Also, our framework can potentially be leveraged to obtain program execution results for evaluating program synthesis frameworks when no program executor is available (e.g. programs describe real-world activities instead of behaviors in simulation). Symbolicplaningandprogrammableagent. Classical symbolic planning concerns the problem of achieving a goal state from an initial state through a series of symbolically represented executions [90, 144]. Our work shares a similar spirit but assume a task (i.e. a program) is given, where the agent is required to learn to ground symbolic concepts [185, 107] and follow the control flow. Executing programs with reinforcement learning has been studied in programmable hierarchies of abstract machines [221, 8, 9], which provide partial descriptions and subroutines of the desired task. [59, 155] train agents to execute declarative programs by grounding these well-structured languages in their learning environments. In contrast, our modular framework consists of modules for perceiving the environment and interacting with 212 it by following an imperative program which specifies the task. An extended discussion on the related work can be found in Section 7.7.3. 7.3 ProblemFormulation We are interested in learning to comprehend and execute an instruction specified by a program to fulfill the desired task. In this section, we formally describe our definition of programs, the family of Markov Decision Processes (MDPs), and the problem formulation. Program. The programs considered in this work are defined based on a Domain-Specific Language (DSL) as shown in Figure 7.2. The DSL is composed of perception primitives, action primitives, and control flow. A perception primitive indicates circumstances in the environment ( e.g.is_there(River), and agent[Gold]<3) that can be perceived by an agent, while an action primitive defines a subtask that describes a certain behavior (e.g. mine(Gold), and goto(1,1)). Control flow includes if/else statements, loops, and Boolean/logical operators to compose more sophisticated conditions. A programp is a deterministic function that outputs a desired behavior (i.e. subtask) given a history of stateso t =p(H j ), where H j = {s 1 ,...,s t } is a state history with s ∈ S denoting a state of the environment, and o ∈ O denotes an instructed behavior (subtask). We denote a program asp∼P , an infinite program set containing all executable programs given a DSL. Note that a discussion on the DSL design principle can be found in Section 7.7.2. MDPs. We consider a family of finite-horizon discounted MDPs in a shared environment, specified by a tuple(S,A,P,T,R,ρ,γ ), whereS denotes a set of states,A denotes a set of low-level actions an agent can take,P denotes a set of programs specifying instructions,T :S×A×S → R denotes a transition probability distribution,R denotes a task-specific reward function, ρ denotes an initial state distribution, andγ denotes a discount factor. For a fixed sequence {(s 0 ,a 0 ),...,(s t ,a t )} of states and actions obtained 213 Module Module Output Environment Action Policy Goal Program Interpreter Response Query Perception Module Program def run(): while env[Gold] > 0: mine(Gold) if is_there[River]: build_bridge() place(Wood, 2, 3) State 3 0 1 Figure 7.3: ProgramGuidedAgent. The proposed modular framework comprehends and fulfills a desired task specified by a program. The program interpreter executes the program by altering between querying the perception module with a queryq when an environment condition is encountered (e.g.env[Gold]>0, is_there[River]) and instructing a policy when it needs to fulfill a goal/subtask g (e.g. mine(Gold), build_bridge()). The perception module produces a responseh to answer the query, determining which paths in the program should be chosen. The policy takes a sequence of low-level actionsa (e.g.moveUp, moveLeft,Pickup) interacting with the environment to accomplish the given subtask (e.g.mine(Gold)). from a rollout of a given policyπ , the performance of the policy is evaluated based on a discounted return P T t=0 γ t r t , whereT is the horizon of the episode. Problem Formulation. We consider developing a framework which can comprehend and fulfill an instruction specified by a program. Specifically, we consider a sampled MDP with a program describing the desired task. Addressing this task requires the ability to keep track of which parts of the program are finished and which parts are not, perceiving the environment and deciding which paths in the program to take, and performing actions interacting with the environment to fulfill subtasks. 7.4 Approach Accomplishing an instructed task described by a program requires (1) executing the program control flow and conditions, (2) recognizing the situations to infer which path in the program should be chosen, and (3) executing a series of actions interacting with the environment to fulfill the subtasks. Based on this intuition, we design a modular framework with three modules: 214 • Program interpreter (Section 7.4.1) reads a program and executes it by querying a perception module with environment conditions (e.g.env[Gold]>0) and instructing the policy with subtasks (e.g.mine(Gold)). • Perceptionmodule (Section 7.4.2) responds to perception queries (e.g.env[Gold]>0) by examining the observation and predicting responses (e.g.true). • Policy(actionmodule) (Section 7.4.3) performs low-level actions (e.g.moveUp,moveLeft,pickUp) to fulfill the symbolically represented subtasks ( e.g.mine(Gold)) provided by the program interpreter. Our key insight is to onlylearn a module when its input or output is associated with the environment (i.e. a function approximator is needed) – the perception module learns to ground the queries to its observation and answer them; the policy learns to ground the symbolically represented subtasks and interact with the environment in a trial-and-error way (Section 7.4.4). On the other hand, we do not learn the program interpreter; instead, we utilize a rule-based parser to execute programs An overview of the proposed framework is illustrated in Figure 7.3. 7.4.1 ProgramInterpreter To execute a program instruction, we group program tokens into three main categories: (1)subtasks indicate what the agent should perform (e.g.mine(Gold)), (2)perceptions the essential information extracted from the environment (e.g.env[Gold]>0), and (3)controlflows determine which paths in a program should be taken according to the perceived information (i.e. perceptions). Then, we devise a program interpreter, which can execute and keep track of the progress by leveraging the structure of programs. Specifically, it consists of a program line parser and a program executor. The parser first transforms the program into a program tree by representing each line of a program as a tree node. Each node is either a leaf node (subtask) or a non-leaf node (perception or control flow) that has various subroutines as children nodes. The executor then performs a pre-order traversal on the program tree to execute the program, utilizing the 215 parsed contents to alternate between querying the perception module when an environment condition is encountered and instructing the policy when it reaches to a leaf node (subtask). The details and the algorithm are summarized in Section 7.7.1. Note that the program interpreter is a rule-based algorithm instead of a learning module. 7.4.2 PerceptionModule Determining which paths should be chosen when executing a program requires grounding a symbolically represented query (e.g.is_there[River] can be represented as a sequence of symbols) and perceiving the environment. To this end, we employ a perception moduleΦ that learns to map a query and current observation to a response: h = Φ( q,s), where q denotes a query, and h denotes the corresponding perception output (e.g.true/false). Note that we focus on Boolean perception outputs in this paper, but a more generic perception type can be used (e.g. object attributes such as color, shape, and size). 7.4.3 Policy When program execution reaches a subtask/leaf node (e.g. mine(Gold)), the agent is required to take a sequence of low-level actions (e.g. moveUp, moveLeft, Pickup) to interact with the environment to fulfill it. To enable the execution, we employ a multitask policy π (i.e. action module) which is instructed by a symbolic goal (e.g. mine(Gold)) provided by the program interpreter indicating the details of the corresponding subtask. To learn to perform different subtasks, we train the policy using actor-critic reinforcement learning, which takes a goal vectorg and an environment states and outputs a probabilistic distributiona for low-level actionsa∼ π (s t ,g t |θ ). The value estimator used for our policy optimization is also goal-conditioned: V π (s t ,g t )=E[ P t γ t R t |s 0 =s,π,g t ]. While the most common way to feed a state and goal to a policy parameterized by a neural network is to concatenate them in a raw space or a latent space, we find this less effective when the policy has to 216 conv / fc conv / fc goal network Modulation concat encode encode Concatenation (Latent) concat Concatenation (Raw) (a) Illustration (b) Training curves Figure 7.4: Learningamultitaskpolicyvialearnedmodulation. (a) A multitask policy takes both a states and a goal specification g as inputs and produces an action distributiona∼ π (s,g). Instead of simply concatenating the state and goal in a raw space or a latent space, we propose to modulate state featurese s using the goal. Specifically, the goal network learns to predict affine transform parameters γ andβ to modulate the state features ˆ e s = γ · e s +β . Then, the final layers use the modulated features to predict actions. (b) We experiment different ways of feeding a state and goal for learning a multitask policy. The training curves demonstrate that all modulation variants, including modulating state feature maps of convolutional layers (Modulation conv), modulating state feature vectors of fully-connected layers (Modulation fc), or both (Modulation conv fc), are more efficient than concatenating the state and the goal in a raw space (Concat raw) or a latent space (Concat). learn a diverse set of tasks. Therefore, we propose a modulation mechanism to effectively learn the policy. Specifically, we employ a goal network to encode the goal and compute affine transform parameters γ and β , which are used to modulate state featurese s to ˆ e s = γ · e s +β . Then, the modulated features ˆ e s are used to predict actiona and valueV . With the modulation mechanism, the goal network learns to activate state features related to the current goal and deactivate others. An illustration is shown in Figure 7.4 (a). A more detailed discussion of the related works that utilize similar learned modulation mechanisms can be found in Section 7.7.4. 7.4.4 Learning To follow a program by perceiving the environment and taking actions to interact with it, we employ two learning modules: a perception module and a policy. In this section, we discuss how each module is trained, 217 their training objectives, and optimization methods. More training details and the architectures can be found in Section 7.7.5.4. 7.4.4.1 PerceptionModule We formulate training the perception module as a supervised learning task. Given tuples of (query q, states, ground truth perceptionh gt ), we train a neural networkΦ to predict the perception outputh by optimizing the binary cross-entropy loss: L CE = − h gt log(h)− (1− h gt )log(1− h). A query such as is_there[River] is represented as a sequence of symbols. Note that when perception describes more than a Boolean, training the perception module can be done by optimizing other losses such as categorical cross-entropy loss. We train the perception module only on the queries appearing in the training programs with randomly sampled states, requiring it to generalize to novel queries to perform well in executing testing programs. 7.4.4.2 Policy We train the policy using Advantage Actor-Critic (A2C) [198, 64], which is commonly used for gridworld environments with discrete action spaces. A2C computes policy gradientsA t ∇ θ logπ θ (a t |s t ,g t ), where A t =R t − V(s t ,g t ) is the advantage function based on empirical returnR t starting froms t and learned value estimatorV(s t ,g t ) conditioning on the goal vectorg t . We denote the learning rate asα , and the policy update rule is as follows: θ ← θ +α (A t ∇ θ logπ θ (a t |s t ,g t )+β ∇ θ H π θ ), (7.1) whereH π θ denotes the policy entropy, where maximizing it improves overall exploration, andβ determines the strength of the entropy regularization term. 218 7.5 Experiments Our experiments aim to answer the following questions: (1) Can our proposed framework learn to perform tasks specified by programs? (2) Can our modular framework generalize better to more complex tasks compared to end-to-end learning models? (3) How well can a variety of end-to-end learning models (e.g. LSTM, Tree-RNN, Transformer) learn from programs and natural language instructions? (4) Is the proposed learned modulation more efficient to learn a multitask (multi-goal) policy than simply concatenating a state and goal? 7.5.1 ExperimentalSetups 7.5.1.1 Environment To evaluate the proposed framework in an environment where an agent can perceive diverse scenarios and interact with the environment to perform various subtasks, we construct a discrete Minecraft-inspired gridworld environment, similar to [11, 281]. As illustrated in Figure 7.1, the agent can navigate through a grid world and interact with resources (e.g. Wood, Iron, Gold) and obstacles (e.g. River, Wall), build tools (e.g. Bridge), and sell resources to a merchant visualized as an alpaca. The environment gives a sparse task completion reward of+1 when an instruction (i.e. an entire program or natural language instruction) is successfully executed. More details can be found in Section 7.7.5.1. 7.5.1.2 TaskInstructions Programs. We sample 4,500 programs using our DSL and split them into 4,000 training programs (train) and 500 testing programs (test). To examine the framework’s ability to generalize to more complex instructions, we generate 500 programs which are twice longer and contains more condition branches on average to construct a harder testing set (test-complex). 219 Naturallanguageinstructions. To obtain the natural language counterparts of those instructions, we asked annotators to construct natural language translations of all the programs. The data collection details, as well as sample programs and their corresponding natural language translations, can be found in Section 7.7.5.3, and Figure 7.10 respectively. We include a brief discussion on how annotated natural language instructions can be ambiguously interpreted as several valid programs. 7.5.2 Training During training, we randomly sample programs from the training set as well as randomly sample an environment state to execute the program interpreter. The program interpreter produces a goal to instruct the policy when encountering a subtask in the program. The policy takes actionsa∼ π (s,g) and receive reward+1 only when the entire program is completed. While we do not explicitly introduce a curriculum like [11], this setup naturally induces a curriculum where the policy first learns to solve simpler programs and gain a better understanding of subtasks by obtaining the task completion, which eventually allows the policy to complete more complex programs. Note that the perception module is pre-trained beforehand in a supervised manner. More training details can be found in Section 7.7.5.7. 7.5.3 End-to-endLearningModels In contrast to the proposed modular framework, we experiment with a variety of end-to-end learning models. Considering programs and natural language instructions as sequences of tokens, we investigate two types of state-of-the-art sequence encoders: LSTM [115] (Seq-LSTM), and Transformers [306, 62] (Transformer). To leverage the explicit structure of programs, we also investigate encoding programs using a generalization of RNNs for tree-structured input [294, 7] (Tree-RNN). All the models are trained using A2C. The details of these architectures can be found in Section 7.7.5.4. 220 Table 7.1: Task completion rate. For each method, we iterate over all the programs in a testing set by randomly sampling ten initial environment states and running three models trained using different random seeds for this method. The averaged task completion rates and their standard deviations are reported. Note that all the end-to-end learning models learning from natural language descriptions and programs suffer from a significant performance drop when evaluated on the more complex testing set. Instruction Naturallanguagedescriptions Programs Method Seq-LSTM Transformer Seq-LSTM Tree-RNN Transformer Ours (concat) Ours Dataset test 54.9± 1.8% 52.5± 2.6% 56.7± 1.9% 50.1± 1.2% 49.4± 1.6% 88.6± 0.8% 94.0± 0.5% test-complex 32.4± 4.9% 38.2± 2.6% 38.8± 1.2% 42.2± 2.4% 40.9± 1.5% 85.2± 0.8% 91.8± 0.2% Generalizationgap 40.9% 27.2% 31.6% 15.8% 17.2% 3.8% 2.3% 7.5.4 Results 7.5.4.1 TaskCompletion We train the proposed framework and the end-to-end learning models on training programs and evaluate their performances using the percentage of completed instructions on test and test-complex sets (shown in Table 7.1). Our proposed framework achieves a satisfactory test performance and only suffers a negligible drop (i.e. generalization gap) when it is evaluated on test-complex set. This can be attributed to the modular design, which explicitly utilizes the structure and grammar of programs, allowing the two learning modules (i.e. perception and policy) to focus on their local jobs. A more detailed failure analysis can be found in Section 7.7.5.6. On the other hand, all the end-to-end learning models suffer a significant performance drop between test andtest-complex sets, while it is less significant for the models learning from programs, potentially indicating that models learning from instructions with explicit structures can generalize to complex instructions better. Among them, Seq-LSTM achieves the best results on test set, but performs the worst on the test-complex set. Transformer has smaller generalization gaps, which could be attributed to their multi-head attention mechanism, capturing the instruction semantics better. By leveraging the explicit structure of programs, Tree-RNN achieves the best generalization performance. 221 Seq-LSTM Tree-RNN Transformer Program Seq-LSTM Transformer Natural Language 20 40 60 80 100 Number of Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate Seq-LSTM 20 40 60 80 100 Number of Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate Transformer 1 2 3 4 5 Number of Control Flows 0.3 0.4 0.5 0.6 Completion Rate Seq-LSTM 1 2 3 4 5 Number of Control Flows 0.3 0.4 0.5 0.6 Completion Rate Transformer 20 40 60 80 100 Number of Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate Program 20 40 60 80 100 Number of Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate Natural Language 1 2 3 4 5 Number of Control Flows 0.3 0.4 0.5 0.6 Completion Rate Program 1 2 3 4 5 Number of Control Flows 0.3 0.4 0.5 0.6 Completion Rate Natural Language (a) Instruction Length (b) Instruction Diversity Figure 7.5:Analysisonend-to-endlearningmodels: (a) Models learning from programs generalize better to longer instructions. Transformer is more robust to longer instructions (Upper). Tree-RNN exploiting the program structure generalizes the best, but performs worst for shorter programs (Lower). (b) Seq-LSTM learning from both instructions performs worse as the diversity increases. Transformer learns better from natural language when the instructions are less diverse (Upper). Transformer and Tree-RNN learning from programs are more consistent as the diversity increases, yet perform worse on less diverse instructions (Lower). 7.5.4.2 Analysis An analysis on the end-to-end learning models with respect to varying instruction length and complexity is shown in Figure 7.5, where all the instructions from test and test-complex sets are considered. Instruction length. As shown in Figure 7.5 (a), both Seq-LSTM and Transformer suffer from a performance drop as the instruction length increases. Seq-LSTM performs better when instructions are shorter, but suffers from generalizing to longer instructions. On the other hand, Transformer may learn on a more semantic level, which leads to similar overall performances across two types of instructions. Tree-RNN leverages the structure of programs and achieves a better performance. Instructiondiversity. We define the diversity of a program based on the number of control flows it contains (i.e. number of branches). Figure 7.5 (b) shows a clear trend of performance drop of all the models. Transformer is more robust to diverse instructions which could be attributed to its better ability to learn 222 the semantics. While Seq-LSTM learning from programs are consistently better across different levels of diversities, Tree-RNN demonstrates the most consistent performances. 7.5.5 PolicyModulation We investigate if learning a multitask policy with the learned modulation mechanism is more effective. We compare against the two most commonly used methods: concatenating a state and goal in a raw space (Concat raw) or a latent space (Concat). An illustration is shown in Figure 7.4 (a). Since our state contains an environment map, which is encoded by a CNN and MLP, we experiment modulating convoluted feature maps (Modulation conv) or feature vectors (Modulation fc) or both (Modulation conv fc). Figure 7.4 (b) demonstrate that the proposed policy modulation mechanism is more sample efficient. Table 7.1 shows that the multitask policy learning using modulation achieves better performance on task completion. 7.6 Conclusion We propose to utilize programs, structured in a formal language, as an expressive and precise way to specify tasks instead of commonly used natural language instructions. We introduce the problem of developing a framework that can comprehend a program as well as perceive and interact with the environment to accomplish the desired task. To address this problem, we devise a modular framework, program guided agent, which executes programs with a program interpreter by altering between querying a perception module when a branching condition is encountered and instructing a policy to fulfill subtasks. We employ a policy modulation mechanism to improve the efficiency of learning the multitask policy. The experimental results on a 2D Minecraft environment demonstrate that the proposed framework learns to reliably fulfill program instructions and generalize well to more complex instructions without additional training. We also investigate the performance of various models that learn from programs and natural language descriptions in an end-to-end fashion. 223 7.7 Appendix 7.7.1 ProgramExecution We describe our program interpreter in Section 7.4 and provide more details in this section. The program instruction considered in this work contains the following three components: (1) subtasks, (2) perceptions, and (3) control flows. Accordingly, our ProgramInterpreter is designed to consist of (1) a parser to parse each line of the program following the grammar defined by our DSL in Figure 7.2, and (2) a program executor which executes the program conditioning on the parsed contents. The interpreter transforms the program into a tree-like object by exploiting its structure (i.e. scopes) and then utilizes the parsed contents to traverse it to execute the program. A program tree is built by representing each line of a program as a tree node. Each tree node is a data structure containing members: (1)node.line, the original line in the program, (2)node.isLeaf(), if the current node is a leaf node (i.e. subtask), and (3)node.children, all the subroutines of the current node (i.e. the processes under the scope of the current line of program). The interpreter will parse according to the original line contained in the node, and decide whether to call the policy if it is a leaf node (subtask) or produce a query to call the perception module, deciding which child node (subroutine) to go into. The subroutines of a node should correspond to proper scoping of the program. For example, in Figure 1 in the main paper, the line if is_there[River] has subroutines mine(Wood), build_bridge(), if agent[Iron]<3, and place(Iron,1,1), but not mine(Iron), which should be if agent[Iron]<3’s subroutine. Once the program tree is built, the program executor will perform a pre-order traversal to initiate the execution. Algorithm 3 summarizes the details of the program execution utilizing the transformed program tree. 224 Algorithm3 Program Execution Require: P : program to be executed Require: s: environmental state Require: π : agent policy parameterized byθ ,Φ : perception module Require: node: has member node.line as the original program line and children nodes node.children 1: procedureExecute(node) 2: if node.isLeaf()then 3: subtask = parse_subtask(node.line) 4: π (subtask,θ ) ▷ Calls the agent policy to execute the subtask 5: else 6: control_flow, perception_query = parse_ctrl_percept(node.line) 7: h =Φ (perception_query,s) ▷ Calls the perception module with a query and state 8: control_flow h ▷ e.g. if, while, loop, calls the subroutines depending onh 9: for child in node.childrendo 10: Execute(child) 11: endfor 12: endif 13: endprocedure 7.7.2 DSLDesignPrinciple Since different domains require different DSLs, we aim to design our DSL by following a design principle that would potentially allow us to easily adapt our DSL to different domain. Specifically, we develop a DSL design principle that considers a general setting where an agent can perceive and interact with the environment to fulfill some tasks. Accordingly, our DSL consist of control flows, perceptions, and actions. While control flows are domain independent, perceptions and actions can be designed based on the domain of interest, which would require certain expertise and domain knowledge. We aim to design our DSL that 225 is (1) intuitive: the actions and perceptions are intuitively align with human common sense, (2) modular: actions are reasonably distinct and can be used to compose more complex behaviors, and (3) hierarchical: a proper level of abstraction that enables describing long-horizon tasks. 7.7.3 ExtendedRelatedWork We present an extended discussion of the related work in this section. Multitaskreinforcementlearning. To achieve multi-task reinforcement learning, previous works devised hierarchical approaches where an RL agent is trained to achieve a series of subtasks to accomplish the task. In [11], a sequence of policy sketches is predefined to guide an agent towards the desired goal by leveraging modularized neural network policies. [216] propose to learn a controller to predict to either proceed, revert, or stay at a current subgoal, which is sampled from a list of simple symbolic instructions. In this paper, hierarchical tasks are described by programs with increased diversity through branching conditions, and therefore our framework is required to determine which branches in a program should be executed. On the other hand, the framework proposed by [281] requires a subtask graph describing a set of subtasks and their dependencies and aims to find the optimal subtask to execute. This is different from our problem formulation where the agent is asked to follow a given program/procedure. Hierarchicalreinforcementlearning. Our work is also closely related to hierarchical reinforcement learning, where a meta-controller learns to predict which sub-policy to take at each time step [148, 20, 66, 84, 309, 160, 23, 203, 184]. Previous works also investigated in explicitly specifying sub-policy with symbolic representations for meta-controller to utilize, or an explicit selection process of lower-level motor skills [201, 299]. Programmable agents. We would like to emphasize that our work differs from programmable agents [59] in motivation, problem formulations, proposed methods, and contributions. First, [59] concern declarative programs which specify what to be computed (e.g. a target object in a reaching task). However, 226 the programs considered in our work are imperative, which how this is to be computed (i.e. a procedure). Also, [59] consider only one-liner programs that contain only AND, OR, and object attributes. On the other hand, we consider programs that are much longer and describe more complex procedures. While [59] aim to generalize to novel combinations of object attributes, our work is mainly interested in generalizing to more complex tasks (i.e. programs) by leveraging the structure of programs. Programsvs. naturallanguageinstructions. In this work, we advocate utilizing programs as a task representation and propose a modular framework that can leverage the structure of programs to address this problem. Yet, natural language instructions enjoy better accessibility and are more intuitive to users who do not have experience in programming languages. While addressing the accessibility of programs or converting a natural language instruction to a more structural form is beyond the scope of this work, we look forward to future research that leverages the strengths of both programs and natural language instructions by bridging the gap between these two representations, such as synthesizing programs from natural language [171, 60, 245], semantic parsing that bridges unstructured languages and structural formal languages [344, 343], and naturalizing program [321]. 7.7.4 DiscussionsonLearnedModulationMechanisms To fuse the information from an input domain (e.g. an image) with another condition domain (e.g. a language query, image such as segmentation map, noise, etc.), a wide range of works have demonstrated the effectiveness of predicting affine transforms based on the condition to scale and bias the input in visual question answering [231, 230], image synthesis [6, 136, 220, 121], style transfer [71], recognition [119, 330], reading comprehension [65], few-shot learning [217, 158], etc. Many of those works present an extensive ablation study to compare the learned modulation against traditional ways to merge the information from the input and condition domains. 227 Recently, a few works have employed a similar learned modulation technique to reinforcement learning frameworks on learning to follow language instruction [21] and meta-reinforcement learning [316, 315]. However, there has not been a comprehensive ablation study that suggests fusing the information from the input domain (e.g. a state) and the condition domain (e.g. a goal or a task embedding) for the reinforcement learning setting. In this work, we conduct an ablation study in our 2D Minecraft environment where an agent is required to fulfill a navigational task specified by a program and show the effectiveness of learning to modulate input features with symbolically represented goal as well as present a number of modulation variations (i.e. modulating the fully-connected layers or the convolutional layers or both). We look forward to future research that verifies if this learned modulation mechanism is effective in dealing with more complex domains such as robot manipulation or locomotion. 7.7.5 AdditionalExperimentalDetails 7.7.5.1 EnvironmentDetails In the following paragraphs, we provide some details of the environment used in this work. Objectsintheenvironment. The major environmental resources that the agent can interact with are: wood,gold, andiron. There is a certain probability that the environment will contain ariver, which the agent cannot go across unless abridge is built (or pre-built). The environment is surrounded by brick walls, which draws the boundaries of the world. Agent action space. The agent’s actions are (1) crafting actions: including mining (collecting resources), placing, building a bridge, and selling an item; and (2)motoractions: including moving to four directions (up, down, left, right). The crafting actions are only allowed on the current grid cell the agent is standing on, e.g. the subtaskmine(gold) requires the agent to navigate to a specific location containing agold with motor actions, and then perform the crafting actionmine at the current location. To build a 228 bridge, the agent should consume one of thewood it possesses. To sell an item, the agent needs to travel to amerchant. With certain probabilities, there can be 2 to 4merchants at different locations. Initialization. During training, when each training program is sampled, a valid environment will be randomly initialized, where validity refers to the property that the agent will be able to successfully follow the program with sufficiently provided environmental resources. At test time, we pre-sample 20 valid initialization of the environment with 20 different random seeds to ensure the validity of the two test sets. Agent observationspace (state representations). The state used in our reinforcement learning policy consists of an environment map s map and an inventory status of the agent s inv . s map is of size 10× 10× 9, where each channel-wise slice of size10× 10× 1 represents the binary existence of certain objects at a specific location, e.g. if(3,4,2) is 1, it means there is agold at location(3,4) (environment objects in channel dimension are zero-indexed). The objects represented by the channels are ordered as follows: wood, iron, gold, agent, wall, goal (2-D representation of intended goal coordinates), river, bridge, andmerchant. The agent inventory statuss inv is augmented with the agent location, resulting in a 1-d integer vector of size5. The ordered entry of such vector is as follows: agent’swood counts, agent’s iron counts, agent’sgold counts, agent’s location coordinate x, and then y. Goal representations. For our proposed framework, we represent the goal of the subtask as an 1-D vector of size10, produced by the program interpreter. The first five entries of the goal vector is a one-hot representation of the subtask: goto,place,mine,sell,build_bridge. The 6th to 8th entries are one-hot representation of the resources: wood,iron, andgold. The last two entries are the intended goal locations. For example,place(iron,3,5) will be represented as[0,1,0,0,0,0,1,0,3,5]. For end-to-end learning models, such goal representation is produced by the input encoder as a continuous latent vector representation. 229 (a) (b) (c) (d) Figure 7.6: Exemplar rendered environment map. The agent, objects, and stuff are represented as blocks with their corresponding textures. Specifically, the agent is represented as a female character. gold is represented as a golden block,wood is shown as a tree, andiron is represented as a sliver block. River is shown as a blue grid with water texture whilebridge is presented as wooden grid. merchant is shown as an alpaca, which is supposed to transport the sold objects. Notice that there are 2merchants in (a) and (b), while (c) and (d) contains 3 and 4 of them, respectively. The boundaries of the map are shown as brick walls. Exemplarenvironmentmaps. We show several exemplary rendered environment maps in Figure 7.6. As can be seen, the essential resources such aswood,gold,iron are represented as block objects, where the merchant is depicted by an alpaca. Theagent is shown as a female human character. River grids, with bridge blocks built on it is shown as the blue grid cells, where the bridge which should be transformed by wood is of wooden texture. The Boundaries can be seen in the surroundings represented as brick wall grids. 7.7.5.2 GroundTruthPerceptionsforEnd-to-endLearningBaselines Since we train our perception module using ground truth information, we provide the ground truth perception information to all our baselines yet failed to elaborate this in the original paper. Specifically, at every time step, we feed the ground truth perception (i.e. the answer to the queries such asenv[Gold]>0 andis_there[River]) to the baseline models. The ground truth perception is represented as a vector that has a dimension of the number of all possible queries, and each element corresponds to a binary answer to a query. Therefore, the baseline models can learn to utilize this ground truth information to infer the desired subtasks. During testing time, the baseline models can still access to all this ground truth perception information, even though it is usually not possible in practice. On the contrary, during testing time, our 230 perception module predicts the answer to given queries and the performance of the whole framework depends on the predicted answers. 7.7.5.3 TaskInstructionsDetails Programs. We generate the program sets by sampling program tokens with normal distributions, and constructing them according to the DSL grammar we define. The training set is composed of on average 32 tokens and 4.6 lines; the more complex test set, i.e.test-complex, contains on average 65 tokens and 9.8 lines. We include the plotted statistics of various essential properties for the three datasets in Figure 7.11, Figure 7.12, and Figure 7.13, respectively. Note that the maximum indent of a program is the maximum depth of its scope or the height of its transformed program tree. The number of recurring procedures includes both while andloop. Naturallanguageinstructions. For each of the three program sets, we chunked them into several subsets of programs and assign them to annotators for their corresponding natural language translations. The annotators were instructed to read the provided DSL to understand the details of program syntax as well as some exemplary translations before they are allowed to start the task. The annotators were encouraged to give diverse and colloquial translations to avoid constantly giving dull line-by-line translations. The collected (translated) natural language instructions were then cleansed with spell checks and grammatical errors fixes. On average, the annotators used 27, 28, and 61 words to describe the instructions for the train, test, and test-complex sets respectively. The total vocabulary size of the natural language instructions is of 448. Qualitativeresultsonnaturallanguageanalysis. We show several example data points from our testing sets in Figure 7.10. The leftmost column displays natural language instructions, the middle column shows our sampled ground truth programs, while the rightmost column illustrates how language can be ambiguous and lead to possible alternative interpreted programs. 231 7.7.5.4 NetworkArchitectures The proposed framework and the end-to-end learning baselines are implemented in TensorFlow [1]. Ourframework Perceptionmodule. The perception module takes a queryq and a states as input and outputs a responseh. A query has a size of6× 186, since the longest query has a length of6 and186 is the dimension of one-hot program tokens. Shorter queries are zero-padded to this size. The state maps map is encoded by a CNN with four layers with channel size of32,64,96, and128. Each convolutional layer has kernel size3 and stride2 and is followed by ReLU nonlinearity. The final feature map is flattened to a feature vector, denoted as f m . The state inventorys inv is encoded by a two-layer MLP with a channel size of32 for both layers. Each fully-connected layer is followed by ReLU nonlinearity. The resulting feature vector is denoted asf i . Each token in the query is first encoded by a two-layer MLP with a channel size of 32 for both layers. Each fully-connected layer is followed by ReLU nonlinearity. Then, all the query token features are concatenated along the feature dimension to a single vector. This vector is then encoded by another two-layer MLP with a channel size of32 for both layers. Each fully-connected layer is followed by ReLU nonlinearity. The resulting feature vector is denoted asf q . All encoded features (f m ,f i , andf q ) are then concatenated along the feature dimensions to a single vector. This vector is processed by a three-layer MLP with a channel size of128,64, and32. Each fully- connected layer is followed by ReLU nonlinearity. Finally, a linear fully-connected layer produces an output with a size of1, which should have a higher value if the response of the query istrue and lower otherwise. Policy. The policy takes a goalg and a states as input and outputs an action distributiona, where the state is encoded by two types of modules: (1) a four-layer CNN encoder to encode the state maps map , and (2) a two-layer MLP to encode the agent inventory statuss inv . 232 The goalg is encoded by a two-layer MLP with a channel size of64 for both layers. Each fully-connected layer is followed by ReLU nonlinearity. The resulting feature vector is denoted asf g . Given the encoded goal vector, we employ four linear fully-connected layers to predict modulation parameters{γ i ,β i } {1,...,4} for the state CNN encoder, whereγ 1 andβ 1 have size32,γ 2 andβ 2 have size64,γ 3 andβ 3 have size96, and γ 4 andβ 4 have size128. Note that these modulation parameters are predicted for modulating convolutional features (i.e. modulation conv). For modulation fc, a linear fully-connected layer is used to produceγ fc and β fc with size64. A state maps map is encoded by four-layer CNN with channel size of32,64,96, and128. Each convolu- tional layer has kernel size3, strides2, and is followed by ReLU nonlinearity. After each convolutional layer, the produced feature mapse are modulated toγ · e+β , whereγ andβ are broadcast along spatial dimensions. The final feature map is flattened to a feature vector and denoted as f π m . A state inventorys inv is encoded by a two-layer MLP with channel size of64 for both layers. Each fully-connected layer is followed by ReLU nonlinearity. The resulting feature vector is denoted asf π i . The two encoded features (f π m andf π i ) are then concatenated along the feature dimension. Two fully- connected layers are used to process the feature with a channel size of64 for both layers. Each layer is followed by ReLU nonlinearity. The final encoded feature u is then modulated toγ fc · u+β fc ifmodulation fc is used. Finally, the modulated features ˆ u are used to produce an action distributiona and a predicted valueV using two separated MLPs. Each MLP has two fully-connected layers with a channel size of64 for both layers. A linear layer then outputs a vector with a size of8 (the number of low-level actions). Another linear layer outputs a vector with a size of1 as the predicted value. End-to-end learning models In addition to the input encoder, the end-to-end learning models can utilize a mechanism to remember what subtasks from the instructions have been accomplished. The agent can then explicitly memorize where it stands in the instruction while completing the task. We augment such 233 memorization mechanism utilizing the memory of another LSTM network, taking as inputs the encoded states throughout the execution trajectory. After agent taking each action, the last hidden state encoding the trajectory up to the current step is used to compute attention scores to pool the outputs of the input encoders. For Tree-RNN encoder, we simply concatenate the hidden representation from memorization LSTM with the root representation of Tree-RNN before feeding them to subsequent layers. The agent policy network then learns to perform task conditioning on this attention-pooled latent instruction vector. We provide details of our various end-to-end learning models in Table 7.2. Program token embedding is jointly trained with learning the whole module, while GloVe [229] (50-D version) is used for word embedding when instructions are natural languages. Model Parameters Details Seq-LSTM 0.62M LSTM size of128, both program and word embeddings are of dimension 50. Attention LSTM size of 128. Attention weights of size[256× 128], with bias of size[128]. Word embeddings utilize pre-trained GloVe. Tree-RNN 0.51M Program embeddings are of dimension128. Attention LSTM size of128. Composition module (to aggregate all the chil- dren representation of a node) is of size[128× 128], and out- put projection weights of size[128× 128], with bias of size [128]. The program embeddings are average pooled across the same program line, so that each line will be mapped to a fixed dimension representation. The composition layer is applied when combining pooled embedding from all the children of a node. Transformer 2.63M Number of hidden layers: 2, with 8 attention heads, and intermediate size of 256. Hidden size is 128. No dropout is applied. Table 7.2: Architectural details for end-to-end learning models 7.7.5.5 RawRGBInput To verify if our framework can be extended to using high-dimensional raw state inputs (i.e. RGB image) as inputs where a hand-crafted policy or perception module might not be easy to obtain, we performed an 234 additional experiment where the perception module and the policy are trained on raw RGB inputs instead of the symbolic state representation. The results suggest that our framework can utilize RGB inputs while maintaining similar performance (93.2% on the test set) and generalization ability (91.4% on the test-complex set). 7.7.5.6 FailureAnalysis To gain a better understanding of how our proposed framework and the end-to-end learning models work or fail, we conduct detailed failure analysis on the execution traces of our model. The analysis is organized as follows: • We first present an analysis of our framework on the subtasks that appear to be the first failed subtask, which immediately leads to failing the whole task. This analysis sheds some light on which subtasks most commonly cause the failure of task execution. (Section 7.7.5.6) • We show an analysis of how many time steps each successfully executed subtask takes on average for our framework, through which we explain which subtasks we find to be harder than others. (Section 7.7.5.6) • We show additional visualizations on the completion rates of different end-to-end learning models plotted with metrics not shown in the main paper, where we aim to deliver a more complete view of how these models perform. (Section 7.7.5.6) First failure rate of subtasks As the first step of failure analysis, we want to get an idea of which subtasks cause the failure of the model in executions more often. To make this possible, we define “first failed subtask” as the first subtask that ends as a failure in an unsuccessful execution of a program. Based on this definition, we further define “first failure rate” as the percentage that an occurrence of a specific subtask turns out to be the first failed subtask of the execution that includes it. 235 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 y goto(x,y) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 place(Iron,x,y) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 place(Gold,x,y) 1 2 3 4 5 6 7 8 x 1 2 3 4 5 6 7 8 place(Wood,x,y) mine(Iron) mine(Gold) mine(Wood) sell(Iron) sell(Gold) sell(Wood) build_bridge() 0.035 0.036 0.037 0.038 0.039 0.040 0.041 First Failure Rate Figure 7.7: Firstfailurerateofsubtasks. Every colored grid shows the first failure rate of each subtask. From top-left to bottom-right, each block of grids show the results for subtask category goto, place, build_bridge, mine, and sell. Warmer colors indicate higher first failure rate; while colder colors indicate lower first failure rate. White grids indicate subtasks that either never occurs as first failed task in any execution or do not exist in the executions that lead to this figure. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 y goto(x,y) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 place(Iron,x,y) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 place(Gold,x,y) 1 2 3 4 5 6 7 8 x 1 2 3 4 5 6 7 8 place(Wood,x,y) mine(Iron) mine(Gold) mine(Wood) sell(Iron) sell(Gold) sell(Wood) build_bridge() 4.4 4.6 4.8 5.0 5.2 Average Subtask Time Cost Figure 7.8: Averagetimecostofsubtasks. The setup of this plot is similar to that of Figure 7.7. Warmer colors indicate higher average subtask time cost; while colder colors indicate lower average subtask time cost. White grids indicate subtasks that do not exist in the executions that lead to this figure. We collect the first failure rate of all subtasks for the result we obtain from running our full model over the more complex test set, i.e. test-complex. The results are plotted in a visually interpretable format in Figure 7.7. As seen in the figure, subtasks in goto andplace categories are more likely to be the first failed subtask than subtasks inbuild_bridge,mine, andsell categories. Within thegoto andplace subtask categories, subtasks requiring the agent to navigate to grid cells nearby the border of the world has a higher first failure rate than ones nearby the center of the world. This shows that these tasks mentioned above are more prone to failure than other subtasks. 236 Tree-RNN Transformer Seq-LSTM 2.5 5.0 7.5 10.0 12.5 15.0 Number of Lines 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate 0 1 2 3 4 5 Number of If or Else 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate 0 1 2 3 Number of Loop or While 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate 1.0 1.5 2.0 2.5 3.0 Max Indent 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate 2.5 5.0 7.5 10.0 12.5 Number of Subtasks 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate 5 10 15 Number of Branches 0.0 0.2 0.4 0.6 0.8 1.0 Completion Rate Figure 7.9: Additionalanalysisoncompletionrates. The results of executing program instructions on both datasets are used to produce the six plots above. In each plot, each color corresponds to a different model that we propose. For the two rightmost plots, there is a very small number of outliers that extends out of the right boundary of the plot that we omit for visual interpretability reasons. Please note that the use of colors in this figure is not the same as that in Figure 5 of the main paper. Averagetimecostofsubtasks Continuing from the previous analysis, we show the average time cost of all successful subtask executions in Figure 7.8. As can be seen in the figure, subtasks in build_bridge, mine, andsell categories take relatively smaller number of time steps to complete. Ingoto andplace categories, the closer to border the subtask requires the agent to reach, the more time consuming it gets for the agent to complete the subtask. This corresponds to the finding in the analysis of first failure rates that subtasks with destinations close to the border are more likely to fail. In other words, the closer to border the agent has to reach, the more likely it is to fail the subtask. Additionalanalysisonend-to-endlearningmodelscompletionrates To conclude failure analysis, we focus on the variation of completion rates of program executions with respect to different conditioning variables. As shown in Figure 7.9, plots showing the trends of completion rates while evaluating with different independent variables show that execution failure is more common in cases when the program 237 consists of larger number of lines, more loops and while statements, and larger number of subtasks. Note that this subtask count is only a summation of the occurrence counts of each subtask in a program, which does not accurately reflect the number of executions each subtask is being invoked ( i.e. it does not reflect the repetitive counts when there is a loop). Meanwhile, the effect of the number of if and else statements and the maximum indent values of programs on completion rates seem to vary across different models. For Seq-LSTM model, having a larger number of if and else statements or having a larger max indent value results in more failures; while for Transformer and Tree-RNN models, having larger values above results instead in fewer failures. This is probably since Transformer and Tree-RNN models are designed in a way that deals with hierarchical structures with jumps in instruction executions better (this point is also mentioned in the main paper). Despite this difference in effects, the overall change in performance when the number of if and else statements and the maximum indent value change is much less significant than that in the previous case. During our analysis, we also designed an algorithm to calculate an estimate to the number of branches a program has. Here, the number of branches is defined by the number of distinct sets of lines that a program can be executed. For a program without control flows (no if, else-if, else, loop, and while statements), the number of branches is always 1. For if, else-if, and else statements, the exact number of branches these statements incur can be calculated easily. In cases of the loop and while statements, we treat loops as being executed only once and while statements as if statements when we calculate the number of branches. The result shown in the analysis does not reveal a clear trend. We attribute this result to two possibilities – either the metric we create is not accurate enough, or it is not a very suitable metric to be inspected. 238 def run(): mine(Wood) if agent[Iron] >= 4: mine(Wood) if is_there[Gold]: place(Iron, 3, 7) Mine wood first. If agent has more than 3 iron, mine wood. If there is gold in the environment, place iron at (3,7). def run(): mine(Wood) if agent[Iron] >= 4: mine(Wood) if is_there[Gold]: place(Iron, 3, 7) def run(): loop(5): place(Iron, 7, 2) if agent[Iron] <= 9: sell(Gold) Place an iron on (7,2) and repeat 4 times, if agent has no more than 9 iron then sell a gold. def run(): place(Iron, 7, 2) loop(4): if agent[Iron] <= 9: sell(Gold) Alternative Interpretation def run(): if is_there[River]: build_bridge() loop(3): mine(Gold) if env[Gold] <= 8: mine(Gold) sell(Iron) If there is a river, build a bridge. Repeat the followings 3 times: mine a gold, and if environment has no more than 8 gold, mine iron, and then sell an iron. Language Instructions def run(): if is_there[River]: build_bridge() loop(3): mine(Gold) if env[Gold] <= 8: mine(Gold) sell(Iron) Ground Truth Program # # (a) (b) (c) Figure 7.10: Exemplardataandlanguageambiguity. The goal of the examples above is to show that natural language instructions while being flexible enough to capture the high-level semantics of the task, can be ambiguous in different ways and thus might lead to impaired performance. In example (a), the modifier "repeat the following 3 times" has an unclear scope, resulting in two possible interpretations shown in program format on the right side; in example (b), "repeat 4 times" can be used to modify either the previous part of the description or the latter part of it, resulting in ambiguity; in example (c), the last sentence starting with "If" has unclear scope. In all of the above cases, a model that learns to execute instructions presented in natural language format might fail to execute the instructions successfully because of the ambiguity of the language instructions. 7.7.5.7 Hyperparameters We use the following hyperparameters to train A2C agents for our model and all the end-to-end learning models: learning rate:1× 10 − 3 , number of environment:64, number of workers:64, and number of update roll-out steps: 5. 7.7.5.8 ComputationalResources We train all our models on a single Nvidia Titan-X GPU, in a 40 core Ubuntu 16.04 Linux server. 239 0 1 2 3 4 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Number of Branches 1.0 1.5 2.0 2.5 3.0 0 1 2 3 4 Max Indent 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of Subtasks 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Number of Lines 0 1 2 3 4 0 1 2 3 4 Number of Recurring Procedures (Loops) 0 20 40 60 0.00 0.01 0.02 0.03 0.04 0.05 Number of Tokens Figure 7.11: Program set statistics for training set (train). 1 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Number of Branches 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Max Indent 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Number of Subtasks 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 Number of Lines 0 1 2 3 0.0 0.5 1.0 1.5 2.0 Number of Recurring Procedures (Loops) 10 20 30 40 50 60 70 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Number of Tokens Figure 7.12: Program set statistics for same complexity testing set (test). 240 0 2 4 6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Number of Branches 2.0 2.5 3.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Max Indent 2 4 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 Number of Subtasks 6 8 10 12 14 0.0 0.1 0.2 0.3 0.4 0.5 Number of Lines 0 2 4 6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Number of Recurring Procedures (Loops) 50 60 70 80 90 100 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Number of Tokens Figure 7.13: Program set statistics for more complex testing set (test-complex). 241 Chapter8 LearningtoComposeSkills 8.1 Introduction While humans are capable of learning complex tasks by reusing previously learned skills, composing and mastering complex skills are not as trivial as sequentially executing those acquired skills. Instead, it requires a smooth transition between skills since the final pose of one skill may not be appropriate to initiate the following one. For example, scoring in basketball with a quick shot after receiving a ball can be decomposed into catching and shooting. However, it is still difficult for beginners who have learned to catch passes and statically shoot. To master this skill, players must practice adjusting their footwork and body into a comfortable shooting pose after catching a pass. Can machines similarly learn new and complex tasks by reusing acquired skills and learning transitions between them? Learning to perform composite and long-term tasks from scratch requires extensive exploration and sophisticated reward design, which can introduce undesired behaviors [250]. Thus, instead of employing intricate reward functions and learning from scratch, modular methods sequentially execute acquired skills with a rule-based meta-policy, enabling machines to solve complicated tasks [222, 201, 11]. These modular approaches assume that a task can be clearly decomposed into several subtasks which are smoothly connected to each other. In other words, an ending state of one subtask falls within the set of starting states, initiation set, of the next subtask [291]. However, this assumption does not hold in many 242 continuous control problems where a given skill may be executed in starting states not considered during training or designing and thus, fail to achieve its goal. To bridge the gap between skills, we propose atransitionpolicy which learns to smoothly navigate from an ending state of a skill to suitable initial states of the following skill, as illustrated in Figure 8.1. However, learning a transition policy between skills without reward shaping is difficult as the only available learning signal is the sparse reward for the successful execution of the next skill. Sparse success/failure reward is challenging to learn from due to the temporal credit assignment problem [292] and the lack of information from failing trajectories. To alleviate these problems, we propose a proximity predictor which outputs the proximity to the initiation set of the next skill and acts as a dense reward function for the transition policy. The main contributions of this paper include (1) the concept of learning transition policies to smoothly connect primitive skills; (2) a novel modular framework with transition policies that is able to compose complex skills by reusing existing skills; and (3) a joint training algorithm with the proximity predictor specifically designed for efficiently training transition policies. This framework is suited for learning complex skills that require sequential execution of acquired primitive skills, which are common for humans yet relatively unexplored in robot learning. Our experiments on simulated environments demonstrate that employing transition policies solves complex continuous control tasks which traditional policy gradient methods struggle at. 8.2 RelatedWork Learning continuous control of diverse behaviors in locomotion [190, 111, 228] and robotic manipulation [91] is an active research area in reinforcement learning (RL). While some complex tasks can be solved through extensive reward engineering [209], undesired behaviors often emerge [250] when tasks require several different primitive skills. Moreover, training complex skills from scratch is not computationally practical. 243 FINAL Good initial states for "#$% Bad initial states for "#$% &'() &'() "#$% execution success "#$% execution fail "#$% "#$% &'() execution Transition policy execution Figure 8.1: Conceptofatransitionpolicy. Composing complex skills using primitive skills requires smooth transitions between primitive skills since a following primitive skill might not be robust to ending states of the previous one. In this example, the ending states (red circles) of the primitive policyp jump are not good initial states to execute the following policyp walk . Therefore, executingp walk from these states will fail (red arrow). To smoothly connect the two primitive policies, we propose a transition policy which navigates an agent to suitable initial states forp walk (dashed arrow), leading to a successful execution of p walk (green arrow). Real-world tasks often require diverse behaviors and longer temporal dependencies. In hierarchical reinforcement learning, the option framework [291] learns meta actions (options), a series of primitive actions over a period of time. Typically, a hierarchical reinforcement learning framework consists of two components: a high-level meta-controller and low-level controllers. A meta-controller determines the order of subtasks to achieve the final goal and chooses corresponding low-level controllers that generate a sequence of primitive actions. Unsupervised approaches to discover meta actions have been proposed [260, 56, 20, 309, 66, 164, 84, 248, 184]. However, to deal with more complex tasks, additional supervision signals [11, 190, 299] or pre-defined low-level controllers [148, 216] are required. To exploit pre-trained modules as low-level controllers, neural module networks [12] have been proposed, which construct a new network dedicated to a given query using a collection of reusable modules. In the RL domain, a meta-controller is trained to follow instructions [216] and demonstrations [332], and support multi-level hierarchies [99]. In the robotics domain, Pastor et al. [222], Kober et al. [142], and Mülling et al. [201] have proposed a modular approach that learns table tennis by selecting appropriate low-level controllers. On the other hand, Andreas, Klein, and Levine [11] and Frans et al. [84] learn abstract skills while experiencing a distribution of tasks and then solve a new task with the learned primitive skills. However, these modular approaches result in undefined behavior when two skills are not smoothly 244 & ' FINAL Jumping Walking Crawling Meta-policy Transition policy Primitive policy joint pos. vel. acc. curb pos. Observation Observation Observation = success or failure c <latexit sha1_base64="ykyXXryT0qS3g8DIJalovrnOKSA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ</latexit> <latexit sha1_base64="ykyXXryT0qS3g8DIJalovrnOKSA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ</latexit> <latexit sha1_base64="ykyXXryT0qS3g8DIJalovrnOKSA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ</latexit> <latexit sha1_base64="ykyXXryT0qS3g8DIJalovrnOKSA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ</latexit> ⌧ <latexit sha1_base64="Dk1kSmpn8/yAFc/1TZ8TK1W+hyU=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==</latexit> <latexit sha1_base64="Dk1kSmpn8/yAFc/1TZ8TK1W+hyU=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==</latexit> <latexit sha1_base64="Dk1kSmpn8/yAFc/1TZ8TK1W+hyU=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==</latexit> <latexit sha1_base64="Dk1kSmpn8/yAFc/1TZ8TK1W+hyU=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==</latexit> = termination 1 2 3 4 a t ,⌧ trans <latexit sha1_base64="0eQ/607gU1FaD6YVSJ6+yrxGHbk=">AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=</latexit> <latexit sha1_base64="0eQ/607gU1FaD6YVSJ6+yrxGHbk=">AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=</latexit> <latexit sha1_base64="0eQ/607gU1FaD6YVSJ6+yrxGHbk=">AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=</latexit> <latexit sha1_base64="0eQ/607gU1FaD6YVSJ6+yrxGHbk=">AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=</latexit> a t ,⌧ p c <latexit sha1_base64="SXcjfjd44UHVDxLievnEKU+0ZeE=">AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=</latexit> <latexit sha1_base64="SXcjfjd44UHVDxLievnEKU+0ZeE=">AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=</latexit> <latexit sha1_base64="SXcjfjd44UHVDxLievnEKU+0ZeE=">AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=</latexit> <latexit sha1_base64="SXcjfjd44UHVDxLievnEKU+0ZeE=">AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=</latexit> ⌧ p c <latexit sha1_base64="QOOQ5ofVdAHhLfY+VT0BMXS8/4g=">AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/</latexit> <latexit sha1_base64="QOOQ5ofVdAHhLfY+VT0BMXS8/4g=">AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/</latexit> <latexit sha1_base64="QOOQ5ofVdAHhLfY+VT0BMXS8/4g=">AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/</latexit> <latexit sha1_base64="QOOQ5ofVdAHhLfY+VT0BMXS8/4g=">AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/</latexit> ⌧ trans <latexit sha1_base64="C58Zhhu5B2jTGUetWFF7FAOl28Y=">AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV</latexit> <latexit sha1_base64="C58Zhhu5B2jTGUetWFF7FAOl28Y=">AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV</latexit> <latexit sha1_base64="C58Zhhu5B2jTGUetWFF7FAOl28Y=">AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV</latexit> <latexit sha1_base64="C58Zhhu5B2jTGUetWFF7FAOl28Y=">AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV</latexit> Figure 8.2: Ourmodularnetworkaugmentedwithtransitionpolicies. To perform a complex task, our model repeats the following steps: (1) The meta-policy chooses a primitive policy of indexc; (2) The corresponding transition policy helps initiate the chosen primitive policy; (3) The primitive policy executes the skill; and (4) A success or failure signal for the primitive skill is produced. connected. Our proposed framework aims to bridge this gap by training transition policies in a model-free manner to navigate the agent from unseen states for following skills to suitable initial states. Deep RL techniques for continuous control demand dense reward signals; otherwise, they suffer from long training time. Instead of manual reward shaping for denser reward, adversarial reinforcement learning [114, 190, 323, 21] employs a discriminator which learns to judge the state or the policy, and the policy takes as rewards the output of the discriminator. While those methods assume ground truth trajectories or goal states are given, our method collects both success and failure trajectories online to train proximity predictors which provide rewards for transition policies. 8.3 Approach In this paper, we address the problem of solving a complex task that requires sequential composition of primitive skills given only sparse and binary rewards (i.e. subtask completion reward). The sequential execution of primitive skills fails when two consecutive skills are not smoothly connected. We propose a modular framework with transition policies that learn to make transition between one policy to the subsequent policy, and therefore, can exploit the given primitive skills to compose complex skills. To 245 accelerate training of transition policies, additional networks, proximity predictors, are jointly trained to provide proximity rewards as intermediate feedback to transition policies. In Section 8.3.2, we describe our framework in details. Next, in Section 8.3.3, we elaborate how transition policies are efficiently trained with induced proximity reward. 8.3.1 Preliminaries We formulate our problem as a Markov decision process defined by a tuple {S,A,T,R,ρ,γ } of states, actions, transition probability, reward, initial state distribution, and discount factor. An action distribution of an agent is represented as a policyπ θ (a t |s t ), wheres t ∈S is a state,a t ∈A is an action at timet, and θ are the parameters of the policy. An initial state s 0 is randomly sampled from ρ , and then, an agent iteratively takes an actiona t sampled from a policyπ θ (a t |s t ) and receives a rewardr t until the episode ends. The performance of the agent is evaluated based on a discounted returnR= P T− 1 t=0 γ t r t , whereT is the episode horizon. 8.3.2 ModularFrameworkwithTransitionPolicies To learn a new task given primitive skills{p 1 ,p 2 ,...,p n }, we design a modular framework that consists of the following components: a meta-policy, primitive policies, and transition policies. The meta-policy chooses a primitive skillp c to execute at the beginning and whenever the primitive skill is terminated. Prior to runningp c , the transition policy forp c is executed to bring the current state to a plausible initial state forp c , and therefore,p c can be successfully performed. This procedure is repeated to compose complex skills as illustrated in Figure 8.2 and Algorithm 5. We denote the meta-policy asπ meta (p c |s), wherec∈[1,n] is a primitive policy index. The observation of the meta-policy contains the low-level information of primitives and task specifications indicating high-level goals (e.g. moving direction and target object position). For example, a walking primitive only 246 takes joint information as observation while the meta-policy additionally takes target direction. In this paper, we use a rule-based meta-policy and focus on transitioning between consecutive primitive policies. Once a primitive skillp c is chosen to be executed, the agent generates an actiona t ∼ π pc (a|s t ) based on the current states t . Note that we did not differentiate state spaces for primitive polices because of the simplicity of notations (e.g. the observation of the jumping primitive contains a distance to a curb while that of the walking primitive only has joint pose and velocities). Every primitive policy is required to generate termination signalsτ pc ∈{continue, success, fail} to indicate policy completion and whether it believes the execution is successful or not. While our method is agnostic to the form of primitive policies (e.g. rule-based, inverse kinematics), we consider the case of a pre-trained neural network in this paper. For smooth transitions between primitive policies, we add a transition policyπ ϕ c (a|s) before executing primitive skillp c , which guides an agent top c ’s initiation set, whereϕ c is the parameters of the transition policy forp c . Note that the transition policy forp c is shared across different preceding primitive policies since a successful transition is defined by the success of the following primitive skill p c . For brevity of notation, we omit the primitive policy indexc in the following equations where unambiguous. The transition policy’s state and action space are the same as the primitive policy’s. The transition policy also learns a termination signalτ trans which indicates transition termination to successfully initiatep c . Our framework contains one transition policy for each primitive skill, in totaln transition policies{π ϕ 1 ,π ϕ 2 ,...,π ϕ n }. 8.3.3 TrainingTransitionPolicies In our framework, transition policies are trained to make the execution of the corresponding following primitive policies successful. During rollouts, transition trajectories are collected and each trajectory can be naively labeled by the success execution of its corresponding primitive policy. Then, transition policies are trained to maximize the average success of the respective primitive policy. In this scenario, by definition, the 247 FINAL Jumping Walking Crawling transition 1 transition 2 transition 3 Proximity predictor Transition policy Success buffer Failure buffer Proximity reward success failure p 3 <latexit sha1_base64="o4nZ68qYcK0np11Ttx2Ns6alTxc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a</latexit> <latexit sha1_base64="o4nZ68qYcK0np11Ttx2Ns6alTxc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a</latexit> <latexit sha1_base64="o4nZ68qYcK0np11Ttx2Ns6alTxc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a</latexit> <latexit sha1_base64="o4nZ68qYcK0np11Ttx2Ns6alTxc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a</latexit> ⇡ 1 <latexit sha1_base64="41qpZeab6F5HP2mdzxATRxWyecw=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x</latexit> <latexit sha1_base64="41qpZeab6F5HP2mdzxATRxWyecw=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x</latexit> <latexit sha1_base64="41qpZeab6F5HP2mdzxATRxWyecw=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x</latexit> <latexit sha1_base64="41qpZeab6F5HP2mdzxATRxWyecw=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x</latexit> ⇡ 2 <latexit sha1_base64="pjS++59c/GUE3UrgzDBxESRDgoc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y</latexit> <latexit sha1_base64="pjS++59c/GUE3UrgzDBxESRDgoc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y</latexit> <latexit sha1_base64="pjS++59c/GUE3UrgzDBxESRDgoc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y</latexit> <latexit sha1_base64="pjS++59c/GUE3UrgzDBxESRDgoc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y</latexit> ⇡ 3 <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit> <latexit sha1_base64="nurogXhKyCNan2aanwCbOxR+Xyo=">AAAB4XicbZDNSgMxFIXv1L9aq1a3boJFcFVmdKFLwY3LCk5baIeSSW/b0ExmSO4IZegzuHGhiC/lzrcx/Vlo64HAxzkJuffEmZKWfP/bK21t7+zulfcrB9XDo+PaSbVl09wIDEWqUtOJuUUlNYYkSWEnM8iTWGE7ntzP8/YzGitT/UTTDKOEj7QcSsHJWWEvk/3rfq3uN/yF2CYEK6jDSs1+7as3SEWeoCahuLXdwM8oKrghKRTOKr3cYsbFhI+w61DzBG1ULIadsQvnDNgwNe5oYgv394uCJ9ZOk9jdTDiN7Xo2N//LujkNb6NC6iwn1GL50TBXjFI235wNpEFBauqACyPdrEyMueGCXD8VV0KwvvImtK4agd8IHn0owxmcwyUEcAN38ABNCEGAhBd4g3dPe6/ex7Kukrfq7RT+yPv8AVWkjRw=</latexit> <latexit sha1_base64="nurogXhKyCNan2aanwCbOxR+Xyo=">AAAB4XicbZDNSgMxFIXv1L9aq1a3boJFcFVmdKFLwY3LCk5baIeSSW/b0ExmSO4IZegzuHGhiC/lzrcx/Vlo64HAxzkJuffEmZKWfP/bK21t7+zulfcrB9XDo+PaSbVl09wIDEWqUtOJuUUlNYYkSWEnM8iTWGE7ntzP8/YzGitT/UTTDKOEj7QcSsHJWWEvk/3rfq3uN/yF2CYEK6jDSs1+7as3SEWeoCahuLXdwM8oKrghKRTOKr3cYsbFhI+w61DzBG1ULIadsQvnDNgwNe5oYgv394uCJ9ZOk9jdTDiN7Xo2N//LujkNb6NC6iwn1GL50TBXjFI235wNpEFBauqACyPdrEyMueGCXD8VV0KwvvImtK4agd8IHn0owxmcwyUEcAN38ABNCEGAhBd4g3dPe6/ex7Kukrfq7RT+yPv8AVWkjRw=</latexit> <latexit sha1_base64="cRNbjwKAJqqf9ZvAm/D2p8QVSAQ=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0n0UI9FLx4rmLbQhrLZbtqlm03YnQgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dkobm1vbO+Xdyt7+weFR9fikbZJMM+6zRCa6G1LDpVDcR4GSd1PNaRxK3gknd3O/88S1EYl6xGnKg5iOlIgEo2glv5+KwfWgWnPr7gJknXgFqUGB1qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1MoxuglyoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSvqp7bt17cGvN2yKOMpzBOVyCBw1owj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94b45v</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> <latexit sha1_base64="fKDXT9VAJodpq0yFh6wurVsvUxc=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z</latexit> p 2 <latexit sha1_base64="DjS1oOAr1VR/VatzfCTqK/ymJgE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=</latexit> <latexit sha1_base64="DjS1oOAr1VR/VatzfCTqK/ymJgE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=</latexit> <latexit sha1_base64="DjS1oOAr1VR/VatzfCTqK/ymJgE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=</latexit> <latexit sha1_base64="DjS1oOAr1VR/VatzfCTqK/ymJgE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=</latexit> a t <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=</latexit> t <latexit sha1_base64="fInOqGTCCrWkRGJFOZWK1l6FLBY=">AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4</latexit> <latexit sha1_base64="fInOqGTCCrWkRGJFOZWK1l6FLBY=">AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4</latexit> <latexit sha1_base64="fInOqGTCCrWkRGJFOZWK1l6FLBY=">AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4</latexit> <latexit sha1_base64="fInOqGTCCrWkRGJFOZWK1l6FLBY=">AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4</latexit> p 1 <latexit sha1_base64="VNts+i3MDO/Tb2F25NGquZU27Os=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=</latexit> <latexit sha1_base64="VNts+i3MDO/Tb2F25NGquZU27Os=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=</latexit> <latexit sha1_base64="VNts+i3MDO/Tb2F25NGquZU27Os=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=</latexit> <latexit sha1_base64="VNts+i3MDO/Tb2F25NGquZU27Os=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=</latexit> p 2 <latexit sha1_base64="3ubTlAh8r00l6s79wAiqXMmnmBY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=</latexit> <latexit sha1_base64="3ubTlAh8r00l6s79wAiqXMmnmBY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=</latexit> <latexit sha1_base64="3ubTlAh8r00l6s79wAiqXMmnmBY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=</latexit> <latexit sha1_base64="3ubTlAh8r00l6s79wAiqXMmnmBY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=</latexit> p 3 <latexit sha1_base64="V29pvUQG3MKQlVRztaqqIJOthf8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=</latexit> <latexit sha1_base64="V29pvUQG3MKQlVRztaqqIJOthf8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=</latexit> <latexit sha1_base64="V29pvUQG3MKQlVRztaqqIJOthf8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=</latexit> <latexit sha1_base64="V29pvUQG3MKQlVRztaqqIJOthf8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=</latexit> s <latexit sha1_base64="/IiGRHESG3M+Np9+OEPleOzFcjU=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==</latexit> <latexit sha1_base64="/IiGRHESG3M+Np9+OEPleOzFcjU=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==</latexit> <latexit sha1_base64="/IiGRHESG3M+Np9+OEPleOzFcjU=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==</latexit> <latexit sha1_base64="/IiGRHESG3M+Np9+OEPleOzFcjU=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==</latexit> (s,v) <latexit sha1_base64="RRa6u0/ZJ3kNMgHedjm9ZAHaZs4=">AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==</latexit> <latexit sha1_base64="RRa6u0/ZJ3kNMgHedjm9ZAHaZs4=">AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==</latexit> <latexit sha1_base64="RRa6u0/ZJ3kNMgHedjm9ZAHaZs4=">AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==</latexit> <latexit sha1_base64="RRa6u0/ZJ3kNMgHedjm9ZAHaZs4=">AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==</latexit> Figure 8.3: Training of transition policies and proximity predictors. After executing a primitive policy, a previously performed transition trajectory is labeled and added to a replay buffer based on the execution success. A proximity predictor is trained on states sampled from the two buffers to output the proximity to the initiation set. The predicted proximity serves as a reward to encourage the transition policy to move toward good initial states for the corresponding primitive policy. only available learning signal for the transition policies is the sparse and binary rewards for the completion of the next task. To alleviate the sparsity of rewards and maximize the objective of moving to viable initial states for the next primitive, we propose a proximity predictor that learns and provides a dense reward, dubbed proximity reward, of how close transition states are to the initiation set of the corresponding primitivep c as shown in Figure 8.3. We denote a proximity predictor asP ωc which is parameterized byω c . We define the proximity of a state as the future discounted proximity,v =δ step , wherestep is the number of steps required to reach an initiation set of the following primitive policy. The proximity of a state can also be a linearly discounted function such asv = 1− δ · step. We refer the readers to Appendix Section 8.6 for comparison of two proximity functions. The proximity predictor is trained to minimize a mean squared error of proximity prediction: L P (ω,B S ,B F )= 1 2 E (s,v)∼B S[(P ω (s)− v) 2 ]+ 1 2 E s∼B F[P ω (s) 2 ], (8.1) 248 whereB S andB F are collections of states from success and failure transition trajectories, respectively. To estimate the proximity to an initiation set,B S contains not only the state that directly leads to the success of the following primitive policy, but also the intermediate states of the successful trajectories with its proximity. By minimizing this objective, given a state, the proximity predictor is learned to predict1 if the state is in the initiation set, a value that is between0 and1 if the state leads the agent to end up with a desired initial states, and0 when the state leads to a failure. The goal of a transition policy is to get close to an initiation set which can be formulated as seeking a states predicted to be in the initiation set by the proximity predictor (i.e.P ω (s) is close to1). To achieve this goal, the transition policy learns to maximize proximity prediction at the ending state of the transition trajectoryP ω (s T ). In addition to providing reward at the end, we also use the increase of predicted proximity to the initiation set,P ω (s t+1 )− P ω (s t ), at every timestep as a reward, dubbed proximity reward, to create a denser reward. The transition policy is trained to maximize the expected discounted return: R trans (ϕ )=E (s 0 ,s 1 ,...,s T )∼ π ϕ h γ T P ω (s T )+ T− 1 X t=0 γ t (P ω (s t+1 )− P ω (s t )) i . (8.2) However, in general skill learning scenarios, ground truth states (B S andB F ) for training proximity predictors are not available. Hence, the training data for a proximity predictor is obtained online during training its corresponding transition policy. Specifically, we label the states in a transition trajectory as success or failure based on whether the following primitive is successfully executed or not, and add them into the corresponding buffers B S orB F , respectively. As stated in Algorithm 4, we train transition policies and proximity predictors by alternating between an Adam [140] gradient step onω to minimize Equation (8.1) with respect toP ω and a PPO [265] step onϕ to maximize Equation (8.2) with respect toπ ϕ . We refer readers to Appendix Section 8.6 for further details on training. 249 In summary, we propose to compose complex skills with transition policies that enable smooth transition between previously acquired primitive policies. Specifically, we propose to reward transition policies based on how close the current state is to suitable initial states of the subsequent policy (i.e. initiation set). To provide the proximity of a state, we collect failing and successful trajectories on the fly and train a proximity predictor to predict the proximity. Utilizing the learned proximity predictors and proximity rewards for training transition policies is beneficial in the following perspectives: (1) the dense rewards speed up transition policy training by differentiating failing states from states in a successful trajectory; and (2) the joint training mechanism prevents a transition policy from getting stuck in local optima. Whenever a transition policy gets into a local optimum (i.e. fails the following skill with a high proximity reward), the proximity predictor learns to lower the proximity for the failing transition as those states are added to its failure buffer, escaping the local optimum. 8.4 Experiments We conducted experiments on two classes of continuous control tasks: robotic manipulation and locomotion. To illustrate the potential of the proposed framework, modular framework with Transition Policies (TP), we designed a set of complex tasks that require agents to utilize diverse primitive skills which are not optimized for smooth composition. All of our environments are simulated in the MuJoCo physics engine [300]. 8.4.1 Baselines We evaluate our method to answer how transition policies benefit complex task learning and how joint training with proximity predictors boosts training of transition policies. To investigate the impact of the transition policy, we compared policies learned from dense rewards with our modular framework that only learns from sparse and binary rewards (i.e. subtask completion rewards). Moreover, we conducted ablation 250 FINAL (e) Hurdle (f) Obstacle course (a) Repetitive picking up (b) Repetitive catching (d) Patrol (c) Serve Figure 8.4: Tasks and success count curves of our model (blue), TRPO (purple), PPO (magenta), and transition policies (TP) trained on task reward (green) and sparse proximity reward (yellow). Our model achieves the best performance and convergence time. Note that TRPO and PPO are trained 5 times longer than ours with dense rewards since TRPO and PPO do not have primitive skills and learn from scratch. In the success count curves, different temporal scales are used for TRPO and PPO (bottom x-axis) and ours (top x-axis). studies to dissect each component in the training method of transition polices. To answer these questions, we compare the following methods: • TrustRegionPolicyOptimizationwithdensereward(TRPO) represents a state-of-the-art policy gradient method [264], which we use for the standard RL comparison. • ProximalPolicyOptimizationwithdensereward(PPO) is another state-of-the-art policy gradient method [265], which is more stable than TRPO with smaller batch sizes. 251 • Withouttransitionpolicies(Without-TP) sequentially executes primitive policies without transition policies and has no learnable components. • Transitionpoliciestrainedontaskrewards(TP-Task) represents a modular network augmented with transition policies learned from the sparse and binary reward (i.e. subtask completion reward), whereas our model learns from the dense proximity reward. • Transitionpoliciestrainedonsparseproximityrewards(TP-Sparse) is a variant of our model which has the proximity reward only at the end of the transition trajectory. In contrast, our model learns from dense proximity rewards generated every timestep. • Transitionpoliciestrainedondenseproximityrewards(TP-Dense,Ours) is our final model where transition policies learn from dense proximity rewards. Initially, we tried comparing baseline methods with our method using only sparse and binary rewards. However, the baselines could not solve any of the tasks due to the complexity and sparse reward of the environments. To provide more competitive comparisons, we engineer dense rewards for baselines (TRPO and PPO) to boost their performance and give baselines 5 times longer training times. We show that transitions with sparse rewards can compete with and even outperform baselines learning from dense rewards. As the performance of TRPO and PPO varies significantly between runs, we train each task with 3 different random seeds and report mean and standard deviation in Figure 8.4. 8.4.2 RoboticManipulation For robotic manipulation, we simulate a Kinova Jaco, a 9 DoF robotic arm with 3 fingers. The agent receives full state information, including the absolute location of external objects. The agent uses joint torque control to perform actions. The results are shown in Figure 8.4 and Table 8.1. Pre-trainedprimitives. There are four pre-trained primitives available: Picking up, Catching, Tossing, and Hitting. Picking up requires the robotic arm to pick up a small block, which is randomly placed on the 252 Table 8.1: Success count for robotic manipulation, comparing our method against baselines with or without transition policies (TP). Our method achieves the best performance over both RL baselines and the ablated variants. Each entry in the table represents average success count and standard deviation over 50 runs with 3 random seeds. Reward Repetitive picking up Repetitive catching Serve TRPO dense 0.69± 0.46 4.54± 1.21 0.32± 0.47 PPO dense 0.95± 0.53 4.26± 1.63 0.00± 0.00 Without TP sparse 0.99± 0.08 1.00± 0.00 0.11± 0.32 TP-Task sparse 0.99± 0.08 4.87± 0.58 0.05± 0.21 TP-Sparse sparse 1.52± 1.12 4.88± 0.59 0.92± 0.27 TP-Dense (ours) sparse 4.84± 0.63 4.97± 0.33 0.92± 0.27 table. If the box is not picked up after a certain amount of time, the agent fails. Catching learns to catch a block that is thrown towards the arm with random initial position and velocity. The agent fails if it does not catch and stably hold the box for a certain amount of time. Tossing requires the robot to pick up a box, toss it vertically in the air, and land the box at a specified position. Hitting requires the robot to hit a box dropped overhead at a target ball. Repetitivepickingup. The Repetitive picking up task requires the agent to complete the Picking up task 5 times. After each successful pick, the box disappears and a new box will be placed randomly on the table again. Our model achieves the best performance and converges the fastest by learning from the proposed proximity reward. With our dense proximity reward at every transition step, we alleviate credit assignment when compared to providing a sparse proximity reward (TP-Sparse) or using a sparse task reward (TP-Task). Conversely, TRPO and PPO with dense rewards take significantly longer to learn and is unable to pick up the second box as the ending pose after the first picking up is too unstable to initialize the next picking up. Repetitivecatching. Similar to Repetitive picking up, the Repetitive catching task requires the agent to catch boxes consecutively up to 5 times. In this task, other than the modular network without a transition 253 policy, all baselines are able to eventually learn while our model still learns the fastest. We believe this is because the Catching primitive policy has a larger initiation set and therefore, the sparse reward problem is less severe since random exploration is able to succeed with a higher chance. Serve. Inspired by tennis, Serve requires the robot to toss the ball and hit it at a target. Even with an extensively engineered reward, TRPO and PPO baselines fail to learn because Hitting is not able to learn to cover all terminal states of Tossing (i.e. a set of initial states for Hitting is large which demands longer training time). In contrast, learning to recover from Tossing’s ending states to Hitting’s initiation set is easier for exploration (11% ofTossing’s ending states are covered byHitting’s initiation set as can be seen in Table 8.1), which reduces the complexity of the task. Thus, our method and the sparse proximity reward baseline are both able to solve it. However, the ablated variant trained on task reward shows high success rates at the beginning of training and collapses after 100 iterations. The performance drops because the transition policy tries to solve failure cases by increasing the transition length and it reaches to a point that it hardly gets reward. This result shows that once the policy falls into local optima, it is not able to escape because the policy will never get a sparse task reward. On the other hand, our method is robust to local optima since the jointly learned dense proximity reward provides a learning signal to an agent even though it cannot get a task reward. 8.4.3 Locomotion For locomotion, we simulate a 9 DoF planar (2D) bipedal walker. The observation of the agent includes joint position, rotation, and velocity. When the agent needs to interact with objects in the environment, we provide additional input such as distance to the curb and ceiling in front of the agent. The agent uses joint torque control to perform actions. The results are shown in Figure 8.4 and Table 8.2. Pre-trainedprimitives. Forward and Backward require the walker to walk forward and backward with a certain velocity, respectively. Balancing requires the walker to robustly stand still under the random 254 Table 8.2: Success count for locomotion, comparing our method against baselines with or without transition policies (TP). Our method outperforms all baselines in Patrol and Obstacle course. In Hurdle, the reward function for TRPO was extensively engineered, which is not directly comparable to our method. Our method outperforms baselines learning from sparse reward, showing the effectiveness of the proposed proximity predictor. Each entry in the table represents average success count and standard deviation over 50 runs with 3 random seeds. Reward Patrol Hurdle Obstacle course TRPO dense 1.37± 0.52 4.13± 1.54 0.98± 1.09 PPO dense 1.53± 0.53 2.87± 1.92 0.85± 1.07 Without TP sparse 1.02± 0.14 0.49± 0.75 0.72± 0.72 TP-Task sparse 1.69± 0.63 1.73± 1.28 1.08± 0.78 TP-Sparse sparse 2.51± 1.26 1.47± 1.53 1.32± 0.99 TP-Dense (Ours) sparse 3.33± 1.38 3.14± 1.69* 1.90± 1.45 external forces. Jumping requires the walker jump over a randomly located curb and land safely. Crawling requires the walker to crawl under a ceiling. In all the aforementioned scenarios, the walker fails when the height of the walker is lower than a threshold. Patrol(Forwardandbackward). ThePatrol task involves walking forward and backward toward goal points on either side and balancing in between to smoothly change its direction. As illustrated in Figure 8.4, our method consistently outperforms TRPO, PPO, and ablated baselines in stably walking forward and transitioning to walk backward. The agent trained with dense rewards is not able to consistently switch directions, whereas our model can utilize previously learned primitives including Balancing to stabilize a reversal in velocity. Hurdle(Walkingforwardandjumping). The Hurdle task requires the agent to walk forward and jump across curbs, which requires a transition between walking and jumping as well as landing the jump to walking forward. As shown in Figure 8.4, our method outperforms the sparse reward baselines, showing the efficiency our proposed proximity reward. While TRPO with dense rewards can learn this task as well, it requires dense rewards consisting of eight different components to collectively enable TRPO to learn the 255 task. It can be considered as learning both primitive skills and transition between skills from dense rewards. However, the main focus of this paper is to learn a complex task by reusing acquired skills, avoiding an extensive reward design. ObstacleCourse(Walkingforward,jumping,andcrawling). Obstacle Course is the most difficult among the locomotion tasks, where the walker must walk forward, jump across curbs, and crawl underneath ceilings. It requires three different behaviors and transitions between two very different primitive skills: crawling and jumping. Since the task requires significantly different behaviors that are hard to transition between, TRPO fails to learn the task and only tries to crawl toward the curb without attempting to jump. In contrast, our method learns to transition between all pairs of primitive skills and often succeeds in crossing multiple obstacles. 8.4.4 AblationStudy We conducted additional experiments to understand the contribution of transition policies, proximity predictors, and dense proximity rewards. The modular framework without transition policies (Without-TP) tends to fail the execution of the second skill since the second skill is not trained to cover ending states of the first skill. Especially, in continuous control making a primitive skill that can cover all possible states is very challenging. Transition policies trained from task completion reward (TP-Task) and sparse proximity reward (TP-Sparse) learn to connect consecutive primitives slower because sparse reward is hard to learn from due to the credit assignment problem. On the other hand, our model alleviates the credit assignment problem and learns quickly by giving predicted proximity reward for every transition state-action pair. 8.4.5 TrainingofTransitionPolicyandProximityPredictor To investigate how transition polices learn to solve the tasks, we present the lengths of transition trajectories and the obtained proximity rewards during training in Figure 8.5. For manipulation, we show the results of 256 Serve <latexit sha1_base64="dzB379CJXvTdqYnMhntrfjPmBZg=">AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx</latexit> <latexit sha1_base64="dzB379CJXvTdqYnMhntrfjPmBZg=">AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx</latexit> <latexit sha1_base64="dzB379CJXvTdqYnMhntrfjPmBZg=">AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx</latexit> <latexit sha1_base64="dzB379CJXvTdqYnMhntrfjPmBZg=">AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx</latexit> FINAL (a) Manipulation Patrol <latexit sha1_base64="d1Oyunve2JK9EyNed/HVDZSnRoQ=">AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo</latexit> <latexit sha1_base64="d1Oyunve2JK9EyNed/HVDZSnRoQ=">AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo</latexit> <latexit sha1_base64="d1Oyunve2JK9EyNed/HVDZSnRoQ=">AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo</latexit> <latexit sha1_base64="d1Oyunve2JK9EyNed/HVDZSnRoQ=">AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo</latexit> FINAL (b) Patrol Figure 8.5: Average transition length and average proximity reward of transition trajectories over training on Manipulation (left) and Patrol (right). Repetitive picking up and Repetitive catching. For locomotion, we show Patrol with three different transition policies. The transition policy quickly learns to maximize the proximity reward regardless of the accuracy of the proximity predictor. All the transition policies increase the length while exploring in the beginning, especially for picking up (55 steps) and balance (45 steps). This is because a randomly initialized proximity predictor outputs high proximity for unseen states and a transition policy tries to get a high reward by visiting these states. However, as these failing initial states with high proximity are collected in the failure buffers, the proximity predictor lowers their proximity and the transition policy learns to avoid them. In other words, the transition policy will end up seeking successful states. As transition policies learn to transition to the following skills, the length decreases to get higher proximity rewards earlier. 8.4.6 VisualizingTransitionTrajectory Figure 8.6(a) shows two transition trajectories (from s 0 to t 0 and s 1 to t 1 ) and two-dimensional PCA embedding of the ending states (blue) and initiation states (red) of the Picking up primitive. A transition policy starts from statess 0 ands 1 where the previous Picking up primitive is terminated. As can be seen in Figure 8.6(a), the proximity predictor outputs small values for s 0 and s 1 since they are far from the initiation set of Picking up primitive. Trajectories in the figure show that as the transition policy moves 257 High P(s) Low P(s) Picking end Picking start Picking end Picking start s0 t0 s1 t1 (a) Repetitive picking up Forward High P(s) Low P(s) Forward Balancing Backward Balancing Balancing Backward s0 t0 s1 t1 (b) Patrol Figure 8.6: Visualization of transition trajectories of (a)Repetitivepickingup and (b)Patrol. TopandBottom rows: contain rendered frames of transition trajectories. Middlerow: contains states extracted from each primitive skill execution projected onto PCA space. The dots connected with lines are extracted from the same transition trajectory, where the marker color indicates the proximity predictionP(s). A higherP(s) value indicates proximity to states suitable for initializing the next primitive skill. Left: two picking up transition trajectories demonstrate that the transition policy learns to navigate from terminate statess 0 ands 1 tot 0 andt 1 . Right: the forward to balance transition moves between the forward and balance state distributions and the balance to backward transition moves from the balancing states close to the backward states. toward states with higher proximity, and finally ends up with states t 0 andt 1 which are in the initiation set of the primitive policy. Figure 8.6(b) illustrates PCA embeddings of initiation sets of three primitive skills, Forward (green), Backward (orange), and Balancing (blue). A transition from Forward to Balancing has very long trajectory, but predicted proximity helps the transition policy to reach to an initiation statet 0 . On the other hand, transitioning between Balancing and Backward only requires 7 steps. 258 8.5 Conclusion In this work, we propose a modular framework with transition policies to empower reinforcement learning agents to learn complex tasks with sparse reward by utilizing prior knowledge. Specifically, we formulate the problem as executing existing primitive skills while smoothly transitioning between primitive skills. To learn transition polices in a sparse reward setting, we propose a proximity predictor which generates dense reward signals and jointly train transition policies and proximity predictors. Our experimental results on robotic manipulation and locomotion tasks demonstrate the effectiveness of employing transition policies. The proposed framework solves complex tasks without reward shaping and outperforms baseline RL algorithms and other ablated baselines. There are many future directions to investigate. Our method is designed to focus on acquiring transition policies that connect a given set of primitive policies under the predefined meta-policy. We believe that joint learning of a meta-policy and transition policies on a new task would make our framework more flexible. Moreover, we made an assumption that a successful transition between two consecutive policies should be achievable by random exploration. To alleviate the exploration problem with sparse rewards, our transition policy training can incorporate exploration methods such as count-based exploration bonuses [27, 187] and curiosity-driven intrinsic reward [224]. We also assume our primitive policies return a signal that indicates whether the execution should be terminated or not, similar to Kulkarni et al. [148], Oh et al. [216], and Le et al. [156]. Learning to assess the successful termination of primitive policies together with learning transition policies is a promising future direction. 259 8.6 Appendix 8.6.1 AcquiringPrimitivePolicies The modular framework proposed in this paper allows a primitive policy to be any of a pre-trained neural network, inverse kinematics module, or hard-coded policy. In this paper, we use neural networks trained with TRPO [264] on dedicated environments as primitive policies (see Section 8.6.3 for the details of environments and reward functions). All policy networks we used consists of 2 layers of 32 hidden units withtanh nonlinearities and predicts the mean and standard deviation of a Gaussian distribution over an action space. We trained all primitive policies until the total return converged (up to 10,000 iterations). Given a state, a primitive policy outputs an action as well as a termination signal indicating whether the execution is done and if the skill was successfully performed (see Section 8.6.3 for details on primitive skills and termination conditions). 8.6.2 TrainingDetails 8.6.2.1 ImplementationDetails For the TRPO and PPO implementation, we used OpenAI baselines [64] with default hyperparameters including learning rate, KL penalty, and entropy coefficients unless specified below. Hyperparameters Transition policy Proximity predictor Primitive policy TRPO PPO Learning rate 1e-4 1e-4 1e-3 (for critic) 1e-3 (for critic) 1e-4 # Mini-batch 150 150 32 150 150 Mini-batch size 64 64 64 64 64 Learning rate decay no no no no linear decay Table 8.3: Hyperparameter values for transition policy, proximity predictor, and primitive policy as well as TRPO and PPO baselines. 260 For all networks, we use the Adam optimizer with mini-batch size of 64. We use 4 workers for rollout and parameter update. The size of rollout for each update is 10,000 steps. We limit the maximum length of a transition trajectory as 100. 8.6.2.2 ReplayBuffers A success buffer B S contains states and their proximity to the corresponding initiation set in successful transitions. On the other hand, a failure buffer B F contains states in failure transitions. Both the two buffers are FIFO (i.e. new items are added on one end and once a buffer is full, a corresponding number of items are discarded from the opposite end). For all experiments, we use buffers, B S andB F , with a capacity of one million states. For efficient training of the proximity predictors, we collect successful trajectories of primitive skills which can be sampled during the training of primitive skills. We run 1,000 episodes for each primitive and put the first 10 - 20% in trajectories into the success buffer as an initiation set. While initiation sets can be discovered via random exploration, we found that this initialization of success buffers improves the efficiency of training by providing initial training data for the proximity predictors. 8.6.2.3 ProximityReward Transition policies receive rewards based on the outputs of proximity predictors. Before computing the reward at every time step, we clip the output of the proximity predictorP by clip(P(s),0,1) which indicates how close the states is to the initiation set of the following primitive (higher values correspond to closer states). We define the proximity of a state to an initiation set as an exponentially discounted function δ step , wherestep is the shortest number of timesteps required to get to a state in the initiation set. We use δ =0.95 for all experiments. To make the reward denser, for every timestept, we provide the increase in proximity,P(s t+1 )− P(s t ), as a reward for transition policy. 261 0 200 400 600 800 1000 Step 0.0 0.5 1.0 1.5 2.0 2.5 Success Ours-Exponential Ours-Linear (a) Obstaclecourse 0 100 200 300 400 500 Step 0 1 2 3 4 Success (b) Repetitivecatching Figure 8.7: Success count curves of our model with exponentially discounted proximity function and linearly discounted proximity function over training on Obstacle course (left) and Repetitive catching (right). Using a linearly discounted proximity function,1− δ · step, is also a valid choice. We compare the two proximity functions on a manipulation task (Repetitive catching) and a locomotion task (Obstacle course), as shown in Figure 8.7, whereδ for exponential decay and linear decay are0.95 and0.01, respectively. The results demonstrate that our model is able to learn well with both proximity functions and they perform similarly. Originally, we opted for the exponential proximity function with the intuition that the faster initial decay near the initiation set would help the policy discriminate successful states from failing states near the initiation set. Also, in our experiments, as we use 0.95 as a decaying factor, the proximity is still reasonably large (e.g., 0.35 for 20 time-steps and 0.07 for 50 time-steps). In this paper, we use the exponential proximity function for all experiments. 8.6.2.4 ProximityPredictor A proximity predictor takes a state as input which includes joint state information, joint acceleration, and any task specification, such as ceiling and curb information. A proximity predictor consists of 2 fully connected layers of 96 hidden units with ReLU nonlinearities and predicts the proximity to the initiation 262 set based on the states sampled from the success and failure buffers. Each training iteration consists of 10 epochs over a batch size of 64 and use a learning rate of10 − 4 . The predictor optimizes the loss in Equation (8.1), similar to the LSGAN loss [186]. 8.6.2.5 TransitionPolicies An observation space of a transition policy consists of joint state information and joint acceleration. A transition policy consists of 2 fully connected layers of 32 hidden units withtanh nonlinearities and predicts the mean and standard deviation of a Gaussian distribution over an action space. A 2-way softmax layer is followed by the last fully connected layer to predict whether to terminate the current transition or not. We train all transition policies using PPO [265] since PPO is robust on smaller batch sizes and the transition states collected for each update is much smaller than the size of a rollout. Each training iteration consists of 5 epochs over a batch. Algorithm4Train 1: Input: Primitive polices{π p1 , ...,π pn }. 2: Initialize success buffers {B S 1 , ...,B S n } with successful trajectories of primitive policies. 3: Initialize failure buffers {B F 1 , ...,B F n }. 4: Randomly initialize parameters of transition policies {ϕ 1 , ...,ϕ n } and proximity predictors {ω 1 , ...,ω n }. 5: repeat 6: Initialize rollout buffers {R 1 , ...,R n }. 7: Collect trajectories usingRollout. 8: fori=1tondo 9: UpdateP ω i to minimize Equation (8.1) usingB S i andB F i . 10: Updateπ ϕ i to maximize Equation (8.2) usingR i . 11: endfor 12: until convergence 263 Algorithm5Rollout 1: Input: Meta policyπ meta , primitive policies{π p1 ,...,π pn }, transition policies{π ϕ 1 ,...,π ϕ n }, and proximity predictors{P ω1 ,...,P ωn }. 2: Initialize an episode and receive initial states 0 . 3: t← 0 4: while episode is not terminateddo 5: c∼ π meta (s t ) 6: Initialize a rollout buffer B. 7: while episode is not terminateddo 8: a t ,τ trans ∼ π ϕ c (s t ) 9: Terminate the transition policy ifτ trans = terminate. 10: s t+1 ,τ env ← ENV(s t ,a t ) 11: r t ← P ωc (s t+1 )− P ωc (s t ) 12: Store(s t ,a t ,r t ,τ env ,s t+1 ) inB 13: t← t+1 14: endwhile 15: while episode is not terminateddo 16: a t ,τ pc ∼ π pc (s t ) 17: Terminate the primitive policy ifτ pc ̸= continue. 18: s t+1 ,τ env ← ENV(s t ,a t ) 19: t← t+1 20: endwhile 21: Compute the discounted proximityv of each states inB. 22: Add pairs of(s,v) toB S c orB F c according toτ pc . 23: AddB to the rollout buffer R c . 24: endwhile 264 8.6.2.6 Scalability Each sub-policy requires its corresponding transition policy, proximity predictor, and two buffers. Hence, both the time and memory complexities of our method are linearly dependent on the number of sub-policies. The memory overhead is affordable since a transition policy (2 layers of 32 hidden units), a proximity predictor (2 layers of 96 hidden units), and replay buffers (1M states) are small. 8.6.3 EnvironmentDescriptions For every task, we add a control penalty,− 0.001∗∥ a∥ 2 , to regularize the magnitude of actions wherea is a torque action performed by an agent. Note that all measures are in meters, and we omit the measures here for clarity of the presentation. 8.6.3.1 RoboticManipulation In object manipulation tasks, a 9-DOF Jaco robotic arm ∗ is used as an agent and a cube with the side length 0.06 m is used as a target object. We follow the tasks and environment settings proposed in Ghosh et al. [91]. The observation consists of the position of the base of the Jaco arm, joint angles, angular velocities as well as the position, rotation, velocity, and angular velocity of the cube. The action space is a torque control on 9 joints. RewardDesignandTerminationCondition Pickingup: In the Picking up task, the position of the box is randomly initialized within a square region of size 0.1 m× 0.1 m with a center (0.5, 0.2). There is an initial guide reward to guide the arm to the box. There is also an over reward to guide the hand directly over the box. When the arm is not picking up the box, there is a pick reward to incentivize the arm to pick the box up. There is an additional hold reward that makes the arm hold the box in place after picking up. Finally, there is a success reward given after the ∗ http://www.mujoco.org/forum/index.php?resources/kinova-arms.12/ 265 arm has held the box for 50 frames. The success reward is scaled with number of timesteps to encourage the arm to succeed as quickly as possible. R(s)=λ guide · 1 Box not picked and Box on ground +λ pick · 1 Box in hand and not picked +λ hold · 1 Box picked and near hold point λ guide =2,λ pick =100,λ hold =0.1 Catching: The position of the box is initialized at (0, 2.0, 1.5) and the directional force of size 110 is applied to throw the box toward the agent with randomness (0.1 m× 0.1 m). R(s)=1 Box in air and Box within 0.06 of Jaco end-effector Tossing: The box is randomly initialized on the ground at (0.4, 0.3, 0.05) within a 0.005× 0.005 square region. A guide reward is given to guide the arm to the top of the box. A pick reward is then given to lift the box up to a specified release height. A release reward is given if the box is no longer in the hand. A stable reward is given to minimize variation in the box’s x and y direction. An up reward is given while the ball is traveling upwards in air, up until the box hits a specified z height. Finally, a success reward +100 is given based on the landing position of the box and the specified landing position. Hitting: The box is randomly initialized overhead the arm at (0.4, 0.3, 1.2) within a 0.005× 0.005 m square region. The box falls and the arm is given a hit reward +10 for hitting the box. Once the box has been hit, a target reward is given based on how close the box is to the target. Repetitivepickingup: The Repetitive picking up task has two reward variants. The sparse version gives a reward +1 for every successful pick. The dense reward version gives a guide reward to the box after each successful pick following the reward for the Picking up task. 266 Repetitivecatching: The Repetitive catching task gives a reward +1 for every successful catch. For dense reward, it uses the same reward function with that of the Catching task. Serve: TheServe task gives a toss reward +1 for a successful toss and a target reward +1 for successfully hitting the target. The dense reward setting provides the Tossing and Hitting reward according to box position. 8.6.3.2 Locomotion A 9-DOF bipedal planar walker is used for simulating locomotion tasks. The observation consists of the position and velocity of the torso, joint angles, and angular velocities. The action space is torque control on the 6 joints. RewardDesign Different locomotion tasks share many components of reward design, such as velocity, stability, and posture. We use the same form of reward functions, but with different hyperparameters for each task. The basic form of the reward function is as following: R(s)=λ vel · abs(v x − v target )+λ alive − λ height · abs(1.1− min(1.1,∆ h))+ λ angle · cos(angle)− λ foot (v right_foot +v left_foot ), wherev x ,v right_foot , andv left_foot are forward velocity, right foot angular velocity, left foot angular velocity; and ∆ h and angle are the distance between the foot and torso and the angle of the torso, respectively. The foot velocities help the agent to move its feet naturally. ∆ h and angle are used to maintain height of the torso and encourage an upright pose. Forward: The Forward task requires the walker agent to walk forward for 20 meters. To make the agent robust, we apply a random force with arbitrary magnitude and direction to a randomly selected joint every 10 timesteps. 267 λ vel =2,λ alive =1,λ height =2,λ angle =0.1,λ foot =0.01, andv target =3 Backward: Similar toForward, theBackward task requires the walker to walk backward for 20 meters under random forces. λ vel =2,λ alive =1,λ height =2,λ angle =0.1,λ foot =0.01, andv target =− 3 Balancing: In the Balancing task, the agent learns to balance under strong random forces for 1000 timesteps. Similar to other tasks, the random forces are applied to a random joint every 10 timesteps, but with magnitude 5 times larger. λ vel =1,λ alive =1,λ height =0.5,λ angle =0.1,λ foot =0, andv target =0 Crawling: In the Crawling task, a ceiling of height 1.0 and length 16 is located in front of the agent, and the agent is required to crawl under the ceiling without touching it. If the agent touches the ceiling, we terminate the episode. The task can be completed when the agent passes a point 1.5 after the ceiling and the agent gets 100 additional reward. λ vel =2,λ alive =1,λ height =0,λ angle =0.1,λ foot =0.01, andv target =3 Jumping: In the Jumping task, a curb of height 0.4 and length 0.2 is located in front of the walker agent. The observation contains a distance to the curb in addition to the 17-dimensional joint information, where the distance is clipped by 3. The x location of the curb is randomly chosen from [2.5, 5.5]. In addition to the reward function above, it also gets an additional 100 reward for passing the curb and200· v y when 268 the agent passes the front, middle, and end slices of the curb, wherev y is y-velocity. If the agent touches the curb, the agent gets -10 penalty and the episode is terminated. λ vel =2,λ alive =1,λ height =2,λ angle =0.1,λ foot =0.01, andv target =3 Patrol: The Patrol task is repetitive running forward and backward between two goals atx =− 2 andx=2. Once the agent touches a goal, the target is changed to another goal and the sparse reward +1 is given. The dense reward alternates between the reward functions of Forward and Backward. The agent gets the reward of Forward when the agent is heading towardx = 2 and gets the reward of Backward, otherwise. Hurdle: TheHurdle environment consists of 5 curbs positioned atx={8,18,28,38,48} and requires repetitive walking and jumping behaviors. The position of each curb is randomized with a uniformly sampled value from[− 0.5,0.5]. The sparse reward +1 is given when the agent jumps over a curb (i.e. pass a point 1.5 after a curb). The dense reward for Hurdle is same with Jumping and has 8 reward components to guide the agent to learn the desired behavior. By extensively designing dense rewards, it is possible to solve complex tasks. In comparison, our proposed method learns from sparse reward by re-using prior knowledge and doesn’t require reward shaping. ObstacleCourse: The Obstacle Course environment replaces two curbs in Hurdle with a ceiling of height 1.0 and length 3. The sparse reward +1 is given when the agent jumps over a curb or passes through a ceiling (i.e. pass a point 1.5 after a curb or a ceiling). The dense reward is alternating between Jumping before the curb and Crawling before the ceiling. Termination Signal Locomotion tasks except Crawling fail if h < 0.8 and Crawling fails if h < 0.3. Forward and Backward tasks are considered as success when the walker reaches to the target or 5 in front 269 of obstacles. Balancing task is considered successful when the agent does not fail for 50 timesteps. The agent succeeds on Jumping and Crawling if the agent passes the obstacles by a distance of 1.5. 270 PartV Conclusion 271 Chapter9 Conclusion 9.1 Summary This dissertation describes a robot learning framework that is designed to allow robots to interpret and acquire complex skills. The key insight is to represent a skill or a task-solving procedure using programs that are structured in a formal domain-specific language (DSL). Specifically, Part II describes techniques that can infer programs from expert’s demonstrations (Chapter 2) and reward functions (Chapter 3). Then, Part III introduces two lines of work that aim to efficiently acquire a set of primitive skills: meta-reinforcement learning (Chapter 4 and Chapter 5) and learning from demonstrations (Chapter 6). Finally, Part IV presents how robots can learn to execute inferred programs (Chapter 7) and hierarchically compose a set of acquired primitive skills (Chapter 8). The novelty, feasibility, and potential impact of this proposed framework have been justified by a series of publications presented at top-tier computer science and machine learning conferences. 9.2 FutureDirections To further improve the proposed robot learning framework, I plan to continue researching in the following directions. 272 9.2.1 ProgramInference This dissertation introduces methods that are designed to infer programs for mimicking expert’s behaviors and for addressing reinforcement learning tasks. To develop more practical program inference frameworks, I plan to conduct research in the following directions. Synthesizingprogramsfromreal-worldvideos. I aim to further leverage my experience in computer vision [121, 285, 286] to devise a more general program synthesis framework for synthesizing programs from more complex task specifications. This includes directions such as incorporating the scene understanding ability of computer vision models trained on large-scale datasets [58, 109, 147, 214], exploiting multimodal demonstrations (e.g. cooking instructional videos) that contain audio, captions or subtitles, manuals, etc. Leveraginglanguagemodelsforprogramsynthesis. Language models trained on large-scale datasets have achieved tremendous success in a wide range of natural language processing tasks [30, 62, 153, 339]. I believe developing and leveraging language models can significantly increase the scalability of program synthesis methods. Recently, encouraging results have been shown in leveraging language models for program synthesis in competitive programming [18, 44, 167]. In contrast, I aim to develop and leverage language models for programs that describe behaviors for agent learning. Learning programmatic priors. The program inference works described in this dissertation rely on learning from randomly generated programs. Yet, it would be more effective to learn from goal-oriented behavioral programs, which allows for extracting meaningful programmatic behavioral priors. Developingdifferentiableprogramexecutors. Existing program synthesis methods assume the avail- ability of program executors that can execute program to produce execution results. In most cases, such a execution process is not differentiable, and therefore cannot be utilized as direct supervision ( i.e. gradients) for learning. Developing differentiable program executors would allow learning models to be optimized based on execution results, which yield more accurate and direct supervision. 273 Buildingprogrammaticmulti-agentlearningsystems. This dissertation shows encouraging results in single agent environment. I believe the advantage of the proposed framework, such as interpretability and generalization ability, can apply to multi-agent robot learning systems. Programming from negative examples. Existing programming from examples methods [35, 46, 47, 63, 273, 287] are designed to learn from "positive" examples that demonstration inputs and outputs based on correct program execution. It would be interesting to explore how "negative" examples, that contain incorrect program execution results, can play a role in learning program inference. 9.2.2 PrimitiveSkillAcquisition To efficiently acquire a set of primitive skills, this dissertation introduces meta-learning and learning from demonstration methods. I plan to continue research in meta-learning and directions that exploit the relationship among skills to further improve the learning efficiency. 9.2.3 TaskExecution The techniques proposed in this dissertation are mainly evaluated in simulated environments. Yet, to justify and improve the effectiveness of applying these techniques to real robots, I plan to work on the following directions. Deployingproposedalgorithmstorealrobots. I aim to investigate the advantage and the limitations of applying the proposed framework to real robot learning systems. Specifically, I plan to explore robot manipulation tasks such as object rearrangement and furniture assembly. Researchingsim-to-realtechniques. I plan to research sim-to-real techniques [41, 126, 189, 227] that can translate the success achieved in simulation to real-world. 274 Bibliography [1] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “TensorFlow: A System for Large-Scale Machine Learning”. In: USENIX Symposium on Operating Systems Design and Implementation. 2016. [2] Pieter Abbeel and Andrew Y Ng. “Apprenticeship learning via inverse reinforcement learning”. In: International Conference on Machine Learning. 2004. [3] Daniel A Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, and Quoc V Le. “Neural program synthesis with priority queue training”. In: arXiv preprint arXiv:1801.03526 (2018). [4] Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. “OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning”. In: International Conference on Learning Representations. 2021. [5] Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, and Joshua Tenenbaum. “A large-scale benchmark for few-shot program induction and synthesis”. In: International Conference on Machine Learning. 2021. [6] Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. “Augmented cyclegan: Learning many-to-many mappings from unpaired data”. In: International Conference on Machine Learning. 2018. [7] Uri Alon, Omer Levy, and Eran Yahav. “code2seq: Generating sequences from structured representations of code”. In: International Conference on Learning Representations. 2019. [8] David Andre and Stuart J Russell. “Programmable reinforcement learning agents”. In: Neural Information Processing Systems. 2001. [9] David Andre and Stuart J Russell. “State abstraction for programmable reinforcement learning agents”. In: National Conference on Artificial Intelligence . 2002. [10] Jacob Andreas, Dan Klein, and Sergey Levine. “Learning with latent language”. In: North American Chapter of the Association for Computational Linguistics. 2017. 275 [11] Jacob Andreas, Dan Klein, and Sergey Levine. “Modular multitask reinforcement learning with policy sketches”. In: International Conference on Machine Learning. 2017. [12] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. “Neural module networks”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. [13] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. “Learning to learn by gradient descent by gradient descent”. In: Neural Information Processing Systems. 2016. [14] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. “Hindsight experience replay”. In: Neural Information Processing Systems. 2017. [15] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. “Learning Dexterous In-Hand Manipulation”. In: The International Journal of Robotics Research (2020). [16] Daniel Angelov, Yordan Hristov, Michael Burke, and Subramanian Ramamoorthy. “Composing Diverse Policies for Temporally Extended Tasks”. In: IEEE Robotics and Automation Letters (2020). [17] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. “Provably safe and robust learning-based model predictive control”. In: Automatica (2013). [18] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. “Program Synthesis with Large Language Models”. In: arXiv preprint arXiv:2108.07732 (2021). [19] Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. “Playing hard exploration games by watching YouTube”. In: Neural Information Processing Systems. 2018. [20] Pierre-Luc Bacon, Jean Harb, and Doina Precup. “The Option-Critic Architecture.” In: AAAI Conference on Artificial Intelligence . 2017. [21] Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Pushmeet Kohli, and Edward Grefenstette. “Learning to Understand Goal Specifications by Modelling Reward”. In: International Conference on Learning Representations. 2019. [22] Mihalj Bakator and Dragica Radosav. “Deep learning and medical diagnosis: A review of literature”. In: Multimodal Technologies and Interaction (2018). [23] Bram Bakker, Jürgen Schmidhuber, et al. “Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization”. In: Intelligent Autonomous Systems. 2004. [24] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. “Deepcoder: Learning to write programs”. In: International Conference on Learning Representations. 2017. 276 [25] Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. “Verifiable reinforcement learning via policy extraction”. In: Neural Information Processing Systems. 2018. [26] Harold Bekkering, Andreas Wohlschlager, and Merideth Gattis. “Imitation of gestures in children is goal-directed”. In: The Quarterly Journal of Experimental Psychology: Section A (2000). [27] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. “Unifying count-based exploration and intrinsic motivation”. In: Neural Information Processing Systems. 2016. [28] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the Optimization of a Synaptic Learning Rule. 1997. [29] Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. “Safe Model-Based Reinforcement Learning with Stability Guarantees”. In: Neural Information Processing Systems. 2017. [30] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. “On the opportunities and risks of foundation models”. In: arXiv preprint arXiv:2108.07258 (2021). [31] Matko Bošnjak, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel. “Programming with a Differentiable Forth Interpreter”. In: International Conference on Machine Learning. 2017. [32] Satchuthananthavale RK Branavan, Harr Chen, Luke S Zettlemoyer, and Regina Barzilay. “Reinforcement learning for mapping instructions to actions”. In: Assosiation of Computational Linguistics. 2009. [33] SRK Branavan, Nate Kushman, Tao Lei, and Regina Barzilay. “Learning high-level planning from text”. In: Assosiation of Computational Linguistics. 2012. [34] Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations”. In: International Conference on Machine Learning. 2019. [35] Rudy R Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. “Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis”. In: International Conference on Learning Representations. 2018. [36] Michael Burke, Katie Lu, Daniel Angelov, Art¯ uras Straižys, Craig Innes, Kartic Subr, and Subramanian Ramamoorthy. “Learning robotic ultrasound scanning using probabilistic temporal ranking”. In: arXiv preprint arXiv:2002.01240 (2020). [37] Michael Burke, Svetlin Penkov, and Subramanian Ramamoorthy. “From explanation to synthesis: Compositional program induction for learning from demonstration”. In: arXiv preprint arXiv:1902.10657 (2019). [38] Jonathon Cai, Richard Shin, and Dawn Song. “Making neural programming architectures generalize via recursion”. In: International Conference on Learning Representations. 2017. 277 [39] Alexandre Campeau-Lecours, Hugo Lamontagne, Simon Latour, Philippe Fauteux, Véronique Maheu, François Boucher, Charles Deguire, and Louis-Joseph Caron L’Ecuyer. “Kinova modular robot arms for service robotics applications”. In: Rapid Automation: Concepts, Methodologies, Tools, and Applications. 2019. [40] Chia-Jung Chang, Wei Guo, Jie Zhang, Jon Newman, Shao-Hua Sun, and Matt Wilson. “Behavioral clusters revealed by end-to-end decoding from microendoscopic imaging”. In: bioRxiv (2021). [41] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. “Closing the sim-to-real loop: Adapting simulation randomization with real world experience”. In: IEEE International Conference on Robotics and Automation. 2019. [42] Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, et al. “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills”. In: arXiv preprint arXiv:2104.07749 (2021). [43] Liushan Chen, Yu Pei, and Carlo A Furia. “Contract-based program repair without the contracts”. In: International Conference on Automated Software Engineering. 2017. [44] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. “Evaluating large language models trained on code”. In: arXiv preprint arXiv:2107.03374 (2021). [45] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. “A Closer Look at Few-shot Classification”. In: International Conference on Learning Representations. 2019. [46] Xinyun Chen, Chang Liu, and Dawn Song. “Execution-Guided Neural Program Synthesis”. In: International Conference on Learning Representations. 2019. [47] Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. “SpreadsheetCoder: Formula Prediction from Semi-structured Context”. In: International Conference on Machine Learning. 2021. [48] Xinyun Chen, Dawn Song, and Yuandong Tian. “Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages”. In: arXiv preprint arXiv:2107.00101 (2021). [49] Yun-Chun Chen, Chao-Te Chou, and Yu-Chiang Frank Wang. “Learning to Learn in a Semi-supervised Fashion”. In: European Conference on Computer Vision. 2020. [50] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid. 2018. [51] Hao-Tien Lewis Chiang, Jasmine Hsu, Marek Fiser, Lydia Tapia, and Aleksandra Faust. “RL-RRT: Kinodynamic motion planning via learning reachability estimators from RL policies”. In: IEEE Robotics and Automation Letters (2019). [52] Dongkyu Choi and Pat Langley. “Learning teleoreactive logic programs from problem solving”. In: International Conference on Inductive Logic Programming. 2005. 278 [53] Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning”. In: International Conference on Learning Representations. 2019. [54] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. “Model-Based Reinforcement Learning via Meta-Policy Optimization”. In: Conference on Robot Learning. 2018. [55] Raphaël Dang-Nhu. “PLANS: Neuro-Symbolic Program Learning from Videos”. In: Neural Information Processing Systems. 2020. [56] Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. “Probabilistic inference for determining options in reinforcement learning”. In: Machine Learning (2016). [57] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. “RoboNet: Large-scale multi-robot learning”. In: Conference on Robot Learning. 2019. [58] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2009. [59] Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. “Programmable agents”. In: arXiv preprint arXiv:1706.06383 (2017). [60] Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, et al. “Program synthesis using natural language”. In: International Conference on Software Engineering. 2016. [61] Jacob Devlin, Rudy R Bunel, Rishabh Singh, Matthew Hausknecht, and Pushmeet Kohli. “Neural Program Meta-Induction”. In: Neural Information Processing Systems. 2017. [62] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: North American Chapter of the Association for Computational Linguistics. 2018. [63] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. “Robustfill: Neural program learning under noisy I/O”. In: International Conference on Machine Learning. 2017. [64] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines. 2017. [65] Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. “Gated-attention readers for text comprehension”. In: Assosiation of Computational Linguistics. 2017. [66] Nat Dilokthanakul, Christos Kaplanis, Nick Pawlowski, and Murray Shanahan. “Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning”. In: arXiv preprint arXiv:1705.06769 (2017). 279 [67] Yiming Ding, Carlos Florensa, Mariano Phielipp, and Pieter Abbeel. “Goal-conditioned imitation learning”. In: Neural Information Processing Systems. 2019. [68] Ron Dorfman, Idan Shenfeld, and Aviv Tamar. “Offline Meta Learning of Exploration”. In: Neural Information Processing Systems. 2021. [69] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. “One-shot imitation learning”. In: Neural Information Processing Systems. 2017. [70] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. “RL 2 : Fast Reinforcement Learning via Slow Reinforcement Learning”. In: arXiv preprint arXiv:1611.02779 (2016). [71] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. “A learned representation for artistic style”. In: International Conference on Learning Representations. 2017. [72] Thomas Durieux and Martin Monperrus. “Dynamoth: dynamic code synthesis for automatic program repair”. In: International Workshop on Automation of Software Test. 2016. [73] Ashley D. Edwards and Charles L. Isbell. “Perceptual Values from Observation”. In: arXiv preprint arXiv:1905.07861 (2019). [74] Ashley D. Edwards, Charles L. Isbell, and Atsuo Takanishi. “Perceptual reward functions”. In: arXiv preprint arXiv:1608.03824 (2016). [75] Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar-Lezama. “Write, execute, assess: Program synthesis with a repl”. In: Neural Information Processing Systems. 2019. [76] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. “Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning”. In: arXiv preprint arXiv:2006.08381 (2020). [77] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In: International Conference on Machine Learning. 2017. [78] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. “A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models”. In: Adversarial Training Workshop on Neural Information Processing Systems (2016). [79] Chelsea Finn and Sergey Levine. “Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm”. In: International Conference on Learning Representations. 2018. [80] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. “Deep spatial autoencoders for visuomotor learning”. In: IEEE International Conference on Robotics and Automation. 2016. 280 [81] Chelsea Finn, Kelvin Xu, and Sergey Levine. “Probabilistic Model-Agnostic Meta-Learning”. In: Neural Information Processing Systems. 2018. [82] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-shot visual imitation learning via meta-learning”. In: Conference on Robot Learning. 2017. [83] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. “A general safety framework for learning-based control in uncertain robotic systems”. In: IEEE Transactions on Automatic Control (2018). [84] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. “Meta Learning Shared Hierarchies”. In: International Conference on Learning Representations. 2018. [85] Daniel Fried, Jacob Andreas, and Dan Klein. “Unified Pragmatic Models for Generating and Following Instructions”. In: North American Chapter of the Association for Computational Linguistics. 2017. [86] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. “Speaker-Follower Models for Vision-and-Language Navigation”. In: Neural Information Processing Systems. 2018. [87] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. “D4rl: Datasets for deep data-driven reinforcement learning”. In: arXiv preprint arXiv:2004.07219 (2020). [88] Justin Fu, Katie Luo, and Sergey Levine. “Learning Robust Rewards with Adverserial Inverse Reinforcement Learning”. In: International Conference on Learning Representations. 2018. [89] Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. “Differentiable Programs with Neural Libraries”. In: International Conference on Machine Learning. 2017. [90] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: theory and practice. Elsevier, 2004. [91] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. “Divide and Conquer Reinforcement Learning”. In: International Conference on Learning Representations. 2018. [92] Biraja Ghoshal and Allan Tucker. “Estimating uncertainty and interpretability in deep learning for coronavirus (COVID-19) detection”. In: arXiv preprint arXiv:2003.10769 (2020). [93] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. “Automated program repair”. In: Communications of the ACM (2019). [94] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. “Recasting Gradient-Based Meta-Learning as Hierarchical Bayes”. In: International Conference on Learning Representations. 2018. [95] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing machines”. In: arXiv preprint arXiv:1410.5401 (2014). 281 [96] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. “Hybrid computing using a neural network with dynamic external memory”. In: Nature (2016). [97] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. “Learning to transduce with unbounded memory”. In: Neural Information Processing Systems. 2015. [98] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”. In: IEEE International Conference on Robotics and Automation. 2017. [99] Aditya Gudimella, Ross Story, Matineh Shaker, Ruofan Kong, Matthew Brown, Victor Shnayder, and Marcos Campos. “Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks”. In: arXiv preprint arXiv: 1709.06977 (2017). [100] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. “RL Unplugged: Benchmarks for Offline Reinforcement Learning”. In: arXiv preprint arXiv:2006.13888 (2020). [101] Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. “Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning”. In: Conference on Robot Learning. 2019. [102] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. “Meta-Reinforcement Learning of Structured Exploration Strategies”. In: Neural Information Processing Systems. 2018. [103] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. “Deepfix: Fixing common c language errors by deep learning”. In: AAAI Conference on Artificial Intelligence . 2017. [104] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: International Conference on Machine Learning. 2018. [105] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to Control: Learning Behaviors by Latent Imagination”. In: International Conference on Learning Representations. 2020. [106] A. Hakobyan, G. C. Kim, and I. Yang. “Risk-Aware Motion Planning and Control Using CVaR-Constrained Optimization”. In: IEEE Robotics and Automation Letters (2019). [107] Chi Han, Jiayuan Mao, Chuang Gan, Josh Tenenbaum, and Jiajun Wu. “Visual Concept-Metaconcept Learning”. In: Neural Information Processing Systems. 2019. [108] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. “Learning an Embedding Space for Transferable Robot Skills”. In: International Conference on Learning Representations. 2018. 282 [109] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn”. In: International Conference on Computer Vision. 2017. [110] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. [111] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. “Emergence of Locomotion Behaviours in Rich Environments”. In: arXiv preprint arXiv: 1707.02286 (2017). [112] Lukas Hewing, Juraj Kabzan, and Melanie N. Zeilinger. “Cautious Model Predictive Control Using Gaussian Process Regression”. In: IEEE Transactions on Control System Technology (2019). [113] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework”. In: International Conference on Learning Representations. 2017. [114] Jonathan Ho and Stefano Ermon. “Generative adversarial imitation learning”. In: Neural Information Processing Systems. 2016. [115] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural Computation (1997). [116] Nicola J Hodges, A Mark Williams, Spencer J Hayes, and Gavin Breslin. “What is modelled during observational learning?” In: Journal of sports sciences (2007). [117] Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, and Manzil Zaheer. “Latent Programmer: Discrete Latent Codes for Program Synthesis”. In: International Conference on Machine Learning. 2021. [118] Hexiang Hu, Liyu Chen, Boqing Gong, and Fei Sha. “Synthesized Policies for Transfer and Adaptation across Tasks and Environments”. In: Neural Information Processing Systems. 2018. [119] Jie Hu, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. [120] De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. “Neural task graphs: Generalizing to unseen tasks from a single video demonstration”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [121] Minyoung Huh, Shao-Hua Sun, and Ning Zhang. “Feedback Adversarial Learning: Spatial Feedback for Improving Generative Adversarial Networks”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [122] Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A Ortega, Yee Whye Teh, and Nicolas Heess. “Meta reinforcement learning as task inference”. In: arXiv preprint arXiv:1905.06424 (2019). 283 [123] Jeevana Priya Inala, Osbert Bastani, Zenna Tavares, and Armando Solar-Lezama. “Synthesizing Programmatic Policies that Inductively Generalize”. In: International Conference on Learning Representations. 2020. [124] Ayush Jain, Andrew Szot, and Joseph J. Lim. “Generalization to new actions in reinforcement learning”. In: International Conference on Machine Learning. 2020. [125] Stephen James, Andrew J Davison, and Edward Johns. “Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task”. In: Conference on Robot Learning. 2017. [126] Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [127] Michael Janner, Karthik Narasimhan, and Regina Barzilay. “Representation learning for grounded spatial reasoning”. In: Assosiation of Computational Linguistics. 2018. [128] Jermsak Jermsurawong and Nizar Habash. “Predicting the Structure of Cooking Recipes”. In: Empirical Methods in Natural Language Processing. 2015. [129] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. “Shaping program repair space with existing patches and similar code”. In: ACM SIGSOFT International Symposium on Software Testing and Analysis. 2018. [130] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986. [131] Armand Joulin and Tomas Mikolov. “Inferring algorithmic patterns with stack-augmented recurrent nets”. In: Neural Information Processing Systems. 2015. [132] Łukasz Kaiser and Ilya Sutskever. “Neural gpus learn algorithms”. In: International Conference on Learning Representations. 2016. [133] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. “Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. In: Conference on Robot Learning. 2018. [134] Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale”. In: arXiv preprint arXiv:2104.08212 (2021). [135] Russell Kaplan, Christopher Sauer, and Alexander Sosa. “Beating Atari with Natural Language Guided Reinforcement Learning”. In: arXiv preprint arXiv:1704.05539 (2017). [136] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. 284 [137] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. “Vizdoom: A doom-based ai research platform for visual reinforcement learning”. In: Computational Intelligence and Games. 2016. [138] Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi. “Mise en Place: Unsupervised Interpretation of Instructional Recipes”. In: Empirical Methods in Natural Language Processing. 2015. [139] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. “Bayesian Model-Agnostic Meta-Learning”. In: Neural Information Processing Systems. 2018. [140] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: International Conference on Learning Representations. 2015. [141] Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: International Conference on Learning Representations. 2014. [142] Jens Kober, Katharina Mülling, Oliver Krömer, Christoph H Lampert, Bernhard Schölkopf, and Jan Peters. “Movement templates for learning of hitting and batting”. In: IEEE International Conference on Robotics and Automation. 2010. [143] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition”. In: Deep Learning Workshop at International Conference on Machine Learning. 2015. [144] George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. “From skills to symbols: Learning symbolic representations for abstract high-level planning”. In: Journal of Artificial Intelligence Research (2018). [145] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. “Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning”. In: International Conference on Learning Representations. 2019. [146] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. “Fixminer: Mining relevant fix patterns for automated program repair”. In: Empirical Software Engineering (2020). [147] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Neural Information Processing Systems. 2012. [148] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation”. In: Neural Information Processing Systems. 2016. [149] Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. “Rma: Rapid motor adaptation for legged robots”. In: Robotics: Science and Systems. 2021. [150] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. “Conservative q-learning for offline reinforcement learning”. In: Neural Information Processing Systems. 2020. 285 [151] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. “One shot learning of simple visual concepts”. In: Conference of the Cognitive Science Society. 2011. [152] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles”. In: Neural Information Processing Systems. 2017. [153] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In: arXiv preprint arXiv:1909.11942 (2019). [154] Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, and Daniel Faissol. “Discovering symbolic policies with deep reinforcement learning”. In: International Conference on Machine Learning. 2021. [155] Miguel Lázaro-Gredilla, Dianhuan Lin, J Swaroop Guntupalli, and Dileep George. “Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs”. In:Science Robotics (2019). [156] Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daumé III. “Hierarchical Imitation and Reinforcement Learning”. In: International Conference on Machine Learning. 2018. [157] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. “Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks”. In: International Conference on Machine Learning. 2019. [158] Yoonho Lee and Seungjin Choi. “Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace”. In: International Conference on Machine Learning. 2018. [159] Youngwoon Lee, Edward S Hu, Zhengyu Yang, and Joseph J. Lim. “To follow or not to follow: Selective imitation learning from observations”. In: Conference on Robot Learning. 2019. [160] Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J. Lim. “Composing Complex Skills by Learning Transition Policies”. In: International Conference on Learning Representations. 2019. [161] Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J. Lim. “Generalizable Imitation Learning from Observation via Inferring Goal Proximity”. In: Neural Information Processing Systems. 2021. [162] Vladimir Iosifovich Levenshtein. “Binary codes capable of correcting deletions, insertions and reversals”. In: Soviet Physics Doklady (1966). [163] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. “Offline reinforcement learning: Tutorial, review, and perspectives on open problems”. In: arXiv preprint arXiv:2005.01643 (2020). [164] Andrew Levy, Robert Platt, and Kate Saenko. “Hierarchical Actor-Critic”. In: arXiv preprint arXiv:1712.00948 (2017). [165] Ke Li and Jitendra Malik. “Learning to Optimize”. In: International Conference on Learning Representations. 2016. 286 [166] Yi Li, Shaohua Wang, and Tien N Nguyen. “DLfix: Context-based code transformation learning for automated program repair”. In: International Conference on Software Engineering. 2020. [167] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-Level Code Generation with AlphaCode. 2022.url: https://www.deepmind.com/blog/article/Competitive-programming-with-AlphaCode. [168] Yujia Li, Felix Gimeno, Pushmeet Kohli, and Oriol Vinyals. “Strong generalization and efficiency in neural programs”. In: arXiv preprint arXiv:2007.03629 (2020). [169] Yuan-Hong Liao, Xavier Puig, Marko Boben, Antonio Torralba, and Sanja Fidler. “Synthesizing Environment-Aware Activities via Activity Sketches”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [170] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: International Conference on Learning Representations. 2016. [171] Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. “NL2Bash: A corpus and semantic parser for natural language interface to the linux operating system”. In: International Conference on Language Resources and Evaluation. 2018. [172] Zachary C Lipton. “The Mythos of Model Interpretability”. In: Workshop on Human Interpretability in Machine Learning at International Conference on Machine Learning. 2016. [173] Paweł Liskowski, Krzysztof Krawiec, Nihat Engin Toklu, and Jerry Swan. “Program Synthesis as Latent Continuous Optimization: Evolutionary Search in Neural Embeddings”. In: Genetic and Evolutionary Computation Conference. 2020. [174] Evan Z Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. “Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices”. In: International Conference on Machine Learning. 2021. [175] Yunchao Liu, Jiajun Wu, Zheng Wu, Daniel Ritchie, William T. Freeman, and Joshua B. Tenenbaum. “Learning to Describe Scenes with Programs”. In: International Conference on Learning Representations. 2019. [176] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. “Imitation from observation: Learning to imitate behaviors from raw video via context translation”. In: IEEE International Conference on Robotics and Automation. 2018. [177] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2015. 287 [178] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-based Neural Machine Translation”. In: Empirical Methods in Natural Language Processing. 2015. [179] Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. “Learning latent plans from play”. In: Conference on Robot Learning. 2020. [180] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE”. In: Journal of Machine Learning Research (2008). [181] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. “Fine-Grained Visual Classification of Aircraft”. In: arXiv preprint airxiv:1306.5151 (2013). [182] Jonathan Malmaud, Earl Wagner, Nancy Chang, and Kevin Murphy. “Cooking with Semantics”. In: Workshop on Semantic Parsing at Association for Computational Linguistics. 2014. [183] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. “RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation”. In: Conference on Robot Learning. 2018. [184] Jiayuan Mao, Honghua Dong, and Joseph J. Lim. “Universal Agent for Disentangling Environments and Tasks”. In: International Conference on Learning Representations. 2018. [185] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision”. In: International Conference on Learning Representations. 2019. [186] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. “Least squares generative adversarial networks”. In: International Conference on Computer Vision. 2017. [187] Jarryd Martin, S Suraj Narayanan, Tom Everitt, and Marcus Hutter. “Count-based exploration in feature space for reinforcement learning”. In: AAAI Conference on Artificial Intelligence . 2017. [188] Maja J Mataric. “Reward functions for accelerated learning”. In: International Conference on Machine Learning. 1994. [189] Jan Matas, Stephen James, and Andrew J Davison. “Sim-to-real reinforcement learning for deformable object manipulation”. In: Conference on Robot Learning. 2018. [190] Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. “Learning human behaviors from motion capture by adversarial imitation”. In: arXiv preprint arXiv: 1707.02201 (2017). [191] Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. “Catch & Carry: Reusable Neural Controllers for Vision-Guided Whole-Body Tasks”. In: ACM Transactions on Graphics (2020). 288 [192] Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. “DeepDelta: learning to repair compilation errors”. In: ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. [193] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. “A Simple Neural Attentive Meta-Learner”. In: International Conference on Learning Representations. 2018. [194] Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. “Mapping instructions to actions in 3D environments with visual goal prediction”. In: Empirical Methods in Natural Language Processing. 2018. [195] Dipendra Misra, John Langford, and Yoav Artzi. “Mapping instructions and visual observations to actions with reinforcement learning”. In: Empirical Methods in Natural Language Processing. 2017. [196] Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. “Offline Meta-Reinforcement Learning with Advantage Weighting”. In: International Conference on Machine Learning. 2021. [197] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral Normalization for Generative Adversarial Networks”. In: International Conference on Learning Representations. 2018. [198] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous methods for deep reinforcement learning”. In: International Conference on Machine Learning. 2016. [199] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu koray. “Recurrent Models of Visual Attention”. In: Neural Information Processing Systems. 2014. [200] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. “Human Level Control Through Deep Reinforcement Learning”. In: Nature (2015). [201] Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. “Learning to select and generalize striking movements in robot table tennis”. In: The International Journal of Robotics Research (2013). [202] Tsendsuren Munkhdalai and Hong Yu. “Meta Networks”. In: International Conference on Machine Learning. 2017. [203] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. “Data-efficient hierarchical reinforcement learning”. In: Neural Information Processing Systems. 2018. [204] Anusha Nagabandi, Chelsea Finn, and Sergey Levine. “Deep online learning via meta-learning: Continual adaptation for model-based rl”. In: International Conference on Learning Representations. 2019. 289 [205] Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, and Joseph J. Lim. “Skill-based Meta-Reinforcement Learning”. In: Meta-Learning Workshop at Neural Information Processing Systems. 2021. [206] Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, and Joseph J. Lim. “Skill-based Meta-Reinforcement Learning”. In: Deep Reinforcement Learning Workshop at Neural Information Processing Systems. 2021. [207] Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, and Joseph J. Lim. “Skill-based Meta-Reinforcement Learning”. In: International Conference on Learning Representations. 2022. [208] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. “Neural programmer: Inducing latent programs with gradient descent”. In: International Conference on Learning Representations. 2015. [209] Andrew Y Ng, Daishi Harada, and Stuart Russell. “Policy invariance under reward transformations: Theory and application to reward shaping”. In: International Conference on Machine Learning. 1999. [210] Andrew Y Ng, Stuart J Russell, et al. “Algorithms for inverse reinforcement learning.” In: International Conference on Machine Learning. 2000. [211] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. “Semfix: Program repair via semantic analysis”. In: Conference on Software Engineering. 2013. [212] Alex Nichol and John Schulman. “Reptile: a Scalable Metalearning Algorithm”. In: arXiv preprint arXiv:1803.02999 (2018). [213] Scott Niekum, Sarah Osentoski, George Konidaris, Sachin Chitta, Bhaskara Marthi, and Andrew G Barto. “Learning grounded finite-state representations from unstructured demonstrations”. In: The International Journal of Robotics Research (2015). [214] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation”. In: International Conference on Computer Vision. 2015. [215] Maxwell Nye, Yewen Pu, Matthew Bowers, Jacob Andreas, Joshua B. Tenenbaum, and Armando Solar-Lezama. “Representing Partial Programs with Blended Abstract Semantics”. In: International Conference on Learning Representations. 2021. [216] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. “Zero-shot task generalization with multi-task deep reinforcement learning”. In: International Conference on Machine Learning. 2017. [217] Boris N. Oreshkin, Pau Rodriguez, and Alexandre Lacoste. “TADAM: Task dependent adaptive metric for improved few-shot learning”. In: Neural Information Processing Systems. 2018. [218] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. “Deep exploration via bootstrapped DQN”. In: Neural Information Processing Systems. 2016. [219] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. “Neuro-symbolic program synthesis”. In: International Conference on Learning Representations. 2017. 290 [220] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. “Semantic Image Synthesis with Spatially-Adaptive Normalization”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [221] Ronald Parr and Stuart J Russell. “Reinforcement learning with hierarchies of machines”. In: Neural Information Processing Systems. 1998. [222] Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. “Learning and generalization of motor skills by learning from demonstration”. In: IEEE International Conference on Robotics and Automation. 2009. [223] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. “Automatic Differentiation in PyTorch”. In: Autodiff Workshop at Neural Information Processing System . 2017. [224] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. “Curiosity-driven exploration by self-supervised prediction”. In: International Conference on Machine Learning. 2017. [225] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. “Zero-shot visual imitation”. In: International Conference on Learning Representations. 2018. [226] Richard E Pattis. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons, Inc., 1981. [227] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. “Sim-to-real transfer of robotic control with dynamics randomization”. In: IEEE International Conference on Robotics and Automation. 2018. [228] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning”. In: ACM Transactions on Graphics (2017). [229] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “GloVe: Global Vectors for Word Representation”. In: Empirical Methods in Natural Language Processing. 2014. [230] Ethan Perez, Harm De Vries, Florian Strub, Vincent Dumoulin, and Aaron Courville. “Learning visual reasoning without strong priors”. In: (2017). [231] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. “FiLM: Visual Reasoning with a General Conditioning Layer”. In: AAAI Conference on Artificial Intelligence . 2018. [232] Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. “Accelerating Reinforcement Learning with Learned Skill Priors”. In: Conference on Robot Learning. 2020. [233] Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. “Demonstration-Guided Reinforcement Learning with Learned Skills”. In: Conference on Robot Learning. 2021. 291 [234] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations”. In: North American Chapter of the Association for Computational Linguistics. 2018. [235] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. “Multi-goal reinforcement learning: Challenging robotics environments and request for research”. In: arXiv preprint arXiv:1802.09464 (2018). [236] Dean A Pomerleau. “Alvinn: An autonomous land vehicle in a neural network”. In: Neural Information Processing Systems. 1989. [237] Dean A Pomerleau. “Efficient training of artificial neural networks for autonomous navigation”. In: Neural Computation (1991). [238] Vitchyr H Pong, Ashvin Nair, Laura Smith, Catherine Huang, and Sergey Levine. “Offline Meta-Reinforcement Learning with Online Self-Supervision”. In: arXiv preprint arXiv:2107.03974 (2021). [239] Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. “Estimating Training Data Influence by Tracing Gradient Descent”. In: Neural Information Processing Systems. 2020. [240] Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. “Visual Adversarial Imitation Learning using Variational Models”. In: arXiv preprint arXiv:2107.08829 (2021). [241] Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck. “Online and Linear-Time Attention by Enforcing Monotonic Alignments”. In: International Conference on Machine Learning. 2017. [242] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations”. In: Robotics: Science and Systems. 2018. [243] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. “Efficient off-policy meta-reinforcement learning via probabilistic context variables”. In: International Conference on Machine Learning. 2019. [244] Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning”. In: International Conference on Learning Representations. 2017. [245] Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. “Compositional program synthesis from natural language and examples”. In: International Joint Conference on Artificial Intelligence . 2015. [246] Siddharth Reddy, Anca D. Dragan, and Sergey Levine. “SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards”. In: International Conference on Learning Representations. 2020. [247] Scott Reed and Nando De Freitas. “Neural programmer-interpreters”. In: International Conference on Learning Representations. 2016. 292 [248] John Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. “Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings”. In: International Conference on Machine Learning. 2018. [249] John D Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, John DeNero, Pieter Abbeel, and Sergey Levine. “Guiding policies with language via meta-learning”. In: International Conference on Learning Representations. 2019. [250] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom van de Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. “Learning by Playing Solving Sparse Reward Tasks from Scratch”. In: International Conference on Machine Learning. 2018. [251] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. “A reduction of imitation learning and structured prediction to no-regret online learning”. In: International Conference on Artificial Intelligence and Statistics. 2011. [252] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel, Dmitriy Shingarey, Lukas Kaul, Tamim Asfour, C Dometios Athanasios, You Zhou, et al. “ProMP: Proximal Meta-Policy Search”. In: International Conference on Learning Representations. 2019. [253] Reuven Y Rubinstein. “Optimization of computer simulation models with rare events”. In: European Journal of Operational Research (1997). [254] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. “Meta-Learning with Latent Embedding Optimization”. In: International Conference on Learning Representations. 2019. [255] Dorsa Sadigh and Ashish Kapoor. “Safe Control under Uncertainty with Probabilistic Signal Temporal Logic”. In: Robotics: Science and Systems. 2016. [256] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. “Meta-Learning with Memory-Augmented Neural Networks”. In: International Conference on Machine Learning. 2016. [257] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. “A simple neural network module for relational reasoning”. In: Neural Information Processing Systems. 2017. [258] Stefan Schaal. “Learning from demonstration”. In: Neural Information Processing Systems. 1997. [259] Stefan Schaal, Jan Peters, Jun Nakanishi, and Auke Ijspeert. “Learning Movement Primitives”. In: Robotics Research. 2005. [260] Jürgen Schmidhuber. “Towards compositional learning with dynamic neural networks”. In: (1990). [261] Jurgen Schmidhuber. “Evolutionary principles in self-referential learning. (On learning how to learn: The meta-meta-... hook.)” Diploma Thesis. 1987. 293 [262] Jürgen Schmidhuber, Jieyu Zhao, and Nicol N Schraudolph. “Reinforcement learning with self-modifying policies”. In: Learning to learn. 1998. [263] Jürgen Schmidhuber, Jieyu Zhao, and Marco Wiering. “Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement”. In: Machine Learning (1997). [264] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. “Trust region policy optimization”. In: International Conference on Machine Learning. 2015. [265] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [266] Eric Schulte, Stephanie Forrest, and Westley Weimer. “Automated program repair through the evolution of assembly code”. In: International Conference on Automated Software Engineering. 2010. [267] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. “Time-contrastive networks: Self-supervised learning from video”. In: IEEE International Conference on Robotics and Automation. 2018. [268] Pierre Sermanet, Kelvin Xu, and Sergey Levine. “Unsupervised perceptual rewards for imitation learning”. In: Robotics: Science and Systems. 2017. [269] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. “Dynamics-aware unsupervised discovery of skills”. In: International Conference on Learning Representations. 2020. [270] Dinggang Shen, Guorong Wu, and Heung-Il Suk. “Deep learning in medical image analysis”. In: Annual review of biomedical engineering (2017). [271] Owen Shen. Interpretability in ML: A Broad Overview. 2020.url: https://mlu.red/muse/52906366310. [272] Nobuyuki Shimizu and Andrew Haas. “Learning to follow navigational route instructions”. In: International Joint Conference on Artificial Intelligence . 2009. [273] Eui Chul Shin, Illia Polosukhin, and Dawn Song. “Improving neural program synthesis with inferred execution traces”. In: Neural Information Processing Systems. 2018. [274] Pranav Shyam, Shubham Gupta, and Ambedkar Dukkipati. “Attentive Recurrent Comparators”. In: International Conference on Machine Learning. 2017. [275] Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. “Keep doing what worked: Behavioral modelling priors for offline reinforcement learning”. In: International Conference on Learning Representations. 2020. [276] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature (2016). 294 [277] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science (2018). [278] Tom Silver, Kelsey R Allen, Alex K Lew, Leslie Pack Kaelbling, and Josh Tenenbaum. “Few-shot Bayesian imitation learning with logical program policies”. In: AAAI Conference on Artificial Intelligence. 2020. [279] Amitojdeep Singh, Sourya Sengupta, and Vasudevan Lakshminarayanan. “Explainable deep learning models in medical image analysis”. In: Journal of Imaging (2020). [280] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical Networks for Few-shot Learning”. In: Neural Information Processing Systems. 2017. [281] Sungryull Sohn, Junhyuk Oh, and Honglak Lee. “Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies”. In: Neural Information Processing Systems. 2018. [282] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control”. In: International Conference on Machine Learning. 2018. [283] Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. “Third Person Imitation Learning”. In: International Conference on Learning Representations. 2017. [284] Shao-Hua Sun. Multi-digit MNIST for Few-shot Learning. 2019.url: https://github.%20com/shaohua0116/MultiDigitMNIST. [285] Shao-Hua Sun, Shang-Pu Fan, and Yu-Chiang Frank Wang. “Exploiting image structural similarity for single image rain removal”. In: IEEE International Conference on Image Processing. 2014. [286] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning Zhang, and Joseph J. Lim. “Multi-view to Novel View: Synthesizing Novel Views with Self-Learned Confidence”. In: European Conference on Computer Vision. 2018. [287] Shao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph Lim. “Neural program synthesis from diverse demonstration videos”. In: International Conference on Machine Learning. 2018. [288] Shao-Hua Sun, Te-Lin Wu, and Joseph J. Lim. “Program Guided Agent”. In: International Conference on Learning Representations. 2020. [289] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. “Learning to Compare: Relation Network for Few-Shot Learning”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. [290] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Neural Information Processing Systems. 2014. 295 [291] Richard S Sutton, Doina Precup, and Satinder Singh. “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning”. In: Artificial intelligence (1999). [292] Richard Stuart Sutton. “Temporal credit assignment in reinforcement learning”. PhD thesis. University of Massachusetts, Amherst, 1984. [293] Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. “Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap”. In: International Conference on Machine Learning. 2021. [294] Kai Sheng Tai, Richard Socher, and Christopher D Manning. “Improved semantic representations from tree-structured long short-term memory networks”. In: Assosiation of Computational Linguistics. 2015. [295] Yee Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. “Distral: Robust multitask reinforcement learning”. In: Neural Information Processing Systems. 2017. [296] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth J Teller, and Nicholas Roy. “Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation.” In: AAAI Conference on Artificial Intelligence . 2011. [297] Brijen Thananjeyan, Ashwin Balakrishna, Ugo Rosolia, Felix Li, Rowan McAllister, Joseph E. Gonzalez, Sergey Levine, Francesco Borrelli, and Ken Goldberg. “Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks”. In: IEEE Robotics and Automation Letters (2020). [298] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. [299] Richard Socher Tianmin Shu Caiming Xiong. “Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning”. In: International Conference on Learning Representations. 2018. [300] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-based control”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012. [301] Faraz Torabi, Garrett Warnell, and Peter Stone. “Behavioral cloning from observation”. In: International Joint Conference on Artificial Intelligence . 2018. [302] Faraz Torabi, Garrett Warnell, and Peter Stone. “Generative adversarial imitation from observation”. In: arXiv preprint arXiv:1807.06158 (2018). [303] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. “Meta-dataset: A dataset of datasets for learning to learn from few examples”. In: International Conference on Learning Representations. 2020. [304] Dweep Trivedi, Jesse Zhang, Shao-Hua Sun, and Joseph J. Lim. “Learning to Synthesize Programs as Interpretable and Generalizable Policies”. In: Neural Information Processing Systems. 2021. 296 [305] Hado Van Hasselt, Arthur Guez, and David Silver. “Deep reinforcement learning with double q-learning”. In: AAAI Conference on Artificial Intelligence . 2016. [306] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need”. In: Neural Information Processing Systems. 2017. [307] Abhinav Verma, Hoang Le, Yisong Yue, and Swarat Chaudhuri. “Imitation-projected programmatic reinforcement learning”. In: Neural Information Processing Systems. 2019. [308] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. “Programmatically interpretable reinforcement learning”. In: International Conference on Machine Learning. 2018. [309] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. “FeUdal Networks for Hierarchical Reinforcement Learning”. In: International Conference on Machine Learning. 2017. [310] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature (2019). [311] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. “Matching networks for one shot learning”. In: Neural Information Processing Systems. 2016. [312] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. “Show and tell: A neural image caption generator”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2015. [313] Adam Vogel and Daniel Jurafsky. “Learning to Follow Navigational Directions”. In: Assosiation of Computational Linguistics. 2010. [314] Risto Vuorio, Dong-Yeon Cho, Daejoong Kim, and Jiwon Kim. “Meta continual learning”. In: arXiv preprint arXiv:1806.06928 (2018). [315] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. “Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation”. In: Neural Information Processing Systems. 2019. [316] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. “Toward Multimodal Model-Agnostic Meta-Learning”. In: Meta-Learning Workshop at Neural Information Processing Systems. 2018. [317] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. “The caltech-ucsd birds-200-2011 dataset”. In: (2011). [318] Jane X Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Demis Hassabis, and Matthew Botvinick. “Prefrontal cortex as a meta-reinforcement learning system”. In: Nature Neuroscience (2018). 297 [319] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. “Learning to reinforcement learn”. In: arXiv preprint arXiv:1611.05763 (2016). [320] Ke Wang, Rishabh Singh, and Zhendong Su. “Dynamic neural program embedding for program repair”. In: arXiv preprint arXiv:1711.07163 (2017). [321] Sida I Wang, Samuel Ginn, Percy Liang, and Christoper D Manning. “Naturalizing a programming language via interactive learning”. In: Assosiation of Computational Linguistics. 2017. [322] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. “Nervenet: Learning structured policy with graph neural networks”. In: International Conference on Learning Representations. 2018. [323] Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess. “Robust imitation of diverse behaviors”. In: Neural Information Processing Systems. 2017. [324] Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and Denys Poshyvanyk. “Sorting and transforming program repair ingredients via deep learning code similarities”. In: IEEE International Conference on Software Analysis, Evolution and Reengineering. 2019. [325] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning (1992). [326] Elly Winner and Manuela Veloso. “DISTILL: Learning Domain-Specific Planners by Example”. In: International Conference on Machine Learning. 2003. [327] Catherine Wong, Kevin Ellis, Joshua B Tenenbaum, and Jacob Andreas. “Leveraging Language to Learn Program Abstractions and Search Heuristics”. In: International Conference on Machine Learning. 2021. [328] Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. “Neural Scene De-rendering”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017. [329] Da Xiao, Jo-Yu Liao, and Xingyuan Yuan. “Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction”. In: International Conference on Learning Representations. 2018. [330] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. “Attentional shapecontextnet for point cloud recognition”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. [331] Qi Xin and Steven P Reiss. “Leveraging syntax-related code for automated program repair”. In: International Conference on Automated Software Engineering. 2017. [332] Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese. “Neural task programming: Learning to generalize across hierarchical tasks”. In: IEEE International Conference on Robotics and Automation. 2018. 298 [333] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. “Show, attend and tell: Neural image caption generation with visual attention”. In: International Conference on Machine Learning. 2015. [334] Jun Yamada, Youngwoon Lee, Gautam Salhotra, Karl Pertsch, Max Pflueger, Gaurav S Sukhatme, Joseph J. Lim, and Peter Englert. “Motion planner augmented reinforcement learning for robot manipulation in obstructed environments”. In: Conference on Robot Learning. 2020. [335] Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. “Neural execution engines: Learning to execute subroutines”. In: Neural Information Processing Systems. 2020. [336] Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, and Chuang Gan. “Imitation learning from observations by minimizing inverse dynamics disagreement”. In: Neural Information Processing Systems. 2019. [337] Yichen Yang, Jeevana Priya Inala, Osbert Bastani, Yewen Pu, Armando Solar-Lezama, and Martin Rinard. “Program Synthesis Guided Reinforcement Learning”. In: arXiv preprint arXiv:2102.11137 (2021). [338] Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Jie Tan, and Chelsea Finn. “NoRML: No-reward meta learning”. In: International Conference on Autonomous Agents and Multiagent Systems. 2019. [339] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. “Xlnet: Generalized autoregressive pretraining for language understanding”. In: Neural Information Processing Systems. 2019. [340] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. “Stacked attention networks for image question answering”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. [341] Michihiro Yasunaga and Percy Liang. “Graph-based, Self-Supervised Program Repair from Diagnostic Feedback”. In: International Conference on Machine Learning. 2020. [342] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. “Learning embedding adaptation for few-shot learning”. In: arXiv preprint arXiv:1812.03664 (2018). [343] Pengcheng Yin and Graham Neubig. “Tranx: A transition-based neural abstract syntax parser for semantic parsing and code generation”. In: Empirical Methods in Natural Language Processing. 2018. [344] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task”. In: Empirical Methods in Natural Language Processing. 2018. [345] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”. In: Robotics: Science and Systems. 2018. 299 [346] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. “Combo: Conservative offline model-based policy optimization”. In: arXiv preprint arXiv:2102.08363 (2021). [347] Wojciech Zaremba and Ilya Sutskever. “Reinforcement learning neural turing machines-revised”. In: arXiv preprint arXiv:1505.00521 (2015). [348] Grace Zhang, Linghan Zhong, Youngwoon Lee, and Joseph J. Lim. “Policy Transfer across Visual and Dynamics Domain Gaps via Iterative Grounding”. In: Robotics: Science and Systems. 2021. [349] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. “Self-Attention Generative Adversarial Networks”. In: International Conference on Machine Learning. 2019. [350] Jesse Zhang, Brian Cheung, Chelsea Finn, Sergey Levine, and Dinesh Jayaraman. “Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings”. In: International Conference on Machine Learning. 2020. [351] Zelin Zhao, Karan Samel, Binghong Chen, and Le Song. “ProTo: Program-Guided Transformer for Program-Guided Tasks”. In: Neural Information Processing Systems. 2021. [352] He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. “An inductive synthesis framework for verifiable reinforcement learning”. In: ACM SIGPLAN Conference on Programming Language Design and Implementation. 2019. [353] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. “Maximum entropy inverse reinforcement learning”. In: AAAI Conference on Artificial Intelligence . 2008. [354] Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. “Fast context adaptation via meta-learning”. In: International Conference on Machine Learning. 2019. [355] Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. “VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning”. In: International Conference on Learning Representations. 2020. [356] Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. “Task-Relevant Adversarial Imitation Learning”. In: Conference on Robot Learning. 2020. 300
Abstract (if available)
Abstract
Recent development in artificial intelligence and machine learning has remarkably advanced machines' ability to understand images and videos, comprehend natural languages and speech, and outperform human experts in complex games. However, building intelligent robots that can physically interact with their surroundings as well as learn to operate in unstructured environments, manipulate unknown objects, and acquire novel skills - to free humans from tedious or dangerous manual work - remains challenging. The focus of my research is to develop a robot learning framework that enables robots to acquire long-horizon and complex skills with hierarchical structures, such as furniture assembly and cooking. Specifically, I aim to devise a robot learning framework which is: (1) interpretable: by decoupling interpreting skill specifications (e.g. demonstrations, reward functions) and executing skills, (2) programmatic: by generalizing from simple instances to complex instances without additional learning, (3) hierarchical: by operating on a proper level of abstraction that enables human users to interpret high-level plans of robots allows for composing primitive skills to solve long-horizon tasks, and (4) modular: by being equipped with modules specialized in different functions (e.g. perception, action) which collaborate, allowing for better generalization. This dissertation discusses a series of research projects toward building such an interpretable, programmatic, hierarchical, and modular robot learning framework.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling robot learning with skills
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Accelerating robot manipulation using demonstrations
PDF
Machine learning of motor skills for robotics
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Learning from planners to enable new robot capabilities
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Learning objective functions for autonomous motion generation
PDF
Decision support systems for adaptive experimental design of autonomous, off-road ground vehicles
Asset Metadata
Creator
Sun, Shao-Hua
(author)
Core Title
Program-guided framework for your interpreting and acquiring complex skills with learning robots
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
03/07/2022
Defense Date
03/06/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
agent learning,deep learning,deep reinforcement learning,deep RL,learning from demonstration,learning from observation,meta-learning,OAI-PMH Harvest,program execution,program synthesis,reinforcement learning,robot learning,skill learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lim, Joseph J. (
committee chair
), Nguyen, Quan (
committee member
), Nikolaidis, Stefanos (
committee member
), Sukhatme, Gaurav (
committee member
)
Creator Email
shaohuas@usc.edu,waltersun81@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110768223
Unique identifier
UC110768223
Legacy Identifier
etd-SunShaoHua-10428
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sun, Shao-Hua
Type
texts
Source
20220308-usctheses-batch-915
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
agent learning
deep learning
deep reinforcement learning
deep RL
learning from demonstration
learning from observation
meta-learning
program execution
program synthesis
reinforcement learning
robot learning
skill learning