Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scaling robot learning with skills
(USC Thesis Other)
Scaling robot learning with skills
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Scaling Robot Learning with Skills
by
Youngwoon Lee
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Youngwoon Lee
Acknowledgements
First and foremost, I am immensely grateful for my Ph.D. advisor, Joseph Lim, for his trust,
mentorship, and continuous support for the last six years. I was fortunate enough to be one of
his first students, have started a new lab together, and get so many invaluable advice not only on
research but also on life.
I would like to thank my dissertation committee members, Gaurav Sukhatme, Stefanos Niko-
laidis, and Somil Bansal, for their support and guidance. I would also like to thank Jesse Thomason,
Yan Liu, and Rahul Jain for being on my qualification exam and thesis proposal committee. I am
also grateful for my masters advisor, Sung-Eui Yoon, and mentor, Jae-Pil Heo, for inspiring me to
pursue a career in research.
I am grateful to work with wonderful people in the CLVR lab as my friends and collaborators.
In particular, I would like to thank the starting members of our lab, Shao-Hua Sun, Jiayuan Mao,
Honghua Dong, Jongwook Choi, and Taehoon Kim, who taught me a lot about deep learning and
TensorFlow, inspired me with thoughtful discussions, and gave me so much encouragement. I was
also very fortunate to have amazing peers in the lab, Ayush Jain, Karl Pertsch, Te-Lin Wu, Grace
Zhang, and Jesse Zhang, who were extremely supportive, helped me on every part of my Ph.D. life,
and made my journey full of fun.
I am very happy to have had the opportunity to collaborate with an amazing set of students
throughout my Ph.D. study. I would especially like to thank Edward Hu, Zhengyu Yang, Alex
Yin, Zetai Yu, and Taehoon Kim for helping my first Ph.D. project, IKEA Furniture Assembly
Environment, which made me struggle the most for more than three years but turned out to be the
most rewarding work. For the work in this thesis, I enjoyed working with Karl Pertsch, Edward
ii
Hu, Shao-Hua Sun, Sriram Somasundaram, Jingyun Yang, Lucy Shi, Minho Heo, and Doohyun
Lee. Beyond the work in this thesis, I had the pleasure of working with a number of other students:
Grace Zhang, Andrew Szot, Yuan-Hong Liao, Jun Yamada, Peter Englert, Gautam Salhotra, Dweep
Trivedi, Linghan Zhong, I-Chun Liu, Shagun Uppal, and Shivin Dass. I would like to thank all
of the undergraduate and masters students that I have mentored, including Edward Hu, Sriram
Somasundaram, Jingyun Yang, Andrew Szot, Zhengyu Yang, Alex Yin, Linghan Zhong, I-Chun
Liu, Shagun Uppal, Lucy Shi, Minho Heo, and Shivin Dass, for your enthusiasm and hard work in
the projects we undertook and your patience as I learned how to best advise you.
I am grateful to have had the opportunity to do my internship at NVIDIA under Yuke Zhu
and Anima Anandkumar, SKT T-Brain, and NA VER AI Labs. I was fortunate enough to have
tremendous freedom during all of my internships.
My time at USC would not have been the same without continuous encouragement from my
friends, including Jae-Pil, Junghwan, Taejoon, Joonsung, Eunhye, Jaeheon, Siyoung, Ilgyu, Jihoon,
Taejin, Changhee, Jangho, Jisu, and many others. I would especially like to thank Eunji, who
introduced me to my Ph.D. advisor, Joseph. Moreover, I am tremendously grateful for all the friends
from KAIST Sorimoeum and ETRI, including Narae, Sungoh, Seongha, Yoonkoo, Kyungmin,
Jaeyoon, Junmin, Cheolhee, Byeongwook, Kyungyoon, Minjoo, Hyeyeon, Changuk, Seungkeun,
Geunchang, Hyesoo, Jeonghee, Suho, Kihoon, Suwoong, Hyunwoo, and Seokbin.
I would like to give a special thank you to Soohwan Jo for your endless love and support. Finally,
I would love to thank my parents, Siyoung Lee and Yunran Kim, and my sister, Younga Lee, for
always believing in me and encouraging me to be the best version of myself.
iii
TABLE OF CONTENTS
Acknowledgements ii
List of Tables x
List of Figures xii
Abstract xxii
Chapter 1: Introduction 1
1.1 Long-Horizon Task Benchmarks: Furniture Assembly in Simulation and Real World 2
1.2 Skill Chaining for Solving Complex Long-Horizon Tasks . . . . . . . . . . . . . . 3
1.3 Accelerating Reinforcement Learning with Learned Skills . . . . . . . . . . . . . 4
I Long-Horizon Task Benchmarks: Furniture Assembly in Simulation and Real World 6
Chapter 2: IKEA Furniture Assembly Simulator 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 IKEA Furniture Assembly Environment . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Environment Development . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Furniture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Furniture Assembly Simulation . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Demonstration Collection and Generation . . . . . . . . . . . . . . . . . . 16
2.2.7 Domain Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.8 Assembly Difficulty by Furniture . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Reinforcement Learning Benchmark . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Imitation Learning Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
Chapter 3: Real-World Furniture Assembly Environment 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Furniture Assembly Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Furniture Assembly as a Complex Long-Horizon Task . . . . . . . . . . . 28
3.3.2 Reproducible System Design and Benchmark Setup . . . . . . . . . . . . . 29
3.3.2.1 Furniture Models . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Reducing Human Effort and Intervention . . . . . . . . . . . . . . . . . . 33
3.3.3.1 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.3.2 Reset Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3.3 Data Collection using Teleoperation . . . . . . . . . . . . . . . 34
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Baselines and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 One-Leg Assembly Task Results . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.4 Full Assembly Task Results . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
II Skill Chaining for Solving Complex Long-Horizon Tasks 42
Chapter 4: Composing Skills via Transition Policies 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Modular Framework with Transition Policies . . . . . . . . . . . . . . . . 47
4.3.3 Training Transition Policies . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.3 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.5 Training of Transition Policies and Proximity Predictors . . . . . . . . . . 57
4.4.6 Visualization of Transition Trajectory . . . . . . . . . . . . . . . . . . . . 58
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: Skill Chaining via Terminal State Regularization 61
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Learning Subtask Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.3 Skill Chaining with Terminal State Regularization . . . . . . . . . . . . . 68
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
5.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 6: Coordinating Skills via Skill Behavior Diversification 77
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.2 Modular Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.3 Training Agent-Specific Primitive Skills with Diverse Behaviors . . . . . . 83
6.3.4 Composing Primitive Skills with Meta Policy . . . . . . . . . . . . . . . . 84
6.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.2 Jaco Pick-Push-Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.3 Jaco Bar-Moving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.4 Ant Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.5 Effect of Diversity of Primitive Skills . . . . . . . . . . . . . . . . . . . . 91
6.4.6 Effect of Skill Selection Interval T
low
. . . . . . . . . . . . . . . . . . . . 91
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
III Accelerating Reinforcement Learning with Learned Skills 94
Chapter 7: Reinforcement Learning with Learned Skills 95
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3.2 Learning Continuous Skill Embedding and Skill Prior . . . . . . . . . . . 99
7.3.3 Skill Prior Regularized Reinforcement Learning . . . . . . . . . . . . . . 101
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.4.1 Environments & Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.2 Maze Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4.3 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 8: Demonstration-Guided Reinforcement Learning with Learned Skills 110
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vi
8.3.2 Skill Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3.3 Demonstration-Guided RL with Learned Skills . . . . . . . . . . . . . . . 116
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.4.1 Experimental Setup and Comparisons . . . . . . . . . . . . . . . . . . . . 119
8.4.2 Demonstration-Guided RL with Learned Skills . . . . . . . . . . . . . . . 121
8.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.4 Robustness to Partial Demonstrations . . . . . . . . . . . . . . . . . . . . 123
8.4.5 Data Alignment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 9: Skill-based Model-based Reinforcement Learning 126
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.3.2 SkiMo Model Components . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.3.3 Pre-Training Skill Dynamics Model and Skills from Task-agnostic Data . . 132
9.3.4 Downstream Task Learning with Learned Skill Dynamics Model . . . . . . 134
9.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.4.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.4.2 Baselines and Ablated Methods . . . . . . . . . . . . . . . . . . . . . . . 137
9.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.4.5 Long-Horizon Prediction with Skill Dynamics Model . . . . . . . . . . . . 141
9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Chapter 10: Conclusion 143
References 146
Appendices 164
Appendix A
Real-World Furniture Assembly Environment . . . . . . . . . . . . . . . . . . . . . . . 166
A Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.2 Multi-Camera Pose Estimation using AprilTag . . . . . . . . . . . . . . . 167
A.3 Reset Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.4 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.5 Sensory Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.6 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.7 Hardware Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B Furniture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
vii
Appendix B
Composing Skills via Transition Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A Acquiring Primitive Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
B Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
B.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
B.2 Replay Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.3 Proximity Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.4 Proximity Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.5 Transition Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C Environment Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.1 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.1.1 Reward Design and Termination Condition . . . . . . . . . . . . 186
C.2 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
C.2.1 Reward Design . . . . . . . . . . . . . . . . . . . . . . . . . . 188
C.2.2 Termination Condition . . . . . . . . . . . . . . . . . . . . . . . 190
Appendix C
Skill Chaining via Terminal State Regularization . . . . . . . . . . . . . . . . . . . . . . 191
A Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
C Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Appendix D
Coordinating Skills via Skill Behavior Diversification . . . . . . . . . . . . . . . . . . . 194
A Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
A.1 Environment Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 194
A.2 Reward Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
B Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
B.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
B.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
B.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Appendix E
Reinforcement Learning with Learned Skills . . . . . . . . . . . . . . . . . . . . . . . . 200
A Action-prior Regularized Soft Actor-Critic . . . . . . . . . . . . . . . . . . . . . . 200
B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
B.1 Model Architecture and Training Objective . . . . . . . . . . . . . . . . . 203
B.2 Reinforcement Learning Setup . . . . . . . . . . . . . . . . . . . . . . . . 204
C Environments and Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 205
D State-Conditioned Skill Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
E Prior Regularization Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
F Prior Initialization Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
G Training with Sub-Optimal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
H Reuse of Learned Skill Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
viii
Appendix F
Demonstration-Guided Reinforcement Learning with Learned Skills . . . . . . . . . . . 213
A Full Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B Implementation and Experimental Details . . . . . . . . . . . . . . . . . . . . . . 213
B.1 Implementation Details: Pre-Training . . . . . . . . . . . . . . . . . . . . 213
B.2 Implementation Details: Downstream RL . . . . . . . . . . . . . . . . . . 216
B.3 Implementation Details: Comparisons . . . . . . . . . . . . . . . . . . . . 216
B.4 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
C Skill Representation Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
D Demonstration-Guided RL Comparisons with Task-Agnostic Experience . . . . . . 221
E Skill-Based Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
F Kitchen Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Appendix G
Skill-based Model-based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 225
A Further Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.1 Skill Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.2 Planning Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.3 Fine-Tuning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
B Qualitative Analysis on Maze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
B.1 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . 227
B.2 Long-Horizon Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
C Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
C.1 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
C.2 Algorithm Implementation Details . . . . . . . . . . . . . . . . . . . . . . 230
C.3 Environments and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
ix
List of Tables
2.1 Evaluation Results in IKEA Furniture Assembly Environment . . . . . . . . . . . 20
3.1 BC hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 IQL hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Average phases over 10 evaluation runs with variations of the baselines and different
number of demonstrations (100, 200, 300, 400, and 900 demonstrations). The
numbers in the parentheses mean the best completed phase among 10 evaluation runs. 39
3.4 Average phases over 10 evaluation runs with improved baselines and more
demonstrations (up to 200 trajectories). *Human scores are reported as success rates. 40
4.1 Success count for robotic manipulation, comparing our method against baselines
with or without transition policies (TP). Our method achieves the best performance
over both RL baselines and the ablated variants. Each entry in the table represents
average success count and standard deviation over 50 runs with 3 random seeds. . 53
4.2 Success count for locomotion, comparing our method against baselines with or
without transition policies (TP). Our method outperforms all baselines in Patrol
and Obstacle course. In Hurdle, the reward function for TRPO was extensively
engineered, which is not directly comparable to our method. Our method
outperforms baselines learning from sparse reward, showing the effectiveness of the
proposed proximity predictor. Each entry in the table represents average success
count and standard deviation over 50 runs with 3 random seeds. . . . . . . . . . . 56
5.1 Average progress of the furniture assembly task. Each subtask amounts to 0.25
progress. Hence, 1 represents successful execution of all four subtasks while 0
means the agent does not achieve any subtask. Our method learns to complete all
four subtasks in sequence and outperforms the policy sequencing baseline and
standard RL and IL methods. We report the mean and standard deviation across 5
seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1 Success rates for all tasks, comparing our method against baselines. Each entry in
the table represents average success rate and standard deviation over 100 runs. The
baselines learning from scratch fail to learn complex tasks with multiple agents. . 91
x
A.1 Sensory inputs available in our environment. . . . . . . . . . . . . . . . . . . . . . 170
B.1 Hyperparameter values for transition policy, proximity predictor, and primitive
policy as well as TRPO and PPO baselines. . . . . . . . . . . . . . . . . . . . . . 181
C.1 PPO hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
D.1 Environment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
D.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
E.1 Number of blocks stacked vs fractions of random training data. . . . . . . . . . . . 210
G.1 Comparison to prior work and ablated methods. . . . . . . . . . . . . . . . . . . . 230
G.2 SkiMo hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
xi
List of Figures
2.1 IKEA Furniture Assembly Environment . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Furniture Assembly Subtask Decomposition . . . . . . . . . . . . . . . . . . . . . 9
2.3 Diverse IKEA Furniture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Supported Robots in IKEA Furniture Assembly Environment . . . . . . . . . . . . 12
2.5 Available Observations of IKEA Furniture Assembly Environment . . . . . . . . . 14
2.6 Domaing Randomization Supported in IKEA Furniture Assembly Environment . . 16
2.7 Learning curves in IKEA Furniture Assembly Environment . . . . . . . . . . . . . 19
3.1 Furniture assembly benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Dexterous skills required for furniture assembly. Note that screwing is difficult
with one arm, therefore we fix a long plastic bar to the table to screw objects against. 26
3.3 Furniture models in our benchmark. Each furniture is designed inspired by IKEA
furniture. Due to the limitations imposed by using a single robotic arm, we modify
some furniture pieces feasible to be assembled with one hand. . . . . . . . . . . . 29
3.4 (a) Experiments configuration, (b) visual inputs from three cameras, and
(c) furniture models with AprilTag markers attached. . . . . . . . . . . . . . . . . 32
3.5 Overall robot system design. The agent receives a proprioceptive robot state and a
front-view RGB-D image, and takes an action (end-effector goal pose and a binary
gripper action) at a frequency of 5 Hz. For automatic reset and reward functions,
we use images collected via three RGB-D cameras and estimate furniture poses
using AprilTag. Note that the estimated poses of furniture parts are only used for
environment reset and reward, not for the learning agent. . . . . . . . . . . . . . . 34
3.6 Teleoperation setup. To collect demonstrations, we primarily use an Oculus Quest
2 controller to control a 7-DoF robotic arm and use a keyboard to rotate the wrist
without moving the arm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xii
4.1 Concept of a transition policy. Composing complex skills using primitive skills
requires smooth transitions between primitive skills since a following primitive
skill might not be robust to ending states of the previous one. In this example, the
ending states (red circles) of the primitive policy p
jump
are not good initial states to
execute the following policy p
walk
. Therefore, executing p
walk
from these states
will fail (red arrow). To smoothly connect the two primitive policies, we propose a
transition policy which navigates an agent to suitable initial states for p
walk
(dashed
arrow), leading to a successful execution of p
walk
(green arrow). . . . . . . . . . . 44
4.2 Our modular network augmented with transition policies. To perform a complex
task, our model repeats the following steps: (1) The meta-policy chooses a primitive
policy of index c; (2) The corresponding transition policy helps initiate the chosen
primitive policy; (3) The primitive policy executes the skill; and (4) A success or
failure signal for the primitive skill is produced. . . . . . . . . . . . . . . . . . . . 46
4.3 Training of transition policies and proximity predictors. After executing a primitive
policy, a previously performed transition trajectory is labeled and added to a replay
buffer based on the execution success. A proximity predictor is trained on states
sampled from the two buffers to output the proximity to the initiation set. The
predicted proximity serves as a reward to encourage the transition policy to move
toward good initial states for the corresponding primitive policy. . . . . . . . . . . 49
4.4 Tasks and success count curves of our model (blue), TRPO (purple), PPO (magenta),
and transition policies (TP) trained on task reward (green) and sparse proximity
reward (yellow). Our model achieves the best performance and convergence time.
Note that TRPO and PPO are trained 5 times longer than ours with dense rewards
since TRPO and PPO do not have primitive skills and learn from scratch. In the
success count curves, different temporal scales are used for TRPO and PPO (bottom
x-axis) and ours (top x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Average transition length and average proximity reward of transition trajectories
over training on Manipulation (left) and Patrol (right). . . . . . . . . . . . . . . . 57
4.6 Visualization of transition trajectories of (a) Repetitive picking up and (b) Patrol.
TOP AND BOTTOM ROWS: contain rendered frames of transition trajectories.
MIDDLE ROW: contains states extracted from each primitive skill execution
projected onto PCA space. The dots connected with lines are extracted from the
same transition trajectory, where the marker color indicates the proximity prediction
P(s). A higher P(s) value indicates proximity to states suitable for initializing the
next primitive skill. LEFT: two picking up transition trajectories demonstrate that
the transition policy learns to navigate from terminate states s
0
and s
1
to t
0
and t
1
.
RIGHT: the forward to balance transition moves between the forward and balance
state distributions and the balance to backward transition moves from the balancing
states close to the backward states. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xiii
5.1 We aim to solve a long-horizon task, e.g., furniture assembly, using independently
trained subtask policies. (a) Each subtask policy,π
i
, works successfully only on its
initiation set (green),I
i
, and results in its termination set (pink),β
i
; thus, it fails
when performed outside ofI
i
(red curve). (b) To enable sequencing policies, a
subsequent policyπ
i
needs to widen its initiation set to cover the termination set
of the prior policy β
i− 1
. But this can result in an increase of its termination set
β
i
, which makes fine-tuning of the following policy π
i+1
even more challenging.
This effect is exacerbated when more policies are chained together. (c) During
fine-tuning of a policy π
i
, we regularize the terminal state distributionβ
i
to be close
to the initiation set of the next policyI
i+1
. In contrast to the boundless increase of
˜
β
i
in (b), our approach effectively keeps the required initiation set small over the
chain of policies with the terminal state regularization. . . . . . . . . . . . . . . . 63
5.2 Our adversarial skill chaining framework regularizes the terminal state distribution
to be close to the initiation set of the subsequent subtask. The initiation set
discriminator models the initiation set distribution by discerning the initiation
set and states in agent trajectories, while the policy learns to reach states close to
the initiation set by augmenting the reward with the discriminator output, dubbed
terminal state regularization. Our method jointly trains all policies and initiation
set discriminators, pushing the termination set close to the initiation set of the
next policy, which leads to smaller changes required for the policies that follow,
especially effective in a long chain of skills. . . . . . . . . . . . . . . . . . . . . . 67
5.3 Two furniture assembly tasks (Lee et al., 2021b) consist of four subtasks (four
table legs for TABLE LACK; two seat supports, chair seat, and front legs for
CHIAR INGOLF). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 To demonstrate the benefit of our terminal state regularization, we visualize
the changes in termination sets over training of the third subtask policy on
CHAIR INGOLF. We plot each terminal state by projecting its object configuration
into 2D space using PCA. Through 36M to 45M training steps, both the policy
sequencing baseline (Clegg et al., 2018) and our method successfully learn to cover
most terminal states from the second subtask. However, without regularization, the
policy sequencing method (red) shows the increasing size of the termination set
(e.g. spread over horizontally at 39M and vertically at 42M steps) as more initial
states are covered by the policy. In contrast, in our approach (blue), the terminal
state distribution is bounded, which shows that the terminal state regularization
can effectively prevent the terminal state distribution diverging. This bounded
termination set makes learning of the following skills efficient, and thus helps
chaining a long sequence of skills. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xiv
6.1 Composing complex skills using multiple agents’ primitive skills requires proper
coordination between agents since concurrent execution of primitive skills requires
temporal and behavioral coordination. For example, to move a block into a
container on the other end of the table, the agent needs to not only utilize pick,
place, and push primitive skills at the right time but also select the appropriate
behaviors for these skills, represented as latent vectors z
1
, z
2
, z
3
, and z
4
above.
Naive methods neglecting either temporal or behavioral coordination will produce
unintended behaviors, such as collisions between end-effectors. . . . . . . . . . . 78
6.2 Our method is composed of two components: a meta policy and a set of
agent-specific primitive policies relevant to task completion. The meta policy
selects which primitive skill to run for each agent as well as the behavior embedding
(i.e. variation in behavior) of the chosen primitive skill. Each selected primitive
skill takes as input the agent observation and the behavior embedding and outputs
action for that agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Different multi-agent architectures. (a) The vanilla RL method considers all agents
as a monolithic agent; thus a single policy takes the full observation as input and
outputs the full action. (b) The multi-agent RL method (MARL) consists of N
policies that operate on the observations and actions of corresponding agents.
(c) The modular network consists of N sets of skills for the N agents trained in
isolation and a meta policy that selects a skill for each agent. (d-f) The RL, MARL,
and modular network methods augmented with skill behavior diversification (SBD)
has a meta policy that outputs a skill behavior embedding vector z for each skill. . 82
6.4 The composite tasks pose a challenging combination of object manipulation and
locomotion skills, which requires coordination of multiple agents and temporally
extended behaviors. (a) The left Jaco arm needs to pick up a block while the right
Jaco arm pushes a container, and then it places the block into the container. (b) Two
Jaco arms are required to pick and place a bar-shaped block together. (c) Two ants
push the red box to the goal location (green circle) together. . . . . . . . . . . . . 86
6.5 Success rates of our method (Modular-SBD) and baselines. For modular
frameworks (Modular and Modular-SBD), we shift the learning curves rightwards
the total number of environment steps the agent takes to learn the primitive skills
(0.9 M, 1.2 M, and 2.0 M, respectively). Our method substantially improves
learning speed and performance on JACO PICK-PUSH-PLACE and ANT PUSH. The
shaded areas represent the standard deviation of results from six different seeds.
The curves are smoothed using moving average over 10 runs. . . . . . . . . . . . 88
6.6 Learning curves of our method with different diversity coefficients λ
2
on ANT PUSH. 92
6.7 Success rates of our method with different T
low
coefficients on Jaco environments. . 92
xv
7.1 Intelligent agents can use a large library of acquired skills when learning new
tasks. Instead of exploring skills uniformly, they can leverage priors over skills as
guidance, based, e.g., on the current environment state. Such priors capture which
skills are promising to explore, like moving a kettle when it is already grasped,
and which are less likely to lead to task success, like attempting to open an already
opened microwave. In this chapter, we propose to jointly learn an embedding space
of skills and a prior over skills from unstructured data to accelerate the learning of
new tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Deep latent variable model for joint learning of skill embedding and skill prior.
Given a state-action trajectory from the dataset, the skill encoder maps the
action sequence to a posterior distribution q(z|a a a
i
) over latent skill embeddings.
The action trajectory gets reconstructed by passing a sample from the posterior
through the skill decoder. The skill prior maps the current environment state to
a prior distribution p
a a a
(z|s
1
) over skill embeddings. Colorful arrows indicate the
propagation of gradients from reconstruction, regularization and prior training
objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 For each environment we collect a diverse dataset from a wide range of training
tasks (examples on top) and test skill transfer to more complex target tasks
(bottom), in which the agent needs to: navigate a maze (left), stack as many blocks
as possible (middle) and manipulate a kitchen setup to reach a target configuration
(right). All tasks require the execution of complex, long-horizon behaviors and
need to be learned from sparse rewards. . . . . . . . . . . . . . . . . . . . . . . . 104
7.4 Downstream task learning curves for our method and all comparisons. Both, learned
skill embedding and skill prior are essential for downstream task performance:
single-action priors without temporal abstraction (Flat Prior) and learned skills
without skill prior (SSP w/o Prior) fail to converge to good performance. Shaded
areas represent standard deviation across three seeds. . . . . . . . . . . . . . . . . 106
7.5 Exploration behavior of our method vs. alternative transfer approaches on the
downstream maze task vs. random action sampling. Through learned skill
embeddings and skill priors our method can explore the environment more widely.
We visualize positions of the agent during 1M steps of exploration rollouts in blue
and mark episode start and goal positions in green and red respectively. . . . . . . 107
7.6 Ablation analysis of skill horizon and skill space dimensionality on block stacking
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.1 We leverage large, task-agnostic datasets collected across many different tasks
for efficient demonstration-guided reinforcement learning by (1) acquiring a rich
motor skill repertoire from such offline data and (2) understanding and imitating
the demonstrations based on the skill repertoire. . . . . . . . . . . . . . . . . . . 110
xvi
8.2 Our approach, SkiLD, combines task-agnostic experience and task-specific
demonstrations to efficiently learn target tasks in three steps: (1) extract skill
representation from task-agnostic offline data, (2) learn task-agnostic skill prior
from task-agnostic data and task-specific skill posterior from demonstrations,
and (3) learn a high-level skill policy for the target task using prior knowledge
from both task-agnostic offline data and task-specific demonstrations. Left: Skill
embedding model with skill extractor (yellow) and closed-loop skill policy (blue).
Middle: Training of skill prior (green) from task-agnostic data and skill posterior
(purple) from demonstrations. Right: Training of high-level skill policy (red) on a
downstream task using the pre-trained skill representation and regularization via
the skill prior and posterior, mediated by the demonstration discriminator D(s). . . 113
8.3 We leverage prior experience dataD and demonstration data D
demo
. Our policy
is guided by the task-specific skill posterior q
ζ
(z|s) within the support of the
demonstrations (green) and by the task-agnostic skill prior p
a a a
(z|s) otherwise (red).
The agent also receives a reward bonus for reaching states in the demonstration
support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.4 Left: Test environments, top to bottom: 2D maze navigation, robotic kitchen
manipulation and robotic office cleaning. Right: Target task performance vs
environment steps. By using task-agnostic experience, our approach more
efficiently leverages the demonstrations than prior demonstration-guided RL
approaches across all tasks. The comparison to SPiRL shows that demonstrations
improve efficiency even if the agent has access to large amounts of prior experience. 119
8.5 Visualization of our approach on the maze navigation task (visualization states
collected by rolling out the skill prior). Left: the given demonstration trajectories;
Middle left: output of the demonstration discriminator D(s) (the greener, the
higher the predicted probability of a state to be within demonstration support,
red indicates low probability). Middle right: policy divergences to the skill
posterior and Right: divergence to the skill prior (blue indicates small and red
high divergence). The discriminator accurately infers the demonstration support,
the policy successfully follows the skill posterior only within the demonstration
support and the skill prior otherwise. . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.6 Ablation studies. We test the performance of SkiLD for different sizes of the
demonstration dataset|D
demo
| on the maze navigation task (left) and ablate the
components of our objective on the kitchen manipulation task (right). . . . . . . . 122
8.7 Left: Robustness to partial demonstrations. SkiLD can leverage partial
demonstrations by seamlessly integrating task-agnostic and task-specific datasets
(see Section 8.4.4). Right: Analysis of data vs. task alignment. The benefit of using
demonstrations in addition to prior experience diminishes if the prior experience is
closely aligned with the target task (solid), but gains are high when data and task
are not well-aligned (dashed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xvii
9.1 Intelligent agents can use their internal models to imagine potential futures for
planning. Instead of planning out every primitive action (black arrows in a), they
aggregate action sequences into skills (red and blue arrows in b). Further, instead
of simulating each low-level step, they can leap directly to the predicted outcomes
of executing skills in sequence (red and blue arrows in c), which leads to better
long-term prediction and planning compared to predicting step-by-step (blurriness
of images represents the level of error accumulation in prediction). With the
ability to plan over skills, the agent can accurately imagine and efficiently plan for
long-horizon tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.2 Our approach, SkiMo, combines model-based RL and skill-based RL for sample
efficient learning of long-horizon tasks. SkiMo consists of two phases: (1) learn a
skill dynamics model and a skill repertoire from offline task-agnostic data, and
(2) learn a high-level policy for the downstream task by leveraging the learned
model and skills. We omit the encoded latent state h in the figure and directly write
observation s for clarity, but most modules take the encoded state h as input. . . . 130
9.3 We evaluate our method on four long-horizon, sparse reward tasks. (a) The green
point mass navigates the maze to reach the goal (red). (b, c) The robot arm in the
kitchen must complete four tasks in the correct order (Microwave - Kettle - Bottom
Burner - Light and Microwave - Light - Slide Cabinet - Hinge Cabinet). (d) The
robot arm needs to compose skills learned from extremely task-agnostic data (Open
Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED). . . . . . . . . . . 136
9.4 Learning curves of our method and baselines. All averaged over 5 random seeds. . 138
9.5 Learning curves of our method and ablated models. All averaged over 5 random
seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1 (a) Environment overview. (b) Camera pose. (c) Marker and obstacle positions. . . 166
A.2 For easy and accurate marker placement, all 3D models have AprilTag placeholders
on their surfaces with corresponding AprilTag IDs. . . . . . . . . . . . . . . . . . 167
A.3 Furniture 3D models. (left) IKEA model furniture, (middle) 3D furniture model,
(right) 3D printed furniture model. . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.4 Assembly procedures: (a)lamp (b)square table (c)desk . . . . . . . . . . . . 174
A.4 Assembly procedures: (d)drawer (e)cabinet . . . . . . . . . . . . . . . . . . . 175
A.4 Assembly procedures: (f)round table (g)stool . . . . . . . . . . . . . . . . . 176
A.4 Assembly procedures: (h)chair . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.5 Blueprints of furniture models. The leftmost column shows the final configuration
(top left) and how furniture parts are assembled (bottom right). The rest of the
columns illustrate dimensions of all furniture parts. . . . . . . . . . . . . . . . . . 178
xviii
A.6 Sequences of policy evaluation. In desk assembly, (a) the BC policy can reach to
the tabletop, but fail to grasp it, while (b) the IQL policy can reach the tabletop,
pickup the tabletop and place it to the corner. But after placing it, robot goes picks
up the tabletop then drops it. In round table assembly, (c) the BC agent can reach
the round table and push the obstacle, while (d) the IQL agent could also grasp the
leg but failed to insert it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.1 Success count curves of our model with exponentially discounted proximity
function and linearly discounted proximity function over training on Obstacle
course (left) and Repetitive catching (right). . . . . . . . . . . . . . . . . . . . . . 182
E.1 Image-based state representation for maze (left) and block stacking (right)
environment (downsampled to 32× 32 px for policy). . . . . . . . . . . . . . . . 205
E.2 Comparison of policy execution traces on the kitchen environment. Following Fu
et al. (2020), the agent’s task is to (1) open the microwave, (2) move the kettle
backwards, (3) turn on the burner and (4) switch on the light. Red frames mark
the completion of subtasks. Our skill-prior guided agent (top) is able to complete
all four subtasks. In contrast, the agent using a flat single-action prior ( middle)
only learns to solve two subtasks, but lacks temporal abstraction and hence fails
to solve the complete long-horizon task. The skill-space policy without prior
guidance (bottom) cannot efficiently explore the skill space and gets stuck in a
local optimum in which it solves only a single subtask. Best viewed electronically
and zoomed in. For videos, see: clvrai.com/spirl. . . . . . . . . . . . . . . . 207
E.3 Results for state-conditioned skill decoder network. left: Exploration visualization
as in Fig. 7.5. Even with state-conditioned skill decoder, exploration without
skill prior is not able to explore a large fraction of the maze. In contrast, skills
sampled from the learned skill prior lead to wide-ranging exploration when
using the state-conditioned decoder. right: Downstream learning performance
of our approach and skill-space policy w/o learned skill prior: w/ vs. w/o
state-conditioning for skill decoder. Only guidance through the learned skill prior
enables learning success. State-conditioned skill-decoder can make the downstream
learning problem more challenging, leading to lower performance (”ours” vs. ”ours
w/ state cond.”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
E.4 Ablation of prior regularization during downstream RL training. Initializing the
high-level policy with the learned prior but finetuning with conventional SAC is not
sufficient to learn the task well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
E.5 Ablation of prior initialization. Initializing the downstream task policy with
the prior network improves training stability and convergence speed. However,
the ”w/o Init” runs demonstrate that the tasks can also be learned with prior
regularization only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
xix
E.6 Success rate on maze environment with sub-optimal training data. Our approach,
using a prior learned from sub-optimal data generated with the BC policy, is able to
reliably learn to reach the goal while the baseline that does not use the learned prior
fails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
E.7 Reuse of one learned skill prior for multiple downstream tasks. We train a single
skill embedding and skill prior model and then use it to guide downstream RL for
multiple tasks. Left: We test prior reuse on three different maze navigation tasks in
the form of different goals that need to be reached. (1)-(3): Agent rollouts during
training; the darker the rollout paths, the later during training they were collected.
The same prior enables efficient exploration for all three tasks, but allows for
convergence to task-specific policies that reach each of the goals upon convergence. 211
F.1 Qualitative results for GAIL+RL on maze navigation. Even though it makes
progress towards the goal (red), it fails to ever obtain the sparse goal reaching
reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
F.2 We compare the exploration behavior in the maze. We roll out skills sampled from
SPiRL’s task-agnostic skill prior (left) and our task-specific skill posterior ( right)
and find that the latter leads to more targeted exploration towards the goal (red). . 217
F.3 Office cleanup task. The robot agent needs to place three randomly sampled
objects (1-7) inside randomly sampled containers (a-c). During task-agnostic data
collection we apply random noise to the initial position of the objects. . . . . . . . 218
F.4 Comparison of our closed-loop skill representation with the open-loop
representation of Pertsch et al. (2020a). Top: Skill prior rollouts for 100 k steps
in the maze environment. Bottom: Subtask success rates for prior rollouts in the
kitchen environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
F.5 Downstream task performance for prior demonstration-guided RL approaches with
combined task-agnostic and task-specific data. All prior approaches are unable to
leverage the task-agnostic data, showing a performance decrease when attempting
to use it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
F.6 Imitation learning performance on maze navigation and kitchen tasks. Compared to
prior imitation learning methods, SkiLD can leverage prior experience to enable
the imitation of complex, long-horizon behaviors. Finetuning the pre-trained
discriminator D(s) further improves performance on more challenging control tasks
like in the kitchen environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
xx
F.7 Subtask transition probabilities in the kitchen environment’s task-agnostic training
dataset from Gupta et al. (2019). Each dataset trajectory consists of four consecutive
subtasks, of which we display three (yellow: first, green: second, grey: third
subtask). The transition probability to the fourth subtask is always near 100 %. In
Section 8.4.5 we test our approach on a target task with good alignment to the
task-agnostic data (Microwave - Kettle - Light Switch - Hinge Cabinet) and a target
task which is mis-aligned to the data (Microwave - Light Switch - Slide Cabinet -
Hinge Cabinet). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
G.1 Ablation analysis on skill horizon H. . . . . . . . . . . . . . . . . . . . . . . . . . 225
G.2 Ablation analysis on planning horizon N. . . . . . . . . . . . . . . . . . . . . . . 226
G.3 Ablation analysis on fine-tuning the model. . . . . . . . . . . . . . . . . . . . . . 226
G.4 Exploration and exploitation behaviors of our method and baseline approaches.
We visualize trajectories in the replay buffer at 1.5M training steps in blue (light
blue for early trajectories and dark blue for recent trajectories). Our method shows
wide coverage of the maze at the early stage of training, and fast convergence to the
solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
G.5 Prediction results of 500 timesteps using a flat single-step model (a) and skill
dynamics model (b), starting from the ground truth initial state and 500 actions (50
skills for SkiMo). The predicted states from the flat model deviate from the ground
truth trajectory quickly while the prediction of our skill dynamics model has little
error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
G.6 Illustration of SkiMo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
xxi
Abstract
Humans are remarkably efficient at learning new complex long-horizon tasks, such as cooking
and furniture assembly, from only a few trials. To do so, we leverage our prior experience by
building up a rich repertoire of skills and knowledge about the world and similar tasks. This enables
us to efficiently reason, plan, and perform a wide variety of tasks. However, in most reinforcement
learning and imitation learning approaches, every task is learned from scratch, requiring a large
amount of interaction data and limiting its scalability to realistic, complex tasks. In this thesis, we
propose novel benchmarks and skill-based learning approaches for scaling robot learning from
simple short-horizon tasks to complex long-horizon tasks faced in our daily lives – consisting of
multiple subtasks and requiring high dexterity skills.
To study the problem of scaling robot learning to complex long-horizon tasks, we first develop
both simulated and real-world “furniture assembly” benchmarks, which require reasoning over long
horizons and dexterous manipulation skills. We then present a series of skill chaining algorithms
that solve such long-horizon tasks by executing a sequence of pre-defined skills, which need to
be smoothly connected, like catching and shooting in basketball. Finally, we extend skill-based
learning to efficiently leverage very diverse skills learned from large-scale task-agnostic agent
experience allowing for scalable skill reuse for complex tasks.
We hope our benchmarks and skill-based learning approaches can help robot learning researchers
to transition to solving long-term tasks, easily compare the performance of their approaches with
prior work, and eventually solve more complicated long-horizon tasks in the real world.
xxii
Chapter 1
Introduction
Humans are particularly efficient at learning new complex tasks from only a few trials under
rough guidance, such as a user’s manual, a demonstration, or intermittent feedback. To enable
such efficient learning, we leverage our prior experience by reusing a rich repertoire of skills and
knowledge about the world and relevant tasks that we have built up over time to efficiently reason,
plan, and perform a novel task. How can we build intelligent robots that can learn complex tasks
with the same efficiency?
Recent advances in reinforcement learning and imitation learning have enabled robots to
learn a diverse set of tasks, including grasping (Levine et al., 2016; Kalashnikov et al., 2018;
Brohan et al., 2022), in-hand dexterous manipulation (OpenAI et al., 2020; Akkaya et al., 2019),
and locomotion (Kumar et al., 2021; Miki et al., 2022). However, robot learning research and
benchmarks today are typically confined to simple short-horizon tasks. In contrast, tasks in our daily
lives are much more complicated – consisting of multiple subtasks and requiring high dexterity
skills.
Let’s imagine assembling an IKEA chair. Humans can perform this complex task by identifying
necessary steps (e.g. which parts should be assembled and in what order) and executing manipulation
skills following the plan (e.g. robust grasping, accurate alignment of two parts, screwing). However,
assembling furniture is seemingly impossible without knowing such sophisticated skills in advance,
such as for young children.
1
Similarly, for robots, learning a task from scratch does not scale well to complex long-horizon
tasks, as they often require a huge amount of experience and are more prone to failure. Thus,
leveraging prior experience is key to extending the range of tasks that robots can learn. One intuitive
way of utilizing prior experience is to acquire a reusable skill set and efficiently harness the skills
for a novel task. This skill-based approach enables robots to solve long-horizon tasks by acting
with skills, temporally extended actions that represent useful behaviors, instead of primitive actions
akin to muscle movements of humans (Sutton et al., 1999; Pastor et al., 2009; Kober et al., 2010;
M¨ ulling et al., 2013; Konidaris and Barto, 2009; Konidaris et al., 2012; Gudimella et al., 2017; Dalal
et al., 2021; Brohan et al., 2022). This temporal abstraction of actions allows robots for systematic
long-range exploration and a small number of required decisions for each task, making long-horizon
downstream tasks more tractable (Nachum et al., 2019b).
The furniture assembly example above, along with cooking, driving, surgery, and machine
repair, is just one of the many complex problems that require not only long-term planning but
also the execution of diverse skills. In this thesis, we explore the problem of scaling a robot’s
learning capability to such complex long-horizon tasks with “skills”. The contributions of this
thesis are primarily categorized into two parts: (1) complex long-horizon task benchmarks and
(2) skill-based learning algorithms. More specifically, we first propose novel benchmarks for
complex long-horizon robotic manipulation (Part I), and then we develop long-horizon task learning
algorithms by composing a set of pre-defined skills (Part II) or continuous skills learned from
large-scale data (Part III).
1.1 Long-Horizon Task Benchmarks: Furniture Assembly in Simulation and
Real World
Benchmarks with the right vision and difficulty have played a key role in the recent advances in
machine learning, e.g., ImageNet (Deng et al., 2009) for computer vision and OpenAI Gym (Brock-
man et al., 2016) for reinforcement learning. Recently, many simulated (Zhu et al., 2020; Gupta
2
et al., 2019; James et al., 2020; Yu et al., 2019; Xiang et al., 2020; Mees et al., 2022) and real-
world (Yang et al., 2019; Lee et al., 2021a; Bauer et al., 2022b) benchmarks have been introduced
for robotic manipulation. The simulated benchmarks are easy to access and fast to verify ideas with
approximated simulation, while the real-world benchmarks ensure that algorithms can directly work
on real robots. Yet, these benchmarks are limited to simple short-horizon tasks, such as picking
and pushing. The lack of a standardized benchmark for long-horizon tasks is a major bottleneck in
advancing robot learning toward solving complex long-horizon tasks. To this end, we propose to
use furniture assembly as a benchmarking task for long-horizon manipulation.
In Chapter 2, we first develop the simulated benchmark, IKEA Furniture Assembly Environment,
equipped with realistic rendering, a variety of furniture models, and scripted expert policies. This
work was published previously as Lee et al. (2021b).
In Chapter 3, we develop the “Real-world” Furniture Assembly Benchmark to illustrate chal-
lenges of real-world robot learning. It aims to measure an algorithm’s capability to learn realistic
long-horizon tasks requiring diverse skills. It lowers the entry barriers of real-world experimentation
with an easily replicable setup of 3D-printed furniture pieces and pre-collected teleoperation data.
We hope robot learning researchers can leverage these benchmarks to identify challenges in
learning realistic long-term tasks, easily compare the performance of their approaches over prior
work, and eventually solve more complicated long-horizon tasks in the real world.
1.2 Skill Chaining for Solving Complex Long-Horizon Tasks
In our benchmarks above, we have shown that learning a complex long-horizon task from scratch
is not efficient and often infeasible due to its high computational costs and exploration burdens.
To extend the robot learning capability from primitive and short-horizon skills to complex and
long-horizon tasks, skill chaining approaches (Konidaris and Barto, 2009; Konidaris et al., 2012;
Clegg et al., 2018) decompose a whole task into smaller chunks, learn a policy for each subtask (or
reuse previously learned skills), and sequentially execute the policies to accomplish the entire task.
3
However, in robot manipulation, the seemingly trivial execution of a skill chain is actually a
central challenge of skill composition. This is because complex interactions between a robot and
objects could lead to a wide range of robot and object configurations, which are infeasible to be
covered by the following skill. Thus, naively executing one policy after another would fail when the
policy encounters a starting state never seen during training.
We address this problem with three different approaches based on target applications: (1) learn-
ing a transition policy that brings an agent to one of starting states of the next skill in Chapter 4,
(2) widening skills’ starting state distributions while preventing the terminal state distributions
from becoming too large in Chapter 5, and (3) leveraging modulated skills for smooth coordination
between simultaneous skills as well as consecutive skills in Chapter 6. All these approaches effec-
tively bridge the gaps between one skill’s terminating state to a state from where the next skill can
initiate, especially when executing a long chain of skills. These works were published previously as
Lee et al. (2019b), Lee et al. (2021c), and Lee et al. (2020), respectively. With the ability of skill
composition, Lee et al. (2021c) shows the first successful solution to our simulated IKEA furniture
assembly benchmark, whereas prior skill chaining approaches fail.
1.3 Accelerating Reinforcement Learning with Learned Skills
Robust chaining of independent skills can achieve the desired behavior as guided by a skill order-
ing or a program. But, such guidance is not always available. This leads to another key algorithmic
challenge of finding the correct skill ordering. It becomes more challenging as the number of skills
grows since an agent should explore the full set of available skills during downstream learning.
Yet, intuitively, not all skills should be explored with equal probability; instead, information about
the current environment state can hint at which skills are promising to explore. Inspired by this
intuition, we have proposed multiple approaches that leverage large, task-agnostic experience to
guide policy learning over skills.
In Chapter 7, we present Skill Prior Reinforcement Learning (SPiRL), which leverages a prior
over skills learned from a task-agnostic agent experience dataset. The learned skill prior effectively
4
guides exploration on a new downstream task and demonstrates that it is essential for effectively
using a rich skill set. This work appeared previously as Pertsch et al. (2020a).
In Chapter 8, we further apply the same intuition to demonstration-guided RL, which efficiently
leverages the provided demonstrations by following the demonstrated skills instead of the primitive
actions. This show substantial performance improvements over prior demonstration-guided RL
approaches. This work was previously published as Pertsch et al. (2021).
In Chapter 9, to achieve further sample efficiency gain for long-horizon tasks, we propose
another way of utilizing large task-agnostic data, learning a skill-level dynamics model that enables
a robot to plan over skills. This combines the benefits of skill-based RL and model-based RL and
shows that skill-level planning allows for accurate long-term planning, resulting in significantly
better sample efficiency over typical model-based RL as well as skill-based RL. This work was
published previously as Shi et al. (2022).
Finally, we conclude by discussing open challenges in learning complex long-horizon tasks and
some future directions in Chapter 10.
5
Part I
Long-Horizon Task Benchmarks: Furniture Assembly in
Simulation and Real World
6
Chapter 2
IKEA Furniture Assembly Simulator
Figure 2.1: The IKEA Furniture Assembly Environment is a furniture assembly simulator with 60
furniture models, 6 robots, and customizable background, lighting, and textures.
2.1 Introduction
The ability to perform complex manipulation of physical objects is necessary to use tools,
build structures, and ultimately interact with the world in a meaningful way. Simulated bench-
marks (Brockman et al., 2016; James et al., 2020; Yu et al., 2019; Zhu et al., 2020; Tassa et al., 2020)
have played a key role in the recent advances in reinforcement learning (RL) and imitation learning
(IL) for robotic manipulation. However, these benchmarks are limited to simple, short-horizon tasks,
such as picking, pushing, and peg inserting. This lack of a “standardized” simulated environment
7
for long-term tasks is the main bottleneck of advancing RL and IL techniques toward solving
long-horizon tasks. To this end, we introduce the IKEA Furniture Assembly Environment as a new
benchmark for complex long-horizon robot manipulation. We believe our environment can play a
key role in advancing RL and IL methods on long-horizon robotics tasks.
Even for humans, furniture assembly is not simple. Imagine that you are building IKEA
furniture. First of all, it is not trivial to figure out how to assemble pieces into the final configuration.
Specifically, it is not apparent from pieces on the floor which parts to choose for attachment and
in what order. Hence, we need to dissect the final configuration and deduce the sequence of tasks
necessary to build the furniture. Moreover, connecting two parts requires complicated manipulation
skills, such as accurate alignment of two attaching points and sophisticated force control to firmly
attach them. Therefore, furniture assembly is a comprehensive robotics task (Niekum et al., 2013;
Knepper et al., 2013; Su´ arez-Ruiz et al., 2018) requiring reliable 3D perception, high-level planning,
and sophisticated control, making it a suitable benchmark for robot learning algorithms.
The IKEA Furniture Assembly Environment is a visually realistic environment that simulates
the task of furniture assembly as a step toward autonomous long-horizon manipulation. The
environment simulates 60 furniture models and supports various agents including Sawyer, Baxter,
Jaco, Panda, and Fetch robots. To provide further diversity, the environment supports randomization
in physics, lighting, textures, and more factors of variations. Lighting conditions, textures, and
backgrounds can be customized or randomized using Unity (Juliani et al., 2018) as illustrated in
Figure 2.1.
A variety of research problems could be investigated with this new environment that broadly
span perception, planning, and control. For perception, the environment could be used to solve
3D object detection, pose estimation, instance segmentation, scene graph generation, and shape
estimation problems. For robotic planning and control, the environment provides dense reward
functions and automated demonstration generation, making it suitable for testing RL and IL on
long-horizon complex manipulation tasks. Further, diverse shapes of furniture pieces can be used to
learn and evaluate generalizable and transferable skills.
8
In this chapter, our aim is to provide a testbed and benchmark for complex long-horizon robotic
manipulation tasks that allows researchers to study reinforcement learning and imitation learning.
For benchmarking RL and IL algorithms, we provide a shaped dense reward and demonstrations for
8 selected furniture models. Our empirical evaluation of RL and IL methods on the 8 benchmark
furniture models shows that, despite recent progress in RL and IL, current methods are not able to
solve complex robotic manipulation tasks, providing ample opportunities for future research.
2.2 IKEA Furniture Assembly Environment
To advance reinforcement learning and imitation learning from simple, videogame-esque tasks
to complex and realistic tasks, the IKEA furniture assembly environment features long-horizon
and hierarchical tasks, realistic rendering, and domain randomization. The furniture assembly can
be accomplished by repeating (1) selecting two compatible parts, (2) grasping these part(s), (3)
aligning the attachable connectors, and (4) firmly attaching them, as illustrated in Figure 2.2, until
all parts are assembled. Thus, furniture assembly has long horizon (200-1500 steps) compared to
prior manipulation benchmarks, e.g., 280 for Franka Kitchen (Gupta et al., 2019).
(a) Selecting (b) Grasping (c) Aligning (d) Attaching
Figure 2.2: Our environment simulates robotic furniture assembly: a robot (a) decides which parts
to assemble, (b) grasps the desired parts, and (c) aligns and (d) attaches the grasped parts. This
procedure is repeated until all parts are assembled.
9
Figure 2.3: Diverse furniture models included in our environment. Each furniture is modeled
following the IKEA’s user’s manual. Different parts are colored differently for visualization.
2.2.1 Environment Development
As a challenging robotic manipulation testbed including long-horizon tasks and 3D alignment
of various shapes of objects, we propose a novel 3D environment that supports assembling IKEA
furniture. For fast and accurate physics simulation, we use MuJoCo (Todorov et al., 2012) as the
underlying physics engine. Specifically, our environment is built on top of Robosuite (Zhu et al.,
2020), which features modularized API design and diverse robot controllers. We use the Unity
game engine (Juliani et al., 2018) and the MuJoCo-Unity interface from DoorGym (Urakami et al.,
2019) for realistic and configurable 3D rendering. Our environment follows the OpenAI Gym
interface (Brockman et al., 2016) for easy integration with existing RL and IL libraries.
2.2.2 Furniture Models
Our environment provides simulation of furniture assembly of 60 different furniture models as
shown in Figure 2.3. Each furniture is modeled following IKEA’s official user’s manuals. Due to the
10
limitation of physics simulation and difficulty in modeling, some complex structures are simplified
and small details, such as carving and screws, are omitted.
Although screwing is an important aspect of many robotic assembly tasks, conducting realistic
screwing with available grippers is infeasible without developing end-effectors or robot arms
dedicated to screwing. Moreover, accurate physical simulation of screws is not supported by the
MuJoCo physics engine, and most physics engines in general. Currently, the environment contains
an abstracted screwing phase where the robot must select the connect action to attach the parts. We
plan to replace the abstracted connect action with a realistic screwing action as future work.
Parts: Based on IKEA’s official user’s manuals, the furniture models are created using the 3D
modeling tool, Rhino, and each furniture part is converted to a separate 3D mesh file in a format of
STL, which is used for both physics simulation in MuJoCo and rendering in Unity. For the robots
with fixed bases (i.e. limited reach), our environment supports downscaling of furniture models to
fit in the workspace.
Connectors: Physics simulators show limited accuracy for sophisticated screwing and peg
inserting interactions between attaching points of furniture parts. Therefore, we abstract connection
information and attachment points between two furniture parts, such as screws and holes, with
connectors. Connectors are located on the furniture parts and serve as areas of attachment by
representing the correspondence between attaching points and relative 3D pose.
Specifically, we represent connectors with their ID, position, and orientation to the part. The
IDs are used to verify compatibility between two connectors. For example, a pair of connectors on
partA and partB have IDsA-B andB-A, respectively. Identical parts can be used interchangeably
since they share the same connector IDs. For furniture pieces with symmetric shapes, acceptable
attachment angles are also specified in connectors (e.g. {0°, 90°, 180°, 270°} for rectangular table
legs and [0°, 360°] for bars). All furniture parts and connectors are manually annotated.
11
(a) Sawyer (b) Baxter (c) Jaco
(d) Panda (e) Fetch (f) Cursor
Figure 2.4: The Sawyer, Baxter, Jaco, Panda, and Fetch mobile manipulator robots are supported.
Various action spaces are provided including end-effector space, joint velocity, and joint torque
controls.
2.2.3 Furniture Assembly Simulation
In our environment, robotic arms can move around the environment and interact with furniture
parts. While assembling two furniture parts requires screwing in the real world, we abstract it
with a connect action. Therefore, in addition to actions for robot control (e.g. joint torque or
3D end-effector pose), the action space has one additional action dimension, connect, which
attaches two attachable furniture parts (i.e. two corresponding connectors are compatible and
aligned). Specifically, we examine the Euclidean distance between the connectors d
L2
(x
A
,x
B
), the
cosine similarities between the connector up vectors s
cos
(u
A
,u
B
), forward vectors s
cos
(f
A
,f
B
), and
projections of up vectors to a segment between two connectors s
cos
(u
A
,x
B
− x
A
) and s
cos
(u
B
,x
A
− 12
x
B
), where x, u, and f denote a 3D coordinate, up vector, and forward vector of a connector,
respectively. By default, two connectors are attachable when they are within 2 cm and the cosine
similarities are larger than 0.99.
2.2.4 Agents
Our environment supports a variety of robots for interacting with furniture: Rethink Sawyer,
Rethink Baxter, Kinova Jaco, Franka Emika Panda, Fetch mobile manipulator, and Cursor (abstract
agent), as illustrated in Figure 2.4.
Observation space: The observation space is fully configurable to fit a variety of problem set-
tings. It can consist of agent state (e.g. joint positions, velocities, and contact forces), environmental
state (e.g. 3D coordinates and orientations of furniture parts), and camera observations. Aside
from third-person view RGB images, the environment also supports object segmentation masks
(pixel-wise dense annotation) and depth camera observations as shown in Figure 2.5. More cameras
can be added to collect images from diverse views, such as egocentric view and wrist view.
Action space: The action space consists of arm movement, gripper control, and connect action
but varies by control modes: 6D end-effector space control with inverse kinematics, joint velocity
control, and joint torque control. The Fetch mobile manipulator has two additional actions for
moving and turning its base.
For RL and IL benchmark, we use the Rethink Sawyer robot with joint velocity control as an
agent. The observation consists of the robot state (joint angles and velocities), end-effector state
(end-effector position, rotation, and velocity), and object state (3D coordinates and orientations of
all parts).
2.2.5 Reward Function
For all furniture models, our environment provides a sparse reward for every successful pick,
attachment, and full assembly. However, learning from such a sparse reward signal is not practical
13
(a) RGB (b) Seg. map (c) Depth (d) Goal
Figure 2.5: An RGB image, pixel-wise part segmentation map, depth map, and goal image are
available from cameras.
with existing RL methods. To ease the challenge of learning from a sparse reward, we provide a
well-shaped dense reward function for 8 furniture models based on manually annotated way-points.
The dense reward function is a multi-phase reward defined with respect to a pair of furniture
parts to attach (e.g. a table leg and a table top) and the corresponding manually annotated way-points,
such as a target gripping point g for each part. The reward function for a pair of furniture parts
consists of eight different phases as follows:
1. Initial phase: The robot has to reconfigure its arm pose to an appropriate pose p
init
for
grasping a new furniture part. The reward is proportional to the negative distance between the
end-effector p
eff
and p
init
.
2. Reach phase: The robot reaches above a target furniture part. The reward is proportional to
the negative distance between the end-effector p
eff
and a point p
reach
5 cm above the gripping
point g.
3. Lower phase: The gripper is lowered onto the target part. The phase reward is proportional
to the negative distance between p
eff
and the target gripping points g.
4. Grasp phase: The robot learns to grasp the target part. The reward is given if the gripper
contacts the part, and is proportional to the force exerted by the grippers.
5. Lift phase: The robot lifts the gripped part up to p
lift
. The reward is proportional to the
negative distance between the gripped part p
part
and the target point p
lift
.
14
6. Align phase: The robot roughly rotates the gripped part before moving it. The reward is
proportional to the consine similarity between up vectors u
A
,u
B
and forward vectors f
A
,f
B
of
the two connectors.
7. Move phase: The robot moves and aligns the gripped part to another part. The reward is
proportional to the negative distance between the connector of the gripped part and a point
p
move to
5 cm above the connector of another part, and the cosine similarity between two
connector up vectors, u
A
and u
B
and forward vectors f
A
and f
B
. Note that all connectors are
labeled with aligned up vectors and forward vectors.
8. Fine-grained move phase: The robot must finely align two connectors until attached. The
same reward is used as the move phase with a higher coefficient, making the reward more
sensitive to small changes. In addition, we provide reward based on the activation of the
connect action a[connect] when the parts are attachable.
Upon every phase completion, a completion reward is given to encourage the agent to move
onto the next phase. In addition to the phase-based reward, a control penalty∥a∥
2
, stable wrist pose
reward s
cos
(u
wrist
,(0,0,− 1)) and s
cos
(f
wrist
,g
1
− g
2
), and gripping reward (i.e. open the gripper
only in initial, reach, and lower phases) are provided throughout the episode. If the robot releases
the gripped object, the episode terminates early with a negative reward.
The phase completion is determined based on whether the robot and part configurations satisfy a
distance and angle constraint with respect to a target configuration. When all phases are completed,
the phase is reset to initial phase. This process is repeated until all parts are attached.
Our dense reward function is verified for full assembly on a simple three-block assembly task.
Note that these phases are only used for computing rewards and not available to a policy. Please
refer to our code for further details.
1
1
https://github.com/clvrai/furniture
15
Scenes
Lighting
Material
Figure 2.6: Examples of diverse visual properties. The first row shows different scenes. The second
row shows different lighting configurations, such as soft light, ambient light, and low-visibility. The
final row shows variations in furniture textures, such as wood, aluminum, and glass.
2.2.6 Demonstration Collection and Generation
Our environment provides multiple ways to collect demonstrations. Human teleoperation can be
used to collect demonstrations with the end-effector space control using keyboard, 3D space mouse,
and HTC Vive VR controllers.
Scripted policies are provided to generate demonstrations for the benchmark furniture models
with the Sawyer robot to assist in evaluating IL algorithms. The hard-coded policies operate by
iterating through the predefined set of phases described in Section 2.2.5 and manually annotated
way-points. Each phase consists of either minimizing the Euclidean distance between relevant
points or maximizing cosine similarity between relevant directional vectors. The corresponding
actions can be computed using positional or rotational differences, and then performed using the
end-effector space control.
16
2.2.7 Domain Randomization
To promote generalization of the learned skills, the environment should provide enough vari-
ability in furniture compositions, visual appearances, object shapes, and physical properties. Our
environment provides variability in furniture compositions and object shapes by containing a diverse
set of furniture including chairs, tables, cabinets, bookcases, desks, shelves, and TV units (see
Figure 2.3). For a given furniture, the environment can randomly initialize the pose of each furniture
part in the scene to increase the diversity of the initial states. Additionally, the environment can
randomize physical properties, such as gravity, scale, density, and friction, to add more variation in
the task. The environment also supports diverse visual properties, such as lighting, backgrounds,
and textures, as illustrated in Figure 2.6.
2.2.8 Assembly Difficulty by Furniture
The difficulty of a furniture model largely depends on the shape of the furniture pieces. For
example,toy table is more difficult to grasp due to its cylindrical legs while rectangular legs of
bench bjursta are easier to grasp. Chairs are generally more difficult to assemble due to their
irregular shapes (e.g. chair seat and chair back). Bookcases are generally more difficult than chairs
due to the wide and thin pieces, which are difficult to grasp. We rank the furniture models by
assembly difficulty with respect to the number and shape of the furniture parts.
2.3 Experiments
To benchmark reinforcement learning (RL) and imitation learning (IL) methods on complex
long-horizon manipulation tasks, we selected 8 furniture models as a benchmark and annotated
way-points for the dense reward and demonstration generation. The first four benchmark furniture
models require peg-insertion-like attachment: three blocks,toy table,table dockstra, and
table bjorkudden, where a peg-like part must be precisely inserted into a recessed receptacle of
another part. The remaining four furniture models (bench bjursta,chair agne,chair ingolf,
17
andtable lack) do not have recessed receptacles, which enables the parts to snap together like
magnets.
The benchmark is conducted with the Sawyer robotic arm and joint velocity control, and the
furniture models are sampled mostly from tables and chairs since other furniture models (e.g.
bookcases and cabinets) often consist of multiple thin boards, which require two grippers to grasp.
With the goal of providing a challenging benchmark to compare performance in learning long-
horizon manipulation tasks, we design an evaluation protocol for RL and IL methods in Section 2.3.1.
Then, we evaluate RL (Section 2.3.2) and IL (Section 2.3.3) methods on the 8 benchmark furniture
models.
2.3.1 Evaluation Protocol
Evaluation metric: The most basic metric for IL and RL algorithms is the successful assembly
of a furniture. However, to provide more fine-grained progress measure than a simple success and
failure signal, we record the number of successful phase completions in an episode (as defined
in Section 2.2.5). The trained models are evaluated on the hold-out episodes (first 50 episodes
with the hold-out random seed 0). Following this evaluation metric, we benchmark RL and IL
algorithms and report average episodic phase completions for each benchmark furniture over 3
different training seeds.
Experimental setup: For both RL and IL benchmarks, we use joint velocity control. During
demonstration collection, training, and testing, each furniture part is initialized with the randomness
of [− 2cm,2cm] and [− 3°,3°] in (x,y)-plane. We sequentially initialize furniture parts to avoid
collisions between parts. After all furniture parts are initialized, the robot is initialized to a predeter-
mined initial position with added random noise. We set the episode length 200 per attachment, e.g.,
600 for 4 parts. Two parts are attachable if the corresponding connectors are within 2 cm and their
forward and up vector cosine similarities are larger than 0.99.
Reinforcement learning: For RL benchmark, each algorithm is trained with the shaped dense
reward described in Section 2.2.5. To compare learning performance (i.e. sample efficiency),
18
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
PPO SAC GAIL GAIL+PPO BC
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(a)three blocks
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(b)toy table
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(c)table bjorkudden
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(d)table dockstra
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(e)bench bjursta
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(f)chair agne
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(g)chair ingolf
0 5 10 15 20
Environment steps (1M)
0
2
4
6
8
10
Phase
(h)table lack
Figure 2.7: Training curves of RL and IL algorithms on the benchmark furniture models. Successful
grasping corresponds with completion of four phases, and successful assembly of one pair of parts
corresponds with completion of eight phases. All algorithms were trained for about 24 hours. The
dashed SAC line shows its final performance after 2M steps.
we compare the learning curves of evaluation results every 10K environment steps up to 2M for
off-policy methods and 20M for on-policy methods. The SAC off-policy update is more expensive
in terms of wall-clock time – SAC took 24 hours to reach 2M steps while PPO could reach 50M
steps in 24 hours.
Imitation learning: For IL benchmark, we collect 100 demonstrations for each of 8 benchmark
furniture models with a hard-coded assembly policy described in Section 2.2.6. Each demonstration
is around 200-1500 steps long due to the long-horizon nature of the task.
2.3.2 Reinforcement Learning Benchmark
We evaluate two state-of-the-art model-free RL algorithms, Soft-Actor Critic (SAC (Haarnoja
et al., 2018b)) and Proximal Policy Optimization (PPO (Schulman et al., 2017)), on the benchmark
furniture models with the dense reward function. The number of completed phases is plotted over
training environment steps in Figure 2.7. Completing four phases means the agent has grasped the
19
Furniture model PPO SAC GAIL GAIL+PPO BC
three blocks 4.96 3.88 1.06 4.89 1.00
toy table 3.34 3.40 1.16 5.88 1.00
table bjorkudden 5.66 3.66 1.01 5.54 1.00
table dockstra 5.24 3.61 1.01 6.21 1.00
bench bjursta 3.66 3.23 1.01 3.33 1.00
chair agne 6.26 3.60 1.06 6.38 1.02
chair ingolf 4.91 7.54 1.19 7.33 1.00
table lack 6.66 4.10 1.00 5.88 1.00
Table 2.1: Average number of phase completions of RL and IL algorithms. SAC, PPO, and
GAIL+PPO learn to pick an object (corresponds to 4 phase completions) and succeed on attaching
two parts inchair ingolf (corresponds to 8 phase completions) while BC and GAIL rarely pass
reach phase. The first four models require peg-insertion style attachments, which are more difficult
to accomplish than the magnet style attachment of the remaining four.
target part and accomplishing eight phases denotes the successful assembly of one pair of parts.
Note that the y-axis can go up to 8× #parts, not 8.
Both SAC and PPO learn to pick the first furniture part in all tasks. In many furniture models,
SAC struggles at completing the lift phase since the agent must firmly hold the gripped part and
avoid collision while lifting. PPO successfully attaches the first pair of parts in three blocks,
table bjorkudden, andchair ingolf, in which the two parts to be attached are initialized close
to each other. However, initializing the arm pose for the next step is challenging since the agent
pushes the gripped part away while reconfigure the arm pose.
SAC is more sample efficient than PPO, with it reaching comparable performance levels millions
of steps earlier than PPO. But in most cases, PPO learns to complete more phases with extensive
exploration with diverse rollouts collected from 16 parallel workers, at the cost of more samples.
This implies that learning to assemble even one pair of parts poses a challenging exploration
problem.
While SAC and PPO can attach the first pair of parts in some furniture models, no policy is able
to advance to the next pair of parts to attach. The best run is able to assemble the first pair of parts
and pick the next part to be assembled. This highlights the difficulty of the long-horizon nature in
20
furniture assembly, and shows ample room to improve RL algorithms for complex long-horizon
manipulation tasks with our environment.
2.3.3 Imitation Learning Benchmark
For the IL benchmark, we evaluate Behavioral Cloning (BC (Pomerleau, 1991)) and Generative
Adversarial Imitation Learning (GAIL (Ho and Ermon, 2016)) with joint velocity control. In
addition, we also evaluate the demonstration-guided RL approach, GAIL+PPO (Kang et al., 2018),
which learns from the weighted sum of GAIL and task rewards.
The quantitative results in Figure 2.7 show that BC and GAIL fail to accomplish the phase of
reaching to an object as the policy suffers from compounding error. On the other hand, GAIL+PPO
successfully learns to pick an object and attaches the object inthree blocks andchair ingolf.
This result demonstrates the importance of having access to the shaped reward function to learn the
assembly behavior.
We additionally compare with BC policies on end-effector space control, which is more robust to
action errors than joint velocity control. The end-effector BC policies demonstrate some successful
picking behaviors, while the joint velocity control BC policies failed completely. This suggests that
BC on joint velocity control is more challenging, and therefore it is prone to failure and requires
more demonstration data. Please refer to our website for further results and analysis.
2.3.4 Implementation Details
For all methods, we use 3-layer MLP with 256 hidden units for policy, value, and discriminator
networks. The policy and value networks use ReLU nonlinearities while the discriminator networks
use the tanh activation function. The output of the policy is squashed into[− 1,1] using tanh. The
discount factor is 0.99. We use the Adam optimizer (Kingma and Ba, 2015) with momentum
(0.9,0.999). PPO is trained with the entropy coefficient 4× 10
− 3
, and rollout of 8192 transitions
collected from 16 parallel workers. We normalize observations using the moving average and
standard deviation. SAC first collects 1000 transitions to fill the replay buffer and the policy
21
and critic networks are updated with the learning rate 3× 10
− 4
. BC is trained with batch size
128 for 500 epochs with learning rate 3× 10
− 4
. GAIL is trained using PPO without changing
any hyperparameters. For reward, we use the vanilla GAIL reward r
GAIL
=− log(1− D(s,a)).
GAIL+PPO trains a policy using a combined reward r= 0.8· r
GAIL
+ 0.2· r
ENV
.
2.4 Related Work
Deep reinforcement learning (RL) has made rapid progress with the advent of standardized,
simulated environments. Most progress has been made in game environments, such as Atari (Belle-
mare et al., 2013), VizDoom (Kempka et al., 2016), and StarCraft2 (Vinyals et al., 2017). Recently,
many simulated environments have been introduced in diverse applications, such as autonomous
driving (Shah et al., 2017; Dosovitskiy et al., 2017), indoor navigation (Kolve et al., 2017; Xia et al.,
2018; Puig et al., 2018; Manolis Savva* et al., 2019), continuous control (Brockman et al., 2016;
Zhu et al., 2020; Tassa et al., 2020; Lee et al., 2019b), and recommendation systems (Rohde et al.,
2018).
In robotic manipulation, most existing environments have been focused on short-term manipula-
tion tasks, such as picking and placing (Brockman et al., 2016; Lee et al., 2019a), in-hand dexterous
manipulation (OpenAI et al., 2020; Rajeswaran et al., 2018), door opening (Urakami et al., 2019),
and peg insertion (Chebotar et al., 2019; Yamada et al., 2020). Recent advancements in simulators,
such as Robosuite (Zhu et al., 2020), RLBench (James et al., 2020), PyRoboLearn (Delhaisse
et al., 2019), and Meta-World (Yu et al., 2019), take a step towards a comprehensive manipulation
simulator by offering a variety of manipulation tasks. However, these tasks, which consist of lifting,
stacking, and picking and placing, are still limited to primitive skills.
Composite manipulation tasks, e.g., block stacking (Duan et al., 2017), ball serving (Lee et al.,
2019b), and kitchen tasks (Gupta et al., 2019; Pertsch et al., 2020a), have been proposed but limited
to little variation in shapes and physical properties of objects. In contrast, we simulate a complex
manipulation task, furniture assembly to evaluate long-term planning and generalizable skills for
various shapes and materials of objects. While robotic furniture assembly has been studied in
22
instrumented and constrained settings (Niekum et al., 2013; Knepper et al., 2013; Su´ arez-Ruiz
et al., 2018), our simulated environment increases accessibility of the furniture assembly task to the
community.
Moreover, recent progress in vision-based RL methods (Yarats et al., 2021; Laskin et al., 2020;
Stooke et al., 2021) shows comparable sample efficiency to state-based RL. To test the scalability of
such vision-based RL approaches to real-world scenarios, our environment can serve as a visually
realistic benchmark, which provides photorealistic rendering with diverse and configurable textures,
backgrounds, and lighting.
2.5 Discussion
In this chapter, we proposed the IKEA Furniture Assembly Environment as a novel benchmark
for testing complex long-horizon manipulation tasks. Furniture assembly is a challenging task
requiring 3D perception, high-level planning, and sophisticated low-level control. The experimental
results show that current RL and IL methods cannot solve the task due to its long horizon and
complex manipulation. Therefore, it is well suited as a benchmark for robot learning algorithms
aiming to tackle complex long-horizon manipulation tasks. An important future direction is to
enable sim-to-real transfer by improving realistic screwing interaction, accurate 3D modeling, and
robot calibration. Moreover, supporting multi-arm or multi-robot collaboration can be another
future work to overcome insufficient payloads of each robot. Finally, benchmarking visuomotor
control and multi-task learning are the natural followups for the proposed environment.
23
Chapter 3
Real-World Furniture Assembly Environment
3.1 Introduction
TEASER
Figure 3.1: Furniture assembly benchmark.
How can we enable robots to perform complex,
long-horizon tasks? Currently, deep reinforcement
learning (RL) and imitation learning (IL) present
promising frameworks for learning to solve impres-
sive robotic manipulation tasks (Levine et al., 2016;
Su´ arez-Ruiz and Pham, 2016; Rajeswaran et al.,
2018; Jain et al., 2019; Jang et al., 2021; Ha and
Song, 2021). Yet, the learned robot behaviors are
limited to simple tasks, such as grasping, pushing,
or placing objects. Scaling the learning capability
of robots to more complex long-horizon real-world
tasks, such as furniture assembly and cooking, is a long-standing goal for robotics. This goal calls
for complex, long-horizon robotic manipulation benchmarks more challenging than existing ones
which only test picking-and-placing (Mandlekar et al., 2018; Yang et al., 2019) or stacking (Lee
et al., 2021a). To this end, we propose to focus on furniture assembly as the next milestone for
complex, long-horizon robotic manipulation.
24
Furniture assembly is a difficult, long-horizon manipulation task through which many challenges
in robotic manipulation must be addressed to solve. It has a hierarchical task structure, requiring
reasoning over long horizons (e.g. deciding in which order and how to assemble various furniture
pieces). In addition, connecting different furniture pieces together requires complex path planning
and dexterous manipulation skills (e.g. robust grasping, accurate alignment of attachment points,
and deliberate force control for inserting and screwing), as illustrated in Figure 3.2.
The previous chapter introduced the IKEA Furniture Assembly Environment for robotic furniture
assembly with a variety of furniture models and realistic rendering. This simulator successfully
served as a testbed for robot learning algorithms for complex long-horizon manipulation (Yamada
et al., 2020; Nasiriany et al., 2021). However, this simulated environment still has a large simulation-
to-real gap (Jakobi et al., 1995) both in visual perception and physics simulation (Zhang et al., 2021)
that hampers the ability to train a policy which can directly be transferred to work in the real world.
More importantly, this environment abstracts away a complicated but essential physical interaction
for assembly—screwing—mainly due to the limitations of its physics engine (Todorov et al., 2012).
To incorporate these challenges into real-world robot learning, we propose the “Real-world”
Furniture Assembly Benchmark. It aims to measure a robot learning algorithm’s learning capa-
bility on realistic long-horizon tasks with physical robots while offering low barriers of entry for
performing real-world robot experiments. In this chapter, our main contributions are threefold:
• We provide a testbed and benchmark for real-world complex long-horizon robotic ma-
nipulation tasks, which allows robot learning researchers to study RL and IL in the real
world.
• We ensure reproducibility and reduce required human effort by incorporating 3D printed
furniture pieces, including a comprehensive environment setup guide, and providing a scripted
environment reset function.
25
Diverse dexterous skills
Grasping Inserting Screwing Flipping Collision avoidance
Figure 3.2: Dexterous skills required for furniture assembly. Note that screwing is difficult with one
arm, therefore we fix a long plastic bar to the table to screw objects against.
• We collected 40+ hours of teleoperation demonstrations that make real-world experiments
plausible and benchmark imitation learning and offline reinforcement learning methods with
the collected data.
We believe that this benchmark will enable robot learning researchers to identify challenges in
solving long-term tasks, easily compare the performance of their approaches over prior work, and
eventually solve more complicated long-horizon tasks in the real world.
3.2 Related Work
Deep reinforcement learning (RL) has made rapid progress with the advent of standardized,
simulated benchmarks, such as Atari (Bellemare et al., 2013) and continuous control (Brockman
et al., 2016; Tassa et al., 2018) benchmarks. In robotic manipulation, most existing simulated
environments have been focused on tasks like picking and placing (Brockman et al., 2016; Lee
et al., 2019a), in-hand dexterous manipulation (OpenAI et al., 2020; Rajeswaran et al., 2018), door
opening (Urakami et al., 2019), peg insertion (Chebotar et al., 2019; Yamada et al., 2020), and
screwing (Narang et al., 2022). However, these tasks, which consist of lifting, stacking, and picking
and placing, are limited to simpler primitive skills.
Composite manipulation tasks such as block stacking (Duan et al., 2017), ball serving (Lee et al.,
2019b), kitchen tasks (Gupta et al., 2019; Pertsch et al., 2020a), and table-top manipulation (Mees
26
et al., 2022) have been proposed but are limited to small variations in shapes and physical properties
of objects. In contrast, the IKEA furniture assembly environment (Lee et al., 2021b) simulates a
complex manipulation task, furniture assembly, to evaluate long-term planning and generalization
of learned skills across various shapes and object materials.
Although simulated benchmarks are easily accessible and useful for quickly verifying ideas, they
do not ensure that algorithms can directly work on real robots due to the complexity and stochastic
nature of the real environment. Real-world robot learning involve many challenges that are not
considered in simulated environments, such as collision, noisy and delayed sensory inputs, auto-
matic environment resetting, and providing reward signals. Our benchmark provides a real-world
environment that resolves or minimizes all of these problems. Some recent robotic manipulation
benchmarks have introduced reproducible benchmarking environments with an affordable robot
and easy-to-make cage (Yang et al., 2019), 3D printed objects (Lee et al., 2021a), and evaluating
in cloud (Bauer et al., 2022a). However, these works focus on tasks much simpler than furniture
assembly.
Robotic furniture assembly has been studied in instrumented and constrained settings (Niekum
et al., 2013; Knepper et al., 2013; Su´ arez-Ruiz et al., 2018). Yet, these works are not suitable
for benchmarking as they use their own furniture models and software. NIST Assembly Task
Board (Kimble et al., 2020) provides a suite of complex manufacturing tasks and evaluation metrics,
but it mainly focuses on providing task specifications and lacks of a standard experimental setup,
such as a robot, hardware, and observation space. Thus, in this chapter, we propose a reproducible,
easy-to-use real-world robot learning benchmark. In addition, we provide diverse furniture models
to evaluate different types of manipulation skills.
3.3 Furniture Assembly Benchmark
To advance robot learning research from simple and artificial tasks to complex and realistic
tasks, we need a benchmark that helps researchers identify challenges in learning such complex,
long-horizon real-world tasks, and allows them to easily evaluate and compare various approaches.
27
To this end, we design our benchmark with three key points of focus: (1) to have meaningful, long-
horizon, complex, real-world tasks—furniture assembly; (2) to be reproducible across institutions
and easy to setup; (3) to minimize the amount of human effort and intervention required to run
experiments and train robotic agents. We detail these three points in the following sections below.
3.3.1 Furniture Assembly as a Complex Long-Horizon Task
Even for humans, furniture assembly is not a simple task. We need to understand how pieces are
assembled into the final configuration and make a plan to assemble them in the right order. Assembly
also involves many complex manipulation skills, such as the accurate alignment of two parts and
sophisticated force control to screw them together. Thus, furniture assembly is a comprehensive
robotic manipulation task requiring high-level long-term planning, sophisticated control, and reliable
3D perception, making it a suitable benchmark for robot learning algorithms (Niekum et al., 2013;
Knepper et al., 2013; Su´ arez-Ruiz et al., 2018).
The furniture assembly task is long-horizon and hierarchical, as illustrated in Figure 3.2. Furni-
ture assembly can be accomplished by repeating 1-step assembly as follows: (1) selecting two parts
to be assembled, (2) grasping one part
1
, (3) moving the part towards the other, (4) aligning the at-
tachable points (e.g. a screw and hole), and (5) firmly attaching them via screwing or insertion, until
all parts are assembled. Thus, furniture assembly has a much longer horizon on average (90-360
sec, 450-1800 steps) than prior manipulation benchmarks, e.g., 186 sec for Roboturk (Mandlekar
et al., 2018).
This seemingly simple procedure requires diverse challenging manipulation skills, as illustrated
in Figure 3.2. First, grasping a furniture piece for assembly requires specific grasping pose for the
following sub-tasks. It sometimes requires re-orienting of the furniture pieces both in hand and on
the table. Second, assembling a complex furniture pieces leads to a collision-prone environment.
1
In this benchmark, we focus on single-arm manipulation, as this is the most common robot arm setup in research
labs. Using one robotic arm results in the limited dexterity to handle multiple objects simultaneously or one large
object; but, we overcome this issue by adding an obstacle fixed on the table (see Figure 3.4), which can be used to
hold an object, and slightly modifying 3D furniture models suited for one-hand manipulation. We leave extending our
environment to multi-arm and mobile manipulation for future work.
28
(a)lamp (b)square table (c)drawer (d)cabinet
(e)round table (f)desk (g)stool (h)chair
Figure 3.3: Furniture models in our benchmark. Each furniture is designed inspired by IKEA
furniture. Due to the limitations imposed by using a single robotic arm, we modify some furniture
pieces feasible to be assembled with one hand.
As furniture assembly proceeds, maneuvering an arm and manipulating furniture parts need more
thorough planning to avoid collisions with irrelevant parts. For example, a robot needs to avoid
colliding with already attached table legs; otherwise, it can move the furniture pieces around.
Lastly, screwing, by itself, is a challenging manipulation skill. Screwing requires precise alignment
between a screw and a hole, then repeated rotation of the screw while gently pressing it. Moreover,
due to the limited wrist joint rotation, screwing (i.e. rotating a part for 540°) requires more than six
90°-screwing and re-grasping motions.
3.3.2 Reproducible System Design and Benchmark Setup
While bringing real-world complexity into the robot learning community, the benchmark was
also designed to be reproducible and easy-to-use. In this section, we first describe the design
of our tasks and furniture models, and then explain how to establish a consistent, reproducible
29
experimental setup. We will release the code and tools to reproduce the environment setup upon
paper publication.
3.3.2.1 Furniture Models
Our environment features 8 different furniture models, each of which are modeled after an
existing piece of IKEA furniture, as shown in Figure 3.3. Due to the limitations of having one
robotic arm, we modify some furniture pieces so that a robot can assemble them with one hand.
To make this benchmark easy-to-use and reproducible, we use 3D-printed furniture models. More
details about the 3D models and their assembly procedures can be found in appendix, Section B.
2
Assembling our furniture models requires multiple challenging manipulation skills–especially
screwing. Screwing with one hand is not trivial as the screwing motion for assembly generally
requires two hands to perform. Thus, we fix a long bar obstacle to the table to replace the other
hand, as illustrated in Figure 3.2. The robot can use this obstacle to prevent parts from rotating
while screwing. Furthermore, each furniture model has its own assembly challenges as explained
below:
• lamp: The robot needs to screw in a light bulb inside a lamp base, and then place a lamp hood
on top of the bulb. It must perform sophisticated grasping skills since the bulb can be easily
slipped when grasping and screwing.
• square table: The robot needs to screw four table legs to a table top. As more legs are
attached, it is easier to collide with already assembled table legs. Thus, careful planning to
avoid collisions is necessary to solve the task.
• desk: This task is similar to square table; but, the legs are larger and longer, making it
possibly more challenging as the robot needs to more precisely orient its hand before grasping.
2
3D printing our furniture models requires a 3D printer with the capacity larger than 25 cm× 25 cm× 15 cm and
printing one furniture model takes approximately 12 hours.
30
• drawer: The robot needs to insert two drawer containers into a drawer box. It must carefully
align the rails on the drawer box and the holes in the containers. Once aligned, the robot has to
push the container into the drawer box.
• cabinet: The task consists of a cabinet body, two cabinet doors, and one cabinet top. The
agent must first insert the doors into the bars on each side of the body. Next, it must lock the
top so that the doors do not slide out. This task requires careful motion control to align the
doors and slide into the bar. Moreover, diverse skills like flipping the cabinet body, pushing
the top to screw, and grasping the top tight the top are also needed to finish the task.
• round table: The robot should handle an easily movable round pole to complete the task.
After inserting a pole to a round table top, the robot must screw it into a cross-shaped table
base, which requires difficult screwing interactions.
• stool: The stool consists of one round chair seat and three tilted legs. Since the legs are
not upright, the entire robot arm needs to move while rotating its wrist. If one leg is not
fully assembled, it may face inward and prevent the assembly of other legs. Moreover, the
round-shaped seat is slippery when screwing; thus, it requires stronger pressing force for
screwing.
• chair: The task consists of one chair seat, one chair back, two table legs, and two chair nuts
(6 parts in total). This task requires an accurate and complex interaction between parts (e.g.
screwing, insertion) and diverse strategies in every stage of assembly.
3.3.2.2 System Design
System overview: Our benchmarking environment consists of one 7-DoF Franka Emika Panda
robot arm with a parallel gripper, three Intel RealSense D435 cameras, and an obstacle fixed on the
table, as shown in Figure 3.4a. The obstacle is introduced to help one hand assembly, especially for
31
Our environment
Camera 2
Camera 1
Obstacle
Furniture
Base marker
Camera 3
(a) Environment setup
(b) Camera views
(c) Example furniture modelchair
Figure 3.4: (a) Experiments configuration, (b) visual inputs from three cameras, and (c) furniture
models with AprilTag markers attached.
holding one part during screwing another part (Figure 3.2). The green cloth is used to produce a
fixed background during experimentation and two ring lights are used to brighten the workspace
3
.
Robot specification: A Franka Emika Panda robot arm is placed on the manipulation workspace
and fixed using Franka Emika bench clamp. Due to the height of the bench clamp, the base of
the robot arm is 2 cm above the table. The robot arm is controlled using the Operational-Space
motion Controller (OSC (Khatib, 1987)), which computes joint torques to reach a given end-effector
goal position (3D Cartesian coordinate) and orientation (4D quaternion). As shown in Figure 3.5,
the robot arm takes an end-effector control command at 5 Hz, and the OSC controller produces
joint torques in real time (1000 Hz). The resulting joint torques are damped to prevent unstable
movements due to high DoFs (Kang and Freeman, 1992), and these conservative motions help
maintain easy and safe robot control.
Observations: The observation space for the environment consists of a front-view image and
proprioceptive robot state. An Intel RealSense D435 camera produces the front-view observation
3
We found that AprilTags are sensitive to the lighting conditions. Thus, brightening the workspace and reducing the
effect of shadows is important for optimal AprilTag tracking performance
32
and the 1280× 720 resolution image is downsampled and cropped to 224× 224. The proprioceptive
robot state includes the end-effector position, orientation (quaternion), velocity, and gripper width.
Tracking cameras: Real robot experiments require significant human interventions, such as
resetting the environment, evaluating performance (reward), and preventing unsafe robot motions—
one of the major bottlenecks in benchmarking real-world robots. To reduce the human intervention
required for robot experiments, our benchmark automates two major jobs, resetting and reward
assignment, using the AprilTag pose tracker (Olson, 2011).
To achieve robust pose estimation from AprilTag, our benchmark uses three Intel RealSense
D435 cameras. The front-view camera captures a comprehensive view of the scene, while two
cameras on the back focus more on the workspace (see Figure 3.4b). The 1280× 720-image is taken
at 30 Hz and AprilTag returns the detected poses of furniture parts at 10 Hz. To make AprilTag
pose estimation more stable, outliers are filtered out and estimated poses from three cameras are
averaged. We arranged pairs of furniture parts and AprilTag IDs, and all 3D furniture models have
placeholders for the corresponding markers on their surfaces, as shown in Figure 3.4c. Note that
AprilTag pose estimation is only used for the reset and reward functions, and not provided to a
learning agent.
3.3.3 Reducing Human Effort and Intervention
Finally, in order to make our benchmark easy to use with as little human effort and intervention
as possible, we provide reward functions, automatic reset functions, and a dataset of successful
demonstrations for every assembly task, which we describe below.
3.3.3.1 Reward Function
At a high level, our reward functions identify currently assembled parts based on their relative
poses estimated via AprilTag. When the relative orientation and position are close enough, a+1
reward is given. Thus, the total reward of an episode is the number successful assembly of two parts.
For more reward function details, see appendix, Section A.4.
33
Policy
Left camera Right camera Front camera Robot arm
Average poses
AprilTag poses AprilTag poses AprilTag poses
Robot state
EEF pose, gripper action Δ
Joint torque
1000 Hz
5 Hz
Reward function Reset function
OSC controller
Perception Control
10 Hz
Figure 3.5: Overall robot system design. The agent receives a proprioceptive robot state and a
front-view RGB-D image, and takes an action (end-effector goal pose and a binary gripper action)
at a frequency of 5 Hz. For automatic reset and reward functions, we use images collected via three
RGB-D cameras and estimate furniture poses using AprilTag. Note that the estimated poses of
furniture parts are only used for environment reset and reward, not for the learning agent.
3.3.3.2 Reset Function
The automatic reset function is essential to automate RL training as well as evaluation of
algorithms. The reset function iterates through each part of the furniture model and the robot will
clear out the location to put the piece and place it back to the proper starting position. For more
details and the reset function pseudocode, see appendix, Section A.3.
3.3.3.3 Data Collection using Teleoperation
Our furniture assembly benchmark proposes the suite of challenging long-horizon manipulation
tasks described above (Section 3.3.2.1). Due to complex manipulation interactions and the task
time horizon, learning these tasks from scratch may require millions of real-world interactions
for the state-of-the-art RL methods. To make this benchmark more feasible, we collect a large
demonstration dataset using teleoperation and evaluate IL and offline RL methods trained on this
dataset.
34
Figure 3.6: Teleoperation setup. To collect demonstrations, we primarily use an Oculus Quest 2
controller to control a 7-DoF robotic arm and use a keyboard to rotate the wrist without moving the
arm.
We collected 40 hours of demonstration data over all eight furniture models using an Oculus
Quest 2 controller and keyboard. Although the VR controller makes data collection easier and faster,
it is not suited for fine-grained control, such as screwing, since the human operator tends to move the
position of the hand while rotating it. Therefore, we use a keyboard to rotate the wrist. We collected
200 demonstrations for three easy models (cabinet,drawer,square table), 100 demonstrations
for two intermediate models (lamp,desk), and 50 demonstrations for three difficult models ( chair,
stool, round table). We summarize the statistics of the collected demonstrations in Table 3.4.
For most furniture models, human success rates are very close to 1, and there is anywhere between
1 to 10 hours of demonstration data for each model. Each demonstration is around 300-3000 steps
long due to the long-horizon nature of the task.
To scale the dataset, we will provide a data repository for our benchmark such that everyone can
share their collected data. We believe we can achieve a generalizable furniture assembly agent with
35
data from different subjects and diverse environments (e.g. backgrounds, lighting conditions, noise
in robot and camera setups, errors in 3D-printed models, and types of table surfaces).
Each timestep of a demonstration consists of RGB-D frames from three cameras, robot action,
reward, AprilTag poses for all furniture parts. In addition, a demonstration includes a metadata,
such as furniture model id, error during data collection, and episode success. The statistics of the
collected data can be found in Table 3.4.
To cover diverse lighting conditions, we use two ring lights with three different colors (warm
white, natural white, and cool white) and vary their positions and directions over time. We plan to
extend the diversity of data by collecting demonstrations from multiple different laboratories.
3.4 Experiments
3.4.1 Baselines and Evaluation Protocol
Baselines: We evaluate our benchmark with imitation learning (BC, IRIS) and the state-of-the-
art offline RL (IQL). BC (Behavioral Cloning (Pomerleau, 1989)) fits a policy to the demonstration
state-action pairs(s,a) with supervised learning. We also test IRIS (Implicit Reinforcement without
Interaction (Mandlekar et al., 2020a)) without reward, which is a hierarchical imitation learning
algorithm. IRIS learns a high-level goal setting policy and a low-level goal-conditioned policy from
demonstrations. IQL (Implicit Q Learning (Kostrikov et al., 2022)) has demonstrated state of the
art performance on long-horizon navigation tasks. It performs advantage-weighted BC, where the
advantage partially comes from a value function trained with expectile regression loss.
Evaluation metric: We record the cumulative return obtained in each episode (rewards defined
in Section 3.3.3.1). The trained models are evaluated on 20 episodes and their initial states are set
by the provided reset function (Section 3.3.3.2). Following this evaluation metric, we benchmark
RL and IL algorithms and report average episodic return for each benchmark furniture.
Experimental setup: For both RL and IL benchmarks, we use end-effector position control.
During demonstration collection, training, and testing, each furniture part is initialized with the
36
randomness of [− 3cm,3cm] and [− 15°,15°] in the (x,y)-plane. The episode length is set to be
5000 timesteps.
3.4.2 Implementation Details
The observation space of our environment consists of the visual input from the front-view
camera and proprioceptive robot state (end-effector position, orientation, and gripper width). For
efficient visual policy learning, we use the pre-trained visual encoder, R3M (Nair et al., 2022),
which is trained with a large-scale egocentric human videos. The R3M encoder outputs a feature
vector of size 2048 and we use 2-layer MLP with 256 hidden units for the policy network. ReLU is
used as the activation function. Furthermore, we normalize the action space to[− 1,1] by dividing
(x,y,z)-positional actions by 0.1, which is the maximum absolute positional action.
Behavioral Cloning (BC (Pomerleau, 1989)) We train the policy using a batch of size 32 for
1,500 epochs. The policy is optimized using the Adam optimizer (Kingma and Ba, 2015) with a
learning rate of 0.001.
Implicit Q Learning (IQL (Kostrikov et al., 2022)) We use the official implementation of IQL
in JAX (Bradbury et al., 2018), with the minor modification for the R3M visual encoder (Nair et al.,
2022).
Implicit Reinforcement without Interaction (IRIS (Mandlekar et al., 2020a)) Instead of
generating an image goal, we use the 2048 dimensional R3M image feature (Nair et al., 2022) as a
state and goal. In addition, we devise IRIS for the imitation learning setup, where reward is not
available from the demonstration data.
Hyperparameters for our baseline implementations are listed in Table 3.1 and Table 3.2.
During our initial benchmarking, we found that IL and offline RL approaches fail to learn any
meaningful behaviors. After careful investigation, we found that these poor results come from (1)
the non-unified sign of the quaternion and the small scale of the positional actions ( − 0.1∼ 0.1,
37
Table 3.1: BC hyperparameters.
Hyperparameter Value
Learning rate 2e-3
Learning rate decay Step decay
Learning rate decay factor 0.5
# Mini-batches 512
# Hidden units (512, 256, 256)
Activation ReLU
Image encoder R3M
Action normalization True
Table 3.2: IQL hyperparameters.
Hyperparameter Value
Learning rate 3e-4
Learning rate decay Cosine decay
# Mini-batches 256
Policy # Hidden units (512, 256, 256)
Q/V # Hidden units (512,256, 256)
Activation ReLU
Image encoder R3M
Reward discount factor (γ) 0.996
Expectile (τ) 0.8
Inverse temperature (β) 0.5
Action normalization True
meter unit), which makes optimization hard; and (2) the high positional and velocity gain for
Operational Space Control (OSC), which leads to shaky and rapid robot motion that makes our
dataset violates safety constraints.
Even with the engineering efforts described above, we found that a robot arm often stops moving
in the middle of an episode and fails the task. This happens because teleoperators frequently stop
moving while collecting data to re-orient the VR controller when the robot pose and controller
pose are different or to accurately control the robot for a fine-grained interaction, such as inserting.
We correct this issue by post-processing demonstrations to not include non-moving transitions and
adding noise to observations to make a robot move when it gets stuck. This change significantly
improves the performance of both BC and IQL as presented in Section 3.4.3.
3.4.3 One-Leg Assembly Task Results
To verify the tractability of the proposed benchmark, we devise an easier benchmarking task,
one-leg assembly, which is a part ofsquare table consisting of 6 phases: (1) reaching a table top,
(2) grasping it, (3) placing it in the corner, (4) picking up a table leg, (5) inserting the leg into the
hole on the table top, and (6) screwing it, as illustrated in Appendix, Figure A.4b. The task horizon
is approximately 313 timesteps. We first collected 500 demonstrations from the fixed viewpoint
38
and then collected additional 400 demonstrations with diverse camera viewpoints, which helps
generalization of the learned policy.
Table 3.3 shows that both BC and IQL perform significantly better than the results in Table 3.4.
Both BC and IQL mostly pick up the table top and move it to the desired position. Moreover, they
often pick up the table leg after moving the table top, which corresponds to phase 4. However, in
most cases, they fail to learn “insertion”.
BC generally works better with the small amount of data than IQL, but it shows limited scalability
as diverse actions from the same state are averaged out, and the policy tends not moving. On the
other hand, the performance of IQL improves as more data is available. Finally, IQL solves the
one-leg assembly task with 900 demonstrations with the 20% success rate.
Throughout this experiment, we found that the “inserting” interaction is very challenging
as it requires sophisticated control and involves frequent collisions; thus, only IQL with 900
demonstrations succeeds it. This experiment verifies that our benchmark is tractable, but the
baseline algorithms require a huge amount of data (900+) to solve a part of a full furniture assembly
task.
Table 3.3: Average phases over 10 evaluation runs with variations of the baselines and different
number of demonstrations (100, 200, 300, 400, and 900 demonstrations). The numbers in the
parentheses mean the best completed phase among 10 evaluation runs.
Viewpoint Fixed Variable Mixed
500 demos 100 demos 200 demos 300 demos 400 demos 500 + 400 demos
IQL 1.2 (2) 2 (3) 1.1 (3) 2.6 (4) 3.3 (4) 4.4 (6)
BC 0 (0) 2.8 (4) 3.3 (4) 3.6 (4) 2.8 (4) 3.2 (4)
IRIS 0.5 (2) 2.8 (4) 2.7 (3) 3.4 (4) 3.2 (4) 3.1 (4)
3.4.4 Full Assembly Task Results
We present full furniture assembly evaluation results in Table 3.4. Overall, neither BC nor IQL
perform well. The learned policies reach the target furniture part in most cases, and often push or
grasp the part, but mostly failing to achieve further phases.
39
Qualitatively, the policies learned by the algorithms are unable to consistently reach the first
part, let alone grasp and attach any parts together. This behavior was consistent across many
hyperparameter configurations for both methods even using the authors’ original implementation of
IQL.
Table 3.4: Average phases over 10 evaluation runs with improved baselines and more demonstrations
(up to 200 trajectories). *Human scores are reported as success rates.
Furniture model Performance
# parts # phases # demos Avg. length Total time (min) Human* BC IQL
lamp 3 8 102 368.24 125.20 0.92 1.20 1.40
square table 5 15 200 1046.99 698.00 0.95 1.50 1.60
desk 5 15 102 1044.18 355.02 0.91 1.40 3.00
drawer 3 8 200 319.96 213.30 1.00 1.00 1.00
cabinet 4 14 200 508.46 338.97 0.97 0.50 3.60
round table 3 9 58 657.66 127.14 0.91 1.40 2.70
stool 4 11 62 979.58 202.45 0.84 1.00 1.50
chair 6 18 70 1453.60 339.17 0.76 1.20 2.70
3.5 Discussion
In this chapter, we proposed the real-world IKEA Furniture Assembly Benchmark as a novel
benchmark for testing complex long-horizon manipulation tasks on real robots. To serve repro-
ducible and easy-to-use benchmarking environment, we provide 3D printable furniture part models,
an experiment setup guide, automatic reset, and reward function. The benchmarking results show
that common imitation learning algorithms and the state-of-the-art offline reinforcement learning
method struggle at solving furniture assembly, due to its long horizon nature and requirement
of complex manipulation skills to solve. Therefore, it is a well suited benchmark to encourage
robot learning researchers to tackle complex long-horizon manipulation tasks with better sample-
efficiency and demonstration-efficiency. An important future direction is to support multi-arm or
multi-robot collaboration to enhance another level of dexterity.
40
Limitations While the proposed furniture assembly benchmark includes many perspectives of
real-world furniture assembly, the 3D furniture models are still tailored to common robotic arms
for research, e.g., all pieces have widths larger than 2 cm for easy grasping, which is larger than
the tiny screws used in real-world IKEA furniture. Furthermore, this benchmark only supports a
single Franka Emika Panda arm for benchmarking furniture assembly. Using only a single hand
limits the dexterity of possible manipulation skills, resulting in the limited interactions tested in the
benchmark and the obstacle fixed to the workspace.
41
Part II
Skill Chaining for Solving Complex Long-Horizon Tasks
42
Chapter 4
Composing Skills via Transition Policies
4.1 Introduction
While humans are capable of learning complex tasks by reusing previously learned skills,
composing and mastering complex skills are not as trivial as sequentially executing those acquired
skills. Instead, it requires a smooth transition between skills since the final pose of one skill may not
be appropriate to initiate the following one. For example, scoring in basketball with a quick shot
after receiving a ball can be decomposed into catching and shooting. However, it is still difficult for
beginners who have learned to catch passes and statically shoot. To master this skill, players must
practice adjusting their footwork and body into a comfortable shooting pose after catching a pass.
Can machines similarly learn new and complex tasks by reusing acquired skills and learning
transitions between them? Learning to perform composite and long-term tasks from scratch
requires extensive exploration and sophisticated reward design, which can introduce undesired
behaviors (Riedmiller et al., 2018). Thus, instead of employing intricate reward functions and
learning from scratch, modular methods sequentially execute acquired skills with a rule-based
meta-policy, enabling machines to solve complicated tasks (Pastor et al., 2009; M¨ ulling et al., 2013;
Andreas et al., 2017). These modular approaches assume that a task can be clearly decomposed into
several subtasks which are smoothly connected to each other. In other words, an ending state of
one subtask falls within the set of starting states, initiation set, of the next subtask (Sutton et al.,
1999). However, this assumption does not hold in many continuous control problems where a given
43
FINAL
Good initial states for
"#$%
Bad initial states for
"#$%
&'()
&'()
"#$%
execution success
"#$%
execution fail
"#$%
"#$%
&'()
execution
Transition policy execution
Figure 4.1: Concept of a transition policy. Composing complex skills using primitive skills requires
smooth transitions between primitive skills since a following primitive skill might not be robust to
ending states of the previous one. In this example, the ending states (red circles) of the primitive
policy p
jump
are not good initial states to execute the following policy p
walk
. Therefore, executing
p
walk
from these states will fail (red arrow). To smoothly connect the two primitive policies, we
propose a transition policy which navigates an agent to suitable initial states for p
walk
(dashed
arrow), leading to a successful execution of p
walk
(green arrow).
skill may be executed in starting states not considered during training or designing and thus, fail to
achieve its goal.
To bridge the gap between skills, we propose a transition policy which learns to smoothly
navigate from an ending state of a skill to suitable initial states of the following skill, as illustrated in
Figure 4.1. However, learning a transition policy between skills without reward shaping is difficult
as the only available learning signal is the sparse reward for the successful execution of the next skill.
Sparse success/failure reward is challenging to learn from due to the temporal credit assignment
problem (Sutton, 1984) and the lack of information from failing trajectories. To alleviate these
problems, we propose a proximity predictor which outputs the proximity to the initiation set of the
next skill and acts as a dense reward function for the transition policy.
The main contributions of this chapter include (1) the concept of learning transition policies to
smoothly connect primitive skills; (2) a novel modular framework with transition policies that is
able to compose complex skills by reusing existing skills; and (3) a joint training algorithm with the
proximity predictor specifically designed for efficiently training transition policies. This framework
is suited for learning complex skills that require sequential execution of acquired primitive skills,
which are common for humans yet relatively unexplored in robot learning. Our experiments on
simulated environments demonstrate that employing transition policies solves complex continuous
control tasks which traditional policy gradient methods struggle at.
44
4.2 Related Work
Learning continuous control of diverse behaviors in locomotion (Merel et al., 2017; Heess et al.,
2017; Peng et al., 2017b) and robotic manipulation (Ghosh et al., 2018) is an active research area in
reinforcement learning (RL). While some complex tasks can be solved through extensive reward
engineering (Ng et al., 1999), undesired behaviors often emerge (Riedmiller et al., 2018) when
tasks require several different primitive skills. Moreover, training complex skills from scratch is not
computationally practical.
Real-world tasks often require diverse behaviors and longer temporal dependencies. In hierarchi-
cal reinforcement learning, the option framework (Sutton et al., 1999) learns meta actions (options),
a series of primitive actions over a period of time. Typically, a hierarchical reinforcement learning
framework consists of two components: a high-level meta-controller and low-level controllers. A
meta-controller determines the order of subtasks to achieve the final goal and chooses corresponding
low-level controllers that generate a sequence of primitive actions. Unsupervised approaches to
discover meta actions have been proposed (Schmidhuber, 1990; Daniel et al., 2016; Bacon et al.,
2017; Vezhnevets et al., 2017; Dilokthanakul et al., 2019; Levy et al., 2017; Frans et al., 2018;
Co-Reyes et al., 2018; Mao et al., 2018). However, to deal with more complex tasks, additional
supervision signals (Andreas et al., 2017; Merel et al., 2017; Tianmin Shu, 2018) or pre-defined
low-level controllers (Kulkarni et al., 2016; Oh et al., 2017) are required.
To exploit pre-trained modules as low-level controllers, neural module networks (Andreas et al.,
2016) have been proposed, which construct a new network dedicated to a given query using a collec-
tion of reusable modules. In the RL domain, a meta-controller is trained to follow instructions (Oh
et al., 2017) and demonstrations (Xu et al., 2018), and support multi-level hierarchies (Gudimella
et al., 2017). In the robotics domain, Pastor et al. (2009); Kober et al. (2010); M¨ ulling et al. (2013)
have proposed a modular approach that learns table tennis by selecting appropriate low-level con-
trollers. On the other hand, Andreas et al. (2017); Frans et al. (2018) learn abstract skills while
experiencing a distribution of tasks and then solve a new task with the learned primitive skills.
However, these modular approaches result in undefined behavior when two skills are not smoothly
45
&
'
FINAL
Jumping Walking Crawling
Meta-policy
Transition policy Primitive policy
joint pos.
vel.
acc.
curb pos.
Observation
Observation Observation
= success
or failure
c
AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ
AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ
AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ
AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkUI8FLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu4XffUKleSwfzCxBP6JjyUPOqLFSiw3LFbfqLkE2iZeTCuRoDstfg1HM0gilYYJq3ffcxPgZVYYzgfPSINWYUDalY+xbKmmE2s+Wh87JlVVGJIyVLWnIUv09kdFI61kU2M6Imole9xbif14/NeGtn3GZpAYlWy0KU0FMTBZfkxFXyIyYWUKZ4vZWwiZUUWZsNiUbgrf+8ibp3FQ9t+q1apVGLY+jCBdwCdfgQR0acA9NaAMDhGd4hTfn0Xlx3p2PVWvByWfO4Q+czx/CL4zZ
⌧
AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==
AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==
AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==
AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOZpMhM7PLTK8QQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlEph0fe/vdLG5tb2Tnm3srd/cHhUPT5p2yQzjLdYIhPTjajlUmjeQoGSd1PDqYok70STu9zvPHFjRaIfcZryUNGRFrFgFHOpjzQbVGt+3V+ArJOgIDUo0BxUv/rDhGWKa2SSWtsL/BTDGTUomOTzSj+zPKVsQke856imittwtrh1Ti6cMiRxYlxpJAv198SMKmunKnKdiuLYrnq5+J/XyzC+DWdCpxlyzZaL4kwSTEj+OBkKwxnKqSOUGeFuJWxMDWXo4qm4EILVl9dJ+6oe+PXg4brWuC7iKMMZnMMlBHADDbiHJrSAwRie4RXePOW9eO/ex7K15BUzp/AH3ucPHPeOOg==
= termination
1
2 3
4
a
t
,⌧ trans
AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=
AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=
AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=
AAACAHicbZDLSsNAFIYnXmu9RV24cBMsggspiRR0WXDjsoK9QBvCyXTSDp1MwsyJWEI2voobF4q49THc+TZOLwtt/WHg4z/ncOb8YSq4Rtf9tlZW19Y3Nktb5e2d3b19++CwpZNMUdakiUhUJwTNBJesiRwF66SKQRwK1g5HN5N6+4EpzRN5j+OU+TEMJI84BTRWYB9DgBc9hCzIe8geMUcFUhdFYFfcqjuVswzeHCpkrkZgf/X6Cc1iJpEK0LrruSn6OSjkVLCi3Ms0S4GOYMC6BiXETPv59IDCOTNO34kSZZ5EZ+r+nsgh1noch6YzBhzqxdrE/K/WzTC69nMu0wyZpLNFUSYcTJxJGk6fK0ZRjA0AVdz81aFDUEDRZFY2IXiLJy9D67LqGb6rVeq1eRwlckJOyTnxyBWpk1vSIE1CSUGeySt5s56sF+vd+pi1rljzmSPyR9bnD8tRlyE=
a
t
,⌧ p
c
AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=
AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=
AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=
AAAB9XicbZBNS8NAEIY39avWr6pHL4tF8CAlkYIeC148VrAf0MYw2W7bpZtN2J0oJfR/ePGgiFf/izf/jds2B219YeHhnRlm9g0TKQy67rdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDshGC6F4k0UKHkn0RyiUPJ2OL6Z1duPXBsRq3ucJNyPYKjEQDBAaz1AgBc9hDTIkoBNg3LFrbpz0VXwcqiQXI2g/NXrxyyNuEImwZiu5yboZ6BRMMmnpV5qeAJsDEPetagg4sbP5ldP6Zl1+nQQa/sU0rn7eyKDyJhJFNrOCHBklmsz879aN8XBtZ8JlaTIFVssGqSSYkxnEdC+0JyhnFgApoW9lbIRaGBogyrZELzlL69C67LqWb6rVeq1PI4iOSGn5Jx45IrUyS1pkCZhRJNn8krenCfnxXl3PhatBSefOSZ/5Hz+AJw1koc=
⌧ p
c
AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/
AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/
AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/
AAAB8XicbZBNS8NAEIYn9avWr6pHL4tF8FQSKeix4MVjBfuBbQib7aZdutmE3YlQQv+FFw+KePXfePPfuG1z0NYXFh7emWFn3jCVwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFJphlvs0QmuhdSw6VQvI0CJe+lmtM4lLwbTm7n9e4T10Yk6gGnKfdjOlIiEoyitR4HSLMgTwM2C6o1t+4uRNbBK6AGhVpB9WswTFgWc4VMUmP6npuin1ONgkk+qwwyw1PKJnTE+xYVjbnx88XGM3JhnSGJEm2fQrJwf0/kNDZmGoe2M6Y4Nqu1uflfrZ9hdOPnQqUZcsWWH0WZJJiQ+flkKDRnKKcWKNPC7krYmGrK0IZUsSF4qyevQ+eq7lm+b9SajSKOMpzBOVyCB9fQhDtoQRsYKHiGV3hzjPPivDsfy9aSU8ycwh85nz/ij5D/
⌧ trans
AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV
AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV
AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV
AAAB/HicbZDLSsNAFIYnXmu9Rbt0M1gEVyURQZcFNy4r2As0IUymk3boZBJmTsQQ4qu4caGIWx/EnW/jtM1CW38Y+PjPOZwzf5gKrsFxvq219Y3Nre3aTn13b//g0D467ukkU5R1aSISNQiJZoJL1gUOgg1SxUgcCtYPpzezev+BKc0TeQ95yvyYjCWPOCVgrMBueECyoPCAPUIBikhdloHddFrOXHgV3AqaqFInsL+8UUKzmEmggmg9dJ0U/IIo4FSwsu5lmqWETsmYDQ1KEjPtF/PjS3xmnBGOEmWeBDx3f08UJNY6j0PTGROY6OXazPyvNswguvYLLtMMmKSLRVEmMCR4lgQeccUoiNwAoYqbWzGdEEUomLzqJgR3+cur0LtouYbvLpttp4qjhk7QKTpHLrpCbXSLOqiLKMrRM3pFb9aT9WK9Wx+L1jWrmmmgP7I+fwADYpWV
Figure 4.2: Our modular network augmented with transition policies. To perform a complex task,
our model repeats the following steps: (1) The meta-policy chooses a primitive policy of index c;
(2) The corresponding transition policy helps initiate the chosen primitive policy; (3) The primitive
policy executes the skill; and (4) A success or failure signal for the primitive skill is produced.
connected. Our proposed framework aims to bridge this gap by training transition policies in a
model-free manner to navigate the agent from unseen states for following skills to suitable initial
states.
Deep RL techniques for continuous control demand dense reward signals; otherwise, they
suffer from long training time. Instead of manual reward shaping for denser reward, adversarial
reinforcement learning (Ho and Ermon, 2016; Merel et al., 2017; Wang et al., 2017; Bahdanau et al.,
2019) employs a discriminator which learns to judge the state or the policy, and the policy takes
as rewards the output of the discriminator. While those methods assume ground truth trajectories
or goal states are given, our method collects both success and failure trajectories online to train
proximity predictors which provide rewards for transition policies.
4.3 Approach
In this chapter, we address the problem of solving a complex task that requires sequential
composition of primitive skills given only sparse and binary rewards (i.e. subtask completion
reward). The sequential execution of primitive skills fails when two consecutive skills are not
smoothly connected. We propose a modular framework with transition policies that learn to make
46
transition between one policy to the subsequent policy, and therefore, can exploit the given primitive
skills to compose complex skills. To accelerate training of transition policies, additional networks,
proximity predictors, are jointly trained to provide proximity rewards as intermediate feedback to
transition policies. In Section 4.3.2, we describe our framework in details. Next, in Section 4.3.3,
we elaborate how transition policies are efficiently trained with induced proximity reward.
4.3.1 Preliminaries
We formulate our problem as a Markov decision process defined by a tuple {S,A,T ,R,ρ,γ}
of states, actions, transition probability, reward, initial state distribution, and discount factor. An
action distribution of an agent is represented as a policyπ
θ
(a
t
|s
t
), where s
t
∈S is a state, a
t
∈A
is an action at time t, andθ are the parameters of the policy. An initial state s
0
is randomly sampled
fromρ, and then, an agent iteratively takes an action a
t
sampled from a policyπ
θ
(a
t
|s
t
) and receives
a reward r
t
until the episode ends. The performance of the agent is evaluated based on a discounted
return R=∑
T− 1
t=0
γ
t
r
t
, where T is the episode horizon.
4.3.2 Modular Framework with Transition Policies
To learn a new task given primitive skills{p
1
, p
2
,..., p
n
}, we design a modular framework that
consists of the following components: a meta-policy, primitive policies, and transition policies. The
meta-policy chooses a primitive skill p
c
to execute at the beginning and whenever the primitive skill
is terminated. Prior to running p
c
, the transition policy for p
c
is executed to bring the current state
to a plausible initial state for p
c
, and therefore, p
c
can be successfully performed. This procedure is
repeated to compose complex skills as illustrated in Figure 4.2 and Algorithm 8.
We denote the meta-policy as π
meta
(p
c
|s), where c∈[1,n] is a primitive policy index. The
observation of the meta-policy contains the low-level information of primitives and task specifica-
tions indicating high-level goals (e.g. moving direction and target object position). For example, a
walking primitive only takes joint information as observation while the meta-policy additionally
47
takes target direction. In this chapter, we use a rule-based meta-policy and focus on transitioning
between consecutive primitive policies.
Once a primitive skill p
c
is chosen to be executed, the agent generates an action a
t
∼ π
p
c
(a|s
t
)
based on the current state s
t
. Note that we did not differentiate state spaces for primitive polices
because of the simplicity of notations (e.g. the observation of the jumping primitive contains a
distance to a curb while that of the walking primitive only has joint pose and velocities). Every
primitive policy is required to generate termination signalsτ
p
c
∈{continue,success,fail} to indicate
policy completion and whether it believes the execution is successful or not. While our method is
agnostic to the form of primitive policies (e.g. rule-based, inverse kinematics), we consider the case
of a pre-trained neural network in this chapter.
For smooth transitions between primitive policies, we add a transition policyπ
φ
c
(a|s) before
executing primitive skill p
c
, which guides an agent to p
c
’s initiation set, whereφ
c
is the parameters
of the transition policy for p
c
. Note that the transition policy for p
c
is shared across different
preceding primitive policies since a successful transition is defined by the success of the following
primitive skill p
c
. For brevity of notation, we omit the primitive policy index c in the following
equations where unambiguous. The transition policy’s state and action space are the same as the
primitive policy’s. The transition policy also learns a termination signal τ
trans
which indicates
transition termination to successfully initiate p
c
. Our framework contains one transition policy for
each primitive skill, in total n transition policies{π
φ
1
,π
φ
2
,...,π
φ
n
}.
4.3.3 Training Transition Policies
In our framework, transition policies are trained to make the execution of the corresponding
following primitive policies successful. During rollouts, transition trajectories are collected and
each trajectory can be naively labeled by the success execution of its corresponding primitive policy.
Then, transition policies are trained to maximize the average success of the respective primitive
policy. In this scenario, by definition, the only available learning signal for the transition policies is
the sparse and binary rewards for the completion of the next task.
48
FINAL
Jumping Walking Crawling
transition 1 transition 2 transition 3
Proximity predictor Transition policy
Success buffer
Failure buffer
Proximity reward
success
failure
p
3
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh6R/2S9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuqp5b9e5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwACiI2a
⇡ 1
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FwBtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFNkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3XPrXsP17XmbRFHGc7gHC7BgwY04R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wd2p45x
⇡ 2
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSS30/FoDGo1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKObIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20G3XPrXsPV7XmbRFHGc7gHC7Bg2towj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94K45y
⇡ 3
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB4XicbZDNSgMxFIXv1L9aq1a3boJFcFVmdKFLwY3LCk5baIeSSW/b0ExmSO4IZegzuHGhiC/lzrcx/Vlo64HAxzkJuffEmZKWfP/bK21t7+zulfcrB9XDo+PaSbVl09wIDEWqUtOJuUUlNYYkSWEnM8iTWGE7ntzP8/YzGitT/UTTDKOEj7QcSsHJWWEvk/3rfq3uN/yF2CYEK6jDSs1+7as3SEWeoCahuLXdwM8oKrghKRTOKr3cYsbFhI+w61DzBG1ULIadsQvnDNgwNe5oYgv394uCJ9ZOk9jdTDiN7Xo2N//LujkNb6NC6iwn1GL50TBXjFI235wNpEFBauqACyPdrEyMueGCXD8VV0KwvvImtK4agd8IHn0owxmcwyUEcAN38ABNCEGAhBd4g3dPe6/ex7Kukrfq7RT+yPv8AVWkjRw=
AAAB4XicbZDNSgMxFIXv1L9aq1a3boJFcFVmdKFLwY3LCk5baIeSSW/b0ExmSO4IZegzuHGhiC/lzrcx/Vlo64HAxzkJuffEmZKWfP/bK21t7+zulfcrB9XDo+PaSbVl09wIDEWqUtOJuUUlNYYkSWEnM8iTWGE7ntzP8/YzGitT/UTTDKOEj7QcSsHJWWEvk/3rfq3uN/yF2CYEK6jDSs1+7as3SEWeoCahuLXdwM8oKrghKRTOKr3cYsbFhI+w61DzBG1ULIadsQvnDNgwNe5oYgv394uCJ9ZOk9jdTDiN7Xo2N//LujkNb6NC6iwn1GL50TBXjFI235wNpEFBauqACyPdrEyMueGCXD8VV0KwvvImtK4agd8IHn0owxmcwyUEcAN38ABNCEGAhBd4g3dPe6/ex7Kukrfq7RT+yPv8AVWkjRw=
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0n0UI9FLx4rmLbQhrLZbtqlm03YnQgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSqFQdf9dkobm1vbO+Xdyt7+weFR9fikbZJMM+6zRCa6G1LDpVDcR4GSd1PNaRxK3gknd3O/88S1EYl6xGnKg5iOlIgEo2glv5+KwfWgWnPr7gJknXgFqUGB1qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1MoxuglyoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSvqp7bt17cGvN2yKOMpzBOVyCBw1owj20wAcGAp7hFd4c5bw4787HsrXkFDOn8AfO5w94b45v
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+O6W19Y3NrfJ2ZWd3b/+genjU0kmmGPosEYnqhFSj4BJ9w43ATqqQxqHAdji+m/ntJ1SaJ/LRTFIMYjqUPOKMGiv5vZT3L/vVmlt35yCrxCtIDQo0+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmegmyLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n4oNwVt+eZW0LuqeW/cermqN2yKOMpzAKZyDB9fQgHtogg8MODzDK7w50nlx3p2PRWvJKWaO4Q+czx95r45z
p
2
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpIRnUBuWKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVc+tevdXlcZNHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOE8+K8Ox/L1oKTz5zCHzifPwEEjZk=
a
t
AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=
AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=
AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=
AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVssijJJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAvv7xKWhc136v595fV+k0RRxlO4BTOwYcrqMMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOYY/cD5/AE4yjcw=
t
AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4
AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4
AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4
AAAB6HicbVBNS8NAEJ34WetX1aOXYBE8lUQEPRa9eGzBfkAbymY7adduNmF3IpTSX+DFgyJe/Une/Ddu2xy09cHA470ZZuaFqRSGPO/bWVvf2NzaLuwUd/f2Dw5LR8dNk2SaY4MnMtHtkBmUQmGDBElspxpZHEpshaO7md96Qm1Eoh5onGIQs4ESkeCMrFSnXqnsVbw53FXi56QMOWq90le3n/AsRkVcMmM6vpdSMGGaBJc4LXYzgynjIzbAjqWKxWiCyfzQqXtulb4bJdqWIneu/p6YsNiYcRzazpjR0Cx7M/E/r5NRdBNMhEozQsUXi6JMupS4s6/dvtDISY4tYVwLe6vLh0wzTjabog3BX355lTQvK75X8etX5eptHkcBTuEMLsCHa6jCPdSgARwQnuEV3pxH58V5dz4WrWtOPnMCf+B8/gDgKYz4
p
1
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQzrwBtWaW3cXIOvEK0gNCrQG1a/+MGFZzBUySY3peW6KQU41Cib5rNLPDE8pm9AR71mqaMxNkC9OnZELqwxJlGhbCslC/T2R09iYaRzazpji2Kx6c/E/r5dhdBPkQqUZcsWWi6JMEkzI/G8yFJozlFNLKNPC3krYmGrK0KZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c6L86787FsLTnFzCn8gfP5A/s7jYo=
p
2
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKoR4LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY+JUM95msYx1L6CGS6F4GwVK3ks0p1EgeTeY3i787hPXRsTqEWcJ9yM6ViIUjKKVHpJhbViuuFV3CbJJvJxUIEdrWP4ajGKWRlwhk9SYvucm6GdUo2CSz0uD1PCEsikd876likbc+Nny1Dm5ssqIhLG2pZAs1d8TGY2MmUWB7YwoTsy6txD/8/ophjd+JlSSIldstShMJcGYLP4mI6E5QzmzhDIt7K2ETaimDG06JRuCt/7yJunUqp5b9e7rlWY9j6MIF3AJ1+BBA5pwBy1oA4MxPMMrvDnSeXHenY9Va8HJZ87hD5zPH/y/jYs=
p
3
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0UI8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD8ngelCuuFV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophjd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlfVT236t3XKo1aHkcRzuAcLsGDOjTgDprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifP/5DjYw=
s
AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==
AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==
AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==
AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148tmA/oA1ls520azebsLsRSugv8OJBEa/+JG/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R1nCqGLRaLWHUDqlFwiS3DjcBuopBGgcBOMLmb1ztPqDSP5YOZJuhHdCR5yBk11mrqQbniVt2FyDp4OVQgV2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz6KkEWo/Wyw6IxfWGZIwVvZJQxbu74mMRlpPo8B2RtSM9Wptbv5X66UmvPUzLpPUoGTLj8JUEBOT+dVkyBUyI6YWKFPc7krYmCrKjM2mZEPwVk9eh/Z11bPcrFXqV3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7I+fwB2AmM4Q==
(s,v)
AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==
AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==
AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==
AAAB7HicbZBNSwMxEIZn61etX1WPXoJFqFDKrgh6LHjxWMGthXYp2XS2Dc1mlyRbKKW/wYsHRbz6g7z5b0zbPWjrC4GHd2bIzBumgmvjut9OYWNza3unuFva2z84PCofn7R0kimGPktEotoh1Si4RN9wI7CdKqRxKPApHN3N609jVJon8tFMUgxiOpA84owaa/lVXRtf9soVt+4uRNbBy6ECuZq98le3n7AsRmmYoFp3PDc1wZQqw5nAWambaUwpG9EBdixKGqMOpotlZ+TCOn0SJco+acjC/T0xpbHWkzi0nTE1Q71am5v/1TqZiW6DKZdpZlCy5UdRJohJyPxy0ucKmRETC5QpbnclbEgVZcbmU7IheKsnr0Prqu5ZfriuNGp5HEU4g3Ooggc30IB7aIIPDDg8wyu8OdJ5cd6dj2VrwclnTuGPnM8f3taN/A==
Figure 4.3: Training of transition policies and proximity predictors. After executing a primitive
policy, a previously performed transition trajectory is labeled and added to a replay buffer based on
the execution success. A proximity predictor is trained on states sampled from the two buffers to
output the proximity to the initiation set. The predicted proximity serves as a reward to encourage
the transition policy to move toward good initial states for the corresponding primitive policy.
To alleviate the sparsity of rewards and maximize the objective of moving to viable initial states
for the next primitive, we propose a proximity predictor that learns and provides a dense reward,
dubbed proximity reward, of how close transition states are to the initiation set of the corresponding
primitive p
c
as shown in Figure 4.3. We denote a proximity predictor as P
ω
c
which is parameterized
byω
c
. We define the proximity of a state as the future discounted proximity, v=δ
step
, where step
is the number of steps required to reach an initiation set of the following primitive policy. The
proximity of a state can also be a linearly discounted function such as v= 1− δ· step. We refer the
readers to the supplementary for comparison of two proximity functions.
The proximity predictor is trained to minimize a mean squared error of proximity prediction:
L
P
(ω,B
S
,B
F
)=
1
2
E
(s,v)∼ B
S[(P
ω
(s)− v)
2
]+
1
2
E
s∼ B
F[P
ω
(s)
2
], (4.1)
whereB
S
andB
F
are collections of states from success and failure transition trajectories, respec-
tively. To estimate the proximity to an initiation set,B
S
contains not only the state that directly leads
to the success of the following primitive policy, but also the intermediate states of the successful
trajectories with its proximity. By minimizing this objective, given a state, the proximity predictor
49
is learned to predict 1 if the state is in the initiation set, a value that is between 0 and 1 if the state
leads the agent to end up with a desired initial states, and 0 when the state leads to a failure.
The goal of a transition policy is to get close to an initiation set which can be formulated as
seeking a state s predicted to be in the initiation set by the proximity predictor (i.e. P
ω
(s) is close
to 1). To achieve this goal, the transition policy learns to maximize proximity prediction at the
ending state of the transition trajectory P
ω
(s
T
). In addition to providing reward at the end, we also
use the increase of predicted proximity to the initiation set, P
ω
(s
t+1
)− P
ω
(s
t
), at every timestep as
a reward, dubbed proximity reward, to create a denser reward. The transition policy is trained to
maximize the expected discounted return:
R
trans
(φ)=E
(s
0
,s
1
,...,s
T
)∼ π
φ
h
γ
T
P
ω
(s
T
)+
T− 1
∑
t=0
γ
t
(P
ω
(s
t+1
)− P
ω
(s
t
))
i
. (4.2)
However, in general skill learning scenarios, ground truth states (B
S
andB
F
) for training
proximity predictors are not available. Hence, the training data for a proximity predictor is obtained
online during training its corresponding transition policy. Specifically, we label the states in a
transition trajectory as success or failure based on whether the following primitive is successfully
executed or not, and add them into the corresponding buffersB
S
orB
F
, respectively. As stated
in Algorithm 7, we train transition policies and proximity predictors by alternating between an
Adam (Kingma and Ba, 2015) gradient step on ω to minimize (4.1) with respect to P
ω
and a
PPO (Schulman et al., 2017) step onφ to maximize (4.2) with respect toπ
φ
. We refer readers to the
supplementary for further details on training.
In summary, we propose to compose complex skills with transition policies that enable smooth
transition between previously acquired primitive policies. Specifically, we propose to reward
transition policies based on how close the current state is to suitable initial states of the subsequent
policy (i.e. initiation set). To provide the proximity of a state, we collect failing and successful
trajectories on the fly and train a proximity predictor to predict the proximity.
50
Utilizing the learned proximity predictors and proximity rewards for training transition policies
is beneficial in the following perspectives: (1) the dense rewards speed up transition policy training
by differentiating failing states from states in a successful trajectory; and (2) the joint training
mechanism prevents a transition policy from getting stuck in local optima. Whenever a transition
policy gets into a local optimum (i.e. fails the following skill with a high proximity reward), the
proximity predictor learns to lower the proximity for the failing transition as those states are added
to its failure buffer, escaping the local optimum.
4.4 Experiments
We conducted experiments on two classes of continuous control tasks: robotic manipulation
and locomotion. To illustrate the potential of the proposed framework, modular framework with
Transition Policies (TP), we designed a set of complex tasks that require agents to utilize diverse
primitive skills which are not optimized for smooth composition. All of our environments are
simulated in the MuJoCo physics engine (Todorov et al., 2012).
4.4.1 Baselines
We evaluate our method to answer how transition policies benefit complex task learning and
how joint training with proximity predictors boosts training of transition policies. To investigate the
impact of the transition policy, we compared policies learned from dense rewards with our modular
framework that only learns from sparse and binary rewards (i.e. subtask completion rewards).
Moreover, we conducted ablation studies to dissect each component in the training method of
transition polices. To answer these questions, we compare the following methods:
• Trust Region Policy Optimization with dense reward (TRPO) represents a state-of-the-art
policy gradient method (Schulman et al., 2015), which we use for the standard RL comparison.
51
FINAL
(e) Hurdle
(f) Obstacle course
(a) Repetitive picking up
(b) Repetitive catching
(d) Patrol
(c) Serve
Figure 4.4: Tasks and success count curves of our model (blue), TRPO (purple), PPO (magenta),
and transition policies (TP) trained on task reward (green) and sparse proximity reward (yellow).
Our model achieves the best performance and convergence time. Note that TRPO and PPO are
trained 5 times longer than ours with dense rewards since TRPO and PPO do not have primitive
skills and learn from scratch. In the success count curves, different temporal scales are used for
TRPO and PPO (bottom x-axis) and ours (top x-axis).
• Proximal Policy Optimization with dense reward (PPO) is another state-of-the-art policy
gradient method (Schulman et al., 2017), which is more stable than TRPO with smaller batch
sizes.
• Without transition policies (Without-TP) sequentially executes primitive policies without
transition policies and has no learnable components.
• Transition policies trained on task rewards (TP-Task) represents a modular network augmented
with transition policies learned from the sparse and binary reward (i.e. subtask completion reward),
whereas our model learns from the dense proximity reward.
52
Table 4.1: Success count for robotic manipulation, comparing our method against baselines with or
without transition policies (TP). Our method achieves the best performance over both RL baselines
and the ablated variants. Each entry in the table represents average success count and standard
deviation over 50 runs with 3 random seeds.
Reward Repetitive picking up Repetitive catching Serve
TRPO dense 0.69± 0.46 4.54± 1.21 0.32± 0.47
PPO dense 0.95± 0.53 4.26± 1.63 0.00± 0.00
Without TP sparse 0.99± 0.08 1.00± 0.00 0.11± 0.32
TP-Task sparse 0.99± 0.08 4.87± 0.58 0.05± 0.21
TP-Sparse sparse 1.52± 1.12 4.88± 0.59 0.92± 0.27
TP-Dense (ours) sparse 4.84± 0.63 4.97± 0.33 0.92± 0.27
• Transition policies trained on sparse proximity rewards (TP-Sparse) is a variant of our model
which has the proximity reward only at the end of the transition trajectory. In contrast, our model
learns from dense proximity rewards generated every timestep.
• Transition policies trained on dense proximity rewards (TP-Dense, Ours) is our final model
where transition policies learn from dense proximity rewards.
Initially, we tried comparing baseline methods with our method using only sparse and binary
rewards. However, the baselines could not solve any of the tasks due to the complexity and sparse
reward of the environments. To provide more competitive comparisons, we engineer dense rewards
for baselines (TRPO and PPO) to boost their performance and give baselines 5 times longer training
times. We show that transitions with sparse rewards can compete with and even outperform baselines
learning from dense rewards. As the performance of TRPO and PPO varies significantly between
runs, we train each task with 3 different random seeds and report mean and standard deviation in
Figure 4.4.
4.4.2 Robotic Manipulation
For robotic manipulation, we simulate a Kinova Jaco, a 9 DoF robotic arm with 3 fingers. The
agent receives full state information, including the absolute location of external objects. The agent
uses joint torque control to perform actions. The results are shown in Figure 4.4 and Table 4.1.
53
Pre-trained primitives There are four pre-trained primitives available: Picking up, Catching,
Tossing, and Hitting. Picking up requires the robotic arm to pick up a small block, which is
randomly placed on the table. If the box is not picked up after a certain amount of time, the agent
fails. Catching learns to catch a block that is thrown towards the arm with random initial position
and velocity. The agent fails if it does not catch and stably hold the box for a certain amount of
time. Tossing requires the robot to pick up a box, toss it vertically in the air, and land the box at a
specified position. Hitting requires the robot to hit a box dropped overhead at a target ball.
Repetitive picking up The Repetitive picking up task requires the agent to complete the Picking up
task 5 times. After each successful pick, the box disappears and a new box will be placed randomly
on the table again. Our model achieves the best performance and converges the fastest by learning
from the proposed proximity reward. With our dense proximity reward at every transition step, we
alleviate credit assignment when compared to providing a sparse proximity reward (TP-Sparse)
or using a sparse task reward (TP-Task). Conversely, TRPO and PPO with dense rewards take
significantly longer to learn and is unable to pick up the second box as the ending pose after the first
picking up is too unstable to initialize the next picking up.
Repetitive catching Similar to Repetitive picking up, the Repetitive catching task requires the
agent to catch boxes consecutively up to 5 times. In this task, other than the modular network
without a transition policy, all baselines are able to eventually learn while our model still learns
the fastest. We believe this is because the Catching primitive policy has a larger initiation set and
therefore, the sparse reward problem is less severe since random exploration is able to succeed with
a higher chance.
Serve Inspired by tennis, Serve requires the robot to toss the ball and hit it at a target. Even with
an extensively engineered reward, TRPO and PPO baselines fail to learn because Hitting is not
able to learn to cover all terminal states of Tossing (i.e. a set of initial states for Hitting is large
which demands longer training time). In contrast, learning to recover from Tossing’s ending states
54
to Hitting’s initiation set is easier for exploration (11% of Tossing’s ending states are covered by
Hitting’s initiation set as can be seen in Table 4.1), which reduces the complexity of the task. Thus,
our method and the sparse proximity reward baseline are both able to solve it. However, the ablated
variant trained on task reward shows high success rates at the beginning of training and collapses
after 100 iterations. The performance drops because the transition policy tries to solve failure cases
by increasing the transition length and it reaches to a point that it hardly gets reward. This result
shows that once the policy falls into local optima, it is not able to escape because the policy will
never get a sparse task reward. On the other hand, our method is robust to local optima since the
jointly learned dense proximity reward provides a learning signal to an agent even though it cannot
get a task reward.
4.4.3 Locomotion
For locomotion, we simulate a 9 DoF planar (2D) bipedal walker. The observation of the agent
includes joint position, rotation, and velocity. When the agent needs to interact with objects in the
environment, we provide additional input such as distance to the curb and ceiling in front of the
agent. The agent uses joint torque control to perform actions. The results are shown in Figure 4.4
and Table 4.2.
Pre-trained primitives Forward and Backward require the walker to walk forward and backward
with a certain velocity, respectively. Balancing requires the walker to robustly stand still under the
random external forces. Jumping requires the walker jump over a randomly located curb and land
safely. Crawling requires the walker to crawl under a ceiling. In all the aforementioned scenarios,
the walker fails when the height of the walker is lower than a threshold.
Patrol (Forward and backward) The Patrol task involves walking forward and backward toward
goal points on either side and balancing in between to smoothly change its direction. As illustrated
in Figure 4.4, our method consistently outperforms TRPO, PPO, and ablated baselines in stably
55
Table 4.2: Success count for locomotion, comparing our method against baselines with or without
transition policies (TP). Our method outperforms all baselines in Patrol and Obstacle course. In
Hurdle, the reward function for TRPO was extensively engineered, which is not directly comparable
to our method. Our method outperforms baselines learning from sparse reward, showing the
effectiveness of the proposed proximity predictor. Each entry in the table represents average success
count and standard deviation over 50 runs with 3 random seeds.
Reward Patrol Hurdle Obstacle course
TRPO dense 1.37± 0.52 4.13± 1.54 0.98± 1.09
PPO dense 1.53± 0.53 2.87± 1.92 0.85± 1.07
Without TP sparse 1.02± 0.14 0.49± 0.75 0.72± 0.72
TP-Task sparse 1.69± 0.63 1.73± 1.28 1.08± 0.78
TP-Sparse sparse 2.51± 1.26 1.47± 1.53 1.32± 0.99
TP-Dense (Ours) sparse 3.33± 1.38 3.14± 1.69* 1.90± 1.45
walking forward and transitioning to walk backward. The agent trained with dense rewards is not
able to consistently switch directions, whereas our model can utilize previously learned primitives
including Balancing to stabilize a reversal in velocity.
Hurdle (Walking forward and jumping) The Hurdle task requires the agent to walk forward
and jump across curbs, which requires a transition between walking and jumping as well as landing
the jump to walking forward. As shown in Figure 4.4, our method outperforms the sparse reward
baselines, showing the efficiency our proposed proximity reward. While TRPO with dense rewards
can learn this task as well, it requires dense rewards consisting of eight different components to
collectively enable TRPO to learn the task. It can be considered as learning both primitive skills and
transition between skills from dense rewards. However, the main focus of this chapter is to learn a
complex task by reusing acquired skills, avoiding an extensive reward design.
Obstacle Course (Walking forward, jumping, and crawling) Obstacle Course is the most
difficult among the locomotion tasks, where the walker must walk forward, jump across curbs,
and crawl underneath ceilings. It requires three different behaviors and transitions between two
very different primitive skills: crawling and jumping. Since the task requires significantly different
behaviors that are hard to transition between, TRPO fails to learn the task and only tries to crawl
56
toward the curb without attempting to jump. In contrast, our method learns to transition between all
pairs of primitive skills and often succeeds in crossing multiple obstacles.
4.4.4 Ablation Studies
We conducted additional experiments to understand the contribution of transition policies,
proximity predictors, and dense proximity rewards. The modular framework without transition
policies (Without-TP) tends to fail the execution of the second skill since the second skill is not
trained to cover ending states of the first skill. Especially, in continuous control making a primitive
skill that can cover all possible states is very challenging. Transition policies trained from task
completion reward (TP-Task) and sparse proximity reward (TP-Sparse) learn to connect consecutive
primitives slower because sparse reward is hard to learn from due to the credit assignment problem.
On the other hand, our model alleviates the credit assignment problem and learns quickly by giving
predicted proximity reward for every transition state-action pair.
Serve
AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx
AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx
AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx
AAAB+HicbZDLSsNAFIYn9VbrpVWXboJFcFUSKeiy4MZlRXuBNpTJ9KQdOpmEmZNiDX0SNy4UceujuPNtnLRZaOuBgY//P2fmzO/Hgmt0nG+rsLG5tb1T3C3t7R8clitHx20dJYpBi0UiUl2fahBcQgs5CujGCmjoC+j4k5vM70xBaR7JB5zF4IV0JHnAGUUjDSrlPsIjapbeg5rCvDSoVJ2asyh7HdwcqiSv5qDy1R9GLAlBIhNU657rxOilVCFnwlzYTzTElE3oCHoGJQ1Be+li8bl9bpShHUTKHIn2Qv09kdJQ61nom86Q4livepn4n9dLMLj2Ui7jBEGy5UNBImyM7CwFe8gVMBQzA5Qpbna12ZgqytBklYXgrn55HdqXNdfwXb3aqOdxFMkpOSMXxCVXpEFuSZO0CCMJeSav5M16sl6sd+tj2Vqw8pkT8qeszx/q2ZMx
FINAL
(a) Manipulation
Patrol
AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo
AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo
AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo
AAAB+XicbZBNS8NAEIYn9avWr6hHL8EieCqJCHosePFYwX5AG8pmu22XbnbD7qRYQv+JFw+KePWfePPfuGlz0NYXFh7emWFm3ygR3KDvfzuljc2t7Z3ybmVv/+DwyD0+aRmVasqaVAmlOxExTHDJmshRsE6iGYkjwdrR5C6vt6dMG67kI84SFsZkJPmQU4LW6rtuD9kTGpo1CGol5pW+W/Vr/kLeOgQFVKFQo+9+9QaKpjGTSAUxphv4CYYZ0cipYPNKLzUsIXRCRqxrUZKYmTBbXD73Lqwz8IZK2yfRW7i/JzISGzOLI9sZExyb1Vpu/lfrpji8DTMukxSZpMtFw1R4qLw8Bm/ANaMoZhYI1dze6tEx0YSiDSsPIVj98jq0rmqB5Yfrav26iKMMZ3AOlxDADdThHhrQBApTeIZXeHMy58V5dz6WrSWnmDmFP3I+fwC7dZOo
FINAL
(b) Patrol
Figure 4.5: Average transition length and average proximity reward of transition trajectories over
training on Manipulation (left) and Patrol (right).
4.4.5 Training of Transition Policies and Proximity Predictors
To investigate how transition polices learn to solve the tasks, we present the lengths of transition
trajectories and the obtained proximity rewards during training in Figure 4.5. For manipulation, we
show the results of Repetitive picking up and Repetitive catching. For locomotion, we show Patrol
with three different transition policies.
57
The transition policy quickly learns to maximize the proximity reward regardless of the accuracy
of the proximity predictor. All the transition policies increase the length while exploring in the
beginning, especially for picking up (55 steps) and balance (45 steps). This is because a randomly
initialized proximity predictor outputs high proximity for unseen states and a transition policy
tries to get a high reward by visiting these states. However, as these failing initial states with
high proximity are collected in the failure buffers, the proximity predictor lowers their proximity
and the transition policy learns to avoid them. In other words, the transition policy will end up
seeking successful states. As transition policies learn to transition to the following skills, the length
decreases to get higher proximity rewards earlier.
4.4.6 Visualization of Transition Trajectory
Figure 4.6a shows two transition trajectories (from s
0
to t
0
and s
1
to t
1
) and two-dimensional
PCA embedding of the ending states (blue) and initiation states (red) of the Picking up primitive. A
transition policy starts from states s
0
and s
1
where the previous Picking up primitive is terminated.
As can be seen in Figure 4.6a, the proximity predictor outputs small values for s
0
and s
1
since they
are far from the initiation set of Picking up primitive. Trajectories in the figure show that as the
transition policy moves toward states with higher proximity, and finally ends up with states t
0
and t
1
which are in the initiation set of the primitive policy.
Figure 4.6b illustrates PCA embeddings of initiation sets of three primitive skills, Forward
(green), Backward (orange), and Balancing (blue). A transition from Forward to Balancing has
very long trajectory, but predicted proximity helps the transition policy to reach to an initiation state
t
0
. On the other hand, transitioning between Balancing and Backward only requires 7 steps.
4.5 Discussion
In this chapter, we propose a modular framework with transition policies to empower rein-
forcement learning agents to learn complex tasks with sparse reward by utilizing prior knowledge.
58
High P(s)
Low P(s)
Picking end Picking start
Picking end
Picking start
s0
t0 s1
t1
(a) Repetitive picking up
Forward
High P(s)
Low P(s)
Forward Balancing Backward
Balancing
Balancing Backward
s0
t0
s1
t1
(b) Patrol
Figure 4.6: Visualization of transition trajectories of (a) Repetitive picking up and (b) Patrol. TOP
AND BOTTOM ROWS: contain rendered frames of transition trajectories. MIDDLE ROW: contains
states extracted from each primitive skill execution projected onto PCA space. The dots connected
with lines are extracted from the same transition trajectory, where the marker color indicates the
proximity prediction P(s). A higher P(s) value indicates proximity to states suitable for initializing
the next primitive skill. LEFT: two picking up transition trajectories demonstrate that the transition
policy learns to navigate from terminate states s
0
and s
1
to t
0
and t
1
. RIGHT: the forward to balance
transition moves between the forward and balance state distributions and the balance to backward
transition moves from the balancing states close to the backward states.
Specifically, we formulate the problem as executing existing primitive skills while smoothly transi-
tioning between primitive skills. To learn transition polices in a sparse reward setting, we propose
a proximity predictor which generates dense reward signals and jointly train transition policies
and proximity predictors. The experimental results on robotic manipulation and locomotion tasks
demonstrate the effectiveness of employing transition policies. The proposed framework solves
complex tasks without reward shaping and outperforms baseline RL algorithms and other ablated
baselines.
There are many future directions to investigate. The proposed method is designed to focus
on acquiring transition policies that connect a given set of primitive policies under the predefined
59
meta-policy. We believe that joint learning of a meta-policy and transition policies on a new task
would make it more flexible. Moreover, we made an assumption that a successful transition between
two consecutive policies should be achievable by random exploration. To alleviate the exploration
problem with sparse rewards, the transition policy training can incorporate exploration methods such
as count-based exploration bonuses (Bellemare et al., 2016; Martin et al., 2017) and curiosity-driven
intrinsic reward (Pathak et al., 2017). We also assume the primitive policies return a signal that
indicates whether the execution should be terminated or not, similar to Kulkarni et al. (2016); Oh
et al. (2017); Le et al. (2018). Learning to assess the successful termination of primitive policies
together with learning transition policies is a promising future direction.
60
Chapter 5
Skill Chaining via Terminal State Regularization
5.1 Introduction
Deep reinforcement learning (RL) presents a promising framework for learning impressive robot
behaviors (Levine et al., 2016; Su´ arez-Ruiz and Pham, 2016; Rajeswaran et al., 2018; Jain et al.,
2019). Yet, learning a complex long-horizon task using a single control policy is still challenging
mainly due to its high computational costs and the exploration burdens of RL models (Andrychowicz
et al., 2021). A more practical solution is to decompose a whole task into smaller chunks of subtasks,
learn a policy for each subtask, and sequentially execute the subtasks to accomplish the entire
task (Clegg et al., 2018; Lee et al., 2019b; Peng et al., 2019; Lee et al., 2020).
However, naively executing one policy after another would fail when the subtask policy en-
counters a starting state never seen during training (Clegg et al., 2018; Lee et al., 2019b; 2020). In
other words, a terminal state of one subtask may fall outside of the set of starting states that the
next subtask policy can handle, and thus fail to accomplish the subtask, as illustrated in Figure 5.1a.
Especially in robot manipulation, complex interactions between a high-DoF robot and multiple
objects could lead to a wide range of robot and object configurations, which are infeasible to be
covered by a single policy (Ghosh et al., 2018). Therefore, skill chaining with policies with limited
capability is not trivial and requires adapting the policies to make them suitable for sequential
execution.
61
To resolve the mismatch between the terminal state distribution (i.e. termination set) of one
policy and the initial state distribution (i.e. initiation set) of the next policy, prior skill chaining
approaches have attempted to learn to bring an agent to a suitable starting state (Lee et al., 2019b),
discover a chain of options (Konidaris and Barto, 2009; Bagaria and Konidaris, 2020), jointly fine-
tune policies to accommodate larger initiation sets that encompass terminal states of the preceding
policy (Clegg et al., 2018), or utilize modulated skills for smooth transition between skills (Pastor
et al., 2009; Kober et al., 2010; M¨ ulling et al., 2013; Hausman et al., 2018; Lee et al., 2020).
Although these approaches can widen the initiation sets to smoothly sequence several subtask
policies, it quickly becomes infeasible as the larger initiation set often leads to an even larger
termination set, which is cascaded along the chain of policies, as illustrated in Figure 5.1b.
Instead of enlarging the initiation set to encompass a termination set modelled as a simple
Gaussian distribution (Clegg et al., 2018), we propose to keep the termination set small and near
the initiation set of the next policy. This can prevent the termination sets from becoming too large
to be covered by the subsequent policies, especially when executing a long sequence of skills;
thus, fine-tuning of subtask polices becomes more sample efficient. As a result, small changes in
the initiation sets of subsequent policies are sufficient to successfully execute a series of subtask
policies.
To this end, we devise an adversarial learning framework that learns an initiation set discriminator
to distinguish the initiation set of the policy that follows from terminal states, and uses it as a
regularization for encouraging the terminal states of a policy to be near the initiation set of the
next policy. With this terminal state regularization, the pretrained subtask policies are iteratively
fine-tuned to solve the subtask with the new initiation set while keep the terminal states of a policy
small enough to be covered by the initiation set of the subsequent policy. As a result, our model is
capable of chaining a sequence of closed-loop policies to accomplish a collection of multi-stage
IKEA furniture assembly tasks (Lee et al., 2021b) that require high-dimensional continuous control
under contact-rich dynamics.
In summary, our contributions are threefold:
62
bDc
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
bD
demo
c
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
Start
Goal
I
0
β
0
I
1
β
1
β
2
I
2
TEASER
⋯
(I
0
,π
0
,β
0
)
Subtask 0
(I
1
,π
1
,β
1
)
Subtask 1
(I
2
,π
2
,β
2
)
Subtask 2
I
0
β
0
˜
I
1
˜
β
1
I
0
β
0
I
1
β
1
π
0
π
1
π
0
˜ π
1
Fail subtask 1
Increase I
1
also increases β
1
I
2
I
2
π
2
π
2
I
0
β
1
˜ π
0
π
1
Push towards I
1
β
0
I
2
π
2
I
0
˜
β
0
˜
I
1
˜
β
1
˜ π
0
˜ π
1
Widen to cover I
1
˜
β
0
I
2
π
2
˜
β
0
I
1
(I
3
,π
3
,β
3
)
Subtask 3
Push towards I
2
β
1
I
i−1
β
i−1
˜
I
i
˜
β
i
I
i−1
I
i
β
i
π
i−1
π
i
π
i−1
˜ π
i
Fail subtask
Increase I
i
also increases β
i
I
i+1
I
i+1
π
i+1
π
i+1
I
i−1
˜
β
i−1
˜
β
i
˜ π
i−1
˜ π
i
Widen to cover I
i
˜
β
i−1
I
i+1
π
i+1
Push towards I
i+1
β
i
i
β
i−1
˜
I
i
(a) Naive execution ofπ
i
bDc
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
bD
demo
c
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
Start
Goal
I
0
β
0
I
1
β
1
β
2
I
2
TEASER
⋯
(I
0
,π
0
,β
0
)
Subtask 0
(I
1
,π
1
,β
1
)
Subtask 1
(I
2
,π
2
,β
2
)
Subtask 2
I
0
β
0
˜
I
1
˜
β
1
I
0
β
0
I
1
β
1
π
0
π
1
π
0
˜ π
1
Fail subtask 1
Increase I
1
also increases β
1
I
2
I
2
π
2
π
2
I
0
β
1
˜ π
0
π
1
Push towards I
1
β
0
I
2
π
2
I
0
˜
β
0
˜
I
1
˜
β
1
˜ π
0
˜ π
1
Widen to cover I
1
˜
β
0
I
2
π
2
˜
β
0
I
1
(I
3
,π
3
,β
3
)
Subtask 3
Push towards I
2
β
1
I
i−1
β
i−1
˜
I
i
˜
β
i
I
i−1
I
i
β
i
π
i−1
π
i
π
i−1
˜ π
i
Fail subtask
Increase I
i
also increases β
i
I
i+1
I
i+1
π
i+1
π
i+1
I
i−1
˜
β
i−1
˜
β
i
˜ π
i−1
˜ π
i
Widen to cover I
i
˜
β
i−1
I
i+1
π
i+1
Push towards I
i+1
β
i
i
β
i−1
˜
I
i
(b) Widen the initiation set ofπ
i
(Policy Sequencing)
bDc
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
bD
demo
c
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
Start
Goal
I
0
β
0
I
1
β
1
β
2
I
2
TEASER
⋯
(I
0
,π
0
,β
0
)
Subtask 0
(I
1
,π
1
,β
1
)
Subtask 1
(I
2
,π
2
,β
2
)
Subtask 2
I
0
β
0
˜
I
1
˜
β
1
I
0
β
0
I
1
β
1
π
0
π
1
π
0
˜ π
1
Fail subtask 1
Increase I
1
also increases β
1
I
2
I
2
π
2
π
2
I
0
β
1
˜ π
0
π
1
Push towards I
1
β
0
I
2
π
2
I
0
˜
β
0
˜
I
1
˜
β
1
˜ π
0
˜ π
1
Widen to cover I
1
˜
β
0
I
2
π
2
˜
β
0
I
1
(I
3
,π
3
,β
3
)
Subtask 3
Push towards I
2
β
1
I
i−1
β
i−1
˜
I
i
˜
β
i
I
i−1
I
i
β
i
π
i−1
π
i
π
i−1
˜ π
i
Fail subtask
Increase I
i
also increases β
i
I
i+1
I
i+1
π
i+1
π
i+1
I
i−1
˜
β
i−1
˜
β
i
˜ π
i−1
˜ π
i
Widen to cover I
i
˜
β
i−1
I
i+1
π
i+1
Push towards I
i+1
β
i
i
β
i−1
˜
I
i
(c) Update π
i
with terminal state regularization
(Ours)
Figure 5.1: We aim to solve a long-horizon task, e.g., furniture assembly, using independently
trained subtask policies. (a) Each subtask policy,π
i
, works successfully only on its initiation set
(green),I
i
, and results in its termination set (pink),β
i
; thus, it fails when performed outside ofI
i
(red curve). (b) To enable sequencing policies, a subsequent policyπ
i
needs to widen its initiation
set to cover the termination set of the prior policy β
i− 1
. But this can result in an increase of its
termination setβ
i
, which makes fine-tuning of the following policy π
i+1
even more challenging.
This effect is exacerbated when more policies are chained together. (c) During fine-tuning of a
policyπ
i
, we regularize the terminal state distributionβ
i
to be close to the initiation set of the next
policyI
i+1
. In contrast to the boundless increase of
˜
β
i
in (b), our approach effectively keeps the
required initiation set small over the chain of policies with the terminal state regularization.
• We propose a novel adversarial skill chaining framework with Terminal STAte Regularization,
T-STAR, for learning long-horizon and hierarchical manipulation tasks.
• We demonstrate that the proposed method can learn a long-horizon manipulation task, furniture
assembly of two different types of furniture. Our terminal state regularization algorithms
improves the success rate of the policy sequencing method from 0% to 56% for CHAIR INGOLF
and from 59% to 87% for TABLE LACK. To our best knowledge, this is the first empirical
result of a model-free RL method that solves these furniture assembly tasks without manual
engineering.
• We present comprehensive comparisons with prior skill composition methods and qualitative
visualizations to analyze our model performances.
63
5.2 Related Work
Deep reinforcement learning (RL) for continuous control (Lillicrap et al., 2016; Schulman
et al., 2015; Haarnoja et al., 2018b) is an active research area. While some complex tasks can be
solved based on a reward function, undesired behaviors often emerge (Riedmiller et al., 2018) when
tasks require several different primitive skills. Moreover, learning such complex tasks becomes
computationally impractical as the tasks become long and complicated due to the credit assignment
problem and the large exploration space.
Imitation learning aims to reduce this complexity of exploration and the difficulty of learning
from reward signal. Behavioral cloning approaches (Pomerleau, 1989; Schaal, 1997; Finn et al.,
2016b; Pathak* et al., 2018) greedily imitate the expert policy and therefore suffer from accumulated
errors, causing a drift away from states seen in the demonstrations. On the other hand, inverse
reinforcement learning (Ng and Russell, 2000; Abbeel and Ng, 2004; Ziebart et al., 2008) and
adversarial imitation learning approaches (Ho and Ermon, 2016; Fu et al., 2018) encourage the
agent to imitate expert trajectories with a learned reward function, which can better handle the
compounding errors. Specifically, generative adversarial imitation learning (GAIL) (Ho and Ermon,
2016) and its variants (Fu et al., 2018; Kostrikov et al., 2019) show improved demonstration
efficiency by training a discriminator to distinguish expert versus agent transitions and using the
discriminator output as a reward for policy training. Although these imitation learning methods
can learn simple locomotion behaviors (Ho and Ermon, 2016) and a handful of short-horizon
manipulation tasks (Zolna et al., 2020; Kostrikov et al., 2019), these methods easily overfit to local
optima and still suffer from temporal credit assignment and accumulated errors for long-horizon
tasks.
Instead of learning an entire task using a single policy, we can tackle the task by decomposing it
into easier and reusable subtasks. Hierarchical reinforcement learning does this by decomposing a
task into a sequence of temporally extended macro actions. It often consists of one high-level policy
and a set of low-level policies, such as in the options framework (Sutton et al., 1999), in which
the high-level policy decides which low-level policy to activate and the chosen low-level policy
64
generates a sequence of atomic actions until the high-level policy switches it to another low-level
policy. Options can be discovered without supervision (Schmidhuber, 1990; Bacon et al., 2017;
Nachum et al., 2018; Levy et al., 2019; Konidaris and Barto, 2009; Bagaria and Konidaris, 2020),
learned from data (Konidaris et al., 2012; Kipf et al., 2019; Lu et al., 2021b), meta-learned (Frans
et al., 2018), pre-defined (Lee et al., 2019b; Kulkarni et al., 2016; Oh et al., 2017; Merel et al.,
2019a), or attained through task structure supervision (Andreas et al., 2017; Ghosh et al., 2018).
To synthesize complex motor skills with a set of predefined skills, Lee et al. (2019b) learns
to find smooth transitions between subtask policies. This method assumes the subtask policies
are fixed, which makes learning such transitions feasible but, at the same time, leads to failure of
smooth transition when an external state has to be changed or the end state of the prior subtask
policy is too far away from the initiation set of the following subtask policy. On the other hand,
Clegg et al. (2018) proposes to sequentially improve subtask policies to cover the terminal states
of previous subtask policies. While this method fails when the termination set of the prior policy
becomes extremely large, our method prevents such boundless expansion of the termination set.
Closely related to our work, prior skill chaining methods (Konidaris and Barto, 2009; Bagaria
and Konidaris, 2020; Konidaris et al., 2012) have proposed to discover a new option that ends with
an initiation set of the previous option. With newly discovered options, the agent can reach the
goal from more initial states. However, these methods have a similar issue with Clegg et al. (2018)
that discovering a new option requires a large enough initiation set of the previous option, which is
cascaded along the chain of options.
Furniture assembly is a challenging robotics task requiring reliable 3D perception, high-level
planning, and sophisticated control. Prior works (Niekum et al., 2013; Knepper et al., 2013; Su´ arez-
Ruiz et al., 2018) tackle this problem by manually programming the high-level plan and learning
only a subset of low-level controls. In this chapter, we propose to tackle the furniture assembly
task in simulation (Lee et al., 2021b) by learning all the low-level control skills and then effectively
sequencing these skills.
65
5.3 Approach
In this chapter, we aim to address the problem of chaining multiple policies for long-horizon
complex manipulation tasks, especially in furniture assembly (Lee et al., 2021b). Sequentially
executing skills often fails when a policy encounters a starting state (i.e. a terminal state of the
preceding policy) never seen during its training. The subsequent policy can learn to address these
new states, but this may require even larger states to be covered by the following policies. To
chain multiple skills without requiring boundlessly large initiation sets of subtask policies, we
introduce a novel adversarial skill chaining framework that constrains the terminal state distribution,
as illustrated in Figure 5.2. Our approach first learns each subtask using subtask rewards and
demonstrations (Section 5.3.2); it then adjusts the subtask policies to effectively chain them to
complete the whole task via the terminal state regularization (Section 5.3.3).
5.3.1 Preliminaries
We formulate our learning problem as a Markov Decision Process (Sutton, 1984) defined
through a tuple(S,A,R,P,ρ
0
,γ) for the state spaceS , action spaceA , reward function R(s,a,s
′
),
transition distribution P(s
′
|s,a), initial state distribution ρ
0
, and discount factor γ. We define a
policyπ :S →A that maps from states to actions and correspondingly moves an agent to a new
state according to the transition probabilities. The policy is trained to maximize the expected sum
of discounted rewardsE
(s,a)∼ π
∑
T− 1
t=0
γ
t
R(s
t
,a
t
,s
t+1
)
, where T is the episode horizon. Each policy
comes with an initiation setI ⊂ S and termination setβ⊂ S , where the initiation setI contains
all initial states that lead to successful execution of the policy and the termination setβ consists of
all final states of successful executions. We assume the environment provides a success indicator
of each subtask and this can be easily inferred from the final state, e.g., two parts are aligned and
the connect action is activated. In addition to the reward function, we assume the learner receives
a fixed set of expert demonstrations, D
e
={τ
e
1
,...,τ
e
N
}, where a demonstration is a sequence of
state-action pairs,τ
e
j
=(s
0
,a
0
,...,s
T
j
− 1
,a
T
j
− 1
,s
T
j
).
66
Subtask policy 1 Subtask policy 2 Subtask policy 3
Initiation Set Discriminator 3
Subtask 2 rollout
Initial State Buffer
Subtask 1 rollout Subtask 3 rollout
Terminal State Regularization
Terminal State
R = λ
1
R
ENV
+λ
2
R
GAIL
+λ
3
R
TSR
Initiation Set Discriminator 2
Agent trajectory
Terminal State Buffer
Sample Initial State
Subtask policy 1 Subtask policy 2 Subtask policy 3
Initiation Set Discriminator 3
Subtask 2 rollout
Initial State Buffer 2
Subtask 1 rollout Subtask 3 rollout
Terminal State Regularization
Terminal State
R = λ
1
R
ENV
+λ
2
R
GAIL
+λ
3
R
TSR
Initiation Set Discriminator 2
Terminal State Buffer 2
Terminal State
Buffer 1
Initial State
Buffer 3
Sample Initial State
Figure 5.2: Our adversarial skill chaining framework regularizes the terminal state distribution to
be close to the initiation set of the subsequent subtask. The initiation set discriminator models
the initiation set distribution by discerning the initiation set and states in agent trajectories, while
the policy learns to reach states close to the initiation set by augmenting the reward with the
discriminator output, dubbed terminal state regularization. Our method jointly trains all policies
and initiation set discriminators, pushing the termination set close to the initiation set of the next
policy, which leads to smaller changes required for the policies that follow, especially effective in a
long chain of skills.
5.3.2 Learning Subtask Policies
To solve a complicated long-horizon task (e.g. assembling a table), we decompose the task
into smaller subtasks (e.g. assembling a table leg to a table top), learn a policyπ
i
θ
for each subtask
M
i
, and chain these subtask policies. Learning each subtask solely from reward signals is still
challenging in robot manipulation due to a huge state space of the robot and objects to explore,
and complex robot control and dynamics. Thus, for efficient exploration, we use an adversarial
imitation learning approach, GAIL (Ho and Ermon, 2016), which encourages the agent to stay near
the expert trajectories using a learned reward by discerning expert and agent behaviors. Together
with reinforcement learning, the policy can efficiently learn to solve the subtask even on states not
covered by the demonstrations.
67
More specifically, we train each subtask policy π
i
θ
on R
i
following the reward formulation
proposed in (Zhu et al., 2018; Peng et al., 2021), which uses the weighted sum of the environment
and GAIL rewards:
R
i
(s
t
,a
t
,s
t+1
;φ)=λ
1
R
i
ENV
(s
t
,a
t
,s
t+1
)+λ
2
R
i
GAIL
(s
t
,a
t
;φ), (5.1)
where R
i
GAIL
(s
t
,a
t
;φ)= 1− 0.25· [ f
i
φ
(s
t
,a
t
)− 1]
2
is the predicted reward by the GAIL discriminator
f
i
φ
(Peng et al., 2021), andλ
1
andλ
2
are hyperparameters that balance between the reinforcement
learning and imitation learning objectives. We found this reward formulation (Peng et al., 2021)
works most stable among variants of GAIL (Ho and Ermon, 2016) in our experiments thanks
to its bounded reward between[0,1]. The discriminator is trained using the following objective:
min
f
i
φ
E
(s)∼ D
e
h
( f
i
φ
(s)− 1)
2
i
+E
(s)∼ π
i
h
( f
i
φ
(s)+ 1)
2
i
.
Due to computational limitations, training a subtask policy for all possible initial states is
impractical, and hence it can cause failure on states unseen during training. Instead of indefinitely
increasing the initiation set, we first train the policy on a limited set of initial states (e.g. predefined
initial states with small noise), and later fine-tune the policy on the set of initial states required for
skill chaining as described in the following section. This pretraining of subtask policies ensures the
quality of the pretrained skills and makes the fine-tuning stage of our method easy and efficient.
5.3.3 Skill Chaining with Terminal State Regularization
Once subtask polices are acquired, one can sequentially execute the subtask policies to complete
more complex tasks. However, naively executing the polices one-by-one would fail since the
policies are not trained to be smoothly connected. As can be seen in Figure 5.1a, independently
trained subtask policies only work on a limited range of initial states. Therefore, the execution of a
policyπ
i
fails on the terminal states of the preceding policyβ
i− 1
outside of its initiation setI
i
.
For successful sequential execution ofπ
i− 1
andπ
i
, the terminal states of the preceding policy
should be included in the initiation set of the subsequent policy,β
i− 1
⊂ I
i
. This can be achieved
68
Algorithm 1 T-STAR: Skill chaining via terminal state regularization
Require: Expert demonstrationsD
e
1
,...,D
e
K
, subtask MDPsM
1
,...,M
K
1: Initialize subtask policiesπ
1
θ
,...,π
K
θ
, GAIL discriminators f
1
φ
,..., f
K
φ
, initiation set discrimina-
tors D
1
ω
,...,D
K
ω
, initial state buffersB
1
I
,...,B
K
I
, and terminal state buffersB
1
β
,...,B
K
β
2: for each subtask i= 1,...,K do
3: while until convergence ofπ
i
θ
do
4: Rollout trajectoriesτ =(s
0
,a
0
,r
0
,...,s
T
) withπ
i
θ
5: Update f
i
φ
withτ andτ
e
∼ D
e
i
▷ Train GAIL discriminator
6: Updateπ
i
θ
using R
i
(s
t
,a
t
,s
t+1
;φ) in Equation (5.1) ▷ Train subtask policy
7: end while
8: end for
9: for iteration m= 0,1,...,M do
10: for each subtask i= 1,...,K do
11: Sample s
0
from environment orB
i− 1
β
12: Rollout trajectoriesτ =(s
0
,a
0
,r
0
,...,s
T
) withπ
i
θ
13: ifτ is successful then
14: B
i
I
← B
i
I
∪ s
0
,B
i
β
← B
i
β
∪ s
T
▷ Collect initial and terminal states of successful trajectories
15: end if
16: Update f
i
φ
withτ andτ
e
∼ D
e
i
▷ Fine-tune GAIL discriminator
17: Update D
i
ω
with s
β
∼ B
i− 1
β
and s
I
∼ B
i
I
▷ Train initiation set discriminator
18: Updateπ
i
θ
using R
i
(s
t
,a
t
,s
t+1
;φ,ω) in Equation (5.3) ▷ Fine-tune subtask policy with terminal
state regularization
19: end for
20: end for
either by widening the initiation set of the subsequent policy or by shifting the terminal state
distribution of the preceding policy. However, in robot manipulation, the set of valid terminal states
can be huge with freely located objects (e.g. a robot can mess up the workplace by moving or
throwing other parts away). This issue is cascaded along the chain of policies, which leads to the
boundlessly large initiation set required for policies, as illustrated in Figure 5.1b.
Therefore, a policy needs to not only increase the initiation set (e.g. assemble the table leg
in diverse configurations) but also regularize the termination set to be bounded and close to the
initiation set of the subsequent policies (e.g. keep the workplace organized), as described in
Figure 5.1c. To this end, we devise an adversarial framework which jointly trains an initiation set
discriminator, D
i
ω
(s
t
), to distinguish the terminal states of the preceding policy and the initiation
set of the corresponding policy, and a policy to reach the initiation set of the subsequent policy with
69
the guidance of the initiation set discriminator. We train the initiation set discriminator for each
policy to minimize the following objective:L
i
(ω)=E
s
I
∼ I
i
D
i
ω
(s
I
)− 1
2
+E
s
T
∼ β
i− 1
D
i
ω
(s
T
)
2
.
With the initiation set discriminator, we regularize the terminal state distribution of the policy
by encouraging the policy to reach a terminal state close to the initiation set of the following policy.
The terminal state regularization can be formulated as following:
R
i
T SR
(s;ω)= 1
s∈β
i
D
i+1
ω
(s) (5.2)
Then, we can rewrite the reward function with the terminal state regularization:
R
i
(s
t
,a
t
,s
t+1
;φ,ω)=λ
1
R
i
ENV
(s
t
,a
t
,s
t+1
)+λ
2
R
i
GAIL
(s
t
,a
t
;φ)+λ
3
R
i
T SR
(s
t+1
;ω),
(5.3)
where λ
3
is a weighting factor for the terminal state regularization. The first two terms of this
reward function guide a policy to accomplish the subtask while the terminal state regularization
term forces the termination set to be closer to the initiation set of the following policy.
With this reward function incorporating the terminal state regularization, subtask policies and
GAIL discriminators can be trained to cover unseen initial states while keeping the termination set
closer to the initiation set of the next policy. Once subtask policies are updated, we collect terminal
states and initiation sets with the updated policies, and train the initiation set discriminators. We
alternate these procedures to smoothly chain subtask policies, as summarized in Algorithm 1, where
changes of our algorithm with respect to Clegg et al. (2018) are marked in red.
5.4 Experiments
In this chapter, we propose a skill chaining approach with the terminal state regularization,
which encourages to match the terminal state distribution of the prior skill with suitable starting
states of the following skill. Through our experiments, we aim to verify our hypothesis that the
policy sequencing fails due to unbounded terminal states, which is cascaded along the sequence
70
of skills, and show the effectiveness of our framework on learning a long sequence of complex
manipulation skills.
5.4.1 Baselines
We compare our method to the state-of-the-art prior works in reinforcement learning, imitation
learning, and skill composition, which are listed below:
• BC (Pomerleau, 1989) fits a policy to the demonstration actions with supervised learning.
• PPO (Schulman et al., 2017) is a model-free on-policy RL method that learns a policy from the
environment reward.
• GAIL (Ho and Ermon, 2016) is an adversarial imitation learning approach with a discriminator
trained to distinguish expert and agent state-action pairs(s,a).
• GAIL + PPO uses both the environment and GAIL reward (Equation (5.1)) to optimize a policy.
• SPiRL (Pertsch et al., 2020a) is a hierarchical RL approach that learns fixed-length skills and
skill prior from the dataset, and then learns a downstream task using skill-prior-regularized RL.
• Policy Sequencing (Clegg et al., 2018) first trains a policy for each subtask independently and
finetunes the policies to cover the terminal states of the previous policy.
• T-STAR (Ours) learns subtask policies simultaneously, leading them to be smoothly connected
using the terminal state regularization.
5.4.2 Tasks
We test our method and baselines with two furniture models, TABLE LACK and CHIAR INGOLF,
from the IKEA furniture assembly environment (Lee et al., 2021b) as illustrated in Figure 5.3:
• TABLE LACK: Four table legs need to be picked up and aligned to the corners of the table top.
71
(a) TABLE LACK (b) CHAIR INGOLF
Figure 5.3: Two furniture assembly tasks (Lee et al., 2021b) consist of four subtasks (four table legs
for TABLE LACK; two seat supports, chair seat, and front legs for CHIAR INGOLF).
• CHAIR INGOLF: Two chair supports and front legs need to be attached to the chair seat. Then,
the chair seat needs to be attached to the chair back while avoiding collision to each other.
In our experiments, we define a subtask as assembling one part to another; thus, we have four
subtasks for each task. Subtasks are independently trained on the initial (object) states sampled
from the environment with random noise ranging from [− 2cm, 2cm] and [− 3 3 in the(x,y)-plane.
This subtask decomposition is given, i.e., the environment can be initialized for each subtask and
the agent is informed whether the subtask is completed and successful.
For the robotic agent, we use the 7-DoF Rethink Sawyer robot operated via joint velocity control.
For imitation learning, we collected 200 demonstrations for each furniture part assembly with a
programmatic assembly policy. Each demonstration for single-part assembly consists of around
200-900 steps long due to the long-horizon nature of the task.
The observation space includes robot observations (29 dim), object observations (35 dim),
and task phase information (8 dim). The object observations contain the positions (3 dim) and
quaternions (4 dim) of all five furniture pieces in the scene. Once two parts are attached, the
corresponding subtask is completed and the robot arm moves back to its initial pose in the center of
72
the workplace. Our approach can in theory solve the task without resetting the robot; however, this
resetting behavior effectively reduces the gap in robot states when switching skills and is available
in most robots.
Table 5.1: Average progress of the furniture assembly task. Each subtask amounts to 0.25 progress.
Hence, 1 represents successful execution of all four subtasks while 0 means the agent does not
achieve any subtask. Our method learns to complete all four subtasks in sequence and outperforms
the policy sequencing baseline and standard RL and IL methods. We report the mean and standard
deviation across 5 seeds.
TABLE LACK CHAIR INGOLF
BC (Pomerleau, 1989) 0.03± 0.00 0.04± 0.01
PPO (Schulman et al., 2017) 0.09± 0.11 0.14± 0.03
GAIL (Ho and Ermon, 2016) 0.00± 0.00 0.00± 0.00
GAIL + PPO (Zhu et al., 2018) 0.21± 0.11 0.22± 0.08
SPiRL (Pertsch et al., 2020a) 0.05± 0.00 0.03± 0.00
Policy Sequencing (Clegg et al., 2018) 0.63± 0.28 0.77± 0.12
T-STAR (Ours) 0.90± 0.07 0.89± 0.04
5.4.3 Results
The results in Table 5.1 show the average progress of the furniture assembly tasks across 200
testing episodes for 5 different seeds. Since each task consists of assembling furniture parts four
times, completing one furniture part assembly amounts to task progress of 0.25. Even though the
environment has small noises in furniture initialization, both BC (Pomerleau, 1989) and GAIL (Ho
and Ermon, 2016) baselines move the robot arm near furniture pieces but struggle at picking up
even one furniture piece. This shows the limitation of BC and GAIL in dealing with compounding
errors in long-horizon tasks with the large state space and continuous action space. Similarly, the
hierarchical skill-based learning approach, SPiRL, also struggles at learning picking up a single
furniture piece. This can be due to the insufficient amount of data to cover a long sequence of
skills on the large state space, e.g., many freely located objects. Moreover, due to the difficulty of
exploration, the model-free RL baseline, PPO, rarely learns to assemble one part. On the other hand,
the GAIL + PPO baseline can consistently learn one-part assembly, but cannot learn to assemble
73
further parts due to the exploration challenge and temporal credit assignment problem. These
baselines are trained for 200M environment steps (5M for off-policy SPiRL).
By utilizing pretrained subtask policies with GAIL + PPO (25M steps for each subtask), skill
chaining approaches could achieve improved performance compared to the single-policy baselines.
We train the policy sequencing baseline and our method for additional 100M steps, which requires
in total 200M steps including the pretraining stage. The policy sequencing baseline (Clegg et al.,
2018) achieves 0.63 and 0.77 average task progress, whereas our method achieves 0.90 and 0.89
average task progress on TABLE LACK and CHAIR INGOLF, respectively. The performance gain of
our method comes from the reduced discrepancy between the termination set and the initiation set
of the next subtask thanks to the terminal state regularization.
We can observe this performance gain even clearer in the success rates. Our terminal state
regularization improves the success rate of policy sequencing from 0% to 56% for CHAIR INGOLF
and from 59% to 87% for TABLE LACK. We observe that with more skills to be chained, the success
rate of the newly chained skill decreases, especially in CHAIR INGOLF; the policy sequencing
baseline learns to complete the first three subtasks, but fails to learn the last subtask due to
excessively large and shifted initial state distribution, i.e., terminal states of the preceding subtask.
5.4.4 Qualitative Results
To analyze the effect of the proposed terminal state regularization, we visualize the changes in
termination sets over training. We first collect 300 terminal states of the third subtask policy on
CHAIR INGOLF both for our method and the policy sequencing baseline for 36M, 39M, 42M, and
45M training steps. Then, we apply PCA on the object state information in the terminal states and
use the first two principal components to reduce the data dimension.
Figure 5.4 shows that our method effectively constrains the terminal state distribution of a
subtask policy. Before 36M training steps, the policy cannot solve the third subtask due to the
shifted initial state distribution by the second subtask policy. With additional training, at 42M and
45M training steps, the policy learns to solve the subtask on newly added initial states for both
74
Terminal states (Ours, 36M steps)
Terminal states (PS, 36M steps)
(a) 36M steps
Terminal states (Ours, 39M steps)
Terminal states (PS, 39M steps)
(b) 39M steps
Terminal states (Ours, 42M steps)
Terminal states (PS, 42M steps)
(c) 42M steps
Terminal states (Ours, 45M steps)
Terminal states (PS, 45M steps)
(d) 45M steps
Figure 5.4: To demonstrate the benefit of our terminal state regularization, we visualize the changes
in termination sets over training of the third subtask policy on CHAIR INGOLF. We plot each
terminal state by projecting its object configuration into 2D space using PCA. Through 36M to 45M
training steps, both the policy sequencing baseline (Clegg et al., 2018) and our method successfully
learn to cover most terminal states from the second subtask. However, without regularization, the
policy sequencing method (red) shows the increasing size of the termination set (e.g. spread over
horizontally at 39M and vertically at 42M steps) as more initial states are covered by the policy. In
contrast, in our approach (blue), the terminal state distribution is bounded, which shows that the
terminal state regularization can effectively prevent the terminal state distribution diverging. This
bounded termination set makes learning of the following skills efficient, and thus helps chaining a
long sequence of skills.
methods. From 36M to 45M steps of training, the termination set of the policy sequencing baseline
(red) spreads out horizontally after 39M training steps and vertically after 42M steps. For successful
skill chaining, this wide termination set has to be covered by the following policy, requiring a large
change in the policy and potentially causing an even larger termination set. In contrast, the terminal
state distribution of our method (blue) does not excessively increase but actually shrinks in this
experiment, which leads to successful execution and efficient adaptation of subsequent subtask
policies.
5.5 Discussion
In this chapter, we proposed T-STAR, a novel adversarial skill chaining framework that ad-
dresses the problem of the increasing size of initiation sets required for executing a long chain of
manipulation skills. To prevent excessively large initiation sets to be learned, we regularize the
terminal state distribution of a subtask policy to be close to the initiation set of the following subtask
75
policy. Through terminal state regularization, T-STAR jointly trains all subtask policies to ensure
that the final state of one policy is a good initial state for the policy that follows. We demonstrate
the effectiveness of T-STAR on the challenging furniture assembly tasks, where prior skill chaining
approaches fail. These results are promising and motivate future work on chaining more skills
with diverse skill combinations to tackle complex long-horizon problems. Another interesting
research direction is eliminating subtask supervision required in T-STAR and discovering subtask
decomposition from large data in a unsupervised learning manner.
76
Chapter 6
Coordinating Skills via Skill Behavior Diversification
6.1 Introduction
Imagine you wish to play Chopin’s Fantaisie Impromptu on the piano. With little prior knowl-
edge about the piece, you would first practice playing the piece with each hand separately. After
independently mastering the left and right hand parts, you would move on to practicing with both
hands simultaneously. To find the synchronized and non-interfering movements of two hands,
you would try variable ways of playing the same melody with each hand, and eventually create
a complete piece of music. Through the decomposition of skills into sub-skills of two hands and
learning variations of sub-skills, humans make the learning process of manipulation skills much
faster than learning everything at once.
Can autonomous agents efficiently learn complicated tasks with coordination of different skills
from multiple end-effectors like humans? Learning to perform collaborative and composite tasks
from scratch requires a huge amount of environment interaction and extensive reward engineering,
which often results in undesired behaviors (Riedmiller et al., 2018). Hence, instead of learning a
task at once, modular approaches (Andreas et al., 2017; Oh et al., 2017; Frans et al., 2018; Lee
et al., 2019b; Peng et al., 2019; Goyal et al., 2020) suggest to learn reusable primitive skills and
solve more complex tasks by recombining the skills. However, all these approaches either focus on
working with single end-effector manipulation or single agent locomotion, and these do not scale to
multi-agent problems.
77
Pick ( )
Push ( )
Place ( )
Push ( )
Figure 6.1: Composing complex skills using multiple agents’ primitive skills requires proper
coordination between agents since concurrent execution of primitive skills requires temporal and
behavioral coordination. For example, to move a block into a container on the other end of the
table, the agent needs to not only utilize pick, place, and push primitive skills at the right time
but also select the appropriate behaviors for these skills, represented as latent vectors z
1
, z
2
, z
3
,
and z
4
above. Naive methods neglecting either temporal or behavioral coordination will produce
unintended behaviors, such as collisions between end-effectors.
To this end, we propose a modular framework that learns to coordinate multiple end-effectors
with their primitive skills for various robotics tasks, such as bimanual manipulation. The main
challenge is that naive simultaneous execution of primitive skills from multiple end-effectors can
often cause unintended behaviors (e.g. collisions between end-effectors). Thus, as illustrated in
Figure 6.1, an agent needs to learn to appropriately coordinate end-effectors; and hence needs a
way to obtain, represent, and control detailed behaviors of each primitive skill. Inspired by these
intuitions, our method consists of two parts: (1) acquiring primitive skills with diverse behaviors
by mutual information maximization, and (2) learning a meta policy that selects a skill for each
end-effector and coordinates the chosen skills by controlling the behavior of each skill.
The main contribution of this chapter is a modular and hierarchical approach that tackles
cooperative manipulation tasks with multiple end-effectors by (1) learning primitive skills of each
end-effector independently with skill behavior diversification and (2) coordinating end-effectors
using diverse behaviors of the skills. The empirical results indicate that the proposed method is
able to efficiently learn primitive skills with diverse behaviors and coordinate these skills to solve
challenging collaborative control tasks such as picking up a long bar, placing a block inside the
container on the right side, and pushing a box with two ant agents. We provide additional qualitative
results and code athttps://clvrai.com/coordination.
78
6.2 Related Work
Deep reinforcement learning (RL) for continuous control is an active research area. However,
learning a complex task either from a sparse reward or a heavily engineered reward becomes
computationally impractical as the target task becomes complicated. Instead of learning from
scratch, complex tasks can be tackled by decomposing the tasks into easier and reusable sub-tasks.
Hierarchical reinforcement learning temporally splits a task into a sequence of temporally extended
meta actions. It often consists of one meta policy (high-level policy) and a set of low-level policies,
such as options framework (Sutton et al., 1999). The meta policy decides which low-level policy to
activate and the chosen low-level policy generates an action sequence until the meta policy switches
it to another low-level policy. Options can be discovered without supervision (Schmidhuber, 1990;
Bacon et al., 2017; Nachum et al., 2018; Levy et al., 2019), meta-learned (Frans et al., 2018),
pre-defined (Kulkarni et al., 2016; Oh et al., 2017; Merel et al., 2019a; Lee et al., 2019b), or attained
from additional supervision signals (Andreas et al., 2017; Ghosh et al., 2018). However, option
frameworks are not flexible to solve a task that requires simultaneous activation or interpolation of
multiple skills since only one skill can be activated at each time step.
To solve composite tasks multiple policies can be simultaneously activated by adding Q-
functions (Haarnoja et al., 2018a), additive composition (Qureshi et al., 2020; Goyal et al., 2020), or
multiplicative composition (Peng et al., 2019). As each policy takes the whole observation as input
and controls the whole agent, it is not robust to changes in unrelated parts of the observation. For
example, a left arm skill can be affected by the pose change in the right arm, which is not relevant
to the left arm skill. Hence, these skill composition approaches can fail when an agent encounters
a new combination of skills or a new skill is introduced since the agent will experience unseen
observations.
Instead of having a policy with the full observation and action space, multi-agent reinforcement
learning (MARL) suggests to explicitly split the observation and action space according to agents
(e.g. robots or end-effectors), which allows efficient low-level policy training as well as flexible skill
composition. For cooperative tasks, communication mechanisms (Sukhbaatar et al., 2016; Peng et al.,
79
2017a; Jiang and Lu, 2018), sharing policy parameters (Gupta et al., 2017), and decentralized actor
with centralized critic (Lowe et al., 2017; Foerster et al., 2018) have been actively used. However,
these approaches suffer from the credit assignment problem (Sutton, 1984) among agents and the
lazy agent problem (Sunehag et al., 2018). As agents have more complicated morphology and
larger observation space, learning a policy for a multi-agent system from scratch requires extremely
long training time. Moreover, the credit assignment problem becomes more challenging when the
complexity of cooperative tasks increases and all agents need to learn completely from scratch. To
resolve these issues, we propose to first train reusable skills for each agent in isolation, instead of
learning primitive skills of multiple agents together. Then, we recombine these skills (Maes and
Brooks, 1990) to complete more complicated tasks with learned coordination of the skills.
To coordinate skills from multiple agents, the skills have to be flexible; hence, a skill can
be adjusted to collaborate with other agents’ skills. Maximum entropy policies (Haarnoja et al.,
2017; 2018a;b) can learn diverse ways to achieve a goal by maximizing not only reward but also
entropy of the policy. In addition, Eysenbach et al. (2019) proposes to discover diverse skills
without reward by maximizing entropy as well as mutual information between resulting states
and latent representations of skills (i.e. skill embeddings). Our method leverages the maximum
entropy policy (Haarnoja et al., 2018b) with the discriminability objective (Eysenbach et al., 2019)
to learn a primitive skill with diverse behaviors conditioned on a controllable skill embedding. This
controllable skill embedding will be later used as a behavior embedding for the meta policy to
adjust a primitive skill’s behavior for coordination.
6.3 Method
In this chapter, we address the problem of solving cooperative manipulation tasks that require col-
laboration between multiple end-effectors or agents. Note that we use the terms “end-effector” and
“agent” interchangeably in this chapter. Instead of learning a multi-agent task from scratch (Lowe
et al., 2017; Gupta et al., 2017; Sunehag et al., 2018; Foerster et al., 2018), modular approaches (An-
dreas et al., 2017; Frans et al., 2018; Peng et al., 2019) suggest to learn reusable primitive skills and
80
State
Next State
Action
Meta
Policy
Agent/End-effector 1
Pick Place Push
Primitive Skills
Behavior Embedding
Pick
Place
Push
Selected Skill Index
Agent/End-effector 2
Behavior Embedding
Pick
Place
Push
Pick Place Push
Primitive Skills
Selected Skill Index
Figure 6.2: Our method is composed of two components: a meta policy and a set of agent-specific
primitive policies relevant to task completion. The meta policy selects which primitive skill to run
for each agent as well as the behavior embedding (i.e. variation in behavior) of the chosen primitive
skill. Each selected primitive skill takes as input the agent observation and the behavior embedding
and outputs action for that agent.
solve more complex tasks by recombining these skills. However, concurrent execution of primitive
skills of multiple agents fails when agents never experienced a combination of skills during the
pre-training stage, or skills require temporal or behavioral coordination.
Therefore, we propose a modular and hierarchical framework that learns to coordinate multiple
agents with primitive skills to perform a complex task. Moreover, during primitive skill training, we
propose to learn a latent behavior embedding, which provides controllability of each primitive skill
to the meta policy while coordinating skills. In Section 6.3.2, we describe our modular framework
in detail. Next, in Section 6.3.3, we elaborate how controllable primitive skills can be acquired.
Lastly, we describe how the meta policy learns to coordinate primitive skills in Section 6.3.4.
6.3.1 Preliminaries
We formulate our problem as a Markov decision process defined by a tuple {S,A,T ,R,ρ,γ}
of states, actions, transition probability, reward, initial state distribution, and discount factor. In
our formulation, we assume the environment includes N agents. To promote consistency in our
terminology, we use superscripts to denote the index of agent and subscripts to denote time or
primitive skill index. Hence, the state space and action space for an agent i can be represented as
S
i
andA
i
where each element ofS
i
is a subset of the corresponding element inS andA =
81
(a) RL (b) MARL (c) Modular (d) RL-SBD (e) MARL-SBD (f) Modular-SBD
Figure 6.3: Different multi-agent architectures. (a) The vanilla RL method considers all agents as a
monolithic agent; thus a single policy takes the full observation as input and outputs the full action.
(b) The multi-agent RL method (MARL) consists of N policies that operate on the observations
and actions of corresponding agents. (c) The modular network consists of N sets of skills for the
N agents trained in isolation and a meta policy that selects a skill for each agent. (d-f) The RL,
MARL, and modular network methods augmented with skill behavior diversification (SBD) has a
meta policy that outputs a skill behavior embedding vector z for each skill.
A
1
× A
2
×···× A
N
, respectively. For each agent i, we provide a set of m
i
skills,Π
i
={π
i
1
,...,π
i
m
i
}.
A policy of an agent i is represented asπ
i
c
i
t
(a
i
t
|s
i
t
)∈Π
i
, where c
i
t
is a skill index, s
i
t
∈S
i
is a state,
and a
i
t
∈A
i
is an agent action at time t. An initial state s
0
is sampled fromρ, and then, N agents
take actions a
1
t
,a
2
t
,...,a
N
t
sampled from a composite policy π(a
1
t
,a
2
t
,...,a
N
t
|s
t
,c
1
t
,c
2
t
,...,c
N
t
)=
(π
1
c
1
t
(a
1
t
|s
t
),π
2
c
2
t
(a
2
t
|s
t
),...,π
N
c
N
t
(a
N
t
|s
t
)) and receive a single reward r
t
. The performance is evaluated
based on a discounted return R=∑
T− 1
t=0
γ
t
r
t
, where T is the episode horizon.
6.3.2 Modular Framework
As illustrated in Figure 6.2, our model is composed of two components: a meta policyπ
meta
and a set of primitive skills of N agents Π
1
,...,Π
N
. Note that each primitive skill π
i
c
i
∈Π
i
contains variants of behaviors parameterized by an N
z
-dimensional latent behavior embedding z
i
(see Section 6.3.3). The meta policy selects a skill to execute for each agent, rather than selecting
one primitive skill for the entire multi-agent system to execute. Also, we give the meta policy
the capability to select which variant of the skill to execute (see Section 6.3.4). Then, the chosen
primitive skills are simultaneously executed for T
low
time steps.
The concurrent execution of multiple skills often leads to undesired results and therefore requires
coordination between the skills. For example, naively placing a block in the left hand to a container
82
being moved by the right hand can cause collision between the two robot arms. The arms can avoid
collision while performing the skills by properly adjusting their skill behaviors (e.g. the left arm
leaning to the left side while placing the block and the right arm leaning to the right side while
pushing the container) as shown in Figure 6.1. In our method, the meta policy learns to coordinate
multiple agents’ skills by manipulating the behavior embeddings (i.e. selecting a proper behavior
from diverse behaviors of each skill).
6.3.3 Training Agent-Specific Primitive Skills with Diverse Behaviors
To adjust a primitive skill to collaborate with other agents’ skills in a new environment, the skill
needs to support variations of skill behaviors when executed at a given state. Moreover, a behavioral
variation of a skill should be controllable by the meta policy for skill coordination. In order to make
our primitive skill policies generate diverse behaviors controlled by a latent vector z, we leverage
the entropy and mutual information maximization objective introduced in Eysenbach et al. (2019).
More specifically, a primitive policy of an agent i outputs an action a∈A conditioned on the
current state s∈S and a latent behavior embedding z∼ p(z), where the prior distribution p(z)
is Gaussian (we omit agent i in this section for the simplicity of notations). Diverse behaviors
conditioned on a random sample z can be achieved by maximizing the mutual information between
behaviors and states MI(s,z), while minimizing the mutual information between behaviors and
actions given the state MI(a,z|s), together with maximizing the entropy of the policyH (a|s) to
encourage diverse behaviors. The objective can be written as follows (we refer the readers to
Eysenbach et al. (2019) for derivation):
F(θ)= MI(s,z)− MI(a,z|s)+H (a|s)=H (a|s,z)− H (z|s)+H (z) (6.1)
=H (a|s,z)+E
z∼ p(z),s∼ π(z)
[log p(z|s)]− E
z∼ p(z)
[log p(z)] (6.2)
≥ H (a|s,z)+E
z∼ p(z),s∼ π(z)
[logq
φ
(z|s)− log p(z)], (6.3)
where the learned discriminator q
φ
(z|s) approximates the posterior p(z|s).
83
To achieve a primitive skill with diverse behaviors, we augment (6.3) to the environment reward:
r
t
+λ
1
H (a|s,z)+λ
2
E
z∼ p(z),s∼ π(z)
[logq
φ
(z|s)− log p(z)], (6.4)
whereλ
1
is the entropy coefficient and λ
2
is the diversity coefficient which corresponds identifiability
of behaviors. Maximizing (6.3) encourages multi-modal exploration strategies while maximizing
the reward r
t
forces to achieve its own goal. Moreover, by maximizing identifiability of behaviors,
the latent vector z, named behavior embedding, can represent a variation of the learned policy and
thus can be used to control the behavior of the policy. For example, when training a robot to move
an object, a policy learns to move the object quickly as well as slowly, and these diverse behaviors
map to different latent vectors z. We empirically show that the policies with diverse behaviors
achieve better compositionality with other agents in our experiments.
6.3.4 Composing Primitive Skills with Meta Policy
We denote the meta policy asπ
meta
(c
1
,...,c
N
,z
1
,...,z
N
|s
t
), where c
i
∈[1,m
i
] represents a skill
index of an agent i∈[1,N] and z
i
∈R
N
z
represents a behavior embedding of the skill. Every T
low
time steps, the meta policy chooses one primitive skillπ
i
c
i
∈Π
i
for each agent i. Also, the meta policy
outputs a set of latent behavior embeddings(z
1
,z
2
,...,z
N
) and feeds them to the corresponding
skills (i.e. π
i
c
i
(a
i
|s
i
,z
i
) for agent i). Once a set of primitive skills{π
1
c
1
,...,π
N
c
N
} are chosen to be
executed, each primitive skill generates an action a
i
∼ π
i
c
i
(a
i
|s
i
,z
i
) based on the current state s
i
and
the latent vector z
i
. Algorithm 2 illustrates the overall rollout process.
Since there are a finite number of skills for each agent to execute, the meta action space for
each agent[1,m
i
] is discrete, while the behavior embedding space for each agentR
N
z
is continuous.
Thus, the meta policy is modeled as a (2× N)-head neural network where the first N heads represent
m
i
-way categorical distributions for skill selection and the last N heads represent N
z
-dimensional
Gaussian distributions for behavior control of the chosen skill.
84
Algorithm 2 ROLLOUT
1: Input: Meta policyπ
meta
, sets of primitive policiesΠ
1
,...,Π
N
, and meta horizon T
low
2: Initialize an episode t← 0 and receive initial state s
0
3: while episode is not terminated do
4: Sample skill indexes and behavior embeddings(c
1
t
,...,c
N
t
),(z
1
t
,...,z
N
t
)∼ π
meta
(s
t
)
5: τ← 0
6: whileτ < T
low
and episode is not terminated do
7: a
t+τ
=(a
1
t+τ
,...,a
N
t+τ
)∼ (π
1
c
1
t
(s
t+τ
,z
1
t
),...,π
N
c
N
t
(s
t+τ
,z
N
t
))
8: s
t+τ+1
,r
t+τ
← ENV(s
t+τ
,a
t+τ
)
9: τ← τ+ 1
10: end while
11: Add a transition s
t
,(c
1
t
,...,c
N
t
),(z
1
t
,...,z
N
t
),s
t+τ
,r
t:t+τ− 1
to the rollout bufferB
12: t← t+τ
13: end while
6.3.5 Implementation
We model the primitive policies and posterior distributions q
φ
as neural networks. We train the
primitive policies using soft actor-critic (Haarnoja et al., 2018b). When we train a primitive policy,
we use a unit Gaussian distribution as the prior distribution of latent variables p(z). We use 5 as the
size of latent behavior embedding N
z
. Each primitive policy outputs the mean and standard deviation
of a Gaussian distribution over an action space. For a primitive policy, we apply tanh activation to
normalize the action between[− 1,1]. We model the meta policy as neural network with multiple
heads that output the skill index c
i
and behavior embedding z
i
for each agent. The meta policy is
trained using PPO (Schulman et al., 2017; 2016; Dhariwal et al., 2017). All policy networks in
this chapter consist of 3 fully connected layers of 64 hidden units with ReLU nonlinearities. The
discriminator q
φ
in (6.4) is a 2-layer fully connected network with 64 hidden units.
6.4 Experiments
To demonstrate the effectiveness of our framework, we compare our method to prior methods in
the field of multi-agent RL and ablate the components of our framework to understand their impor-
tance. We conducted experiments on a set of challenging robot control environments that require
85
(a) JACO PICK-PUSH-PLACE (b) JACO BAR-MOVING (c) ANT PUSH
Figure 6.4: The composite tasks pose a challenging combination of object manipulation and
locomotion skills, which requires coordination of multiple agents and temporally extended behaviors.
(a) The left Jaco arm needs to pick up a block while the right Jaco arm pushes a container, and then
it places the block into the container. (b) Two Jaco arms are required to pick and place a bar-shaped
block together. (c) Two ants push the red box to the goal location (green circle) together.
coordination of different agents to complete collaborative robotic manipulation and locomotion
tasks.
Through our experiments, we aim to answer the following questions: (1) can our framework
efficiently learn to combine primitive skills to execute a complicated task; (2) can our learned
agent exhibit collaborative behaviors during task execution; and (3) can our framework leverage the
controllable behavior variations of the primitive skills to achieve better coordination?
For details about environments and training, please refer to the supplementary material. As the
performance of training algorithms varies between runs, we train each method on each task with 6
different random seeds and report mean and standard deviation of each method’s success rate.
6.4.1 Baselines
We compare the performance of our method with various single- and multi-agent RL methods
illustrated in Figure 6.3:
Single-agent RL (RL) A vanilla RL method where a single policy takes as input the full observa-
tion and outputs all agents’ actions.
86
Multi-agent RL (MARL) A multi-agent RL method where each of N policies takes as input the
observation of the corresponding agent and outputs an action for that agent. All policies share the
global critic learned from a single task reward (Lowe et al., 2017).
Modular Framework (Modular) A modular framework composed of a meta policy and N sets
of primitive skills (i.e. one or more primitive skills per agent). Every T
low
time steps, the meta
policy selects a primitive skill for each agent based on the full observation. Then, the chosen skills
are executed for T
low
time steps.
Single-agent RL with Skill Behavior Diversification (RL-SBD) An RL method augmented
with the behavior diversification objective. A meta policy is employed to generate a behavior
embedding for a low-level policy, and the low-level policy outputs all agents’ actions conditioned
on the behavior embedding and the full observation for T
low
time steps. The meta policy and the
low-level policy are jointly trained with the behavior diversification objective described in (6.4).
Multi-agent RL with Skill Behavior Diversification (MARL-SBD) A MARL method aug-
mented with the behavior diversification objective. A meta policy generates N behavior embeddings.
Then, each low-level policy outputs each agent’s action conditioned on its observation and behavior
embedding for T
low
time steps. All policies are jointly trained to maximize (6.4).
Modular Framework with Skill Behavior Diversification (Modular-SBD, Ours) Our method
which coordinates primitive skills of multiple agents. The modular framework consists of a meta
policy and N sets of primitive skills, where each primitive skill is conditioned on a behavior
embedding z. The meta policy takes as input the full observation and selects both a primitive skill
and a behavior embedding for each agent. Then, each primitive skill outputs action for each agent.
87
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
RL
MARL
RL-SBD
MARL-SBD
Modular
Modular-SBD
(a) JACO PICK-PUSH-PLACE
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(b) JACO BAR-MOVING
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(c) ANT PUSH
Figure 6.5: Success rates of our method (Modular-SBD) and baselines. For modular frameworks
(Modular and Modular-SBD), we shift the learning curves rightwards the total number of environ-
ment steps the agent takes to learn the primitive skills (0.9 M, 1.2 M, and 2.0 M, respectively). Our
method substantially improves learning speed and performance on JACO PICK-PUSH-PLACE and
ANT PUSH. The shaded areas represent the standard deviation of results from six different seeds.
The curves are smoothed using moving average over 10 runs.
6.4.2 Jaco Pick-Push-Place
We developed JACO PICK-PUSH-PLACE and JACO BAR-MOVING environments using two
Kinova Jaco arms, where each Jaco arm is a 9 DoF robotic arm with 3 fingers. JACO PICK-PUSH-
PLACE starts with a block on the left and a container on the right. The robotic arms need to pick up
the block, push the container to the center, and place the block inside the container. For successful
completion of the task, the two Jaco arms have to concurrently execute their distinct sets of skills
and dynamically adjust their picking, pushing, and placing directions to avoid collision between
arms.
Primitives skills There are three primitive skills available to each arm: Picking up, Pushing, and
Placing to center (see Figure 6.4a). Picking up requires a robotic arm to pick up a small block,
which is randomly placed on the table. If the block is not picked up after a certain amount of time or
the arm drops the block, the agent fails. Pushing learns to push a big container to its opposite side
(e.g. from left to the center or from right to center). The agent fails if it cannot place the container
to the center. Placing to center requires placing an object in the gripper to the table. The agent only
succeeds when it stably places the object at the desired location on the container.
88
Composite task Our method (Modular-SBD) can successfully perform JACO PICK-PUSH-PLACE
task while all baselines fail to compose primitive skills as shown in Figure 6.5a. The RL and
MARL baselines cannot learn the composite task mainly because the agent requires to learn the
combinatorial number of skill compositions and to solve the credit assignment problem across
multiple agents. Since the composite task requires multiple primitive skills of multiple agents to be
performed properly at the same time, a reward signal about a failure case cannot be assigned to the
correct agent or skill. By using pre-trained primitive skills, the credit assignment problem is relaxed
and all agents can perform their skills concurrently. Therefore, the Modular baseline learns to
achieve success but shows significantly lower performance than our method (Modular-SBD). This is
because the lack of skill behavior diversification makes it impossible to adjust pushing and placing
trajectories during skill composition time, which resulting in frequent end-effector collisions.
6.4.3 Jaco Bar-Moving
In JACO BAR-MOVING, two Jaco arms need to pick up a long bar together, move the bar towards
a target location while maintaining its rotation, and place it on the table (see Figure 6.4b). The initial
position of the bar is randomly initialized every episode and an agent needs to find appropriate
coordination between two arms for each initialization. Compared to JACO PICK-PUSH-PLACE,
this task requires that the two arms synchronize their movements and perform more micro-level
adjustments to their behaviors.
Primitives skills There are two pre-trained primitive skills available to each arm: Picking up and
Placing towards arm. Picking up is same as described in Section 6.4.2. Placing towards arm learns
to move a small block (half size of the block used in the composite task) in the hand towards the
robotic arm and then place it on the table. The agent fails if it cannot place the block to the target
location.
89
Composite task The JACO BAR-MOVING task requires the two arms to work very closely together.
For example, the Picking up skill of both arms should be synchronized when they start to lift the
bar and two arms require to lift the bar while maintaining the relative position between them since
they are connected by holding the bar. The modular framework without explicit coordination of
skills (Modular) can synchronize the execution of picking, moving, and placing. But the inability
to micro-adjust the movement of the other arm causes instability of bar picking and moving. This
results in degraded success rates compared to the modular framework with explicit coordination.
Meanwhile, all baselines without pre-defined primitive skills fail to learn J ACO BAR-MOVING.
6.4.4 Ant Push
We developed a multi-ant environment, ANT-PUSH, inspired from Nachum et al. (2019a),
simulated in the MuJoCo (Todorov et al., 2012) physics engine. We use the ant model in OpenAI
Gym (Brockman et al., 2016). In this environment, two ants need to push a large object toward a
green target place, collaborating with each other to keep the angle of the object as stable as possible
(see Figure 6.4c).
Primitives skills We train walking skills of an ant agent in 4 directions: up, down, left, and right.
During primitive skill training, a block (half size of the block used in the composite task) and an
ant agent are randomly placed. Pushing the block gives an additional reward to the agent, which
prevents an ant to avoid the block. The learned primitive skills have different speed and trajectories
conditioned on the latent behavior embedding.
Composite task Our method achieves 32.3% success rate on ANT PUSH task while all baselines
fail to compose primitive skills as shown in Figure 6.5c and Table 6.1. The poor performance
of RL, MARL, RL-SBD, and MARL-SBD baselines shows the difficulty of credit assignment
between agents, which leads one of the ants moves toward a block and pushes it but another ant
does not move. Moreover, the Modular baseline with primitive skills also fails to learn the pushing
90
Table 6.1: Success rates for all tasks, comparing our method against baselines. Each entry in the
table represents average success rate and standard deviation over 100 runs. The baselines learning
from scratch fail to learn complex tasks with multiple agents.
Jaco Pick-Push-Place Jaco Bar-Moving Ant Push
RL 0.000± 0.000 0.000± 0.000 0.000± 0.000
MARL 0.000± 0.000 0.000± 0.000 0.000± 0.000
RL-SBD 0.000± 0.000 0.000± 0.000 0.000± 0.000
MARL-SBD 0.000± 0.000 0.000± 0.000 0.000± 0.000
Modular 0.324± 0.468 0.917± 0.276 0.003± 0.058
Modular-SBD (Ours) 0.902± 0.298 0.950± 0.218 0.323± 0.468
task. This result illustrates the importance of coordination of agents, which helps synchronizing
and controlling the velocities of both ant agents to push the block toward the goal position while
maintaining its rotation.
6.4.5 Effect of Diversity of Primitive Skills
To analyze the effect of the diversity of primitive skills, we compare our model with primitive
skills trained with different diversity coefficients λ
2
={0.0,0.05,0.1,0.5,1.0} in Equation (6.4) on
ANT PUSH. Figure 6.6 shows that with small diversity coefficients λ
2
={0.05,0.1}, the agent can
control detailed behaviors of primitive skills while primitive skills without diversity (λ
2
= 0) cannot
be coordinated. The meta policy tries to synchronize two ant agents’ positions and velocities by
switching primitive skills, but it cannot achieve proper coordination without diversified skills. On
the other hand, large diversity coefficients λ
2
={0.5,1.0} make the primitive skills often focus on
demonstrating diverse behaviors and fail to achieve the goals of the skills. Hence, these primitive
skills do not have enough functionality to solve the target task. The diversity coefficient needs to be
carefully chosen to acquire primitive skills with good performance as well as diverse behaviors.
6.4.6 Effect of Skill Selection Interval T
low
To analyze the effect of the skill selection interval hyperparameter T
low
, we compare our method
trained with T
low
={1,2,3,5,10} on Jaco environments. The success rate curves in Figure 6.7
91
2.0 2.2 2.4 2.6 2.8 3.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Episode success
=1.0
=0.5
=0.1
=0.05
=0.0
(a) Success rate
2.0 2.2 2.4 2.6 2.8 3.0
Environment steps (1M)
0
300
600
900
1200
Episode reward
(b) Reward
Figure 6.6: Learning curves of our method with
different diversity coefficients λ
2
on ANT PUSH.
1.2 1.7 2.2 2.7 3.2
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(a) PICK-PUSH-PLACE
0.9 1.4 1.9 2.4 2.9
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
Fix z until skill update
T_low=1
T_low=2
T_low=3
T_low=5
T_low=10
(b) BAR-MOVING
Figure 6.7: Success rates of our method with
different T
low
coefficients on Jaco environments.
demonstrate that smaller T
low
values in range[1,3] lead to better performance. This can be because
the agent can realize more flexible skill coordination by adjusting the behavior embedding frequently.
In addition to the fixed T
low
values, we also consider the variation of our method in which the skill
behavior embedding is only sampled when the meta policy updates its skill selection. Concretely,
we set the value of T
low
to 1 but update (z
1
t
,...,z
N
t
) only if (c
1
t
,...,c
N
t
)̸=(c
1
t− 1
,...,c
N
t− 1
). We
observe that in this setting, the meta policy at times switch back and forth between two skills in two
consecutive time steps, leading to slightly worse performance compared to our method with small
T
low
values. This indicates that the meta policy needs to adjust the behavior embedding in order to
optimally coordinate skills of the different agents.
6.5 Discussion
In this chapter, we propose a modular framework with skill coordination to tackle challenges
of composition of sub-skills with multiple agents. Specifically, we use entropy maximization with
mutual information maximization to train controllable primitive skills with diverse behaviors. To
coordinate learned primitive skills, the meta policy predicts not only the skill to execute for each
agent (end-effector) but also the behavior embedding that controls the chosen primitive skill’s
behavior. The experimental results on robotic manipulation and locomotion tasks demonstrate
that the proposed framework is able to efficiently learn primitive skills with diverse behaviors and
coordinate multiple agents (end-effectors) to solve challenging cooperative control tasks. Acquiring
92
skills without supervision and extending this method to a visual domain are exciting directions for
future work.
93
Part III
Accelerating Reinforcement Learning with Learned Skills
94
Chapter 7
Reinforcement Learning with Learned Skills
7.1 Introduction
Intelligent agents are able to utilize a large pool of prior experience to efficiently learn how to
solve new tasks (Woodworth and Thorndike, 1901). In contrast, reinforcement learning (RL) agents
typically learn each new task from scratch, without leveraging prior experience. Consequently,
agents need to collect a large amount of experience while learning the target task, which is expensive,
especially in the real world. On the other hand, there is an abundance of collected agent experience
available in domains like autonomous driving (Caesar et al., 2020), indoor navigation (Mo et al.,
2018), or robotic manipulation (Dasari et al., 2019; Cabi et al., 2019). With the widespread
deployment of robots on streets or in warehouses the available amount of data will further increase
in the future. However, the majority of this data is unstructured, without clear task or reward
definitions, making it difficult to use for learning new tasks. In this chapter, our aim is to devise
a scalable approach for leveraging such unstructured experience to accelerate the learning of new
downstream tasks.
One flexible way to utilize unstructured prior experience is by extracting skills, temporally
extended actions that represent useful behaviors, which can be repurposed to solve downstream
tasks. Skills can be learned from data without any task or reward information and can be transferred
to new tasks and even new environment configurations. Prior work has learned skill libraries from
data collected by humans (Schaal, 2006; Merel et al., 2019b; 2020; Shankar et al., 2020; Lynch
95
et al., 2020) or by agents autonomously exploring the world (Hausman et al., 2018; Sharma et al.,
2020). To solve a downstream task using the learned skills, these approaches train a high-level
policy whose action space is the set of extracted skills. The dimensionality of this action space
scales with the number of skills. Thus, the large skill libraries extracted from rich datasets can,
somewhat paradoxically, lead to worse learning efficiency on the downstream task, since the agent
needs to collect large amounts of experience to perform the necessary exploration in the space of
skills (Jong et al., 2008).
The key idea of this chapter is to learn a prior over skills along with the skill library to guide
exploration in skill space and enable efficient downstream learning, even with large skill spaces.
Intuitively, the prior over skills is not uniform: if the agent holds the handle of a kettle, it is more
promising to explore a pick-up skill than a sweeping skill (see Figure 7.1). To implement this idea,
we design a stochastic latent variable model that learns a continuous embedding space of skills
and a prior distribution over these skills from unstructured agent experience. We then show how
to naturally incorporate the learned skill prior into maximum-entropy RL algorithms for efficient
learning of downstream tasks. To validate the effectiveness of our approach, SPiRL (Skill-Prior
RL), we apply it to complex, long-horizon navigation and robot manipulation tasks. We show that
through the transfer of skills we can use unstructured experience for accelerated learning of new
downstream tasks and that learned skill priors are essential to efficiently utilize rich experience
datasets.
In summary, the contributions of this chapter are threefold: (1) we design a model for jointly
learning an embedding space of skills and a prior over skills from unstructured data, (2) we extend
maximum-entropy RL to incorporate learned skill priors for downstream task learning, and (3) we
show that learned skill priors accelerate learning of new tasks across three simulated navigation and
robot manipulation tasks.
96
Open Microwave
Move Kettle Slide Cabinet Door
Skill Library
Microwave Opened Kettle Grasped
slide move open slide move open
Skill Priors
1
2
3
Efficient Downstream
Task Learning
Figure 7.1: Intelligent agents can use a large library of acquired skills when learning new tasks.
Instead of exploring skills uniformly, they can leverage priors over skills as guidance, based, e.g.,
on the current environment state. Such priors capture which skills are promising to explore, like
moving a kettle when it is already grasped, and which are less likely to lead to task success, like
attempting to open an already opened microwave. In this chapter, we propose to jointly learn an
embedding space of skills and a prior over skills from unstructured data to accelerate the learning of
new tasks.
7.2 Related Work
The goal of our work is to leverage prior experience for accelerated learning of downstream
tasks. Meta-learning approaches Finn et al. (2017); Rakelly et al. (2019) similarly aim to extract
useful priors from previous experience to improve the learning efficiency for unseen tasks. However,
they require a defined set of training tasks and online data collection during pre-training and
therefore cannot leverage large offline datasets. In contrast, our model learns skills fully offline
from unstructured data.
Approaches that operate on such offline data are able to leverage large existing datasets Dasari
et al. (2019); Cabi et al. (2019) and can be applied to domains where data collection is particularly
costly or safety critical Levine et al. (2020). A number of works have recently explored the offline
reinforcement learning setting Levine et al. (2020); Fujimoto et al. (2019); Kumar et al. (2019);
Wu et al. (2019), in which a task needs to be learned purely from logged agent experience without
any environment interactions. It has also been shown how offline RL can be used to accelerate
online RL Nair et al. (2020). However, these approaches require the experience to be annotated with
rewards for the downstream task, which are challenging to provide for large, real-world datasets,
97
especially when the experience is collected across a wide range of tasks. Our approach based on
skill extraction, on the other hand, does not require any reward annotation on the offline experience
data and, once extracted, skills can be reused for learning a wide range of downstream tasks.
More generally, the problem of inter-task transfer has been studied for a long time in the RL
community Taylor and Stone (2009). The idea of transferring skills between tasks dates back
at least to the SKILLS Thrun and Schwartz (1995) and PolicyBlocks Pickett and Barto (2002)
algorithms. Learned skills can be represented as sub-policies in the form of options Sutton et al.
(1999); Bacon et al. (2017), as subgoal setter and reacher functions Gupta et al. (2019); Mandlekar
et al. (2020a) or as discrete primitive libraries (Schaal, 2006; Lee et al., 2019b). Recently, a number
of works have explored the embedding of skills into a continuous skill space via stochastic latent
variable models (Hausman et al., 2018; Merel et al., 2019b; Kipf et al., 2019; Merel et al., 2020;
Shankar et al., 2020; Lynch et al., 2020). When using powerful latent variable models, these
approaches are able to represent a very large number of skills in a compact embedding space.
However, the exploration of such a rich skill embedding space can be challenging, leading to
inefficient downstream task learning Jong et al. (2008). Our work introduces a learned skill prior to
guide the exploration of the skill embedding space, enabling efficient learning on rich skill spaces.
Learned behavior priors are commonly used to guide task learning in offline RL approaches Fu-
jimoto et al. (2019); Wu et al. (2019) in order to avoid value overestimation for actions outside of the
training data distribution. Recently, action priors have been used to leverage offline experience for
learning downstream tasks Siegel et al. (2020). Crucially, our approach learns priors over temporally
extended actions (i.e., skills) allowing it to scale to complex, long-horizon downstream tasks.
7.3 Approach
Our goal is to leverage skills extracted from large, unstructured datasets to accelerate the learning
of new tasks. Scaling skill transfer to large datasets is challenging, since learning the downstream
task requires picking the appropriate skills from an increasingly large library of extracted skills. In
this chapter, we propose to use learned skill priors to guide exploration in skill space and allow for
98
efficient skill transfer from large datasets. We decompose the problem of prior-guided skill transfer
into two sub-problems: (1) the extraction of skill embedding and skill prior from offline data, and
(2) the prior-guided learning of downstream tasks with a hierarchical policy.
7.3.1 Problem Formulation
We assume access to a datasetD of pre-recorded agent experience in the form of state-action
trajectoriesτ
i
={(s
0
,a
0
),...,(s
T
i
,a
T
i
)}. This data can be collected using previously trained agents
across a diverse set of tasks (Fu et al., 2020), through agents autonomously exploring their en-
vironment (Hausman et al., 2018; Sharma et al., 2020), via human teleoperation (Schaal et al.,
2005; Gupta et al., 2019; Mandlekar et al., 2018; Lynch et al., 2020) or any combination of these.
Crucially, we aim to leverage unstructured data that does not have annotations of tasks or sub-skills
and does not contain reward information to allow for scalable data collection on real world systems.
In contrast to imitation learning problems we do not assume that the training data contains complete
solutions for the downstream task. Hence, we focus on transferring skills to new problems.
The downstream learning problem is formulated as a Markov decision process (MDP) defined by
a tuple{S,A,T ,R,ρ,γ} of states, actions, transition probability, reward, initial state distribution,
and discount factor. We aim to learn a policy π
θ
(a|s) with parameters θ that maximizes the
discounted sum of rewards J(θ)=E
π
∑
T− 1
t=0
γ
t
r
t
where T is the episode horizon.
7.3.2 Learning Continuous Skill Embedding and Skill Prior
We define a skill a a a
i
as a sequence of actions{a
i
t
,...,a
i
t+H− 1
} with fixed horizon H. Using fixed-
length skills allows for scalable skill learning and has proven to be effective in prior works (Merel
et al., 2019b; 2020; Gupta et al., 2019; Mandlekar et al., 2020a). Other work has proposed to learn
semantic skills of flexible length (Kipf et al., 2019; Shankar et al., 2020; Pertsch et al., 2020b) and
our model can be extended to include similar approaches, but we leave this for future work.
99
a
1
a
2
a
3
a
H
q(z|a
i
)
Skill Prior
p
a
(z|s
1
)
Skill
Encoder
s
1
Skill Embedding
̂ a
1
̂ a
2
̂ a
3
̂ a
H
Skill
Decoder
p(a
i
|z)
Learned Prior
p
a
(z|s
1
)
Skill Posterior
q(z|a
i
)
Fixed Prior
p(z)∼
Learned Skill Embedding
a
i
Skill
s
2
s
3
s
H+1
Figure 7.2: Deep latent variable model for joint learning of skill embedding and skill prior. Given a
state-action trajectory from the dataset, the skill encoder maps the action sequence to a posterior
distribution q(z|a a a
i
) over latent skill embeddings. The action trajectory gets reconstructed by passing
a sample from the posterior through the skill decoder. The skill prior maps the current environment
state to a prior distribution p
a a a
(z|s
1
) over skill embeddings. Colorful arrows indicate the propagation
of gradients from reconstruction, regularization and prior training objectives.
To learn a low-dimensional skill embedding spaceZ , we train a stochastic latent variable model
p(a a a
i
|z) of skills using the offline dataset (see Figure 7.2). We randomly sample H-step trajectories
from the training sequences and maximize the following evidence lower bound (ELBO):
log p(a a a
i
)≥ E
q
log p(a a a
i
|z)
| {z }
reconstruction
− β
logq(z|a a a
i
)− log p(z)
| {z }
regularization
. (7.1)
Here,β is a parameter that is commonly used to tune the weight of the regularization term (Higgins
et al., 2017). We optimize this objective using amortized variational inference with an inference
network q(z|a a a
i
) (Kingma and Welling, 2014). To learn a rich skill embedding space we implement
skill encoder q(z|a a a
i
) and decoder p(a a a
i
|z) as deep neural networks that output the parameters of the
Gaussian posterior and output distributions. The prior p(z) is set to be unit GaussianN (0,I). Once
trained, we can sample skills from our model by sampling latent variables z∼ N (0,I) and passing
100
them through the decoder p(a a a
i
|z). In Section 7.3.3 we show how to use this generative model of
skills for learning hierarchical RL policies.
To better guide downstream learning, we learn a prior over skills along with the skill embedding
model. We therefore introduce another component in our model: the skill prior p
a a a
(z|·). The
conditioning of this skill prior can be adjusted to the environment and task at hand, but should
be informative about the set of skills that are meaningful to explore in a given situation. Possible
choices include the embedding of the last executed skill z
t− 1
or the current environment state s
t
.
In this chapter we focus on learning a state-conditioned skill prior p
a a a
(z|s
t
). Intuitively, the current
state should provide a strong prior over which skills are promising to explore and, importantly,
which skills should not be explored in the current situation (see Figure 7.1).
To train the skill prior we minimize the Kullback-Leibler divergence between the predicted prior
and the inferred skill posterior:E
(s,a a a
i
)∼ D
D
KL
q(z|a a a
i
), p
a a a
(z|s
t
)
. Using the reverse KL divergence
D
KL
(q, p) instead of D
KL
(p,q) ensures that the learned prior is mode-covering (Bishop, 2006), i.e.,
represents all observed skills in the current situation. Instead of training the skill prior after training
the skill embedding model, we can jointly optimize both models and ensure stable convergence
by stopping gradients from the skill prior objective into the skill encoder. We experimented with
different parameterizations for the skill prior distribution, in particular multi-modal distributions,
such as Gaussian mixture models and normalizing flows (Rezende and Mohamed, 2015; Dinh et al.,
2017), but found simple Gaussian skill priors to work equally well in our experiments. For further
implementation details, see appendix, Section B.
7.3.3 Skill Prior Regularized Reinforcement Learning
To use the learned skill embedding for downstream task learning, we employ a hierarchical
policy learning scheme by using the skill embedding space as action space of a high-level pol-
icy. Concretely, instead of learning a policy over actions a∈A we learn a policy π
θ
(z|s
t
) that
outputs skill embeddings, which we decode into action sequences using the learned skill decoder
101
Algorithm 3 SPiRL: Skill-Prior RL
1: Inputs: H-step reward function ˜ r(s
t
,z
t
), discount γ, target divergence δ, learning rates
λ
π
,λ
Q
,λ
α
, target update rateτ.
2: Initialize replay bufferD, high-level policyπ
θ
(z
t
|s
t
), critic Q
φ
(s
t
,z
t
), target network Q
¯
φ
(s
t
,z
t
)
3: for each iteration do
4: for every H environment steps do
5: z
t
∼ π(z
t
|s
t
) ▷ sample skill from policy
6: s
t
′∼ p(s
t+H
|s
t
,z
t
) ▷ execute skill in environment
7: D← D∪{s
t
,z
t
, ˜ r(s
t
,z
t
),s
t
′} ▷ store transition in replay buffer
8: end for
9: for each gradient step do
10:
¯
Q= ˜ r(s
t
,z
t
)+γ
Q
¯
φ
(s
t
′,π
θ
(z
t
′|s
t
′))− αD
KL
π
θ
(z
t
′|s
t
′), p
a a a
(z
t
′|s
t
′)
▷ compute Q-target
11: θ← θ− λ
π
∇
θ
Q
φ
(s
t
,π
θ
(z
t
|s
t
))− αD
KL
(π
θ
(z
t
|s
t
), p
a a a
(z
t
|s
t
))
▷ update policy weights
12: φ← φ− λ
Q
∇
φ
1
2
Q
φ
(s
t
,z
t
)− ¯
Q
2
▷ update critic weights
13: α← α− λ
α
∇
α
α· (D
KL
(π
θ
(z
t
|s
t
), p
a a a
(z
t
|s
t
))− δ)
▷ update alpha
14:
¯
φ← τφ+(1− τ)
¯
φ ▷ update target network weights
15: end for
16: end for
17: return trained policyπ
θ
(z
t
|s
t
)
{a
i
t
,...,a
i
t+H− 1
}∼ p(a a a
i
|z)
1
. We execute these actions for H steps before sampling the next skill
from the high-level policy. This hierarchical structure allows for temporal abstraction, which
facilitates long-horizon task learning (Sutton et al., 1999).
We can cast the problem of learning the high-level policy into a standard MDP by replacing the
action spaceA with the skill spaceZ , single-step rewards with H-step rewards ˜ r=∑
H
t=1
r
t
, and
single-step transitions with H-step transitions s
t+H
∼ p(s
t+H
|s
t
,z
t
). We can then use conventional
model-free RL approaches to maximize the return of the high-level policyπ
θ
(z|s
t
).
This naive approach struggles when training a policy on a very rich skill spaceZ that encodes
many different skills. While the nominal dimensionality of the skill space might be small, its
continuous nature allows the model to embed an arbitrary number of different behaviors. Therefore,
the effective dimensionality of the high-level policies’ action space scales with the number of
embedded skills. When using a large offline dataset D with diverse behaviors, the number of
embedded skills can grow rapidly, leading to a challenging exploration problem when training
1
We also experimented with models that directly condition the decoder on the current state, but found downstream
RL to be less stable (see appendix, Section D)
102
the high-level policy. For more efficient exploration, we propose to use the learned skill prior to
guide the high-level policy. We will next show how the skill prior can be naturally integrated into
maximum-entropy RL algorithms.
Maximum entropy RL (Ziebart, 2010; Levine, 2018) augments the training objective of the
policy with a term that encourages maximization of the policy’s entropy along with the return:
J(θ)=E
π
T
∑
t=1
γ
t
r(s
t
,a
t
)+αH
π(a
t
|s
t
)
(7.2)
The added entropy term is equivalent to the negated KL divergence between the policy and
a uniform action prior U(a
t
):H (π(a
t
|s
t
))=− E
π
logπ(a
t
|s
t
)
∝− D
KL
(π(a
t
|s
t
),U(a
t
)) up to a
constant. However, in our case we aim to regularize the policy towards a non-uniform, learned skill
prior to guide exploration in skill space. We can therefore replace the entropy term with the negated
KL divergence from the learned prior, leading to the following objective for the high-level policy:
J(θ)=E
π
T
∑
t=1
˜ r(s
t
,z
t
)− αD
KL
π(z
t
|s
t
), p
a a a
(z
t
|s
t
)
(7.3)
We can modify the state-of-the-art maximum-entropy RL algorithms, such as Soft Actor-
Critic (SAC (Haarnoja et al., 2018b;c)) to optimize this objective. We summarize our approach
in Algorithm 3 with changes to SAC marked in red. For a detailed derivation of the update
rules, see appendix, Section A. Analogous to Haarnoja et al. (2018c) we can devise an automatic
tuning strategy for the regularization weightα by defining a target divergence parameterδ (see
Algorithm 3, Section A).
7.4 Experiments
Our experiments are designed to answer the following questions: (1) Can we leverage unstruc-
tured datasets to accelerate downstream task learning by transferring skills? (2) Can learned skill
103
Maze Navigation Block Stacking Kitchen Environment
1
2
3
Training
Data
Target Tasks
4
Figure 7.3: For each environment we collect a diverse dataset from a wide range of training tasks
(examples on top) and test skill transfer to more complex target tasks (bottom), in which the agent
needs to: navigate a maze (left), stack as many blocks as possible (middle) and manipulate a
kitchen setup to reach a target configuration ( right). All tasks require the execution of complex,
long-horizon behaviors and need to be learned from sparse rewards.
priors improve exploration during downstream task learning? (3) Are learned skill priors necessary
to scale skill transfer to large datasets?
7.4.1 Environments & Comparisons
We evaluate SPiRL on one simulated navigation task and two simulated robotic manipulation
tasks (see Fig. 7.3). For each environment, we collect a large and diverse dataset of agent experience
that allows to extract a large number of skills. To test our method’s ability to transfer to unseen
downstream tasks, we vary task and environment setup between training data collection and
downstream task.
Maze Navigation. A simulated maze navigation environment based on the D4RL maze environ-
ment (Fu et al., 2020). The task is to navigate a point mass agent through a maze between fixed
start and goal locations. We use a planner-based policy to collect 85000 goal-reaching trajectories
in randomly generated, small maze layouts and test generalization to a goal-reaching task in a
104
randomly generated, larger maze. The state is represented as a RGB top-down view centered around
the agent. For downstream learning the agent only receives a sparse reward when in close vicinity
to the goal. The agent can transfer skills, such as traversing hallways or passing through narrow
doors, but needs to learn to navigate a new maze layout for solving the downstream task.
Block Stacking. The goal of the agent is to stack as many blocks as possible in an environment
with eleven blocks. We collect 37000 training sequences with a noisy, scripted policy that randomly
stacks blocks on top of each other in a smaller environment with only five blocks. The state is
represented as a RGB front view centered around the agent and it receives binary rewards for
picking up and stacking blocks. The agent can transfer skills like picking up, carrying and stacking
blocks, but needs to perform a larger number of consecutive stacks than seen in the training data on
a new environment with more blocks.
Kitchen Environment. A simulated kitchen environment based on Gupta et al. (2019). We
use the training data provided in the D4RL benchmark (Fu et al., 2020), which consists of 400
teleoperated sequences in which the 7-DoF robot arm manipulates different parts of the environment
(e.g. open microwave, switch on stove, slide cabinet door). During downstream learning the agent
needs to execute an unseen sequence of multiple subtasks. It receives a sparse, binary reward for
each successfully completed manipulation. The agent can transfer a rich set of manipulation skills,
but needs to recombine them in new ways to solve the downstream task.
For further details on environment setup, data collection and training, see appendix, Sec-
tion B and Section C.
We compare the downstream task performance of SPiRL to several flat and hierarchical baselines
that test the importance of learned skill embeddings and skill prior:
• Flat Model-Free RL (SAC). Trains an agent from scratch with Soft Actor-Critic (SAC, (Haarnoja
et al., 2018b)). This comparison tests the benefit of leveraging prior experience.
105
SPiRL (Ours) Flat Prior SSP w/o Prior SAC BC + SAC
Maze Navigation Block Stacking Kitchen Environment
Figure 7.4: Downstream task learning curves for our method and all comparisons. Both, learned
skill embedding and skill prior are essential for downstream task performance: single-action priors
without temporal abstraction (Flat Prior) and learned skills without skill prior (SSP w/o Prior) fail
to converge to good performance. Shaded areas represent standard deviation across three seeds.
• Behavioral Cloning w/ finetuning (BC + SAC). Trains a supervised behavioral cloning (BC)
policy from the offline data and finetunes it on the downstream task using SAC.
• Flat Behavior Prior (Flat Prior). Learns a single-step action prior on the primitive action
space and uses it to regularize downstream learning as described in Section 7.3.3, similar
to (Siegel et al., 2020). This comparison tests the importance of temporal abstraction through
learned skills.
• Hierarchical Skill-Space Policy (SSP). Trains a high-level policy on the skill-embedding
space of the model described in Section 7.3.2 but without skill prior, representative of (Merel
et al., 2019b; Kipf et al., 2019; Shankar et al., 2020). This comparison tests the importance of
the learned skill prior for downstream task learning.
7.4.2 Maze Navigation
We first evaluate SPiRL on the simulated maze navigation task. This task poses a hard explo-
ration problem since the reward feedback is very sparse: following the D4RL benchmark (Fu et al.,
2020) the agent receives a binary reward only when reaching the goal and therefore needs to explore
large fractions of the maze without reward feedback. We hypothesize that learned skills and a prior
106
SPiRL
(Ours)
Flat Prior Skills w/o Prior Random
Figure 7.5: Exploration behavior of our method vs. alternative transfer approaches on the down-
stream maze task vs. random action sampling. Through learned skill embeddings and skill priors
our method can explore the environment more widely. We visualize positions of the agent during
1M steps of exploration rollouts in blue and mark episode start and goal positions in green and red
respectively.
that guides exploration are crucial for successful learning, particularly when external feedback is
sparse.
In Figure 7.4 (left) we show that only SPiRL is able to successfully learn a goal-reaching policy
for the maze task; none of the baseline policies reaches the goal during training. To better understand
this result, we compare the exploration behaviors of our approach and the baselines in Figure 7.5:
we collect rollouts by sampling from our skill prior and the single-step action prior and record the
agent’s position in the maze. To visualize the exploration behavior of skill-space policies without
learned priors (”Skills w/o Prior”) we sample skills uniformly from the skill space.
Figure 7.5 shows that only SPiRL is able to explore large parts of the maze, since targeted
sampling of skills from the prior guides the agent to navigate through doorways and traverse
hallways. Random exploration in skill space, in contrast, does not lead to good exploration behavior
since the agent often samples skills that lead to collisions. The comparison to single-step action
priors (”Flat Prior”) shows that temporal abstraction is beneficial for coherent exploration.
Finally, we show reuse of a single skill prior for a variety of downstream goals in appendix,
Section H.
7.4.3 Robotic Manipulation
Next, we investigate the ability of SPiRL to scale to complex, robotic manipulation tasks
in the block stacking problem and in the kitchen environment. For both environments we find
107
that using learned skill embeddings together with the extracted skill prior is essential to solve
the task (see Figure 7.4, middle and right; appendix Figure E.2 for qualitative policy rollouts).
In contrast, using non-hierarchical action priors (”Flat Prior”) leads to performance similar to
behavioral cloning of the training dataset, but fails to solve longer-horizon tasks. This shows the
benefit of temporal abstraction through skills. The approach leveraging the learned skill space
without guidance from the skill prior (”SSP w/o Prior”) only rarely stacks blocks or successfully
manipulates objects in the kitchen environment. Due to the large number of extracted skills from
the rich training datasets, random exploration in skill space does not lead to efficient downstream
learning. Instead, performance is comparable or worse than learning from scratch without skill
transfer. This underlines the importance of learned skill priors for scaling skill transfer to large
datasets. Similar to prior work (Gupta et al., 2019), we find that a policy initialized through
behavioral cloning is not amenable to efficient finetuning on complex, long-horizon tasks.
7.4.4 Ablation Studies
0.00 0.15 0.30 0.45 0.60
Environment steps (1M)
0
1
2
3
4
Stacked Blocks
Skill Horizon
H = 30
H = 10
H = 3
0.00 0.15 0.30 0.45 0.60
Environment steps (1M)
0
1
2
3
4
Stacked Blocks
Embedding Dimension
|Z| = 30
|Z| = 10
|Z| = 2
Figure 7.6: Ablation analysis of skill horizon and skill space
dimensionality on block stacking task.
We analyze the influence of skill
horizon H and dimensionality of the
learned skill space |Z| on down-
stream performance in Figure 7.6. We
see that too short skill horizons do not
afford sufficient temporal abstraction.
Conversely, too long horizons make
the skill exploration problem harder, since a larger number of possible skills gets embedded in the
skill space. Therefore, the policy converges slower.
We find that the dimensionality of the learned skill embedding space needs to be large enough
to represent a sufficient diversity of skills. Beyond that, |Z| does not have a major influence on the
downstream performance. We attribute this to the usage of the learned skill prior: even though the
108
nominal dimensionality of the high-level policy’s action space increases, its effective dimensionality
remains unchanged since the skill prior focuses exploration on the relevant parts of the skill space.
We further test the importance of prior initialization and regularization, as well as training priors
from sub-optimal data in appendix, Section E - Section G.
7.5 Discussion
We presented SPiRL, an approach for leveraging large, unstructured datasets to accelerate
downstream learning of unseen tasks. We propose a deep latent variable model that jointly learns an
embedding space of skills and a prior over these skills from offline data. We then extend maximum-
entropy RL algorithms to incorporate both skill embedding and skill prior for efficient downstream
learning. Finally, we evaluate SPiRL on challenging simulated navigation and robotic manipulation
tasks and show that both, skill embedding and skill prior are essential for effective transfer from
rich datasets.
Future work can combine learned skill priors with methods for extracting semantic skills
of flexible length from unstructured data (Shankar et al., 2020; Pertsch et al., 2020b). Further,
skill priors are important in safety-critical applications, like autonomous driving, where random
exploration is dangerous. Skill priors learned e.g. from human demonstration, can guide exploration
to skills that do not endanger the learner or other agents.
109
Chapter 8
Demonstration-Guided Reinforcement Learning with Learned
Skills
8.1 Introduction
Policy Environment
Demonstration-Guided RL
Task-Specific
Demonstrations
Task-Agnostic Offline Data
Figure 8.1: We leverage large, task-agnostic
datasets collected across many different tasks
for efficient demonstration-guided reinforcement
learning by (1) acquiring a rich motor skill reper-
toire from such offline data and (2) understanding
and imitating the demonstrations based on the
skill repertoire.
Humans are remarkably efficient at acquir-
ing new skills from demonstrations: often a sin-
gle demonstration of the desired behavior and
a few trials of the task are sufficient to master
it (Bekkering et al., 2000; Al-Abood et al., 2001;
Hodges et al., 2007). To allow for such efficient
learning, we can leverage a large number of
previously learned behaviors (Al-Abood et al.,
2001; Hodges et al., 2007). Instead of imitat-
ing precisely each of the demonstrated muscle
movements, humans can extract the performed
skills and use the rich repertoire of already acquired skills to efficiently reproduce the desired
behavior.
Demonstrations are also commonly used in reinforcement learning (RL) to guide exploration
and improve sample efficiency (Vecerik et al., 2017; Hester et al., 2018; Rajeswaran et al., 2018;
110
Nair et al., 2018; Zhu et al., 2018). However, such demonstration-guided RL approaches attempt
to learn tasks from scratch: analogous to a human trying to imitate a completely unseen behavior
by following every demonstrated muscle movement, they try to imitate the primitive actions
performed in the provided demonstrations. As with humans, such step-by-step imitation leads to
brittle policies (Ross et al., 2011), and thus these approaches require many demonstrations and
environment interactions to learn a new task.
We propose to improve the efficiency of demonstration-guided RL by leveraging prior experience
in the form of an offline “task-agnostic“ experience dataset, collected not on one but across many
tasks (see Figure 8.1). Given such a dataset, we extract reusable skills: robust short-horizon
behaviors that can be recombined to learn new tasks. Like a human imitating complex behaviors via
the chaining of known skills, we can use this repertoire of skills for efficient demonstration-guided
RL on a new task by guiding the policy using the demonstrated skills instead of the primitive actions.
Concretely, we propose Skill-based Learning with Demonstrations (SkiLD), a demonstration-
guided RL algorithm that learns short-horizon skills from offline datasets and leverages them for
following demonstrations of a new task. Across challenging navigation and robotic manipula-
tion tasks this significantly improves the learning efficiency over prior demonstration-guided RL
approaches.
In summary, the contributions of this chapter are threefold: (1) we introduce the problem of
leveraging task-agnostic offline datasets for accelerating demonstration-guided RL on unseen tasks,
(2) we propose SkiLD, a skill-based algorithm for efficient demonstration-guided RL and (3) we
show the effectiveness of SkiLD on a maze navigation and two complex robotic manipulation tasks.
8.2 Related Work
Imitation learning. Learning from Demonstration, also known as imitation learning, is a
common approach for learning complex behaviors by leveraging a set of demonstrations. Most
prior approaches for imitation learning are either based on behavioral cloning (BC, (Pomerleau,
1989)), which uses supervised learning to mimic the demonstrated actions, or inverse reinforcement
111
learning (IRL, (Abbeel and Ng, 2004; Ho and Ermon, 2016)), which infers a reward from the
demonstrations and then trains a policy to optimize it. However, BC commonly suffers from distri-
bution shift and struggles to learn robust policies (Ross et al., 2011), while IRL’s joint optimization
of reward and policy can result in unstable training.
Demonstration-guided RL. A number of prior works aim to mitigate these problems by
combining reinforcement learning with imitation learning. They can be categorized into three groups:
(1) approaches that use BC to initialize and regularize policies during RL training (Rajeswaran
et al., 2018; Nair et al., 2018), (2) approaches that place the demonstrations in the replay buffer
of an off-policy RL algorithm (Vecerik et al., 2017; Hester et al., 2018), and (3) approaches that
augment the environment rewards with rewards extracted from the demonstrations (Zhu et al., 2018;
Peng et al., 2018; Merel et al., 2017). While these approaches improve the efficiency of RL, they
treat each task as an independent learning problem and thus require many demonstrations to learn
effectively, which is especially expensive since a new set of demonstrations needs to be collected
for every new task.
Online RL with offline datasets. As an alternative to expensive task-specific demonstrations,
multiple recent works have proposed to accelerate reinforcement learning by leveraging task-
agnostic experience in the form of large datasets collected across many tasks (Pertsch et al., 2020a;
Siegel et al., 2020; Nair et al., 2020; Ajay et al., 2021; Singh et al., 2021; 2020). In contrast to
demonstrations, such task-agnostic datasets can be collected cheaply from a variety of sources like
autonomous exploration (Hausman et al., 2018; Sharma et al., 2020) or human tele-operation (Gupta
et al., 2019; Mandlekar et al., 2018; Lynch et al., 2020), but will lead to slower learning than
demonstrations since the data is not specific to the downstream task.
Skill-based RL. One class of approaches for leveraging such offline datasets that is particularly
suited for learning long-horizon behaviors is skill-based RL (Hausman et al., 2018; Merel et al.,
2019b; Kipf et al., 2019; Merel et al., 2020; Shankar et al., 2020; Gupta et al., 2019; Lee et al.,
2020; Lynch et al., 2020; Pertsch et al., 2020b;a). These methods extract reusable skills from
task-agnostic datasets and learn new tasks by recombining them. Yet, such approaches perform
112
s
0
s
1
s
2
s
3
s
4
a
0
a
1
a
2
a
3
q
ω
(z|s,a)
π
ϕ
(a
t
|s
t
,z)
s
t
π
θ
(z|s
t
)
π
ϕ
(a
t
|s
t
,z)
Environment
z
a
t
Pre-Trained
q
ζ
(z|s
0
)
q
ω
(z|s,a)
Regularization
Skill Extraction
1
Prior & Posterior Training 2 Downstream RL 3
s
2
s
0
s
1
s
3
s
4
p
a
(z|s
0
)
q
ω
(z|s,a)
D(s)
Posterior Prior
Figure 8.2: Our approach, SkiLD, combines task-agnostic experience and task-specific demonstra-
tions to efficiently learn target tasks in three steps: (1) extract skill representation from task-agnostic
offline data, (2) learn task-agnostic skill prior from task-agnostic data and task-specific skill posterior
from demonstrations, and (3) learn a high-level skill policy for the target task using prior knowledge
from both task-agnostic offline data and task-specific demonstrations. Left: Skill embedding model
with skill extractor (yellow) and closed-loop skill policy (blue). Middle: Training of skill prior
(green) from task-agnostic data and skill posterior (purple) from demonstrations. Right: Training
of high-level skill policy (red) on a downstream task using the pre-trained skill representation and
regularization via the skill prior and posterior, mediated by the demonstration discriminator D(s).
reinforcement learning over the set of extracted skills to learn the downstream task. Although being
more efficient than RL over primitive actions, they still require many environment interactions to
learn long-horizon tasks. In our work we combine the best of both worlds: by using large, task-
agnostic datasets and a small number of task-specific demonstrations, we accelerate the learning of
long-horizon tasks while reducing the number of required demonstrations.
8.3 Approach
Our goal is to use skills extracted from task-agnostic prior experience data to improve the
efficiency of demonstration-guided RL on a new task. We aim to leverage a set of provided
demonstrations by following the performed skills as opposed to the primitive actions. Therefore, we
need a model that can (1) leverage prior data to learn a rich set of skills and (2) identify the skills
performed in the demonstrations in order to follow them. Next, we formally define our problem,
summarize relevant prior work on RL with learned skills and then describe our demonstration-guided
RL approach.
113
8.3.1 Preliminaries
Problem Formulation We assume access to two types of datasets: a large task-agnostic offline
dataset and a small task-specific demonstration dataset. The task-agnostic dataset D ={s
t
,a
t
,...}
consists of trajectories of meaningful agent behaviors, but includes no demonstrations of the target
task. We only assume that its trajectories contain short-horizon behaviors that can be reused to solve
the target task. Such data can be collected without a particular task in mind using a mix of sources,
e.g., via human teleoperation, autonomous exploration, or through policies trained for other tasks.
Since it can be used to accelerate many downstream task that utilize similar short-term behaviors
we call it task-agnostic. In contrast, the task-specific data is a much smaller set of demonstration
trajectoriesD
demo
={s
d
t
,a
d
t
,...} that are specific to a single target task.
The downstream learning problem is formulated as a Markov decision process (MDP) defined
by a tuple(S,A,T ,R,ρ,γ) of states, actions, transition probabilities, rewards, initial state distri-
bution, and discount factor. We aim to learn a policyπ
θ
(a|s) with parametersθ that maximizes the
discounted sum of rewards J(θ)=E
π
∑
T− 1
t=0
J
t
=E
π
∑
T− 1
t=0
γ
t
r
t
, where T is the episode horizon.
Skill Prior RL Our goal is to extract skills from task-agnostic experience data and reuse them for
demonstration-guided RL. Prior work has investigated the reuse of learned skills for accelerating
RL (Pertsch et al., 2020a). In this section, we will briefly summarize their proposed approach Skill
Prior RL (SPiRL) and then describe how our approach improves upon it in the demonstration-guided
RL setting.
SPiRL defines a skill as a sequence of H consecutive actions a a a={a
t
,...,a
t+H− 1
}, where the
skill horizon H is a hyperparameter. It uses the task-agnostic data to jointly learn (1) a generative
model of skills p(a a a|z), that decodes latent skill embeddings z into executable action sequences
a a a, and (2) a state-conditioned prior distribution p(z|s) over skill embeddings. For learning a new
downstream task, SPiRL trains a high-level skill policy π
θ
(z|s) whose outputs get decoded into
executable actions using the pre-trained skill decoder. Crucially, the learned skill prior is used to
114
guide the policy during downstream RL by maximizing the following divergence-regularized RL
objective:
J(θ)=E
π
θ
T− 1
∑
t=0
r(s
t
,z
t
)− αD
KL
π
θ
(z
t
|s
t
), p(z
t
|s
t
)
. (8.1)
Here, the KL-divergence term ensures that the policy remains close to the learned skill prior, guiding
exploration during RL. By combining this guided exploration with temporal abstraction via the
learned skills, SPiRL substantially improves the efficiency of RL on long-horizon tasks.
8.3.2 Skill Representation Learning
We leverage SPiRL’s skill embedding model for learning our skill representation. We follow
prior work on skill-based RL (Lynch et al., 2020; Ajay et al., 2021) and increase the expressiveness
of the skill representation by replacing SPiRL’s low-level skill decoder p(a a a|z) with a closed-loop
skill policyπ
φ
(a|s,z) with parametersφ that is conditioned on the current environment state. In
our experiments we found this closed-loop decoder to improve performance (see Section C for an
empirical comparison).
Figure 8.2 (left) summarizes our skill learning model. It consists of two parts: the skill inference
network q
ω
(z|s
0:H− 1
,a
0:H− 2
) with parametersω and the closed-loop skill policyπ
φ
(a
t
|s
t
,z
t
). Note
that in contrast to SPiRL the skill inference network is state-conditioned to account for the state-
conditioned low-level policy. During training we randomly sample an H-step state-action trajectory
from the task-agnostic dataset and pass it to the skill inference network, which predicts the low-
dimensional skill embedding z. This skill embedding is then input into the low-level policy
π
φ
(a
t
|s
t
,z) for every input state. The policy is trained to imitate the given action sequence, thereby
learning to reproduce the behaviors encoded by the skill embedding z.
115
The latent skill representation is optimized using variational inference, which leads to the full
skill learning objective:
max
φ,ω
E
q
H− 2
∏
t=0
logπ
φ
(a
t
|s
t
,z)
| {z }
behavioral cloning
− β
logq
ω
(z|s
0:H− 1
,a
0:H− 2
)− log p(z)
| {z }
embedding regularization
. (8.2)
We use a unit Gaussian prior p(z) and weight the embedding regularization term with a factor
β (Higgins et al., 2017).
8.3.3 Demonstration-Guided RL with Learned Skills
bDc
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=
bD
demo
c
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=
Start
Goal
Figure 8.3: We leverage prior experi-
ence data D and demonstration data
D
demo
. Our policy is guided by the task-
specific skill posterior q
ζ
(z|s) within the
support of the demonstrations (green)
and by the task-agnostic skill prior
p
a a a
(z|s) otherwise (red). The agent also
receives a reward bonus for reaching
states in the demonstration support.
To leverage the learned skills for accelerating
demonstration-guided RL on a new task, we use a hi-
erarchical policy learning scheme (see Figure 8.2, right):
a high-level policyπ
θ
(z|s) outputs latent skill embeddings
z that get decoded into actions using the pre-trained low-
level skill policy. We freeze the weights of the skill policy
during downstream training for simplicity.
Our goal is to leverage the task-specific demonstra-
tions to guide learning of the high-level policy on the new
task. In Section 8.3.1, we showed how SPiRL (Pertsch
et al., 2020a) leverages a learned skill prior p(z|s) to guide
exploration. However, this prior is task-agnostic, i.e., it
encourages exploration of all skills that are meaningful to be explored, independent of which task
the agent is trying to solve. Even though SPiRL’s objective makes learning with a large number
of skills more efficient, it encourages the policy to explore many skills that are not relevant to the
downstream task.
116
In this chapter, we propose to extend the skill prior guided approach and leverage target
task demonstrations to additionally learn a task-specific skill distribution, which we call skill
posterior q
ζ
(z|s) with parameters ζ (in contrast to the skill prior it is conditioned on the target
task, hence “posterior”). We train this skill posterior by using the pre-trained skill inference
model q
ω
(z|s
0:H− 1
,a
0:H− 2
) to extract the embeddings for the skills performed in the demonstration
sequences (see Figure 8.2, middle):
min
ζ
E
(s,a)∼ D
demo
D
KL
q
ω
(z|s
0:H− 1
,a
0:H− 2
),q
ζ
(z|s
0
)
, (8.3)
A naive approach for leveraging the skill posterior is to simply use it to replace the skill prior in
Equation 8.1, i.e., to regularize the policy to stay close to the skill posterior in every state. However,
the trained skill posterior is only accurate within the demonstration support⌊D
demo
⌋, because by
definition it was only trained on demonstration sequences. Since |D
demo
|≪| D| (see Figure 8.3), the
skill posterior will often provide incorrect guidance in states outside the demonstrations’ support.
Instead, we propose to use a three-part objective that guides the policy to (1) follow the skill
posterior within the support of the demonstrations, (2) follow the skill prior outside the demonstration
support, and (3) reach states within the demonstration support. To determine whether a given state
is within the support of the demonstration data we train a learned discriminator D(s) as a binary
classifier using samples from the demonstration and task-agnostic datasets, respectively.
In summary, our algorithm pre-trains the following components: (1) the low-level skill policy
π
φ
(a|s,z), (2) the task-agnostic skill prior p(z|s), (3) the task-specific skill posterior q
ζ
(z|s) and
(4) the learned discriminator D(s). Only the latter two need to be re-trained for a new target task.
Once all components are pre-trained, we use the discriminator’s output to weight terms in
our objective that regularize the high-level policy π
θ
(z|s) towards the skill prior or posterior.
Additionally, we provide a reward bonus for reaching states which the discriminator classifies as
117
being within the demonstration support. This results in the following term J
t
for SkiLD’s full RL
objective:
J
t
= ˜ r(s
t
,z
t
)− α
q
D
KL
(π
θ
(z
t
|s
t
),q
ζ
(z
t
|s
t
))· D(s
t
)
| {z }
posterior regularization
− αD
KL
(π
θ
(z
t
|s
t
), p(z
t
|s
t
))· (1− D(s
t
))
| {z }
prior regularization
,
with ˜ r(s
t
,z
t
)= (1− κ)· r(s
t
,z
t
)+κ·
logD(s
t
)− log
1− D(s
t
)
| {z }
discriminator reward
. (8.4)
The weighting factorκ is a hyperparameter;α andα
q
are either constant or tuned automatically
via dual gradient descent (Haarnoja et al., 2018c). The discriminator reward follows common
formulations used in adversarial imitation learning (Finn et al., 2016a; Fu et al., 2018; Zhu et al.,
2018; Kostrikov et al., 2019).
1
Our formulation combines IRL-like and BC-like objectives by using
learned rewards and trying to match the demonstration’s skill distribution.
For policy optimization, we use a modified version of the SPiRL algorithm (Pertsch et al.,
2020a), which itself is based on Soft Actor-Critic (Haarnoja et al., 2018b). Concretely, we replace
the environment reward with the discriminator-augmented reward and all prior divergence terms
with our new, weighted prior-posterior-divergence terms from equation 8.4 (for the full algorithm
see appendix, Section A).
8.4 Experiments
In this chapter, we propose to leverage a large offline experience dataset for efficient demonstration-
guided RL. We aim to answer the following questions: (1) Can the use of task-agnostic prior
experience improve the efficiency of demonstration-guided RL? (2) Does the reuse of pre-trained
skills reduce the number of required target-specific demonstrations? (3) In what scenarios does the
combination of prior experience and demonstrations lead to the largest efficiency gains?
1
We found that using the pre-trained discriminator weights led to stable training, but it is possible to perform full
adversarial training by finetuning D(s) with rollouts from the downstream task training. We report results for initial
experiments with discriminator finetuning in Section E and leave further investigation for future work.
118
8.4.1 Experimental Setup and Comparisons
To evaluate whether our method, SkiLD, can efficiently use task-agnostic data, we compare it
to prior demonstration-guided RL approaches on three complex, long-horizon tasks: a 2D maze
navigation task, a robotic kitchen manipulation task and a robotic office cleaning task (see Figure 8.4,
left).
1
2
3
4
1
2
3
SkiLD (Ours) SPiRL Skill BC+RL
Replay BC+RL GAIL+RL SAC
4
Figure 8.4: Left: Test environments, top to bot-
tom: 2D maze navigation, robotic kitchen ma-
nipulation and robotic office cleaning. Right:
Target task performance vs environment steps.
By using task-agnostic experience, our approach
more efficiently leverages the demonstrations than
prior demonstration-guided RL approaches across
all tasks. The comparison to SPiRL shows that
demonstrations improve efficiency even if the
agent has access to large amounts of prior experi-
ence.
Maze Navigation. We adapt the maze navi-
gation task from Pertsch et al. (2020a) and in-
crease task complexity by adding randomness
to the agent’s initial position. The agent needs
to navigate through a maze for hundreds of time
steps using planar velocity commands to receive
a sparse binary reward upon reaching a fixed
goal position. We collect 3000 task-agnostic tra-
jectories using a motion planner that finds paths
between randomly sampled start-goal pairs. For
the target task we collect 5 demonstrations for
an unseen start-goal pair.
Robot Kitchen Environment. We use the en-
vironment of Gupta et al. (2019) in which a
7DOF robot arm needs to perform a sequence
of four subtasks, such as opening the microwave
or switching on the light, in the correct order.
The agent observes a low-dimensional state rep-
resentation and receives a binary reward upon
completion of each consecutive subtask. We use
119
603 teleoperated sequences performing various subtask combinations (from Gupta et al. (2019)) as
task-agnostic experienceD and separate a set of 20 demonstrations for one particular sequence of
subtasks, which we define as our target task (see Figure 8.4, middle).
Robot Office Environment. A 5 DOF robot arm needs to clean an office environment by placing
objects in their target bins or putting them in a drawer. It observes the poses of its end-effector
and all objects in the scene and receives binary rewards for the completion of each subtask. We
collect 2400 training trajectories by perturbing the objects initial positions and performing random
subtasks using scripted policies. We also collect 50 demonstrations for the unseen target task with
unseen object locations and subtask sequence.
We compare our approach to multiple prior demonstration-guided RL approaches that represent
the different classes of existing algorithms introduced in Section 8.2. In contrast to SkiLD, these
approaches are not designed to leverage task-agnostic prior experience: BC + RL initializes a policy
with behavioral cloning of the demonstrations, then continues to apply BC loss while finetuning
the policy with Soft Actor-Critic (SAC, (Haarnoja et al., 2018b)), representative of (Rajeswaran
et al., 2018; Nair et al., 2018). GAIL + RL (Zhu et al., 2018) combines rewards from the
environment and adversarial imitation learning (GAIL, (Ho and Ermon, 2016)), and optimizes the
policy using PPO (Schulman et al., 2017). Demo Replay initializes the replay buffer of an SAC
agent with the demonstrations and uses them with prioritized replay during updates, representative
of (Vecerik et al., 2017). We also compare our approach to RL-only methods to show the benefit
of using demonstration data: SAC (Haarnoja et al., 2018b) is a state-of-the-art model-free RL
algorithm, it neither uses offline experience nor demonstrations. SPiRL (Pertsch et al., 2020a)
extracts skills from task-agnostic experience and performs prior-guided RL on the target task (see
Section 8.3.1)
2
. Finally, Skill BC+RL combines skills learned from task-agnostic data with target
task demonstrations: it encodes the demonstrations with the pre-trained skill encoder and runs BC
2
We train SPiRL with the closed-loop policy representation from Section 8.3.2 for fair comparison and better
performance. For an empirical comparison of open and closed-loop skill representations in SPiRL, see Section C.
120
Demonstrations
KL
policy || posterior
AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ
AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ
AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ
AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ
KL
policy || prior
AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==
AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==
AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==
AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==
p
demo | state
AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N
AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N
AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N
AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N
Figure 8.5: Visualization of our approach on the maze navigation task (visualization states collected
by rolling out the skill prior). Left: the given demonstration trajectories; Middle left: output of the
demonstration discriminator D(s) (the greener, the higher the predicted probability of a state to be
within demonstration support, red indicates low probability). Middle right: policy divergences
to the skill posterior and Right: divergence to the skill prior (blue indicates small and red high
divergence). The discriminator accurately infers the demonstration support, the policy successfully
follows the skill posterior only within the demonstration support and the skill prior otherwise.
for the high-level skill policy, then finetunes on the target task using SAC. For further details on the
environments, data collection, and implementation, see appendix Section B.
8.4.2 Demonstration-Guided RL with Learned Skills
Maze Navigation. Prior demonstration-guided RL approaches struggle on the task (see Figure 8.4,
right) since rewards are sparse and only five demonstrations are provided. With such small coverage,
behavioral cloning of the demonstrations’ primitive actions leads to brittle policies which are hard
to finetune. The Replay agent improves over SAC without demonstrations and partly succeeds at
the task, but learning is slow. The GAIL+RL approach is able to follow part of the demonstrated
behavior, but fails to reach the final goal (see Figure F.1 for qualitative results). SPiRL and Skill
BC+RL leverage task-agnostic data to learn to occasionally solve the task, but train slowly: SPiRL’s
learned, task-agnostic skill prior and Skill BC+RL’s uniform skill prior during SAC finetuning
encourage the exploration of many task-irrlevant skills
3
. In contrast, our approach SkiLD leverages
the task-specific skill posterior to quickly explore the relevant skills, leading to significant efficiency
gains (see Figure 8.5 for qualitative analysis and Figure F.2 for a comparison of SkiLD vs. SPiRL
exploration).
3
Performance of SPiRL differs from Pertsch et al. (2020a) due to increased task complexity, see Section B.4.
121
Robotic Manipulation. We show the performance comparison on the robotic manipulation tasks
in Figure 8.4 (right)
4
. Both tasks are more challenging since they require precise control of a
high-DOF manipulator. We find that approaches for demonstration-guided RL that do not leverage
task-agnostic experience struggle to learn either of the tasks since following the demonstrations
step-by-step is inefficient and prone to accumulating errors. SPiRL, in contrast, is able to learn
meaningful skills from the offline datasets, but struggles to explore the task-relevant skills and
therefore learns slowly. Worse yet, the uniform skill prior used in Skill BC+RL’s SAC finetuning is
even less suited for the target task and leads the policy to deviate from the BC initialization early
on in training, preventing the agent from learning the task altogether (for pure BC performance,
see appendix, Figure F.6). Our approach, however, uses the learned skill posterior to guide the
chaining of the extracted skills and thereby learns to solve the tasks efficiently, showing how
SkiLD effectively combines task-agnostic and task-specific data for demonstration-guided RL.
8.4.3 Ablation Studies
Figure 8.6: Ablation studies. We test the perfor-
mance of SkiLD for different sizes of the demon-
stration dataset|D
demo
| on the maze navigation
task (left) and ablate the components of our objec-
tive on the kitchen manipulation task (right).
In Figure 8.6 (left) we test the robustness
of our approach to the number of demonstra-
tions in the maze navigation task and com-
pare to BC+RL, which we found to work best
across different demonstration set sizes. Both
approaches benefit from more demonstrations,
but our approach is able to learn with much
fewer demonstrations by using prior experience.
While BC+RL learns each low-level action from the demonstrations, SkiLD merely learns to recom-
bine skills it has already mastered using the offline data, thus requiring less dense supervision and
fewer demonstrations. We also ablate the components of our RL objective on the kitchen task
in Figure 8.6 (right). Removing the discriminator reward bonus (”no-GAIL”) slows convergence
4
For qualitative robot manipulation videos, seehttps://sites.google.com/view/skill-demo-rl.
122
since the agent lacks a dense reward signal. Naively replacing the skill prior in the SPiRL objective
of Equation 8.1 with the learned skill posterior (”post-only”) fails since the agent follows the skill
posterior outside its support. Removing the skill posterior and optimizing a discriminator bonus
augmented reward using SPiRL (”no-post”) fails because the agent cannot efficiently explore the
rich skill space. Finally, we show the efficacy of our approach in the pure imitation setting, without
environment rewards, in appendix, Section E.
8.4.4 Robustness to Partial Demonstrations
SkiLD (Ours)
SPiRL
SkiLD - Full
SPiRL
SkiLD - Partial
Figure 8.7: Left: Robustness to partial demon-
strations. SkiLD can leverage partial demonstra-
tions by seamlessly integrating task-agnostic and
task-specific datasets (see Section 8.4.4). Right:
Analysis of data vs. task alignment. The bene-
fit of using demonstrations in addition to prior
experience diminishes if the prior experience is
closely aligned with the target task (solid), but
gains are high when data and task are not well-
aligned (dashed).
Most prior approaches that aim to follow
demonstrations of a target task, assume that
these demonstrations show the complete exe-
cution of the task. However, we can often en-
counter situations in which the demonstrations
only show incomplete solutions, e.g. because
the agent’s and demonstration’s initial states
do not align or because we only have access
to demonstrations for a subtask within a long-
horizon task. Thus, SkiLD is designed to han-
dle such partial demonstrations: through the
discriminator weighting it relies on demonstrations only within their support and falls back to
following the task-agnostic skill prior otherwise. Thus it provides a robust framework that seam-
lessly integrates task-specific and task-agnostic data sources. We test this experimentally in the
kitchen environment: we train SkiLD with partial demonstrations in which we remove one of
the subskills. The results in Figure 8.7 (left) show that “SkiLD-Partial” is able to leverage the
partial demonstrations to improve efficiency over SPiRL, which does not leverage demonstrations.
Expectedly, using the full demonstrations in the SkiLD framework (“SkiLD-Full”) leads to even
higher learning efficiency.
123
8.4.5 Data Alignment Analysis
We aim to analyze in what scenarios the use of demonstrations in addition to task-agnostic
experience is most beneficial. In particular, we evaluate how the alignment between the distribution
of observed behaviors in the task-agnostic dataset and the target behaviors influences learning
efficiency. We choose two different target tasks in the kitchen environment, one with good and one
with bad alignment between the behavior distributions (see Section F), and compare our method,
which uses demonstrations, to SPiRL, which only relies on the task-agnostic data.
In the well-aligned case (Figure 8.7, right, solid lines), we find that both approaches learn
the task efficiently. Since the skill prior encourages effective exploration on the downstream task,
the benefit of the additional demonstrations leveraged in our method is marginal. In contrast, if
task-agnostic data and downstream task are not well-aligned (Figure 8.7, right, dashed), SPiRL
struggles to learn the task since it cannot maximize task reward and minimize divergence from the
mis-aligned skill prior at the same time. Our approach learns more reliably by encouraging the
policy to reach demonstration-like states and then follow the skill posterior, which by design is
well-aligned with the target task.
In summary, our analysis finds that approaches which leverage both task-agnostic data and
demonstrations, improve over methods that use either of the data sources alone across all tested
tasks. We find that combining the data sources is particularly beneficial in two cases:
• Diverse Task-Agnostic Data. Demonstrations can focus exploration on task-relevant skills if
the task-agnostic skill prior explores a too large set of skills (see Section 8.4.2).
• Mis-Aligned Task-Agnostic Data. Demonstrations can compensate mis-alignment between
task-agnostic data and target task by guiding exploration with the skill posterior instead of the
mis-aligned prior.
124
8.5 Discussion
In this chapter, we proposed SkiLD, an approach for demonstration-guided RL that is able to
leverage task-agnostic experience datasets and task-specific demonstrations for accelerated learning
of unseen tasks. In three challenging environments SkiLD learns new tasks more efficiently than
both, prior demonstration-guided RL approaches that are unable to leverage task-agnostic data, as
well as skill-based RL methods that cannot effectively incorporate demonstrations. Future work
should combine task-agnostic data and demonstrations for efficient learning in the real world and
investigate domain-agnostic measures for data-task alignment to quantify the usefulness of prior
experience for target tasks.
125
Chapter 9
Skill-based Model-based Reinforcement Learning
9.1 Introduction
A key trait of human intelligence is the ability to plan abstractly for solving complex tasks (Legg
and Hutter, 2007). For instance, we perform cooking by imagining outcomes of high-level skills like
washing and cutting vegetables, instead of planning every muscle movement involved (Botvinick
and Weinstein, 2014). This ability to plan with temporally-extended skills helps to scale our internal
model to long-horizon tasks by reducing the search space of behaviors. To apply this insight
to artificial intelligence agents, we propose a novel skill-based and model-based reinforcement
learning (RL) method, which learns a model and a policy in a high-level skill space, enabling
accurate long-term prediction and efficient long-term planning.
Typically, model-based RL involves learning a flat single-step dynamics model, which pre-
dicts the next state from the current state and action. This model can then be used to simulate
“imaginary” trajectories, which significantly improves sample efficiency over their model-free alter-
natives (Hafner et al., 2019; Hansen et al., 2022). However, such model-based RL methods have
shown only limited success in long-horizon tasks due to inaccurate long-term prediction (Lu et al.,
2021a) and computationally expensive search (Lowrey et al., 2019; Janner et al., 2019; Argenson
and Dulac-Arnold, 2021).
Skill-based RL enables agents to solve long-horizon tasks by acting with multi-action subroutines
(skills) (Sutton et al., 1999; Lee et al., 2019b; 2020; Pertsch et al., 2020a; Lee et al., 2021c; Dalal
126
et al., 2021) instead of primitive actions. This temporal abstraction of actions enables systematic
long-range exploration and allows RL agents to plan farther into the future, while requiring a shorter
horizon for policy optimization, which makes long-horizon downstream tasks more tractable. Yet,
on complex long-horizon tasks, such as maze navigation and furniture assembly (Lee et al., 2021b),
skill-based RL still requires a few million to billion environment interactions to learn (Lee et al.,
2021c), which is impractical for real-world applications (e.g. robotics and healthcare).
To combine the best of both model-based RL and skill-based RL, we propose Skill-based
Model-based RL (SkiMo), which enables effective planning in the skill space using a skill dynamics
model. Given a state and a skill to execute, the skill dynamics model directly predicts the resultant
state after skill execution, without needing to model every intermediate step and low-level action
(Figure 9.1), whereas the flat dynamics model predicts the immediate next state after one action
execution. Thus, planning with skill dynamics requires fewer predictions than flat (single-step)
dynamics, resulting in more reliable long-term future predictions and plans.
Concretely, we first jointly learn the skill dynamics model and a skill repertoire from large
offline datasets collected across diverse tasks (Lynch et al., 2020; Pertsch et al., 2020a; 2021).
This joint training shapes the skill embedding space for easy skill dynamics prediction and skill
execution. Then, to solve a complex downstream task, we train a hierarchical task policy that acts
in the learned skill space. For more efficient policy learning and better planning, we leverage the
skill dynamics model to simulate skill trajectories.
The main contribution of this chapter is to propose Skill-based Model-based RL (SkiMo), a novel
sample-efficient model-based hierarchical RL algorithm that leverages task-agnostic data to extract
not only a reusable skill set but also a skill dynamics model. The skill dynamics model enables
efficient and accurate long-term planning for sample-efficient RL. The experiments show that
SkiMo outperforms the state-of-the-art skill-based and model-based RL algorithms on long-horizon
navigation and robotic manipulation tasks with sparse rewards.
127
Teaser
Comparison between no-model, model-based rl, non-saltatory model-based hrl, saltatory model-based hrl
(a) Flat dynamics model without skills
(b) Flat dynamics model with skills
(c) Skill dynamics model with skills
Move kettle Open microwave
Figure 9.1: Intelligent agents can use their internal models to imagine potential futures for planning.
Instead of planning out every primitive action (black arrows in a), they aggregate action sequences
into skills (red and blue arrows in b). Further, instead of simulating each low-level step, they can
leap directly to the predicted outcomes of executing skills in sequence (red and blue arrows in
c), which leads to better long-term prediction and planning compared to predicting step-by-step
(blurriness of images represents the level of error accumulation in prediction). With the ability to
plan over skills, the agent can accurately imagine and efficiently plan for long-horizon tasks.
9.2 Related Work
Model-based RL leverages a learned dynamics model of the environment to plan a sequence of
actions that leads to the desired behavior. The dynamics model predicts the future state of the envi-
ronment, and optionally the associated reward, after taking a specific action for planning (Argenson
and Dulac-Arnold, 2021; Hansen et al., 2022) or subsequent policy search (Ha and Schmidhuber,
2018; Hafner et al., 2019; Mendonca et al., 2021; Hansen et al., 2022). By simulating candidate
behaviors in imagination instead of in the physical environment, model-based algorithms improve
the sample efficiency of RL agents (Hafner et al., 2019; Hansen et al., 2022). Typically, model-based
RL leverage the model for planning, e.g., CEM (Rubinstein, 1997) and MPPI (Williams et al.,
2015). On the other hand, the model can also be used to generate imaginary rollouts for policy
optimization (Hafner et al., 2019; Hansen et al., 2022). Yet, due to the accumulation of prediction
error at each step (Lu et al., 2021a) and the increasing search space, finding an optimal, long-horizon
plan is inaccurate and computationally expensive (Lowrey et al., 2019; Janner et al., 2019; Argenson
and Dulac-Arnold, 2021).
128
To facilitate learning of long-horizon behaviors, skill-based RL lets the agent act over temporally-
extended skills (i.e. options (Sutton et al., 1999) or motion primitives (Pastor et al., 2009)), which can
be represented as sub-policies or a coordinated sequence of low-level actions. Temporal abstraction
effectively reduces the task horizon for the agent and enables directed exploration (Nachum et al.,
2019b), a major challenge in RL. The reusable skills can be manually defined (Pastor et al., 2009;
M¨ ulling et al., 2013; Lee et al., 2019b; 2020; Dalal et al., 2021), extracted from large offline
datasets (Shiarlis et al., 2018; Kipf et al., 2019; Lynch et al., 2020; Shankar and Gupta, 2020; Lu
et al., 2021b), discovered online in an unsupervised manner (Sharma et al., 2020; Eysenbach et al.,
2019), or acquired in the form of goal-reaching policies (Nachum et al., 2018; Gupta et al., 2019;
Mandlekar et al., 2020b;a; Mendonca et al., 2021). However, skill-based RL is still impractical
for real-world applications, requiring a few million to billion environment interactions (Lee et al.,
2021c). In this chapter, we use model-based RL to guide the planning of skills to improve the
sample efficiency of skill-based approaches.
There have been attempts to plan over skills in model-based RL (Sharma et al., 2020; Wu
et al., 2021; Lu et al., 2021a; Xie et al., 2020; Shah et al., 2022). However, most of these ap-
proaches (Sharma et al., 2020; Lu et al., 2021a; Xie et al., 2020) still utilize the conventional
flat (single-step) dynamics model, which struggles at handling long-horizon planning due to error
accumulation. Wu et al. (2021) proposes to learn a temporally-extended dynamics model; however,
it conditions on low-level actions rather than skills and is only used for low-level planning. A
concurrent work, Shah et al. (2022), is most similar to our approach in that it learns a skill dynamics
model, but it uses a limited set of discrete, manually-defined skills. To fully unleash the potential of
temporally abstracted skills, we devise a skill-level dynamics model to provide accurate long-term
prediction, which is essential for solving long-horizon tasks. To the best of our knowledge, SkiMo
is the first work that jointly learns skills and a skill dynamics model from data for model-based RL.
129
s
0
s
1
s
2
s
3
s
4
a
0
a
1
a
2
a
3
q(z|s,a)
π
L
(a
t
|s
t
,z)
s
t
π(z|s
t
)
π
L
(a
t
|s
t
,z)
Environment
z
a
t
q
ζ
(z|s
0
)
q
ω
(z|s,a)
Skills & Skill Dynamics Model 1 Downstream RL 2
s
2
s
0
s
1
s
3
s
4
p(z|s
0
)
D(s
H
|s
0
,z)
Skill Prior
Skill Dynamics
R(r
t
|s
t
,z)
Reward
r
t
CEM Planning in Skill Space
s
s′
Model/training
CEM <-> policy
Skill is sampled every H steps
Q(s
t
,z)
Value
Skill Encoder
Skill Policy
Task Policy
Skill Policy
Figure 9.2: Our approach, SkiMo, combines model-based RL and skill-based RL for sample efficient
learning of long-horizon tasks. SkiMo consists of two phases: (1) learn a skill dynamics model and
a skill repertoire from offline task-agnostic data, and (2) learn a high-level policy for the downstream
task by leveraging the learned model and skills. We omit the encoded latent state h in the figure and
directly write observation s for clarity, but most modules take the encoded state h as input.
9.3 Method
In this chapter, we aim to improve the long-horizon learning capability of RL agents. To enable
accurate long-term prediction and efficient long-horizon planning for RL, we introduce SkiMo, a
novel skill-based and model-based RL method that combines the benefits of both frameworks. A key
change to prior model-based approaches is the use of a skill dynamics model that directly predicts the
outcome of a chosen skill, which enables efficient and accurate long-term planning. In this section,
we describe the overview of our approach which consists of two phases: (1) learning the skill
dynamics model and skills from an offline task-agnostic dataset (Section 9.3.3) and (2) downstream
task learning with the skill dynamics model (Section 9.3.4), as illustrated in Figure 9.2.
9.3.1 Preliminaries
Reinforcement Learning We formulate our problem as a Markov decision process (Sutton, 1984),
which is defined by a tuple (S,A,R,P,ρ
0
,γ) of the state spaceS , action spaceA , reward function
R(s,a), transition probability P(s
′
|s,a), initial state distributionρ
0
, and discounting factorγ. We
130
define a policy π(a|s) that maps from a state s to an action a. Our objective is to learn the opti-
mal policy that maximizes the expected discounted return,E
s
0
∼ ρ
0
,(s
0
,a
0
,...,s
T
i
)∼ π
h
∑
T
i
− 1
t=0
γ
t
R(s
t
,a
t
)
i
,
where T
i
is the variable episode length.
Unlabeled Offline Data We assume access to a reward-free task-agnostic dataset (Lynch et al.,
2020; Pertsch et al., 2020a), which is a set of N state-action trajectories,D={τ
1
,...,τ
N
}. Since it is
task-agnostic, this data can be collected from training data for other tasks, unsupervised exploration,
or human teleoperation. We do not assume this dataset contains solutions for the downstream
task; therefore, tackling the downstream task requires re-composition of skills learned from diverse
trajectories.
Skill-based RL We define skills as a sequence of actions (a
0
,...,a
H− 1
) with a fixed horizon H
1
and parameterize skills as a skill latent z and skill policy, π
L
(a|s,z), that maps a skill latent and
state to the corresponding action sequence. The skill latent and skill policy can be trained using
variational auto-encoder (V AE (Kingma and Welling, 2014)), where a skill encoder q(z|(s,a)
0:H− 1
)
embeds a sequence of transitions into a skill latent z, and the skill policy decodes it back to the
original action sequence. Following SPiRL (Pertsch et al., 2020a), we also learn a skill prior p(z|s),
which is the skill distribution in the offline data, to guide the downstream task policy to explore
promising skills over the large skill space.
9.3.2 SkiMo Model Components
SkiMo consists of three major model components: the skill policy (π
L
θ
), skill dynamics model
(D
ψ
), and task policy (π
φ
), along with auxiliary components for representation learning and value
estimation. A state encoder E
ψ
first encodes an observation s into the latent state h. Then, given a
skill z, the skill dynamics D
ψ
predicts the skill effect in the latent space. The task policyπ
φ
, reward
1
It is worth noting that our method is compatible with variable-length skills (Kipf et al., 2019; Shankar et al., 2020;
Shankar and Gupta, 2020) and goal-conditioned skills (Mendonca et al., 2021) with minimal change; however, for
simplicity, we adopt fixed-length skills of H = 10 in this chapter.
131
function R
φ
, and value function Q
φ
predict a skill, reward, and value on the (imagined) latent state,
respectively. The following is a summary of the notations of our model components:
State encoder: h
t
= E
ψ
(s
t
)
Observation decoder: ˆ s
t
= O
θ
(h
t
)
Skill prior: ˆ z
t
∼ p
θ
(s
t
)
Skill encoder: z
t
∼ q
θ
((s,a)
t:t+H− 1
)
Skill policy: ˆ a
t
=π
L
θ
(s
t
,z
t
)
Skill dynamics:
ˆ
h
t+H
= D
ψ
(h
t
,z
t
)
Task policy: ˆ z
t
∼ π
φ
(h
t
)
Reward: ˆ r
t
= R
φ
(h
t
,z
t
)
Value: ˆ v
t
= Q
φ
(h
t
,z
t
)
(9.1)
For convenience, we label the trainable parameters ψ, θ, φ of each component according to
which phase they are trained on:
1. Learned from offline data and finetuned in downstream RL (ψ ={ψ
E
,ψ
D
}): The state
encoder (E
ψ
) and the skill dynamics model (D
ψ
) are first trained on the offline task-agnostic
data and then finetuned in downstream RL to account for unseen states and transitions.
2. Learned only from offline data (θ ={θ
O
,θ
q
,θ
p
,θ
π
L}): The observation decoder (O
θ
), skill
encoder (q
θ
), skill prior (p
θ
), and skill policy (π
L
θ
) are learned from the offline data.
3. Learned in downstream RL(φ ={φ
Q
,φ
R
,φ
π
}): The value (Q
φ
) and the reward (R
φ
) function,
and the high-level task policy (π
φ
) are trained for the downstream task using environment
interactions.
9.3.3 Pre-Training Skill Dynamics Model and Skills from Task-agnostic Data
Our method, SkiMo, consists of pre-training and downstream RL phases. In pre-training, SkiMo
leverages offline data to extract (1) skills for temporal abstraction of actions, (2) skill dynamics
for skill-level planning on a latent state space, and (3) a skill prior (Pertsch et al., 2020a) to guide
exploration. Specifically, we jointly learn a skill policy and skill dynamics model, instead of learning
them separately (Wu et al., 2021; Lu et al., 2021a; Xie et al., 2020), in a self-supervised manner.
132
The key insight is that this joint training could shape the latent skill spaceZ and state embedding in
that the skill dynamics model can easily predict the future.
In contrast to prior works that learn models completely online (Hafner et al., 2019; Sekar et al.,
2020; Hansen et al., 2022), we leverage existing offline task-agnostic datasets to pre-train a skill
dynamics model and skill policy. This offers the benefit that the model and skills are agnostic to
specific tasks so that they may be used in multiple tasks. Afterwards in the downstream RL phase,
the agent continues to finetune the skill dynamics model to accommodate task-specific trajectories.
To learn a low-dimensional skill latent space Z that encodes action sequences, we train a
conditional V AE (Kingma and Welling, 2014) on the offline dataset that reconstructs the action
sequence through a skill embedding given a state-action sequence as in SPiRL (Pertsch et al., 2020a;
2021). Specifically, given H consecutive states and actions(s,a)
0:H− 1
, a skill encoder q
θ
predicts a
skill embedding z and a skill decoderπ
L
θ
(i.e. the low-level skill policy) reconstructs the original
action sequence from z:
L
V AE
=E
(s,a)
0:H− 1
∼ D
λ
BC
H
H− 1
∑
i=0
(π
L
θ
(s
i
,z)− a
i
)
2
| {z }
Behavioral cloning
+β· KL
q
θ
(z|(s,a)
0:H− 1
)∥ p(z)
| {z }
Embedding regularization
, (9.2)
where z is sampled from q
θ
and λ
BC
,β are weighting factors for regularizing the skill latent z
distribution to a prior of a tanh-transformed unit Gaussian distribution, Z∼ tanh(N (0,1)).
To ensure the latent skill space is suited for long-term prediction, we jointly train a skill
dynamics model with the V AE above. The skill dynamics model learns to predict h
t+H
, the latent
state H-steps ahead conditioned on a skill z, for N sequential skill transitions using the latent state
consistency loss (Hansen et al., 2022). To prevent a trivial solution and encode rich information from
observations, we additionally train an observation decoder O
θ
using the observation reconstruction
133
loss. Altogether, the skill dynamics D
ψ
, state encoder E
ψ
, and observation decoder O
θ
are trained
on the following objective:
L
REC
=E
(s,a)
0:NH
∼ D
N− 1
∑
i=0
h
λ
O
∥s
iH
− O
θ
(E
ψ
(s
iH
))∥
2
2
| {z }
Observation reconstruction
+λ
L
∥D
ψ
(
ˆ
h
iH
,z
iH
)− E
ψ
− (s
(i+1)H
)∥
2
2
| {z }
Latent state consistency
i
,
(9.3)
whereλ
O
,λ
L
are weighting factors and
ˆ
h
0
= E
ψ
(s
0
) and
ˆ
h
(i+1)H
= D
ψ
(
ˆ
h
iH
,z
iH
) such that gradients
are back-propagated through time. For stable training, we use a target network whose parameter
ψ
− is slowly soft-copied fromψ.
Furthermore, to guide the exploration for downstream RL, we also extract a skill prior (Pertsch
et al., 2020a) from offline data that predicts the skill distribution for any state. The skill prior is
trained by minimizing the KL divergence between output distributions of the skill encoder q
θ
and
the skill prior p
θ
:
L
SP
=E
(s,a)
0:H− 1
∼ D
λ
SP
· KL
sg(q
θ
(z|s
0:H− 1
,a
0:H− 1
))∥ p
θ
(z|s
0
)
, (9.4)
whereλ
SP
is a weighting factor and sg denotes the stop gradient operator. Combining the objectives
above, we jointly train the policy, model, and prior, which leads to a well-shaped skill latent space
that is optimized for both skill reconstruction and long-term prediction:
L =L
V AE
+L
REC
+L
SP
(9.5)
9.3.4 Downstream Task Learning with Learned Skill Dynamics Model
To accelerate downstream RL with the learned skill repertoire, SkiMo learns a high-level task
policyπ
φ
(z
t
|h
t
) that outputs a latent skill embedding z
t
, which is then translated into a sequence
of H actions using the pre-trained skill policyπ
L
θ
to act in the environment (Pertsch et al., 2020a;
2021).
134
To further improve the sample efficiency, we propose to use model-based RL in the skill space
by leveraging the skill dynamics model. The skill dynamics model and task policy can generate
imaginary rollouts in the skill space by repeating (1) sampling a skill, z
t
∼ π
φ
(h
t
), and (2) predicting
H-step future after executing the skill, h
t+H
= D
ψ
(h
t
,z
t
). Our skill dynamics model requires only
1/H dynamics predictions and action selections of the flat model-based RL approaches (Hafner
et al., 2019; Hansen et al., 2022), resulting in more efficient and accurate long-horizon imaginary
rollouts (see Appendix, Figure G.5).
Following TD-MPC (Hansen et al., 2022), we leverage these imaginary rollouts both for
planning (Algorithm 11) and policy optimization (Equation (9.7)), significantly reducing the number
of necessary environment interactions. During rollout, we perform Model Predictive Control (MPC),
which re-plans every step using CEM and executes the first skill of the skill plan (see Appendix,
Section C for more details).
To evaluate imaginary rollouts, we train a reward function R
φ
(h
t
,z
t
) that predicts the sum of
H-step rewards
2
, r
t
, and a Q-value function Q
φ
(h
t
,z
t
). We also finetune the skill dynamics model
D
ψ
and state encoder E
ψ
on the downstream task to improve the model prediction:
L
′
REC
=E
s
t
,z
t
,s
t+H
,r
t
∼ D
λ
L
∥D
ψ
(
ˆ
h
t
,z
t
)− E
ψ
− (s
t+H
)∥
2
2
| {z }
Latent state consistency
+λ
R
∥r
t
− R
φ
(
ˆ
h
t
,z
t
)∥
2
2
| {z }
Reward prediction
+λ
V
∥r
t
+γQ
φ
− (
ˆ
h
t+H
,π
φ
(
ˆ
h
t+H
))− Q
φ
(
ˆ
h
t
,z
t
)∥
2
2
| {z }
Value prediction
.
(9.6)
Finally, we train a high-level task policyπ
φ
to maximize the estimated Q-value while regularizing
it to the pre-trained skill prior p
θ
(Pertsch et al., 2020a), which helps the policy output plausible
skills:
L
RL
=E
s
t
∼ D
− Q
φ
(
ˆ
h
t
,π
φ
(sg(
ˆ
h
t
)))+α· KL
π
φ
(z
t
|sg(
ˆ
h
t
))∥ p
θ
(z
t
|s
t
)
. (9.7)
2
For clarity, we use r
t
to denote the sum of H-step rewards∑
H− 1
i=0
r
t+i
135
(a) Maze
1 2
4
Tasks
3
(b) Kitchen
6/22/22, 7:20 PM skimo_talk
https://www.icloud.com/keynote/0eccDtS4EmzoT0nYp7AxsSegg#skimo_talk 1/1
1 2
4
T ask s
3
2
3
4
1
1
2
4
3
1
2
3
4
5
6
7
8
9
10
11
skimo_talk / Sign In Sign Up
75% Open in Keynote
Slide
Appearance
Title
Body
Slide Number
Background
Color Fill
(c) Mis-aligned Kitchen
1 2
4
Tasks
3
2
3 4
1
1
2
4
3
(d) CALVIN
Figure 9.3: We evaluate our method on four long-horizon, sparse reward tasks. (a) The green point
mass navigates the maze to reach the goal (red). (b, c) The robot arm in the kitchen must complete
four tasks in the correct order (Microwave - Kettle - Bottom Burner - Light and Microwave - Light -
Slide Cabinet - Hinge Cabinet). (d) The robot arm needs to compose skills learned from extremely
task-agnostic data (Open Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED).
For both Equation (9.6) and Equation (9.7), consecutive skill-level transitions can be sampled
together, so that the models can be trained using backpropagation through time, similar to Equa-
tion (9.3).
9.4 Experiments
In this chapter, we propose a model-based RL approach that can efficiently and accurately plan
long-horizon trajectories over the skill space, rather than the primitive action space, by leveraging
the skills and skill dynamics model learned from an offline task-agnostic dataset. Hence, in our
experiments, we aim to answer the following questions: (1) Can the use of the skill dynamics model
improve the efficiency of RL training for long-horizon tasks? and (2) Is the joint training of skills
and the skill dynamics model essential for efficient model-based learning?
9.4.1 Tasks
To evaluate whether our method can efficiently learn temporally-extended tasks with sparse
rewards, we compare it to prior model-based RL and skill-based RL approaches on four tasks: 2D
maze navigation, two robotic kitchen manipulation, and robotic tabletop manipulation tasks, as
136
illustrated in Figure 9.3. More details about environments, tasks, and offline data can be found in
Section C.
Maze We use the maze navigation task from Pertsch et al. (2021), where a point mass agent
is initialized randomly near the green region and needs to reach the fixed goal region in red
(Figure 9.3a). The agent observes its 2D position and 2D velocity, and controls its(x,y)-velocity.
The agent receives a sparse reward of 100 only when it reaches the goal. The task-agnostic offline
dataset from Pertsch et al. (2021) consists of 3,046 trajectories between randomly sampled initial
and goal positions.
Kitchen We use the FrankaKitchen environment and 603 teleoperation trajectories from D4RL (Fu
et al., 2020). The 7-DoF Franka Emika Panda arm needs to perform four sequential sub-tasks
(Microwave - Kettle - Bottom Burner - Light). In Mis-aligned Kitchen, we also test another task
sequence (Microwave - Light - Slide Cabinet - Hinge Cabinet), which has a low sub-task transition
probability in the offline data distribution (Pertsch et al., 2021). The agent observes 11D robot state
and 19D object state, and uses 9D joint velocity control. The agent receives a reward of 1 for every
sub-task completion in order.
CALVIN We adapt the CALVIN benchmark (Mees et al., 2022) to have the target task Open
Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED and the 21D observation space of
robot and object states. It also uses a Panda arm, but with 7D end-effector pose control. We use
the play data of 1,239 trajectories from Mees et al. (2022) as our offline data. The agent receives a
reward of 1 for every sub-task completion in the correct order.
9.4.2 Baselines and Ablated Methods
We compare our method to the state-of-the-art model-based RL (Hafner et al., 2019; Hansen
et al., 2022), skill-based RL (Pertsch et al., 2020a), and combinations of them, as well as three
ablations of our method, as summarized in Appendix, Table G.1.
137
Dreamer SPiRL SPiRL+Dreamer TD-MPC SPiRL+TDMPC DADS LSP SkiMo (Ours)
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Average Success
(a) Maze
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(b) Kitchen
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(c) Mis-aligned Kitchen
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(d) CALVIN
Figure 9.4: Learning curves of our method and baselines. All averaged over 5 random seeds.
• Dreamer (Hafner et al., 2019) and TD-MPC (Hansen et al., 2022) learn a flat (single-step)
dynamics and train a policy using latent imagination to achieve a high sample efficiency.
• DADS (Sharma et al., 2020) discovers skills and learns a dynamics model through unsupervised
learning.
• LSP (Xie et al., 2020) plans in the skill space, but using a single-step dynamics model from
Dreamer.
• SPiRL (Pertsch et al., 2020a) learns skills and a skill prior, and guides a high-level policy
using the learned prior.
• SPiRL + Dreamer and SPiRL + TD-MPC pre-train the skills using SPiRL and learn a policy
and model in the skill space (instead of the action space) using Dreamer and TD-MPC,
respectively. In contrast to SkiMo, these baselines do not jointly train the model and skills.
• SkiMo w/o joint training learns the latent skill space using only the V AE loss in Equation (9.2).
• SkiMo + SAC uses model-free RL (SAC (Haarnoja et al., 2018b)) to train a policy in the skill
space.
• SkiMo w/o CEM selects skill based on the policy without planning using the learned model.
138
9.4.3 Results
Maze Maze navigation poses a hard exploration problem due to the sparsity of the reward: the
agent only receives reward after taking 1,000+ steps to reach the goal. Figure 9.4a shows that only
SkiMo is able to consistently succeed in long-horizon navigation, whereas baselines struggle to
learn a policy or an accurate model due to the challenges in sparse feedback and long-term planning.
To better understand the result, we qualitatively analyze the behavior of each agent in Appendix,
Figure G.4. Dreamer and TD-MPC have a small coverage of the maze, since it is challenging
to coherently explore for 1,000+ steps to reach the goal from taking primitive actions. SPiRL is
able to explore a large fraction of the maze, but it does not learn to consistently find the goal due
to difficult policy optimization in long-horizon tasks. On the other hand, SPiRL + Dreamer and
SPiRL + TD-MPC fail to learn an accurate model and often collide with walls.
Kitchen Figure 9.4b demonstrates that SkiMo reaches the same performance (above 3 sub-tasks)
with 5x less environment interactions than SPiRL. This improvement in sample efficiency is
crucial in real-world robot learning. In contrast, Dreamer and TD-MPC rarely succeed on the first
sub-task due to the difficulty in long-horizon learning with primitive actions. SPiRL + Dreamer
and SPiRL + TD-MPC perform better than flat model-based RL by leveraging skills, yet the
independently trained model and policy are not accurate enough to consistently achieve more than
two sub-tasks.
Mis-aligned Kitchen The mis-aligned target task makes the downstream learning harder, because
the skill prior, which reflects offline data distribution, offers less meaningful regularization to the
policy. Figure 9.4c shows that SkiMo still performs well despite the mis-alignment between the
offline data and downstream task. This demonstrates that the skill dynamics model is able to adapt
to the new distribution of behaviors, which might greatly deviate from the distribution in the offline
dataset.
139
CALVIN One of the major challenges in CALVIN is that the offline data is much more task-
agnostic. Any particular sub-task transition has probability lower than 0.1% on average, resulting in
a large number of plausible sub-tasks from any state. This setup mimics real-world large-scale robot
learning, where the robot may not receive a carefully curated dataset. Figure 9.4d demonstrates
that SkiMo can learn faster than the model-free baseline, SPiRL, which supports the benefit of
using our skill dynamics model. Meanwhile, Dreamer performs better in CALVIN than in Kitchen
because objects in CALVIN are more compactly located and easier to manipulate; thus, it becomes
viable to accomplish initial sub-tasks through random exploration. Yet, it falls short in composing
coherent action sequences to achieve a longer task sequence due to the lack of temporally-extended
reasoning.
In summary, we show the synergistic benefit of temporal abstraction in both the policy and
dynamics model. SkiMo achieves at least 5x higher sample efficiency than all baselines in robotic
domains, and is the only method that consistently solves the long-horizon maze navigation. Our
results also demonstrate the importance of algorithmic design choices (e.g. skill-level planning,
joint training of a model and skills) as naive combinations (SPiRL + Dreamer, SPiRL + TD-MPC)
fail to learn.
9.4.4 Ablation Studies
Model-based vs. Model-free In Figure 9.5, SkiMo achieves better asymptotic performance and
higher sample efficiency across all tasks than SkiMo + SAC, which directly uses the high-level task
policy to select skills instead of using the skill dynamics model to plan. Since the only difference is
in the use of the skill dynamics model for planning, this suggests that the task policy can make more
informative decisions by leveraging accurate long-term predictions of the skill dynamics model.
Joint training of skills and skill dynamics model Figure 9.5 shows that the joint training is
crucial for Maze and CALVIN while the difference is marginal in the Kitchen tasks. This suggests
140
SkiMo (Ours) SkiMo+SAC SkiMo w/o joint training SkiMo w/o CEM
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Average Success
(a) Maze
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(b) Kitchen
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(c) Mis-aligned Kitchen
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(d) CALVIN
Figure 9.5: Learning curves of our method and ablated models. All averaged over 5 random seeds.
that the joint training is essential especially in more challenging scenarios, where the agent needs to
generate accurate long-term plans (for Maze) or the skills are very diverse (in CALVIN).
CEM planning As shown in Figure 9.5, SkiMo learns significantly better and faster in Kitchen,
Mis-aligned Kitchen, and CALVIN than SkiMo w/o CEM, indicating that CEM planning can
effectively find a better plan. On the other hand, in Maze, SkiMo w/o CEM learns twice as fast.
We find that action noise for exploration in CEM leads the agent to get away from the skill prior
support and get stuck at walls and corners. We believe that with a careful tuning of action noise,
SkiMo can solve Maze much more efficiently. Meanwhile, the fast learning of SkiMo w/o CEM in
Maze confirms the advantage of policy optimization using imaginary rollouts generated by our skill
dynamics model.
For further ablations and discussion on skill horizon and planning horizon, see Appendix,
Section A.
9.4.5 Long-Horizon Prediction with Skill Dynamics Model
To assess the accuracy of long-term prediction of our proposed skill dynamics over flat dynamics,
we visualize imagined trajectories in Appendix, Figure G.5a, where the ground truth initial state
and a sequence of 500 actions (50 skills for SkiMo) are given. Dreamer struggles to make accurate
long-horizon predictions due to error accumulation. In contrast, SkiMo is able to reproduce the
ground truth trajectory with little prediction error even when traversing through hallways and
141
doorways. This is mainly because SkiMo allows temporal abstraction in the dynamics model,
thereby enabling temporally-extended prediction and reducing step-by-step prediction error.
9.5 Discussion
In this chapter, we proposed SkiMo, an intuitive instantiation of saltatory model-based hierarchi-
cal RL (Botvinick and Weinstein, 2014) that combines skill-based and model-based RL approaches.
Our experiments demonstrate that (1) a skill dynamics model reduces the long-term prediction
error, improving the performance of prior model-based RL and skill-based RL; (2) it leads to
temporal abstraction in both the policy and dynamics model, so the downstream RL can do effi-
cient, temporally-extended reasoning without needing to model step-by-step planning; and (3) joint
training of the skill dynamics and skill representations further improves the sample efficiency by
learning skills useful to predict their consequences. We believe that the ability to learn and utilize a
skill-level model holds the key to unlocking the sample efficiency and widespread use of RL agents,
and our method takes a step toward this direction.
Limitations and future work While our method extracts fixed-length skills from offline data,
the lengths of semantic skills may vary based on the contexts and goals. Future work can learn
variable-length semantic skills to improve long-term prediction and planning. Further, although we
only experimented on state-based inputs, SkiMo is a general framework that can be extended to
RGB, depth, and tactile observations. Thus, we would like to apply this sample-efficient approach
to real robots where the sample efficiency is crucial.
142
Chapter 10
Conclusion
In this thesis, we considered the problem of scaling a robot’s learning capability to complex
long-horizon tasks. To start, we developed benchmarks for evaluating a long-horizon task learning
capability in the domain of furniture assembly. For cheap and easy benchmarking, we first developed
the simulated environment, IKEA Furniture Assembly Environment. We then developed the real-
world furniture assembly benchmark with a reproducible experimental setup and simplified 3D
printed furniture parts, aiming to investigate challenges in learning long-horizon tasks in the real
world. We hope our furniture assembly benchmarks can contribute to pushing the boundary of robot
learning research in solving complex long-horizon tasks.
After presenting furniture assembly benchmarks as the next milestone, we considered a number
of skill chaining approaches to tackle these challenging tasks with pre-defined skills. First, we
showed that naive skill execution can fail due to the mismatch between the terminal state distribution
of one skill and the initial state distribution of the following skill. Thus, we proposed to bridge this
gap by learning transition policies between skills. We also considered an approach that directly
finetunes skills to fit their terminal state distributions into the initial state distributions of the
following skills, and this approach succeeded to learn furniture assembly in simulation for the first
time. Finally, we further extended skill chaining to multi-agent setups, where two arms (or two
agents) can collaborate for long-horizon tasks by explicitly coordinating skills of two arms.
To improve the flexibility of robots, instead of relying on a handful number of manually-defined
skills, we proposed to learn a rich skill repertoire from large-scale task-agnostic data. To efficiently
143
use such large number of skills, we presented an approach for learning skill priors and following
them as an exploration guidance. Then, we combined this skill-based approach with model-based
reinforcement learning, resulting in a significant improvement in sample efficiency.
The previous chapters showed that harnessing skills is key to enabling complex task learning.
Yet, their success was limited to relatively simple behaviors in simulated environments mainly due
to huge demands on online interactions. The ultimate goal in robot learning is to enable general-
purpose robots that can automate dangerous or tedious jobs in our daily lives. To realize generalist
robots that can learn diverse complex tasks in the real world, robot learning algorithms need to
be scalable in three dimensions: (1) complex long-horizon tasks, (2) high-complexity (high-DoF)
robot systems, and (3) diversity of tasks. Below, we discuss some of the open questions and future
directions that we believe to be the most compelling.
Unsupervised skill discovery This thesis assumes that a set of skills or a dataset including
reusable skills are given. Manually defining complex skills is costly and does not guarantee optimal
skills. One way of acquiring diverse skills is to extract skills from trajectory data (Shiarlis et al.,
2018; Kipf et al., 2019; Shankar et al., 2020; Shankar and Gupta, 2020; Lee et al., 2021d) or to learn
a goal-reaching policy (Pathak* et al., 2018; Lee et al., 2019a). But, collecting high-quality data
including sufficient, reusable skills is also not trivial. Thus, autonomous skill discovery (Sekar et al.,
2020; Eysenbach et al., 2019; Sharma et al., 2020; Park et al., 2021; Mendonca et al., 2021) is an
appealing research direction for acquiring a rich skill repertoire, resulting in efficient downstream
learning. But, how to learn skills without prior knowledge remains an open question, which we
believe to be crucial for building general-purpose agents.
Scaling robot learning to complex robot systems In the field of robotics, many complicated
skills can only be achieved with the simultaneous actuation of multiple arms or robots. Thus,
we need an approach that scales from a simple robotic arm to a complicated humanoid robot to
multiple of these robots. However, training such complex robotic systems is challenging due to the
exploration burden and difficult credit assignment over the large action space (Lee et al., 2020).
144
Similar to temporally segmenting a long-horizon task with skills, morphologically splitting an
agent’s control into modular policies (Wang et al., 2018; Pathak et al., 2019; Huang et al., 2020) can
be a plausible direction for a complex robot system. By using modular policies for different parts of
the agent, each policy can focus on learning part of the task while ignoring other parts of the task.
By adding a coordination module on top of all low-level policies, the skills learned by the modular
policies can be effectively managed to solve a complex task. Generally, the space of learning with
complex agents has been explored much less than that of learning for long-horizon tasks, leaving
significant room for future work.
Scaling robot learning to general-purpose robots While my research has focused on learning
individual long-horizon tasks, the ultimate goal is to make robots capable of performing diverse
tasks in any environment. But, collecting robot experience for training such versatile and robust
robots is very challenging. Instead, we can utilize large, diverse human data (e.g. YouTube videos,
online texts). It is not trivial to train a robot from human data due to the embodiment mismatch
and lack of expert actions. However, we can extract abstract and semantic behavioral information
(e.g. skills) from human data, which can be better transferred to robots. In this way, we believe a
robot can indirectly learn diverse tasks in a variety of environments from its augmented experience.
145
References
Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.
In International Conference on Machine Learning, 2004.
Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. {OPAL}:
Offline primitive discovery for accelerating offline reinforcement learning. In International
Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?
id=V69LGwJ0lIN.
Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur
Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s
cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
Saleh A Al-Abood, Keith Davids, and Simon J Bennett. Specificity of task constraints and
effects of visual demonstrations and verbal instructions in directing learners’ search during
skill acquisition. Journal of motor behavior, 33(3):295–305, 2001.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning
with policy sketches. In International Conference on Machine Learning, 2017.
Marcin Andrychowicz, Anton Raichuk, Piotr Sta´ nczyk, Manu Orsini, Sertan Girgin, Rapha¨ el
Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain
Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? a
large-scale study. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=nIAxjsniDzg.
Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In International
Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?
id=OMNB1G5xzd4.
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Associa-
tion for the Advancement of Artificial Intelligence , pages 1726–1734, 2017.
Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In
International Conference on Learning Representations, 2020.
146
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Pushmeet Kohli, and Edward
Grefenstette. Learning to understand goal specifications by modelling reward. In Interna-
tional Conference on Learning Representations, 2019.
Stefan Bauer, Manuel W¨ uthrich, Felix Widmaier, Annika Buchholz, Sebastian Stark, Anirudh
Goyal, Thomas Steinbrenner, Joel Akpo, Shruti Joshi, Vincent Berenz, Vaibhav Agrawal,
Niklas Funk, Julen Urain De Jesus, Jan Peters, Joe Watson, Claire Chen, Krishnan Srinivasan,
Junwu Zhang, Jeffrey Zhang, Matthew Walter, Rishabh Madan, Takuma Yoneda, Denis
Yarats, Arthur Allshire, Ethan Gordon, Tapomayukh Bhattacharjee, Siddhartha Srinivasa,
Animesh Garg, Takahiro Maeda, Harshit Sikchi, Jilong Wang, Qingfeng Yao, Shuyu Yang,
Robert McCarthy, Francisco Sanchez, Qiang Wang, David Bulens, Kevin McGuinness,
Noel O’Connor, Redmond Stephen, and Bernhard Sch¨ olkopf. Real robot challenge: A
robotics competition in the cloud. In Proceedings of the NeurIPS 2021 Competitions and
Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, pages
190–204. PMLR, 06–14 Dec 2022a.
Stefan Bauer, Manuel W¨ uthrich, Felix Widmaier, Annika Buchholz, Sebastian Stark, Anirudh
Goyal, Thomas Steinbrenner, Joel Akpo, Shruti Joshi, Vincent Berenz, et al. Real robot
challenge: A robotics competition in the cloud. In NeurIPS 2021 Competitions and Demon-
strations Track, pages 190–204. PMLR, 2022b.
Harold Bekkering, Andreas Wohlschlager, and Merideth Gattis. Imitation of gestures in
children is goal-directed. The Quarterly Journal of Experimental Psychology: Section A, 53
(1):153–164, 2000.
M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The arcade learning environment:
An evaluation platform for general agents. Journal of Artificial Intelligence Research , 47:
253–279, jun 2013.
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi
Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural
Information Processing Systems, 2016.
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
Matthew Botvinick and Ari Weinstein. Model-based hierarchical reinforcement learning
and human action control. Philosophical Transactions of the Royal Society B: Biological
Sciences, 369(1655), 2014.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL
http://github.com/google/jax.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie
Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
147
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog,
Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say:
Grounding language in robotic affordances. In Conference on Robot Learning, 2022.
Serkan Cabi, Sergio Gomez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott
Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov,
David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling
data-driven robotics with reward sketching and batch reinforcement learning. In Robotics:
Science and Systems, 2019.
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu,
Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal
dataset for autonomous driving. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 11621–11631, 2020.
Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan
Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization
with real world experience. In IEEE International Conference on Robotics and Automation,
2019.
Alexander Clegg, Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Learning to dress:
Synthesizing human dressing motion via deep reinforcement learning. ACM Transactions on
Graphics (TOG), 2018.
John Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and
Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning
with trajectory embeddings. In International Conference on Machine Learning, 2018.
Murtaza Dalal, Deepak Pathak, and Ruslan Salakhutdinov. Accelerating robotic reinforce-
ment learning via parameterized action primitives. In Neural Information Processing Systems,
2021.
Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference
for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357,
2016.
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeck-
peper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot
learning. In Conference on Robot Learning, 2019.
Brian Delhaisse, Leonel Rozo, and Darwin G Caldwell. Pyrobolearn: A python framework
for robot learning practitioners. In Conference on Robot Learning, 2019.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
148
Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec
Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines.
https://github.com/openai/baselines, 2017.
Nat Dilokthanakul, Christos Kaplanis, Nick Pawlowski, and Murray Shanahan. Feature
control as intrinsic motivation for hierarchical reinforcement learning. IEEE transactions on
neural networks and learning systems, 30(11):3409–3418, 2019.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.
In International Conference on Learning Representations, 2017.
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun.
CARLA: An open urban driving simulator. In Conference on Robot Learning, pages 1–16,
2017.
Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya
Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Ad-
vances in Neural Information Processing Systems, pages 1087–1098, 2017.
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you
need: Learning skills without a reward function. In International Conference on Learning
Representations, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm.
Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between
generative adversarial networks, inverse reinforcement learning, and energy-based models.
arXiv preprint arXiv:1611.03852, 2016a.
Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel.
Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on
Robotics and Automation, pages 512–519. IEEE, 2016b.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In International Conference on Machine Learning, pages
1126–1135, 2017.
Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon
Whiteson. Counterfactual multi-agent policy gradients. In Association for the Advancement
of Artificial Intelligence , 2018.
Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning
shared hierarchies. In International Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=SyX0IeWAW.
Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse
reinforcement learning. In International Conference on Learning Representations, 2018.
URLhttps://openreview.net/forum?id=rkHywl-A-.
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets
for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
149
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning
without exploration. In International Conference on Machine Learning, pages 2052–2062,
2019.
Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-
and-conquer reinforcement learning. In International Conference on Learning Representa-
tions, 2018. URLhttps://openreview.net/forum?id=rJwelMbR-.
Anirudh Goyal, Shagun Sodhani, Jonathan Binas, Xue Bin Peng, Sergey Levine, and Yoshua
Bengio. Reinforcement learning with competitive ensembles of information-constrained
primitives. In International Conference on Learning Representations, 2020. URLhttps:
//openreview.net/forum?id=ryxgJTEYDr.
Aditya Gudimella, Ross Story, Matineh Shaker, Ruofan Kong, Matthew Brown, Victor
Shnayder, and Marcos Campos. Deep reinforcement learning for dexterous manipulation
with concept networks. arXiv preprint arXiv:1709.06977, 2017.
Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Re-
lay policy learning: Solving long-horizon tasks via imitation and reinforcement learning.
Conference on Robot Learning, 2019.
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control
using deep reinforcement learning. In International Conference on Autonomous Agents and
Multi-Agent Systems, pages 66–83, 2017.
David Ha and J¨ urgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
Huy Ha and Shuran Song. Flingbot: The unreasonable effectiveness of dynamic manipulation
for cloth unfolding. In Conference on Robot Learning, pages 24–33, 2021.
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning
with deep energy-based policies. In International Conference on Machine Learning, pages
1352–1361, 2017.
Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey
Levine. Composable deep reinforcement learning for robotic manipulation. In IEEE Interna-
tional Conference on Robotics and Automation, pages 6244–6251, 2018a.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. In International
Conference on Machine Learning, pages 1856–1865, 2018b.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan,
Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms
and applications. arXiv preprint arXiv:1812.05905, 2018c.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to con-
trol: Learning behaviors by latent imagination. In International Conference on Learning
Representations, 2019.
150
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model
predictive control. In International Conference on Machine Learning, 2022.
Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller.
Learning an embedding space for transferable robot skills. In International Conference on
Learning Representations, 2018.
Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval
Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver.
Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286,
2017.
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Hor-
gan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations.
In Association for the Advancement of Artificial Intelligence , 2018.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual
concepts with a constrained variational framework. In International Conference on Learning
Representations, 2017.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in
Neural Information Processing Systems, pages 4565–4573, 2016.
Nicola J Hodges, A Mark Williams, Spencer J Hayes, and Gavin Breslin. What is modelled
during observational learning? Journal of sports sciences, 25(5):531–545, 2007.
Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all:
Shared modular policies for agent-agnostic control. In International Conference on Machine
Learning, pages 4455–4464. PMLR, 2020.
Divye Jain, Andrew Li, Shivam Singhal, Aravind Rajeswaran, Vikash Kumar, and Emanuel
Todorov. Learning deep visuomotor policies for dexterous hand manipulation. In IEEE
International Conference on Robotics and Automation, pages 3636–3643. IEEE, 2019.
Nick Jakobi, Phil Husbands, and Inman Harvey. Noise and the reality gap: The use of
simulation in evolutionary robotics. In European Conference on Artificial Life , pages 704–
720. Springer, 1995.
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The
robot learning benchmark & learning environment. IEEE Robotics and Automation Letters,
2020.
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey
Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning.
In Conference on Robot Learning, pages 991–1002, 2021.
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model:
Model-based policy optimization. In Neural Information Processing Systems, 2019.
151
Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent
cooperation. In Neural Information Processing Systems, pages 7254–7264, 2018.
Nicholas K Jong, Todd Hester, and Peter Stone. The utility of temporal abstraction in
reinforcement learning. In International Conference on Autonomous Agents and Multi-Agent
Systems, pages 299–306. Citeseer, 2008.
Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mat-
tar, and Danny Lange. Unity: A general platform for intelligent agents. arXiv preprint
arXiv:1809.02627, 2018.
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,
Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable
deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot
Learning, pages 651–673, 2018.
Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations. In
International Conference on Machine Learning, volume 80, pages 2469–2478, 2018.
H-J Kang and Robert A Freeman. Joint torque optimization of redundant manipulators
via the null space damping method. In IEEE International Conference on Robotics and
Automation, pages 520–521, 1992.
Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´ skowski.
ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE
Conference on Computational Intelligence and Games, pages 341–348, 2016.
Oussama Khatib. A unified approach for motion and force control of robot manipulators:
The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53,
1987.
Kenneth Kimble, Karl Van Wyk, Joe Falco, Elena Messina, Yu Sun, Mizuho Shibata, Wataru
Uemura, and Yasuyoshi Yokokohji. Benchmarking protocols for evaluating small parts
robotic assembly systems. IEEE robotics and automation letters, 5(2):883–889, 2020.
Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
International Conference on Learning Representations, 2015.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International
Conference on Learning Representations, 2014.
Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward
Grefenstette, Pushmeet Kohli, and Peter Battaglia. Compile: Compositional imitation
learning and execution. In International Conference on Machine Learning, 2019.
Ross A Knepper, Todd Layton, John Romanishin, and Daniela Rus. Ikeabot: An autonomous
multi-robot coordinated furniture assembly system. In IEEE International Conference on
Robotics and Automation, pages 855–862, 2013.
152
Jens Kober, Katharina M¨ ulling, Oliver Kr¨ omer, Christoph H Lampert, Bernhard Sch¨ olkopf,
and Jan Peters. Movement templates for learning of hitting and batting. In IEEE International
Conference on Robotics and Automation, pages 853–858, 2010.
Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi.
Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474,
2017.
George Konidaris and Andrew Barto. Skill discovery in continuous reinforcement learning
domains using skill chaining. In Advances in Neural Information Processing Systems,
volume 22, pages 1015–1023, 2009.
George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning
from demonstration by constructing skill trees. The International Journal of Robotics
Research, 31(3):360–375, 2012.
Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan
Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in
adversarial imitation learning. In International Conference on Learning Representations,
2019.
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit
q-learning. In International Conference on Learning Representations, 2022. URLhttps:
//openreview.net/forum?id=68n2s9ZJWF8.
Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical
deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In
Advances in Neural Information Processing Systems, pages 3675–3683, 2016.
Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation
for legged robots. In Robotics: Science and Systems, 2021.
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing
off-policy q-learning via bootstrapping error reduction. In Neural Information Processing
Systems, pages 11784–11794, 2019.
Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas.
Reinforcement learning with augmented data. In Neural Information Processing Systems,
2020.
Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daum´ e, III.
Hierarchical imitation and reinforcement learning. In International Conference on Machine
Learning, 2018.
Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis,
Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David
Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In
Conference on Robot Learning, 2021a.
153
Youngwoon Lee, Edward S. Hu, Zhengyu Yang, and Joseph J. Lim. To follow or not to
follow: Selective imitation learning from observations. In Conference on Robot Learning,
2019a.
Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J.
Lim. Composing complex skills by learning transition policies. In International Confer-
ence on Learning Representations, 2019b. URL https://openreview.net/forum?id=
rygrBhC5tQ.
Youngwoon Lee, Jingyun Yang, and Joseph J. Lim. Learning to coordinate manipulation
skills via skill behavior diversification. In International Conference on Learning Representa-
tions, 2020.
Youngwoon Lee, Edward S Hu, and Joseph J Lim. IKEA furniture assembly environment
for long-horizon complex manipulation tasks. In IEEE International Conference on Robotics
and Automation, 2021b. URLhttps://clvrai.com/furniture.
Youngwoon Lee, Joseph J. Lim, Anima Anandkumar, and Yuke Zhu. Adversarial skill
chaining for long-horizon robot manipulation via terminal state regularization. In Conference
on Robot Learning, 2021c.
Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J. Lim. Generalizable imitation
learning from observation via inferring goal proximity. In Neural Information Processing
Systems, 2021d.
Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence.
Minds and machines, 17(4):391–444, 2007.
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and
review. arXiv preprint arXiv:1805.00909, 2018.
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep
visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning:
Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actor-critic. arXiv preprint
arXiv:1712.00948, 2017.
Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level
hierarchies with hindsight. In International Conference on Learning Representations, 2019.
URLhttps://openreview.net/forum?id=ryzECoAcY7.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.
In International Conference on Learning Representations, 2016.
154
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International
Conference on Learning Representations, 2020.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch.
Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in
Neural Information Processing Systems, pages 6379–6390, 2017.
Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch.
Plan online, learn offline: Efficient learning and exploration via model-based control. In
International Conference on Learning Representations, 2019. URLhttps://openreview.
net/forum?id=Byey7n05FQ.
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning
with skill-space planning. In International Conference on Learning Representations, 2021a.
Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B Tenenbaum, and Chuang
Gan. Learning task decomposition with ordered memory policy network. In International
Conference on Learning Representations, 2021b.
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine,
and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning,
pages 1113–1132. PMLR, 2020.
Pattie Maes and Rodney A Brooks. Learning to coordinate behaviors. In Association for the
Advancement of Artificial Intelligence , volume 90, pages 796–802, 1990.
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian
Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. Roboturk:
A crowdsourcing platform for robotic skill learning through imitation. In Conference on
Robot Learning, pages 879–893, 2018.
Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris:
Implicit reinforcement without interaction at scale for learning control from offline robot
manipulation data. In IEEE International Conference on Robotics and Automation, 2020a.
Ajay Mandlekar, Danfei Xu, Roberto Mart´ ın-Mart´ ın, Silvio Savarese, and Li Fei-Fei. Gti:
Learning to generalize across long-horizon tasks from human demonstrations. In Robotics:
Science and Systems, 2020b.
Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans,
Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv
Batra. Habitat: A platform for embodied ai research. In IEEE International Conference on
Computer Vision, 2019.
Jiayuan Mao, Honghua Dong, and Joseph J. Lim. Universal agent for disentangling envi-
ronments and tasks. In International Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=B1mvVm-C-.
155
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smol-
ley. Least squares generative adversarial networks. In IEEE International Conference on
Computer Vision, pages 2794–2802, 2017.
Jarryd Martin, S Suraj Narayanan, Tom Everitt, and Marcus Hutter. Count-based exploration
in feature space for reinforcement learning. In Association for the Advancement of Artificial
Intelligence, 2017.
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark
for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE
Robotics and Automation Letters, 2022.
Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak.
Discovering and achieving goals via world models. In Neural Information Processing
Systems, 2021.
Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg
Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial
imitation. arXiv preprint arXiv:1707.02201, 2017.
Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala,
Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In Interna-
tional Conference on Learning Representations, 2019a. URLhttps://openreview.net/
forum?id=BJfYvo09Y7.
Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne,
Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid
control. In International Conference on Learning Representations, 2019b.
Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever,
Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch & carry: Reusable neu-
ral controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG),
2020.
Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco
Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science
Robotics, 7(62):eabk2822, 2022.
Kaichun Mo, Haoxiang Li, Zhe Lin, and Joon-Young Lee. The adobeindoornav dataset:
Towards deep reinforcement learning based real-world indoor robot visual navigation. arXiv
preprint arXiv:1802.08824, 2018.
Katharina M¨ ulling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and
generalize striking movements in robot table tennis. The International Journal of Robotics
Research, 32(3):263–279, 2013.
Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchi-
cal reinforcement learning. In Advances in Neural Information Processing Systems, pages
3303–3313, 2018.
156
Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent
manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning,
2019a.
Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, and Sergey Levine.
Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint
arXiv:1909.10618, 2019b.
Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel.
Overcoming exploration in reinforcement learning with demonstrations. In IEEE Interna-
tional Conference on Robotics and Automation, pages 6292–6299, 2018.
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online
reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A
universal visual representation for robot manipulation. In Conference on Robot Learning,
2022.
Yashraj Narang, Kier Storey, Iretiayo Akinola, Miles Macklin, Philipp Reist, Lukasz Wawrzy-
niak, Yunrong Guo, Adam Moravanszky, Gavriel State, Michelle Lu, et al. Factory: Fast
contact for robotic assembly. In Robotics: Science and Systems, 2022.
Soroush Nasiriany, Vitchyr H. Pong, Ashvin Nair, Alexander Khazatsky, Glen Berseth, and
Sergey Levine. Disco rl: Distribution-conditioned reinforcement learning for general-purpose
policies. In IEEE International Conference on Robotics and Automation, pages 6635–6641,
2021.
Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In
International Conference on Machine Learning, volume 1, page 2, 2000.
Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-
mations: Theory and application to reward shaping. In International Conference on Machine
Learning, pages 278–287, 1999.
Scott Niekum, Sachin Chitta, Andrew G Barto, Bhaskara Marthi, and Sarah Osentoski.
Incremental semantically grounded learning from demonstration. In Robotics: Science and
Systems, 2013.
Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generaliza-
tion with multi-task deep reinforcement learning. In International Conference on Learning
Representations, pages 2661–2670, 2017.
Edwin Olson. Apriltag: A robust and flexible visual fiducial system. In IEEE International
Conference on Robotics and Automation, pages 3400–3407, 2011.
OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob
McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al.
Learning dexterous in-hand manipulation. The International Journal of Robotics Research,
39(1):3–20, 2020.
157
Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, and Gunhee Kim. Lipschitz-
constrained unsupervised skill discovery. In International Conference on Learning Represen-
tations, 2021.
Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and generaliza-
tion of motor skills by learning from demonstration. In IEEE International Conference on
Robotics and Automation, pages 763–768, 2009.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differenti-
ation in PyTorch. In NIPS Autodiff Workshop, 2017.
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven
exploration by self-supervised prediction. In International Conference on Machine Learning,
2017.
Deepak Pathak*, Parsa Mahmoudieh*, Michael Luo*, Pulkit Agrawal*, Dian Chen, Fred
Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, and Trevor Darrell. Zero-shot
visual imitation. In International Conference on Learning Representations, 2018.
Deepak Pathak, Christopher Lu, Trevor Darrell, Phillip Isola, and Alexei A Efros. Learning
to control self-assembling morphologies: a study of generalization via modularity. In Neural
Information Processing Systems, volume 32, 2019.
Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun
Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination
in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017a.
Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic
locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on
Graphics (TOG), 36(4):41, 2017b.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic:
Example-guided deep reinforcement learning of physics-based character skills. ACM Trans-
actions on Graphics (TOG), 37(4):1–14, 2018.
Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. Mcp:
Learning composable hierarchical control with multiplicative compositional policies. In
Neural Information Processing Systems, 2019.
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Ad-
versarial motion priors for stylized physics-based character control. ACM Transactions on
Graphics (TOG), 40(4):1–20, 2021.
Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with
learned skill priors. In Conference on Robot Learning, 2020a.
158
Karl Pertsch, Oleh Rybkin, Jingyun Yang, Shenghao Zhou, Konstantinos Derpanis, Kostas
Daniilidis, Joseph Lim, and Andrew Jaegle. Keyframing the future: Keyframe discovery
for visual prediction and planning. In Learning for Dynamics and Control, pages 969–979.
PMLR, 2020b.
Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. Demonstration-guided reinforce-
ment learning with learned skills. In Conference on Robot Learning, 2021.
Marc Pickett and Andrew G Barto. Policyblocks: An algorithm for creating useful macro-
actions in reinforcement learning. In International Conference on Machine Learning, vol-
ume 19, pages 506–513, 2002.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances
in Neural Information Processing Systems, pages 305–313, 1989.
Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation.
Neural computation, 3(1):88–97, 1991.
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio
Torralba. Virtualhome: Simulating household activities via programs. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin, Taylor Henderson, Byron Boots, and
Michael C. Yip. Composing task-agnostic policies with deep reinforcement learning. In
International Conference on Learning Representations, 2020. URLhttps://openreview.
net/forum?id=H1ezFREtwH.
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman,
Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep
reinforcement learning and demonstrations. In Robotics: Science and Systems, 2018.
Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient
off-policy meta-reinforcement learning via probabilistic context variables. In International
Conference on Machine Learning, 2019.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows.
In International Conference on Machine Learning, 2015.
Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom
Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving
sparse reward tasks from scratch. In International conference on machine learning, pages
4344–4353. PMLR, 2018.
David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou.
Recogym: A reinforcement learning environment for the problem of product recommendation
in online advertising. arXiv preprint arXiv:1808.00720, 2018.
St´ ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and
structured prediction to no-regret online learning. In International Conference on Artificial
Intelligence and Statistics, pages 627–635, 2011.
159
Reuven Y Rubinstein. Optimization of computer simulation models with rare events. Euro-
pean Journal of Operational Research, 99(1):89–112, 1997.
Stefan Schaal. Learning from demonstration. In Advances in Neural Information Processing
Systems, pages 1040–1046, 1997.
Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and
humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer,
2006.
Stefan Schaal, Jan Peters, Jun Nakanishi, and Auke Ijspeert. Learning movement primitives.
In Robotics research. the eleventh international symposium, pages 561–572. Springer, 2005.
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.
In International Conference on Learning Representations, 2016.
J¨ urgen Schmidhuber. Towards compositional learning with dynamic neural networks. Inst.
f¨ ur Informatik, 1990.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In International Conference on Machine Learning, pages 1889–
1897, 2015.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. In International
Conference on Learning Representations, 2016.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak
Pathak. Planning to explore via self-supervised world models. In International Conference
on Machine Learning, pages 8583–8592. PMLR, 2020.
Dhruv Shah, Alexander T Toshev, Sergey Levine, and brian ichter. Value function
spaces: Skill-centric state abstractions for long-horizon reasoning. In International Con-
ference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=
vgqS1vkkCbE.
Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual
and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017.
Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational
inference. In International Conference on Machine Learning, 2020.
Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, and Abhinav Gupta. Discovering mo-
tor programs by recomposing demonstrations. In International Conference on Learning
Representations, 2020.
160
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-
aware unsupervised discovery of skills. In International Conference on Learning Represen-
tations, 2020.
Lucy Xiaoyang Shi, Joseph J. Lim, and Youngwoon Lee. Skill-based model-based reinforce-
ment learning. In Conference on Robot Learning, 2022.
Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and Ingmar Posner.
Taco: Learning task decomposition via temporal alignment for control. In International
Conference on Machine Learning, pages 4654–4663. PMLR, 2018.
Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael
Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing
what worked: Behavior modelling priors for offline reinforcement learning. In International
Conference on Learning Representations, 2020.
Avi Singh, Albert Yu, Jonathan Yang, Jesse Zhang, Aviral Kumar, and Sergey Levine. Cog:
Connecting new skills to past experience with offline reinforcement learning. In Conference
on Robot Learning, 2020.
Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine.
Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference
on Learning Representations, 2021.
Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation
learning from reinforcement learning. In International Conference on Machine Learning,
pages 9870–9879. PMLR, 2021.
Francisco Su´ arez-Ruiz and Quang-Cuong Pham. A framework for fine robotic assembly. In
IEEE International Conference on Robotics and Automation, pages 421–426, 2016.
Francisco Su´ arez-Ruiz, Xian Zhou, and Quang-Cuong Pham. Can robots assemble an ikea
chair? Science Robotics, 3(17), 2018.
Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with back-
propagation. In Advances in Neural Information Processing Systems, pages 2244–2252,
2016.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zam-
baldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al.
Value-decomposition networks for cooperative multi-agent learning based on team reward. In
International Conference on Autonomous Agents and Multi-Agent Systems, pages 2085–2087,
2018.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
2018.
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A
framework for temporal abstraction in reinforcement learning. Artificial intelligence , 112
(1-2):181–211, 1999.
161
Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. PhD thesis,
University of Massachusetts Amherst, 1984.
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas,
David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and
Martin A. Riedmiller. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
Yuval Tassa, Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez,
Josh Merel, Tom Erez, Timothy Lillicrap, and Nicolas Heess. dm control: Software and
tasks for continuous control. arXiv preprint arXiv:2006.12983, 2020.
Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A
survey. Journal of Machine Learning Research, 10(7), 2009.
Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning. In
Advances in Neural Information Processing Systems, 1995.
Richard Socher Tianmin Shu, Caiming Xiong. Hierarchical and interpretable skill acquisition
in multi-task reinforcement learning. International Conference on Learning Representations,
2018. URLhttps://openreview.net/forum?id=SJJQVZW0b.
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based
control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
5026–5033, 2012.
Yusuke Urakami, Alec Hodgkinson, Casey Carlin, Randall Leu, Luca Rigazio, and Pieter
Abbeel. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint
arXiv:1908.01887, 2019.
Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas
Heess, Thomas Roth¨ orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations
for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint
arXiv:1707.08817, 2017.
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,
David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement
learning. In International Conference on Machine Learning, pages 3540–3549. PMLR,
2017.
Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets,
Michelle Yeo, Alireza Makhzani, Heinrich K¨ uttler, John Agapiou, Julian Schrittwieser, et al.
Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,
2017.
Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured
policy with graph neural networks. In International Conference on Learning Representations,
2018.
162
Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas
Heess. Robust imitation of diverse behaviors. In Advances in Neural Information Processing
Systems, 2017.
Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral
control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149,
2015.
Robert S Woodworth and EL Thorndike. The influence of improvement in one mental
function upon the efficiency of other functions.(i). Psychological review, 8(3):247, 1901.
Bohan Wu, Suraj Nair, Li Fei-Fei, and Chelsea Finn. Example-driven model-based reinforce-
ment learning for solving long-horizon visuomotor tasks. In Conference on Robot Learning,
2021.
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement
learning. arXiv preprint arXiv:1911.11361, 2019.
Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese.
Gibson env: Real-world perception for embodied agents. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 9068–9079, 2018.
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu,
Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su.
SAPIEN: A simulated part-based interactive environment. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 11097–11107, 2020.
Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, and Florian Shkurti.
Latent skill planning for exploration and transfer. In International Conference on Learning
Representations, 2020.
Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese.
Neural task programming: Learning to generalize across hierarchical tasks. In IEEE Interna-
tional Conference on Robotics and Automation, pages 1–8. IEEE, 2018.
Jun Yamada, Youngwoon Lee, Gautam Salhotra, Karl Pertsch, Max Pflueger, Gaurav S.
Sukhatme, Joseph J. Lim, and Peter Englert. Motion planner augmented reinforcement
learning for obstructed environments. In Conference on Robot Learning, 2020.
Brian Yang, Jesse Zhang, Vitchyr Pong, Sergey Levine, and Dinesh Jayaraman. Replab:
A reproducible low-cost arm benchmark platform for robotic learning. arXiv preprint
arXiv:1905.07447, 2019.
Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu-
larizing deep reinforcement learning from pixels. In International Conference on Learning
Representations, 2021. URLhttps://openreview.net/forum?id=GY6-6sTvGaf.
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn,
and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta
reinforcement learning. In Conference on Robot Learning, 2019.
163
Grace Zhang, Linghan Zhong, Youngwoon Lee, and Joseph J. Lim. Policy transfer across
visual and dynamics domain gaps via iterative grounding. In Robotics: Science and Systems,
2021.
Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvu-
nakool, J´ anos Kram´ ar, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement
and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems, 2018.
Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Mart´ ın-Mart´ ın. robosuite: A modular
simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293,
2020.
Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum
causal entropy. Carnegie Mellon University, 2010.
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy
inverse reinforcement learning. In Association for the Advancement of Artificial Intelligence ,
volume 8, pages 1433–1438, 2008.
Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden,
Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial
imitation learning. In Conference on Robot Learning, 2020.
164
Appendices
165
Appendix A
Real-World Furniture Assembly Environment
A Environment Details
A.1 Environment Setup
(a) Environment overview
(b) Camera pose (c) Marker and obstacle positions
Figure A.1: (a) Environment overview. (b) Camera pose. (c) Marker and obstacle positions.
166
As can be seen in Figure A.1c, we attach the “base” AprilTag (Olson, 2011) markers to the
center of the table, 23 cm away from the robot base. These base tags set the origin of the world
coordinate. The front-view camera, which provides a visual observation for a learning agent, can be
adjusted to the exact camera pose using a GUI tool provided in our code. The left and right cameras
are only used for improving AprilTag detection in the reset and reward functions; thus, their poses
can vary across environments. To reduce discrepancy in backgrounds, we surround the table with
a green cloth for the consistent background. While collecting data, we varied lighting positions,
brightness, and color using two ring lights.
A.2 Multi-Camera Pose Estimation using AprilTag
Figure A.2: For easy and accu-
rate marker placement, all 3D mod-
els have AprilTag placeholders on
their surfaces with corresponding
AprilTag IDs.
AprilTag pose estimation using a single camera is often
inaccurate and can even fail to detect object poses when mark-
ers are occluded or angled. To compensate for this detection
failure and inaccurate pose estimation, we use three parallel
AprilTag detection modules using three cameras (front, left,
and right cameras as illustrated in Figure 3.4a) and aggregate
the estimated poses from three cameras. Thus, the resulting
pose estimations can be acquired faster (10 Hz) and more ac-
curately. To handle the noise (e.g. flipping axis) in AprilTag
pose estimation, we filter outliers that are different from the
last five detection results for each camera.
For reliable and accurate pose estimation, we attach multi-
ple markers to each furniture part. For each furniture part, the poses of the detected markers are
converted to the canonical furniture part pose, where the position is defined as its center of mass and
the orientation is the same as the orientation of the marker with the smallest ID. Then, the furniture
pose is estimated by averaging the canonical poses computed from all detected markers.
167
To sum it up, we improve the stability and accuracy of pose estimation by (1) removing outliers,
(2) averaging over estimated poses of multiple markers on each furniture part, and finally, (3)
averaging over estimated poses from three cameras.
For accurate and easy marker placement, all 3D models have placeholders of markers with their
AprilTag IDs, as shown in Figure A.2.
A.3 Reset Function
To make our benchmark easy to use, we provide an automatic reset function that can initialize
the environment from the end of the previous episode such that a human operator does not need to
manually reset the environment every episode.
The reset function iterates through each part of the furniture model to check whether all the parts
are graspable (e.g., the gripper can not grasp the flipped tabletop due to its size and orientation).
Then, the robot will clear out the location where the piece will be reset to and place it back to the
proper starting position using the estimated furniture part poses from AprilTag detection. Although
our hard-coded reset function cannot cover all possible cases, e.g., a furniture part falling off from
the workspace, it is still useful as it can eliminate human intervention in most common cases. We
describe our environment reset algorithm in Algorithm 4.
A.4 Reward Function
The reward function is another critical component for reinforcement learning approaches as
well as for evaluating imitation learning algorithms. Together with the automatic reset function
(Section A.3), our reward function allows autonomous training and evaluation without a human
annotator.
For all furniture models, we provide a reward of+1 for assembly of a pair of two furniture parts.
Thus, successful furniture assembly will achieve a total reward of N− 1, where N is the number
of furniture parts. Our environment computes this reward function based on the relative poses
between parts (the part poses are estimated from AprilTag). We pre-define ground truth relative
168
Algorithm 4 RESET FUNCTION
Require: Reset pose of all furniture parts r, real-time part poses p
1: for each furniture part i do
2: if not DETECTED(p
i
) or not IS RESET FEASIBLE(p
i
) then
3: call human operator for manual reset
4: end if
5: if not IS IN RESET POSE(p
i
,r
i
) then
6: EMPTY RESET PLACE(i,r
i
, p) ▷ Empty the place to avoid collisions
7: PICK AND PLACE(i,r
i
) ▷ Pick up the part i, place it in the reset pose r
i
8: ADJUST(i,r
i
) ▷ Reorient part i to match the reset pose r
i
9: end if
10: end for
11: for each furniture part i do
12: if not IS IN RESET POSE(p
i
,r
i
) then
13: call human operator for manual reset
14: end if
15: end for
Algorithm 5 EMPTY RESET PLACE(i,r
i
, p)
Require: target furniture part i, target reset pose r
i
, real-time part poses p
1: for each furniture part j other than i do
2: if IS CLOSE(p
j
,r
i
) then
3: t
j
← GET TEMP POSE() ▷ Find a temporal place for the part j
4: PICK AND PLACE( j,t
j
) ▷ Move the part j to t
j
5: end if
6: end for
poses between pairs of connected parts. We consider two parts are assembled when their relative
pose remains within a threshold (cosine similarity larger than 0.95 and distance smaller than 7 mm)
for five consecutive timesteps. Each pair of parts is rewarded only once.
Figure A.4 illustrates example assembly procedures for all furniture models, and the red boxes
indicate that two parts are assembled, i.e. when+1 reward is provided. The assembly order shown
in Figure A.4 is not the only solution, so the agent can achieve rewards with different assembly
orders unless it does not violates the strict procedure of assembly (e.g., assemble the lamp bulb first
before placing the hood)
Phase The proposed furniture assembly task is complex and long-horizon – even getting the first
reward requires 416 demonstrator actions on average. Therefore, instead of benchmarking with
169
Algorithm 6 PICK AND PLACE(i,g)
Require: target furniture part i, goal pose g
1: if not IS REACHABLE(p
i
) then
2: call human operator for manual reset
3: end if
4: h← FIND GRASP POSE(i, p
i
) ▷ Compute where to grasp given pose of part i and robot’s joint limits
5: PICK UP(h) ▷ Move the end-effector to h, close the gripper, and lift the part i
6: PLACE(g) ▷ Move the end-effector to g and release the gripper
reward, we propose to measure more fine-grained progress, which we refer to as a “phase”. In
contrast to the automated reward function, the phase measure is evaluated by a human operator. Note
that this phase information is not used for training an agent but used for evaluating and analyzing a
learned agent. Detailed phase definitions can be found in Figure A.4.
A.5 Sensory Inputs
The available sensory inputs from our environment is listed below:
Table A.1: Sensory inputs available in our environment.
Name Shape Unit
Front-view image (224, 224, 3) pixel value (0 - 255)
Depth image (224, 224) (0 - 65536) mm
End-effector position (3,) m
End-effector orientation (4,) quaternion
End-effector linear velocity (3,) m/s
End-effector rotational velocity (3,) rad/s
Joint positions (7,) rad
Joint velocities (7,) rad/s
Joint torques (7,) N· m
Gripper width (1,) m
A.6 Action Space
We use an 8D action space, which consists of the change in end-effector(x,y,z-position in meter
(3D), the relative quaternion to the target orientation (4D), and gripper action (1D). The action space
170
is bounded between− 1 and+1. To make policy learning efficient, we make the quaternion scalar
part always positive.
Although the gripper action is binary (− 1 for opening the gripper and +1 for closing the
gripper), we allow a policy to output a continuous value between− 1 to 1. To prevent the gripper
from repeating opening and closing actions, the gripper action is only activated when the absolute
value is greater than a threshold (0.009 in this paper).
A.7 Hardware Details
Here is the list of products used to build our benchmark system:
• Franka Emika Panda (link) and bench clamp (link)
• 3x Intel RealSense D435 (link)
• Oculus Quest 2 (link)
• 3x Manfrotto Variable Friction Arm with Camera Bracket (link)
• 3x Manfrotto Super Clamp (link)
• Background frame (190 cmx150 cm w x h)
• Green cloth (280 cmx400 cm w x h)
• 2x LED ring light (support warm, natural, and cool white colors, bright adjustable)
• 3x USB-C type 3.1 cables (link)
• Client workstation with AMD Ryzen 9 5950X and one NVIDIA RTX 3090 GPU: for our
environment interface with sensors and an agent
• Server workstation with AMD Ryzen 5 5600X: for real-time robot control
171
B Furniture Models
Our benchmark features 8 different furniture models, each of which is modeled after an existing
piece of IKEA furniture, as shown in Figure A.3. Due to the limitations of having one robotic arm,
we modify some furniture pieces so that a robot can assemble them with one hand. To make this
benchmark easy-to-use and reproducible, we use 3D-printed furniture models, which are manually
3D modelled using AUTODESK FUSION 360.
The blueprints of furniture models are shown in Figure A.5. Some small or thin parts are
designed thicker than 2.6 cm to afford space for an AprilTag marker. The largest dimension is
smaller than 21 cm, so that our furniture models are compatible with most 3D printers. The potential
grasping regions are having at most 6 cm, which is smaller than the open gripper width in our setting
(7.5 cm).
In our experiments, the 3D models are printed using a FLASHFORGE GUIDER IIS 3D printer
and white PLA filaments. Printing a small furniture model takes approximately 15 hours, while
large furniture models, such asdesk, need two passes of 3D printing, which takes about 21 hours.
172
(a) IKEA INGOLFchair (b) IKEA LAMPANlamp
(c) IKEA LACKsquare table (d) IKEA LAGKAPTENdesk
(e) IKEA GURSKENdrawer (f) IKEA HA VSTAcabinet
(g) IKEA INGATORPround table (h) IKEA KYRREstool
Figure A.3: Furniture 3D models. (left) IKEA model furniture, (middle) 3D furniture model,
(right) 3D printed furniture model.
173
Initial state
0
Reach base
1
Push it to corner
2
Reach to bulb
3
Pick up bulb
4
Insert bulb
5
Screw bulb
6
Pick up hood
7
Place on the top of base
8
(a) Assembly procedure oflamp
Initial state
0
Reach tabletop
1
Pick up tabletop
2
Place it to corner
3
Pick up leg
4
Insert the leg into screw
hole
5
Screw leg
6
Repeat 4-6 with second
leg
7
Repeat 4-6 with third leg
8
Repeat 4-6 with fourth
leg
9
(b) Assembly procedure ofsquare table
Initial state
0
Reach tabletop
1
Pick up tabletop
2
Place it to corner
3
Pick up leg
4
Insert leg into screw hole
5
Screw leg
6
Repeat 4-6 with second
leg
7
Repeat 4-6 with third leg
8
Repeat 4-6 with fourth
leg
9
(c) Assembly procedure ofdesk
Figure A.4: Assembly procedures: (a)lamp (b)square table (c)desk
174
1
Reach drawer box
0
Initial state
2
Push box to corner
3
Pick up container
4
Insert container to box
5
Push container to be fully
inserted
6
Repeat 3-5 with another
container
(d) Assembly procedure ofdrawer
0
Initial state
2
Grasp box
3
Place box to corner
4
Pick up door
5
Insert door
6
Slide in door
8
Grasp cabinet box
9
Make box stand up
10
Grasp cabinet top
11
Insert cabinet top
12
Screw cabinet top
Reach cabinet box
1
Repeat 4-6 with another
door
7
(e) Assembly procedure ofcabinet
Figure A.4: Assembly procedures: (d)drawer (e)cabinet
175
Initial state
0
Reach tabletop
1
Pick up tabletop
2
Place it to corner
3
Pick up round leg
4
Insert it into screw hole
5
Screw round leg
6
Pick up base
7
Insert it into screw hole
8 9
Screw base
(f) Assembly procedure ofround table
Initial state
0
Reach stool seat
1
Push it to corner
2
Pick up leg
3
Insert it into screw hole
4
Screw leg
5
Repeat 3-5 with second
leg
6
Repeat 3-5 with third leg
7
(g) Assembly procedure ofstool
Figure A.4: Assembly procedures: (f)round table (g)stool
176
Initial state
0
Reach seat
1
Pick up seat
2
Place seat to corner
3
Pick up leg
4
Insert the leg into seat
6
Screw leg
6
Repeat 4-6 with another
leg
7
Flip seat
8
Pick up seat back
9
Insert the back
into the seat
11
Pick up nut
11
Insert nut into seat
12
Screw nut
13
Repeat 11-13 with
another nut
14
Insert leg into seat
5
Insert seat back into seat
10
(h) Assembly procedure ofchair
Figure A.4: Assembly procedures: (h)chair
177
(a)lamp (b)square table
(c)drawer (d)cabinet
(e)round table (f)desk
(g)stool (h)chair
Figure A.5: Blueprints of furniture models. The leftmost column shows the final configuration
(top left) and how furniture parts are assembled (bottom right). The rest of the columns illustrate
dimensions of all furniture parts.
178
Desk Assembly
1 2 3 4 5
BC evaluation
(a) BC SEQUENCE
1 2 3 4 5 6
(b) IQL SEQUENCE
Round Table Assembly
1 2 3 4 5
BC evaluation
(round_table)
(c) BC SEQUENCE
1 2 3 4 5 6
IQL evaluation
(round_table)
(d) IQL SEQUENCE
Figure A.6: Sequences of policy evaluation. In desk assembly, (a) the BC policy can reach to the
tabletop, but fail to grasp it, while (b) the IQL policy can reach the tabletop, pickup the tabletop and
place it to the corner. But after placing it, robot goes picks up the tabletop then drops it. In round
table assembly, (c) the BC agent can reach the round table and push the obstacle, while (d) the IQL
agent could also grasp the leg but failed to insert it.
179
Appendix B
Composing Skills via Transition Policies
A Acquiring Primitive Policies
The modular framework proposed in this chapter allows a primitive policy to be any of a
pre-trained neural network, inverse kinematics module, or hard-coded policy. In this chapter, we use
neural networks trained with TRPO (Schulman et al., 2015) on dedicated environments as primitive
policies (see Section C for the details of environments and reward functions). All policy networks
we used consists of 2 layers of 32 hidden units with tanh nonlinearities and predicts the mean and
standard deviation of a Gaussian distribution over an action space. We trained all primitive policies
until the total return converged (up to 10,000 iterations).
Given a state, a primitive policy outputs an action as well as a termination signal indicating
whether the execution is done and if the skill was successfully performed (see Section C for details
on primitive skills and termination conditions).
B Training Details
B.1 Implementation Details
For the TRPO and PPO implementation, we used OpenAI baselines (Dhariwal et al., 2017)
with default hyperparameters including learning rate, KL penalty, and entropy coefficients unless
specified below.
180
Hyperparameters Transition policy Proximity predictor Primitive policy TRPO PPO
Learning rate 1e-4 1e-4 1e-3 (for critic) 1e-3 (for critic) 1e-4
# Mini-batch 150 150 32 150 150
Mini-batch size 64 64 64 64 64
Learning rate decay no no no no linear decay
Table B.1: Hyperparameter values for transition policy, proximity predictor, and primitive policy as
well as TRPO and PPO baselines.
For all networks, we use the Adam optimizer with mini-batch size of 64. We use 4 workers
for rollout and parameter update. The size of rollout for each update is 10,000 steps. We limit the
maximum length of a transition trajectory as 100.
B.2 Replay Buffers
A success bufferB
S
contains states and their proximity to the corresponding initiation set in
successful transitions. On the other hand, a failure bufferB
F
contains states in failure transitions.
Both the two buffers are FIFO (i.e. new items are added on one end and once a buffer is full, a
corresponding number of items are discarded from the opposite end). For all experiments, we use
buffers,B
S
andB
F
, with a capacity of one million states.
For efficient training of the proximity predictors, we collect successful trajectories of primitive
skills which can be sampled during the training of primitive skills. We run 1,000 episodes for
each primitive and put the first 10 - 20% in trajectories into the success buffer as an initiation set.
While initiation sets can be discovered via random exploration, we found that this initialization of
success buffers improves the efficiency of training by providing initial training data for the proximity
predictors.
B.3 Proximity Reward
Transition policies receive rewards based on the outputs of proximity predictors. Before
computing the reward at every time step, we clip the output of the proximity predictor P by
clip(P(s),0,1) which indicates how close the state s is to the initiation set of the following primitive
181
0 200 400 600 800 1000
Step
0.0
0.5
1.0
1.5
2.0
2.5
Success
Ours-Exponential
Ours-Linear
(a) OBSTACLE COURSE
0 100 200 300 400 500
Step
0
1
2
3
4
Success
(b) REPETITIVE CATCHING
Figure B.1: Success count curves of our model with exponentially discounted proximity function
and linearly discounted proximity function over training on Obstacle course (left) and Repetitive
catching (right).
(higher values correspond to closer states). We define the proximity of a state to an initiation set as
an exponentially discounted functionδ
step
, where step is the shortest number of timesteps required
to get to a state in the initiation set. We use δ = 0.95 for all experiments. To make the reward
denser, for every timestep t, we provide the increase in proximity, P(s
t+1
)− P(s
t
), as a reward for
transition policy.
Using a linearly discounted proximity function, 1− δ· step, is also a valid choice. We compare
the two proximity functions on a manipulation task (Repetitive catching) and a locomotion task
(Obstacle course), as shown in Figure B.1, where δ for exponential decay and linear decay are
0.95 and 0.01, respectively. The results demonstrate that our model is able to learn well with both
proximity functions and they perform similarly.
Originally, we opted for the exponential proximity function with the intuition that the faster
initial decay near the initiation set would help the policy discriminate successful states from failing
states near the initiation set. Also, in our experiments, as we use 0.95 as a decaying factor, the
proximity is still reasonably large (e.g., 0.35 for 20 time-steps and 0.07 for 50 time-steps). In this
chapter, we use the exponential proximity function for all experiments.
182
B.4 Proximity Predictors
A proximity predictor takes a state as input which includes joint state information, joint accel-
eration, and any task specification, such as ceiling and curb information. A proximity predictor
consists of 2 fully connected layers of 96 hidden units with ReLU nonlinearities and predicts the
proximity to the initiation set based on the states sampled from the success and failure buffers. Each
training iteration consists of 10 epochs over a batch size of 64 and use a learning rate of 10
− 4
. The
predictor optimizes the loss in Equation (4.1), similar to the LSGAN loss (Mao et al., 2017).
B.5 Transition Policies
An observation space of a transition policy consists of joint state information and joint ac-
celeration. A transition policy consists of 2 fully connected layers of 32 hidden units with tanh
nonlinearities and predicts the mean and standard deviation of a Gaussian distribution over an action
space. A 2-way softmax layer is followed by the last fully connected layer to predict whether to
terminate the current transition or not. We train all transition policies using PPO (Schulman et al.,
2017) since PPO is robust on smaller batch sizes and the transition states collected for each update
is much smaller than the size of a rollout. Each training iteration consists of 5 epochs over a batch.
183
Algorithm 7 TRAIN
1: Input: Primitive polices{π
p
1
, ...,π
p
n
}.
2: Initialize success buffers{B
S
1
, ...,B
S
n
} with successful trajectories of primitive policies.
3: Initialize failure buffers{B
F
1
, ...,B
F
n
}.
4: Randomly initialize parameters of transition policies{φ
1
, ...,φ
n
} and proximity predictors{ω
1
,
...,ω
n
}.
5: repeat
6: Initialize rollout buffers{R
1
, ...,R
n
}.
7: Collect trajectories using ROLLOUT.
8: for i= 1 to n do
9: Update P
ω
i
to minimize Equation (4.1) usingB
S
i
andB
F
i
.
10: Updateπ
φ
i
to maximize Equation (4.2) usingR
i
.
11: end for
12: until convergence
184
Algorithm 8 ROLLOUT
1: Input: Meta policy π
meta
, primitive policies{π
p
1
,...,π
p
n
}, transition policies{π
φ
1
,...,π
φ
n
},
and proximity predictors{P
ω
1
,...,P
ω
n
}.
2: Initialize an episode and receive initial state s
0
.
3: t← 0
4: while episode is not terminated do
5: c∼ π
meta
(s
t
)
6: Initialize a rollout bufferB.
7: while episode is not terminated do
8: a
t
,τ
trans
∼ π
φ
c
(s
t
)
9: Terminate the transition policy ifτ
trans
= terminate.
10: s
t+1
,τ
env
← ENV(s
t
,a
t
)
11: r
t
← P
ω
c
(s
t+1
)− P
ω
c
(s
t
)
12: Store(s
t
,a
t
,r
t
,τ
env
,s
t+1
) inB
13: t← t+ 1
14: end while
15: while episode is not terminated do
16: a
t
,τ
p
c
∼ π
p
c
(s
t
)
17: Terminate the primitive policy ifτ
p
c
̸= continue.
18: s
t+1
,τ
env
← ENV(s
t
,a
t
)
19: t← t+ 1
20: end while
21: Compute the discounted proximity v of each state s inB.
22: Add pairs of(s,v) toB
S
c
orB
F
c
according toτ
p
c
.
23: AddB to the rollout bufferR
c
.
24: end while
185
B.6 Scalability
Each sub-policy requires its corresponding transition policy, proximity predictor, and two buffers.
Hence, both the time and memory complexities of our method are linearly dependent on the number
of sub-policies. The memory overhead is affordable since a transition policy (2 layers of 32 hidden
units), a proximity predictor (2 layers of 96 hidden units), and replay buffers (1M states) are small.
C Environment Descriptions
For every task, we add a control penalty,− 0.001∗∥ a∥
2
, to regularize the magnitude of actions
where a is a torque action performed by an agent. Note that all measures are in meters, and we omit
the measures here for clarity of the presentation.
C.1 Robotic Manipulation
In object manipulation tasks, a 9-DOF Jaco robotic arm
1
is used as an agent and a cube with the
side length 0.06 m is used as a target object. We follow the tasks and environment settings proposed
in Ghosh et al. (2018). The observation consists of the position of the base of the Jaco arm, joint
angles, angular velocities as well as the position, rotation, velocity, and angular velocity of the cube.
The action space is a torque control on 9 joints.
C.1.1 Reward Design and Termination Condition
Picking up: In the Picking up task, the position of the box is randomly initialized within a
square region of size 0.1 m× 0.1 m with a center (0.5, 0.2). There is an initial guide reward to guide
the arm to the box. There is also an over reward to guide the hand directly over the box. When the
arm is not picking up the box, there is a pick reward to incentivize the arm to pick the box up. There
is an additional hold reward that makes the arm hold the box in place after picking up. Finally, there
1
http://www.mujoco.org/forum/index.php?resources/kinova-arms.12/
186
is a success reward given after the arm has held the box for 50 frames. The success reward is scaled
with number of timesteps to encourage the arm to succeed as quickly as possible.
R(s)=λ
guide
· 1 1 1
Box not picked and Box on ground
+λ
pick
· 1 1 1
Box in hand and not picked
+λ
hold
· 1 1 1
Box picked and near hold point
λ
guide
= 2,λ
pick
= 100,λ
hold
= 0.1
Catching: The position of the box is initialized at (0, 2.0, 1.5) and the directional force of
size 110 is applied to throw the box toward the agent with randomness (0.1 m× 0.1 m).
R(s)= 1 1 1
Box in air and Box within 0.06 of Jaco end-effector
Tossing: The box is randomly initialized on the ground at (0.4, 0.3, 0.05) within a 0.005× 0.005 square region. A guide reward is given to guide the arm to the top of the box. A pick reward
is then given to lift the box up to a specified release height. A release reward is given if the box is
no longer in the hand. A stable reward is given to minimize variation in the box’s x and y direction.
An up reward is given while the ball is traveling upwards in air, up until the box hits a specified z
height. Finally, a success reward +100 is given based on the landing position of the box and the
specified landing position.
Hitting: The box is randomly initialized overhead the arm at (0.4, 0.3, 1.2) within a 0.005
× 0.005 m square region. The box falls and the arm is given a hit reward +10 for hitting the box.
Once the box has been hit, a target reward is given based on how close the box is to the target.
Repetitive picking up: The Repetitive picking up task has two reward variants. The sparse
version gives a reward +1 for every successful pick. The dense reward version gives a guide reward
to the box after each successful pick following the reward for the Picking up task.
Repetitive catching: The Repetitive catching task gives a reward +1 for every successful
catch. For dense reward, it uses the same reward function with that of the Catching task.
187
Serve: The Serve task gives a toss reward +1 for a successful toss and a target reward +1 for
successfully hitting the target. The dense reward setting provides the Tossing and Hitting reward
according to box position.
C.2 Locomotion
A 9-DOF bipedal planar walker is used for simulating locomotion tasks. The observation
consists of the position and velocity of the torso, joint angles, and angular velocities. The action
space is torque control on the 6 joints.
C.2.1 Reward Design
Different locomotion tasks share many components of reward design, such as velocity, stability,
and posture. We use the same form of reward functions, but with different hyperparameters for each
task. The basic form of the reward function is as following:
R(s)=λ
vel
· abs(v
x
− v
target
)+λ
alive
− λ
height
· abs(1.1− min(1.1,∆h))+
λ
angle
· cos(angle)− λ
f oot
(v
right f oot
+ v
le ft f oot
),
where v
x
, v
right f oot
, and v
le ft f oot
are forward velocity, right foot angular velocity, left foot
angular velocity; and∆h and angle are the distance between the foot and torso and the angle of the
torso, respectively. The foot velocities help the agent to move its feet naturally.∆h and angle are
used to maintain height of the torso and encourage an upright pose.
Forward: The Forward task requires the walker agent to walk forward for 20 meters. To make
the agent robust, we apply a random force with arbitrary magnitude and direction to a randomly
selected joint every 10 timesteps.
λ
vel
= 2,λ
alive
= 1,λ
height
= 2,λ
angle
= 0.1,λ
f oot
= 0.01, and v
target
= 3
188
Backward: Similar to Forward, the Backward task requires the walker to walk backward for
20 meters under random forces.
λ
vel
= 2,λ
alive
= 1,λ
height
= 2,λ
angle
= 0.1,λ
f oot
= 0.01, and v
target
=− 3
Balancing: In the Balancing task, the agent learns to balance under strong random forces for
1000 timesteps. Similar to other tasks, the random forces are applied to a random joint every 10
timesteps, but with magnitude 5 times larger.
λ
vel
= 1,λ
alive
= 1,λ
height
= 0.5,λ
angle
= 0.1,λ
f oot
= 0, and v
target
= 0
Crawling: In the Crawling task, a ceiling of height 1.0 and length 16 is located in front of the
agent, and the agent is required to crawl under the ceiling without touching it. If the agent touches
the ceiling, we terminate the episode. The task can be completed when the agent passes a point 1.5
after the ceiling and the agent gets 100 additional reward.
λ
vel
= 2,λ
alive
= 1,λ
height
= 0,λ
angle
= 0.1,λ
f oot
= 0.01, and v
target
= 3
Jumping: In the Jumping task, a curb of height 0.4 and length 0.2 is located in front of the
walker agent. The observation contains a distance to the curb in addition to the 17-dimensional joint
information, where the distance is clipped by 3. The x location of the curb is randomly chosen from
[2.5, 5.5]. In addition to the reward function above, it also gets an additional 100 reward for passing
the curb and 200· v
y
when the agent passes the front, middle, and end slices of the curb, where v
y
is
y-velocity. If the agent touches the curb, the agent gets -10 penalty and the episode is terminated.
λ
vel
= 2,λ
alive
= 1,λ
height
= 2,λ
angle
= 0.1,λ
f oot
= 0.01, and v
target
= 3
Patrol: The Patrol task is repetitive running forward and backward between two goals at
x=− 2 and x= 2. Once the agent touches a goal, the target is changed to another goal and the
189
sparse reward +1 is given. The dense reward alternates between the reward functions of Forward
and Backward. The agent gets the reward of Forward when the agent is heading toward x= 2 and
gets the reward of Backward, otherwise.
Hurdle: The Hurdle environment consists of 5 curbs positioned at x={8,18,28,38,48} and
requires repetitive walking and jumping behaviors. The position of each curb is randomized with a
uniformly sampled value from[− 0.5,0.5]. The sparse reward +1 is given when the agent jumps
over a curb (i.e. pass a point 1.5 after a curb).
The dense reward for Hurdle is same with Jumping and has 8 reward components to guide the
agent to learn the desired behavior. By extensively designing dense rewards, it is possible to solve
complex tasks. In comparison, our proposed method learns from sparse reward by re-using prior
knowledge and doesn’t require reward shaping.
Obstacle Course: The Obstacle Course environment replaces two curbs in Hurdle with a
ceiling of height 1.0 and length 3. The sparse reward +1 is given when the agent jumps over a curb
or passes through a ceiling (i.e. pass a point 1.5 after a curb or a ceiling). The dense reward is
alternating between Jumping before the curb and Crawling before the ceiling.
C.2.2 Termination Condition
Locomotion tasks except Crawling fail if h< 0.8 and Crawling fails if h< 0.3. Forward and
Backward tasks are considered as success when the walker reaches to the target or 5 in front of
obstacles. Balancing task is considered successful when the agent does not fail for 50 timesteps.
The agent succeeds on Jumping and Crawling if the agent passes the obstacles by a distance of 1.5.
190
Appendix C
Skill Chaining via Terminal State Regularization
A Environment Details
We choose two tasks, TABLE LACK and CHAIR INGOLF, from the IKEA furniture assembly
environment (Lee et al., 2021b) for our experiments. For reinforcement learning, we use a heavily
shaped multi-phase dense reward from (Lee et al., 2021b). For the robotic agent, we use the 7-DoF
Rethink Sawyer robot operated via joint velocity control. For imitation learning, we collected
200 demonstrations for each furniture part assembly with a programmatic assembly policy. Each
demonstration for single-part assembly consists of around 200-900 steps long due to the long-
horizon nature of the task.
The observation space includes robot observations (29 dim), object observations (35 dim), and
task phase information (8 dim). The robot observations consist of robot joint angles (7 dim), joint
velocities (7 dim), gripper state (2 dim), gripper position (3 dim), gripper quaternion (4 dim),
gripper velocity (3 dim), and gripper angular velocity (3 dim). The object observations contain the
positions (3 dim) and quaternions (4 dim) of all five furniture pieces in the scene. In addition, the
8-dimensional task information is an one-hot embedding representing the phase of the state (e.g.
reaching, grasping, lifting, moving, aligning).
In our experiments, we define a subtask as assembling one part to another; thus, we have four
subtasks for each task. Subtasks are independently trained on the initial states sampled from the
environment with random noise ranging from [− 2cm, 2cm] and [− 3 3 in the(x,y)-plane. Moreover,
191
this subtask decomposition is given, i.e., the environment can be initialized for each subtask and the
agent is informed when the subtask is completed. Once two parts are attached, the corresponding
subtask is completed and the robot arm moves back to its initial pose in the center of the workplace.
B Network Architectures
For fair comparison, we use the same network architecture for our method and the baseline
methods. The policy and critic networks consist of two fully connected layers of 128 and 256
hidden units with ReLU nonlinearities, respectively. The output layer of the actor network outputs
an action distribution, which consists of the mean and standard deviation of a Gaussian distribution.
The critic network outputs only a single critic value. The discriminator for GAIL (Ho and Ermon,
2016) and initiation set discriminator for our proposed method use a two-layer fully connected
network with 256 hidden units. tanh is used as a nonlinear activation function. These discriminators’
outputs are clipped between[0,1] following Peng et al. (2021).
C Training Details
During policy learning, we sample a batch of 128 elements from the expert (for GAIL) and
agent experience buffers. To acquire subtask polices for the policy sequencing baseline (Clegg
et al., 2018) and our method, we run from 20M to 25M steps depending on the difficulty of each
subtask. We train all methods except BC for 48 hours which collect 200M steps for PPO, GAIL,
GAIL + PPO; and 5M steps for SPiRL. We train the policy sequencing baseline and our method
for additional 100M steps; thus, in total 200M steps. For the final evaluation, we evaluate a trained
policy every 50 updates and report the best performing checkpoints due to unstable adversarial
training. Most of the runs are converged within these training steps.
192
Table C.1: PPO hyperparameters.
Hyperparameter Value
# Workers 16
Rollout Size 1,024
Learning Rate 0.0003
Learning Rate Decay Linear decay
Mini-batch Size 128
# Epochs per Update 10
Discount Factor 0.99
Entropy Coefficient 0.003
Reward Scale 0.05
State Normalization True
BC (Pomerleau, 1989): The policy is trained with batch
size 256 for 1000 epochs using the Adam optimizer with
learning rate 1e-4 and momentum (0.9, 0.999).
PPO (Schulman et al., 2017): Any reinforcement learn-
ing algorithm can be used for policy optimization, but we
choose to use Proximal Policy Optimization (PPO) (Schul-
man et al., 2017) and we use the default hyperparameters
of PPO (see Table C.1).
GAIL (Ho and Ermon, 2016): We tested different vari-
ants of GAIL implementation and found the AMP (Peng
et al., 2021) formulation is most stable. Please refer to
Peng et al. (2021) for implementation details about GAIL
training and GAIL reward. In this chapter, we specifically
use an agent states s to discriminate agent and expert trajectories, instead of state-action pairs(s,a).
GAIL + PPO: We use the same setup with the GAIL baseline, but with the environment reward.
We useλ
1
=λ
2
= 0.5.
SPiRL (Pertsch et al., 2020a): We use the official implementation of the original paper.
Policy Sequencing (Clegg et al., 2018): We use the same implementation and hyperparameters
as ours, but without the terminal state regularization. To prevent catastrophic forgetting, we sample
an initial state from the environment for 20% and sample from the Gaussian distribution for 80%.
T-STAR (Ours): We use λ
3
= 10000 for the terminal state regularization. Similar to PS, we
sample an initial state from the environment for 20% and sample from the terminal state buffer of
the previous skill for 80%.
193
Appendix D
Coordinating Skills via Skill Behavior Diversification
A Environment Details
The details of observation spaces, action spaces, number of agents, and episode lengths are
described in Table D.1. All units in this section are in meters unless otherwise specified.
Table D.1: Environment details
Jaco Pick-Push-Place Jaco Bar-Moving Ant Push
Observation Space 88 88 100
- Robot observation 62 62 82
- Object observation 26 26 18
Action Space 18 18 16
Number of Agents 2 2 2
Episode length 150 100 200
A.1 Environment Descriptions
In both Jaco environments, the robot works on a table with size (1.6,1.6) and top center
position (0,0,0.82). The two Jaco arms are initialized at positions (− 0.16,− 0.16,1.2) and
(− 0.16,0.24,1.2). Left arm and right arm objects are initialized around (0.3,0.2,0.86) and
194
(0.3,− 0.2,0.86) respectively in all primitive training and composite task training environments,
with small random position and rotation perturbation.
In the Jaco Pick-Push-Place task, the right jaco arm needs to pick up the object and place it into
the container initialized at the other side of the table. Success is defined by contact between the
object and the inner top side of the container.
In the Jaco Bar-Moving task, the two Jaco arms need to together pick the long bar up by height
of 0.7, move it towards the arms by distance of 0.15, and place it back on the table. Success is
defined by (1) the bar being placed within 0.04 away from the desired destination both in height
and in xy-position and (2) the bar having been picked 0.7 above the table.
In the Ant Push task, the two ant agents need to push a big box together to the goal position. The
box has a size of 8.0× 1.6× 1.6. The distance between ants and the box is 20 cm and the distance
between the box and the goal is 30 cm. Initial positions have 1 cm of randomness and the agent
has a randomness of 0.01 in each joint. The task is considered as success when both the distances
between left and right end of the box and the goal are within 5 cm.
A.2 Reward Design
For every task, we add a control penalty,− 0.001∗∥ a∥
2
, to regularize the magnitude of actions
where a is a torque action performed by an agent.
Jaco Pick: To help the agent learn to reach, pick, and hold the picked object, we provide dense
reward to the agent defined by the weighted sum of pick reward, gripper-to-cube distance reward,
cube position and quaternion stability reward, hold duration reward, success reward, and robot
control reward. More concretely,
R(s)=λ
pick
· (z
box
− z
init
)+λ
dist
· dist(p
gripper
, p
box
)+λ
pos
· dist(p
box
, p
init
)+
λ
quat
· abs(∆
quat
)+λ
hold
· t
hold
+λ
success
· 1
success
+λ
ctrl
∥a∥
2
,
where λ
pick
= 500,λ
dist
= 100,λ
pos
= 1000,λ
quat
= 1000,λ
hold
= 10,λ
success
= 100,λ
ctrl
= 1× 10
− 4
.
195
Jaco Place: Reward for place primitive is defined by the weighted sum of xy-distance reward,
height reward (larger when cube close to floor), success reward, and robot control reward.
R(s)=λ
xy
· dist
xy
(p
box
, p
goal
)+λ
z
·| z
box
− z
goal
|+λ
success
· 1
success
+λ
ctrl
∥a∥
2
,
whereλ
xy
= 500,λ
z
= 500,λ
success
= 500,λ
ctrl
= 1× 10
− 4
.
Jaco Push: Reward for push primitive is defined by the weighted sum of gripper reaching
reward, box-to-destination distance reward, quaternion stability reward, hold duration reward,
success reward, and robot control reward.
R(s)=λ
reaching
· dist(p
gripper
, p
box
)+λ
pos
· dist(p
box
, p
dest
)+
λ
quat
· abs(∆
quat
)+λ
hold
· t
hold
+λ
success
· 1
success
+λ
ctrl
∥a∥
2
,
whereλ
reaching
= 100,λ
pos
= 500,λ
quat
= 30,λ
hold
= 10,λ
success
= 1000,λ
ctrl
= 1× 10
− 4
.
Jaco Pick-Push-Place: Reward for Pick-Push-Place is defined by the weighted sum of gripper
contact reward, per-stage reach/pick/push/place rewards, success reward, and control reward. We
tune the reward carefully for all baselines.
R(s)=λ
contact
·
1
left gripper touches container
+ 1
right gripper touches box
+
λ
reach
· 1
reach
·
dist(p
le ft gripper
, p
container
)+ dist(p
right gripper
, p
box
)
+
λ
pick
· 1
pick
· dist(p
box
, p
box target
)+λ
place
· 1
place
· dist(p
box
, p
box target
)+
λ
push
· dist(p
container
, p
container target
)+λ
success
· 1
success
+λ
ctrl
·∥ a∥
2
,
whereλ
reach
= 10,λ
contact
= 10,λ
pick
=λ
place
=λ
place
= 10,λ
success
= 50,λ
ctrl
= 0, and 1
reaching
,
1
pick
, and 1
place
are indicator functions specifying whether the agent is in reaching, pick or place
stage. Agent stages are determined by how many multiples of 25 steps the agent has stepped through
in the environment.
196
Jaco Bar-Moving: Reward for Bar-Moving is defined by the weighted sum of per-stage
reach/pick/move/place rewards, success reward, and control reward.
R(s)=λ
reach
· 1
reach
·
dist(p
le ft gripper
, p
le ft handle
)+ dist(p
right gripper
, p
right handle
)
+
λ
pick
· 1
pick
· dist(p
bar
, p
bar target
)+λ
move
· 1
place
· dist
xy
(p
bar
, p
bar target
)+
λ
place
· 1
place
· dist
z
(p
bar
, p
bar target
)+λ
success
· 1
success
+λ
ctrl
·∥ a∥
2
,
whereλ
reach
= 10,λ
pick
= 30,λ
move
= 100,λ
place
= 100,λ
success
= 100,λ
ctrl
= 1× 10
− 4
, and 1
pick
and 1
place
are indicator functions specifying whether the agent is in pick or place stage. Agent stages
are determined by whether the pick objective is fulfilled or not.
Ant Push & Ant Moving: Reward for ANT PUSH is defined by upright, velocity towards the
desired direction. We provide a dense reward to encourage the desired locomotion behavior using
velocity, stability, and posture, as following:
R(s)=λ
vel
· abs(∆x
ant
)+λ
boxvel
· abs(∆x
box
)+λ
upright
· cos(θ)− λ
height
· abs(0.6− h)+
λ
goal
· dist(p
goal
, p
box
),
whereλ
vel
= 50,λ
boxvel
= 20,λ
upright
= 1,λ
height
= 0.5. For ANT PUSH, we provide an additional
reward based on distance between the box and the goal position withλ
goal
= 200.
B Experiment Details
We use PyTorch (Paszke et al., 2017) for our implementation and all experiments are conducted
on a workstation with Intel Xeon Gold 6154 CPU and 4 NVIDIA GeForce RTX 2080 Ti GPUs.
197
B.1 Hyperparameters
Table D.2: Hyperparameters
Parameters Value
learning rate 3e-4
gradient steps 50
batch size 256
discount factor 0.99
target smoothing coefficient 0.005
reward scale (SAC) 1.0
experience buffer size (# episodes) 1000
T
low
1 for JACO, 5 for ANT
N
z
(dimensionality of z) 5
B.2 Network Architectures
Actor Networks: In all experiments, we model our actor network for each primitive skill as a
3-layer MLP with hidden layer size 64. The last layer of the MLP is two-headed – one for the mean
of the action distribution and the other for the standard deviation of it. We use ReLU as activation
function in hidden layers. We do not apply any activation function for the final output layer. The
action distribution output represents per-dimension normal distribution, from which single actions
can be sampled and executed in the environment.
Critic Networks: The critic network for each primitive skill and meta policy is modeled as a
2-layer MLP with hidden layer size 128. ReLU is used as an activation function in the hidden layers.
The critic network output is used to assess the value of a given state-action pair, and is trained by
fitting its outputs to the target Q-value clamped by ± 100.
Meta Policy: The meta policy is modeled as a 3-layer MLP with hidden layer size of 64. Since
the actions of meta policy are sampled from N categorical distributions for each end-effector/agent
198
and N normal distributions for behavior embeddings, the output dimension of the meta policy is
∑
N
i=1
(m
i
+ N
z
). The meta policy uses ReLU as an activation function for all layers except for the
final output layer.
B.3 Training Details
For all baselines, we train the meta policies using PPO and the low-level policies using SAC.
We use the same environment configurations, composite task reward definitions, and value of T
low
across all baselines.
For Jaco tasks, we train a total of 4 primitive skills – right arm pick, right arm place-to-center,
right arm place-towards-arm, and left arm push – to be composed by meta-policy. For Jaco Pick-
Push-Place, we provide the meta-policy with right arm pick and right arm place-to-center as right
arm primitives and left arm push as left arm primitives; for Jaco Bar-Moving, we provide the meta-
policy with right arm pick and right arm place-towards-arm as both right and left arm primitives and
left arm pick and right arm place-towards-arm as left arm primitives. We obtain left arm primitives
for bar-moving task by using the learned right arm primitives directly.
To obtain the 4 primitives skills described above, we train right arm pick with diversity coefficient
λ
2
= 0.01 and the other three primitives withλ
2
= 0.1. The destination of right arm push is set to
(0.3,− 0.03,0.86), which is slightly left of the center of the table. After pick primitive is trained,
we train the two right arm place primitives where episodes are initialized by intermediate states
of successful right arm pick episodes where the height of the box is larger than 0.94 (0.01 higher
than the target pick height). The place destinations for towards-arm and to-center primitives are
(0.15,− 0.2,0.86) and(0.3,− 0.02,0.86), respectively.
For non-modular baselines that incorporates skill behavior diversification, we use λ
2
= 0.01 for
both Jaco Pick-Push-Place and Jaco Bar-Moving because both tasks require picking skills, which
can only be trained with a small value ofλ
2
.
199
Appendix E
Reinforcement Learning with Learned Skills
A Action-prior Regularized Soft Actor-Critic
The original derivation of the SAC algorithm assumes a uniform prior over actions. We extend
the formulation to the case with a non-uniform action prior p(a|·), where the dot indicates that the
prior can be non-conditional or conditioned on, e.g., the current state or the previous action. Our
derivation closely follows Haarnoja et al. (2018b) and Levine (2018) with the key difference that we
replace the entropy maximization in the reward function with a term that penalizes divergence from
the action prior. We derive the formulation for single-step action priors below, and the extension to
skill priors is straightforward by replacing actions a
t
with skill embeddings z
t
.
We adopt the probabilistic graphical model (PGM) described in (Levine, 2018), which includes
optimality variablesO
1:T
, whose distribution is defined as p(O
t
|s
t
,a
t
) = exp
r(s
t
,a
t
)
where
r(s
t
,a
t
) is the reward. We treatO
1:T
= 1 as evidence in our PGM and obtain the following
conditional trajectory distribution:
p(τ|O
1:T
)= p(s
1
)
T
∏
t=1
p(O
t
|s
t
,a
t
)p(s
t+1
|s
t
,a
t
)p(a
t
|·)
=
p(s
1
)
T
∏
t=1
p(s
t+1
|s
t
,a
t
)p(a
t
|·)
· exp
T
∑
t=1
r(s
t
,a
t
)
Crucially, in contrast to (Levine, 2018) we did not omit the action prior p(a
t
|·) since we assume it
to be generally not uniform.
200
Our goal is to derive an objective for learning a policy that induces such a trajectory distribution.
Following (Levine, 2018) we will cast this problem within the framework of structured variational
inference and derive an expression for an evidence lower bound (ELBO).
We define a variational distribution q(a
t
|s
t
) that represents our policy. It induces a trajectory
distribution q(τ)= p(s
1
)∏
T
t=1
p(s
t+1
|s
t
,a
t
)q(a
t
|s
t
). We can derive the ELBO as:
log p(O
1:T
)≥− D
KL
q(τ)|| p(τ|O
1:T
)
≥ E
τ∼ q(τ)
"
log p(s
1
)+
T
∑
t=1
log p(s
t+1
|s
t
,a
t
)log p(a
t
|·)
+
T
∑
t=1
r(s
t
,a
t
)
− log p(s
1
)− T
∑
t=1
log p(s
t+1
|s
t
,a
t
)logq(a
t
|s
t
)
#
≥ E
τ∼ q(τ)
"
T
∑
t=1
r(s
t
,a
t
)+ log p(a
t
|·)− logq(a
t
|s
t
)
#
≥ E
τ∼ q(τ)
"
T
∑
t=1
r(s
t
,a
t
)− D
KL
q(a
t
|s
t
)|| p(a
t
|·)
#
Note that in the case of a uniform action prior the KL divergence is equivalent to the negative
entropy− H (q(a
t
|s
t
)). Substituting the KL divergence with the entropy recovers the ELBO derived
in (Levine, 2018).
To maximize this ELBO with respect to the policy q(a
t
|s
t
), (Levine, 2018) propose to use an
inference procedure based on a message passing algorithm. Following this derivation for the ”mes-
sages” V(s
t
) and Q(s
t
,a
t
) (Levine (2018), Section 4.2), but substituting policy entropy− logq(a
t
|s
t
)
with prior divergence D
KL
q(a
t
|s
t
)|| p(a
t
|·)
, the modified Bellman backup operator can be
derived as:
T
π
Q(s
t
,a
t
)= r(s
t
,a
t
)+γE
s
t+1
∼ p
V(s
t+1
)
where V(s
t
)=E
a
t
∼ π
Q(s
t
,a
t
)− D
KL
π(a
t
|s
t
)|| p(a
t
|·)
201
To show convergence of this operator to the optimal Q function we follow the proof of (Haarnoja
et al., 2018b) in appendix B1 and introduce a divergence-augmented reward:
r
π
(s
t
,a
t
)= r(s
t
,a
t
)− E
s
t+1
∼ p
D
KL
π(a
t+1
|s
t+1
)|| p(a
t+1
|·)
.
Then we can recover the original Bellman update:
Q(s
t
,a
t
)← r
π
(s
t
,a
t
)+γE
s
t+1
∼ p,a
t+1
∼ π
Q(s
t+1
,a
t+1
)
,
for which the known convergence guarantees hold (Sutton and Barto, 2018).
The modifications to the messages Q(s
t
,a
t
) and V(s
t
) directly lead to the following modified
policy improvement operator:
argmin
θ
E
s
t
∼ D,a
t
∼ π
D
KL
π(a
t
|s
t
)|| p(a
t
|·)
− Q(s
t
,a
t
)
Finally, the practical implementations of SAC introduce a temperature parameterα that trades off
between the reward and the entropy term in the original formulation and the reward and divergence
term in our formulation. Haarnoja et al. (2018c) propose an algorithm to automatically adjustα by
formulating policy learning as a constrained optimization problem. In our formulation we derive a
similar update mechanism forα. We start by formulating the following constrained optimization
problem:
max
x
1:T
E
p
π
T
∑
t=1
r(s
t
,a
t
)
s.t. D
KL
π(a
t
|s
t
)|| p(a
t
|·)
≤ δ ∀t
Hereδ is a target divergence between policy and action prior similar to the target entropy
¯
H is the
original SAC formulation. We can formulate the dual problem by introducing the temperatureα:
min
α>0
max
π
T
∑
t=1
r(s
t
,a
t
)+α
δ− D
KL
π(a
t
|s
t
)|| p(a
t
|·)
202
This leads to the modified update objective for α:
argmin
α>0
E
a
t
∼ π
αδ− αD
KL
π(a
t
|s
t
)|| p(a
t
|·)
We combine the modified objectives for Q-value function, policy and temperatureα in the skill-prior
regularized SAC algorithm, summarized in Algorithm 3.
B Implementation Details
B.1 Model Architecture and Training Objective
We instantiate the skill embedding model described in Section 7.3.2 with deep neural networks.
The skill encoder is implemented as a one-layer LSTM with 128 hidden units. After processing the
full input action sequence, it outputs the parameters(µ
z
,σ
z
) of the Gaussian posterior distribution in
the 10-dimensional skill embedding spaceZ . The skill decoder mirrors the encoder’s architecture
and is unrolled for H steps to produce the H reconstructed actions. The sampled skill embedding z
is passed as input in every step.
The skill prior is implemented as a 6-layer fully-connected network with 128 hidden units per
layer. It parametrizes the Gaussian skill prior distributionN (µ
p
,σ
p
). For image-based state inputs
in maze and block stacking environment, we first pass the state through a convolutional encoder
network with three layers, a kernel size of three and(8,16,32) channels respectively. The resulting
feature map is flattened to form the input to the skill prior network.
We use leaky-ReLU activations and batch normalization throughout our architecture. We
optimize our model using the RAdam optimizer with parameters with β
1
= 0.9 and β
2
= 0.999,
batch size 16 and learning rate 1× 10
− 3
. Training on a single NVIDIA Titan X GPU takes
203
approximately 8 hours. Assuming a unit-variance Gaussian output distribution our full training
objective is:
L =
H
∑
i=1
(a
i
− ˆ a
i
)
2
| {z }
reconstruction
− β D
KL
N (µ
z
,σ
z
)||N (0,I)
| {z }
regularization
+D
KL
N (⌊µ
z
⌋,⌊σ
z
⌋)||N (µ
p
,σ
p
)
| {z }
prior training
. (B.1)
Here⌊·⌋ indicates that gradients flowing through these variables are stopped. For Gaussian distribu-
tions the KL divergence can be analytically computed. For non-Gaussian prior parametrizations
(e.g. with Gaussian mixture model or normalizing flow priors) we found that sampling-based
estimates also suffice to achieve reliable, albeit slightly slower convergence. We tune the weighting
parameterβ separately for each environment and useβ = 1× 10
− 2
for maze and block stacking
andβ = 5× 10
− 4
for the kitchen environment.
B.2 Reinforcement Learning Setup
The architecture of policy and critic mirror the one of the skill prior network. The policy outputs
the parameters of a Gaussian action distribution while the critic outputs a single Q-value estimate.
Empirically, we found it important to initialize the weights of the policy with the pre-trained skill
prior weights in addition to regularizing towards the prior (see Section F).
We use the hyperparameters of the standard SAC implementation (Haarnoja et al., 2018b) with
batch size 256, replay buffer capacity of 1× 10
6
and discount factor γ = 0.99. We collect 5000
warmup rollout steps to initialize the replay buffer before training. We use Adam optimizer with
β
1
= 0.9, β
2
= 0.999 and learning rate 3× 10
− 4
for updating policy, critic and temperature α.
Analogous to SAC, we train two separate critic networks and compute the Q-value as the minimum
over both estimates to stabilize training. The corresponding target networks get updated at a rate of
τ = 5× 10
− 3
. The policy’s actions are limited in the range[− 2,2] by a tanh ”squashing function”
(see Haarnoja et al. (2018b), appendix C).
We tune the target divergenceδ separately for each environment and useδ = 1 for the maze
navigation task andδ = 5 for both robot manipulation tasks.
204
C Environments and Data Collection
Maze Navigation Block Stacking
Figure E.1: Image-based state representation for
maze (left) and block stacking (right) environ-
ment (downsampled to 32× 32 px for policy).
Maze Navigation. The maze navigation envi-
ronment is based on the maze environment in
the D4RL benchmark (Fu et al., 2020). Instead
of using a single, fixed layout, we generate ran-
dom layouts for training data collection by plac-
ing walls with doorways in randomly sampled
positions. For each collected training sequence
we sample a new maze layout and randomly
sample start and goal position for the agent. Fol-
lowing Fu et al. (2020), we collect goal-reaching examples through a combination of high-level
planner with access to a map of the maze and a low-level controller that follows the plan.
For the downstream task we randomly sample a maze that is four times larger than the training
data layouts. We keep maze layout, as well as start and goal location for the agent fixed throughout
downstream learning. The policy outputs (x,y)-velocities for the agent. The state is represented as a
local top-down view around the agent (see Fig. E.1). To represent the agent’s velocity, we stack two
consecutive 32× 32px observations as input to the policy. The agent receives a per-timestep binary
reward when the distance to the goal is below a threshold.
Block Stacking. The block stacking environment is simulated using the Mujoco physics engine.
For data collection, we initialize the five blocks in the environment to be randomly stacked on
top of each other or placed at random locations in between. We use a hand-coded data collection
policy to generate trajectories with up to three consecutive stacking manipulations. The location of
blocks and the movement of the agent are limited to a 2D plane and a barrier prevents the agent
from leaving the table. To increase the support of the collected trajectories we add noise to the
hard-coded policy by placing pseudo-random subgoals in between and within stacking sequences.
205
The downstream task of the agent is to stack as many blocks as possible in a larger version
of the environment with 11 blocks. The environment state is represented through a front view of
the agent (see Fig. E.1). The policies’ input is a stack of two 32× 32px images and it outputs
(x,z)-displacements for the robot as well as a continuous action in range[0...1] that represents the
opening degree of the gripper. The agent receives per-timestep binary rewards for lifting a block
from the ground and moving it on top of another block. It further receives a reward proportional to
the height of the highest stacked tower.
Kitchen environment. We use the kitchen environment from the D4RL benchmark (Fu et al.,
2020) which was originally published by Gupta et al. (2019). For training we use the data provided
in D4RL (dataset version ”mixed”). It consists of trajectories collected via human tele-operation
that each perform four consecutive manipulations of objects in the environment. There are seven
manipulatable objects in the environment. The downstream task of the agent consists of performing
an unseen sequence of four manipulations - while the individual manipulations have been observed
in the training data, the agent needs to learn to recombine these skills in a new way to solve the task.
The state is a 30-dimensional vector representing the agent’s joint velocities as well as poses of the
manipulatable objects. The agent outputs 7-dimensional joint velocities for robot control as well as
a 2-dimensional continuous gripper opening/closing action. It receives a one-time reward whenever
fulfilling one of the subtasks.
D State-Conditioned Skill Decoder
Following prior works (Merel et al., 2019b; Kipf et al., 2019), we experimented with conditioning
the skill decoder on the current environment state s
1
. Specifically, the current environment state is
passed through a neural network that predicts the initial hidden state of the decoding LSTM. We
found that conditioning the skill decoder on the state did not improve downstream learning and can
even lead to worse performance. In particular, it does not improve the exploration behavior of the
206
Ours
Flat Prior
SSP
w/o Prior
time
Figure E.2: Comparison of policy execution traces on the kitchen environment. Following Fu
et al. (2020), the agent’s task is to (1) open the microwave, (2) move the kettle backwards, (3) turn
on the burner and (4) switch on the light. Red frames mark the completion of subtasks. Our
skill-prior guided agent (top) is able to complete all four subtasks. In contrast, the agent using a
flat single-action prior ( middle) only learns to solve two subtasks, but lacks temporal abstraction
and hence fails to solve the complete long-horizon task. The skill-space policy without prior
guidance (bottom) cannot efficiently explore the skill space and gets stuck in a local optimum in
which it solves only a single subtask. Best viewed electronically and zoomed in. For videos, see:
clvrai.com/spirl.
Skills + Prior
(Ours)
Skills w/o Prior
+ conditioned decode
Figure E.3: Results for state-conditioned skill decoder network. left: Exploration visualization as
in Fig. 7.5. Even with state-conditioned skill decoder, exploration without skill prior is not able
to explore a large fraction of the maze. In contrast, skills sampled from the learned skill prior
lead to wide-ranging exploration when using the state-conditioned decoder. right: Downstream
learning performance of our approach and skill-space policy w/o learned skill prior: w/ vs. w/o state-
conditioning for skill decoder. Only guidance through the learned skill prior enables learning success.
State-conditioned skill-decoder can make the downstream learning problem more challenging,
leading to lower performance (”ours” vs. ”ours w/ state cond.”).
207
0.00 0.15 0.30 0.45 0.60
Environment steps (1M)
0
1
2
3
4
Stacked Blocks
Ours
SSP PriorInit + SAC
Figure E.4: Ablation of prior
regularization during down-
stream RL training. Initializing
the high-level policy with the
learned prior but finetuning with
conventional SAC is not suffi-
cient to learn the task well.
Figure E.5: Ablation of prior initialization. Initializing the down-
stream task policy with the prior network improves training sta-
bility and convergence speed. However, the ”w/o Init” runs
demonstrate that the tasks can also be learned with prior regular-
ization only.
skill-space policy without learned prior (see Fig. E.3, left); a learned skill prior is still necessary for
efficient exploration on the downstream task.
Additionally, we found that conditioning the skill decoder on the state can reduce downstream
learning performance (see Fig. E.3, right). We hypothesize that state-conditioning can make the
learning problem for the high-level policy more challenging: due to the state-conditioning the same
high-level action z can result in different decoded action sequences depending on the current state,
making the high-level policies’ action space dynamics more complex. As a result, downstream
learning can be less stable.
E Prior Regularization Ablation
We ablate the influence of the prior regularization during downstream learning as described
in Section 7.3.3. Specifically, we compare to an approach that initializes the high-level policy
with the learned skill prior, but then uses conventional SAC (Haarnoja et al., 2018b) (with uniform
skill prior) to finetune on the downstream task. Fig. E.4 shows that the prior regularization during
downstream learning is essential for good performance: conventional maximum-entropy SAC (”SSP
PriorInit + SAC”) quickly leads the prior-initialized policy to deviate from the learned skill prior
208
by encouraging it to maximize the entropy of the distribution over skills, slowing the learning
substantially.
F Prior Initialization Ablation
For the RL experiments in Section 7.4 we initialize the weights of the high-level policy with the
learned skill prior network. Here, we compare the performance of this approach to an ablation that
does not perform the initialization. We find that prior initialization improves convergence speed and
stability of training (see Fig. E.5).
We identify two major challenges that make training policies ”from scratch”, without initializa-
tion, challenging: (1) sampling-based divergence estimates between the randomly initialized policy
and the prior distribution can be inaccurate in early phases of training, (2) learning gets stuck in
local optima where the divergence between policy and prior is minimized on a small subset of the
state space, but the policy does not explore beyond this region.
Since both our learned prior and the high-level policy are parametrized with Gaussian output
distributions, we can analytically compute the KL divergence to get a more stable estimate and
alleviate problem (1). To address problem (2) when training from scratch, we encourage exploration
by sampling a fractionω of the rollout steps during experience collection directly from the prior
instead of the policy. For the ”w/o Init” experiments in Fig. E.5 we setω = 1.0 for the first 500 k
steps (i.e. always sample from the prior) and then anneal it to 0 (i.e. always use the policy for
rollout collection) over the next 500 k steps. Note that policies trained with prior initialization do
not require these additions and still converge faster.
G Training with Sub-Optimal Data
We investigate the influence of the training data on the performance of our approach. In
particular, we test whether it is possible to learn effective skill priors from heavily sub-optimal data.
209
For the experiments in Section 7.4 we trained the skill prior from high-quality experience
collected using expert-like policies, albeit on tasks that differ from the downstream task (see
Section C). However, in many practical cases the quality of training data can be mixed. In this
Section we investigate two scenarios for training from sub-optimal data.
Table E.1: Number of blocks stacked vs
fractions of random training data.
% Random Data 0% 50% 75%
# Blocks Stacked 3.5 2.0 1.0
Mixed Expert and Random Data. We assume that a
large fraction of the data is collected by inexperienced
users leading to very low quality trajectories, while an-
other part of the data is collected by experts and has high
quality. We emulate this in the block stacking environment
by combining rollouts collected by executing random ac-
tions with parts of the high-quality rollouts we used for the experiments in Section 7.4.3.
The results are shown in table E.1. Our approach achieves good performance even when half of
the training data consists of very low quality rollouts, and can learn meaningful skills for stacking
blocks when 75 % of the data is of low quality. The best baseline was only able to stack 1.5 blocks
on average, even though it was trained from only high-quality data (see Fig. 7.4, middle).
0.00 0.45 0.90 1.35 1.80
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success Rate
Ours (BC-Data)
SSP w/o Prior
Figure E.6: Success rate on maze environment with sub-optimal training data. Our approach, using
a prior learned from sub-optimal data generated with the BC policy, is able to reliably learn to reach
the goal while the baseline that does not use the learned prior fails.
210
1
2
3
1 2 3
Figure E.7: Reuse of one learned skill prior for multiple downstream tasks. We train a single skill
embedding and skill prior model and then use it to guide downstream RL for multiple tasks. Left:
We test prior reuse on three different maze navigation tasks in the form of different goals that need
to be reached. (1)-(3): Agent rollouts during training; the darker the rollout paths, the later during
training they were collected. The same prior enables efficient exploration for all three tasks, but
allows for convergence to task-specific policies that reach each of the goals upon convergence.
Only Non-Expert Data. We assume access to a dataset of only mediocre quality demonstrations,
without any expert-level trajectories. We generate this dataset in the maze environment by training
a behavior cloning (BC) policy on expert-level trajectories and using it to collect a new dataset. Due
to the limited capacity of the BC policy, this dataset is of substantially lower quality than the expert
dataset, e.g., the agent collides with walls on average 2.5 times per trajectory while it never collides
in expert trajectories.
While we find that a skill-prior-regularized agent trained on the mediocre data explores the maze
less widely than one trained on the expert data, it still works substantially better than the baseline
that does not use the skill prior, achieving 100 % success rate of reaching a faraway goal after ¡1M
environment steps, while the baseline does not reach the goal even after 3M environment steps.
Both scenarios show that our approach can learn effective skill embeddings and skill priors even
from substantially sub-optimal data.
H Reuse of Learned Skill Priors
Our approach has two separate stages: (1) learning of skill embedding and skill prior from offline
data and (2) prior-regularized downstream RL. Since the learning of the skill prior is independent of
the downstream task, we can reuse the same skill prior for guiding learning on multiple downstream
211
tasks. To test this, we learn a single skill prior on the maze environment depicted in Fig. 7.3 (left)
and use it to train multiple downstream task agents that reach different goals.
In Fig. E.7 we show a visualization of the training rollouts in a top-down view, similar to the
visualization in Fig. 7.5; darker trajectories are more recent. We can see that the same prior is
able to guide downstream agents to efficiently learn to reach diverse goals. All agents achieve
∼ 100% success rate upon convergence. Intuitively, the prior captures the knowledge that it is more
meaningful to e.g. cross doorways instead of bumping into walls, which helps exploration in the
maze independent of the goal position.
212
Appendix F
Demonstration-Guided Reinforcement Learning with Learned
Skills
A Full Algorithm
We detail our full SkiLD algorithm for demonstration-guided RL with learned skills in Algo-
rithm 9. It is based on the SPiRL algorithm for RL with learned skills (Pertsch et al., 2020a) which
in turn builds on Soft-Actor Critic (Haarnoja et al., 2018b), an off-policy model-free RL algorithm.
We mark changes of our algorithm with respect to SPiRL and SAC in red in Algorithm 9.
The hyperparametersα andα
q
can either be constant, or they can be automatically tuned using
dual gradient descent (Haarnoja et al., 2018c; Pertsch et al., 2020a). In the latter case, we need to
define a set of target divergencesδ,δ
q
. The parametersα andα
q
are then optimized to ensure that
the expected divergence between policy and skill prior and posterior distributions is equal to the
chosen target divergence (see Algorithm 9).
B Implementation and Experimental Details
B.1 Implementation Details: Pre-Training
We introduce our objective for learning the skill inference network q
ω
(z|s,a) and low-level
skill policy π
φ
(a
t
|s
t
,z) in Section 8.3.2. In practice, we instantiate all model components with
213
Algorithm 9 SkiLD (Skill-based Learning with Demonstrations)
1: Inputs: H-step reward function ˜ r(s
t
,z
t
), reward weightγ, discountη, target divergencesδ,δ
q
,
learning ratesλ
π
,λ
Q
,λ
α
, target update rateτ.
2: Initialize replay bufferD, high-level policyπ
θ
(z
t
|s
t
), critic Q
φ
(s
t
,z
t
), target network Q
¯
φ
(s
t
,z
t
)
3: for each iteration do
4: for every H environment steps do
5: z
t
∼ π(z
t
|s
t
) ▷ sample skill from policy
6: s
t
′∼ p(s
t+H
|s
t
,z
t
) ▷ execute skill in environment
7: D← D∪{s
t
,z
t
, ˜ r(s
t
,z
t
),s
t
′} ▷ store transition in replay buffer
8: end for
9: for each gradient step do
10: r
Σ
=(1− γ)· ˜ r(s
t
,z
t
)+γ·
logD(s
t
)− log
1− D(s
t
)
▷ compute combined reward
11:
¯
Q= r
Σ
+η
Q
¯
φ
(s
t
′,π
θ
(z
t
′|s
t
′))−
α
q
D
KL
π
θ
(z
t
′|s
t
′),q
ζ
(z
t
′|s
t
′)
· D(s
t
′)
12: +αD
KL
π
θ
(z
t
′|s
t
′), p(z
t
′|s
t
′)
·
1− D(s
t
′)
▷ compute
Q-target
13: θ← θ− λ
π
∇
θ
Q
φ
(s
t
,π
θ
(z
t
|s
t
))−
α
q
D
KL
π
θ
(z
t
|s
t
),q
ζ
(z
t
|s
t
)
· D(s
t
)
14: +αD
KL
π
θ
(z
t
|s
t
), p(z
t
|s
t
)
·
1− D(s
t
)
▷ update policy
weights
15: φ← φ− λ
Q
∇
φ
1
2
Q
φ
(s
t
,z
t
)− ¯
Q
2
▷ update critic weights
16: α← α− λ
α
∇
α
α· (D
KL
(π
θ
(z
t
|s
t
), p(z
t
|s
t
))− δ)
▷ update alpha
17: α
q
← α
q
− λ
α
∇
α
q
α
q
· (D
KL
(π
θ
(z
t
|s
t
),q
ζ
(z
t
|s
t
))− δ
q
)
▷ update alpha-q
18:
¯
φ← τφ+(1− τ)
¯
φ ▷ update target network weights
19: end for
20: end for
21: return trained policyπ
θ
(z
t
|s
t
)
deep neural networks Q
ω
,Π
φ
respectively, and optimize the full model using back-propagation.
We also jointly train our skill prior network P. We follow the common assumption of Gaussian,
unit-variance output distributions for low-level policy actions, leading to the following network
loss:
L =
H− 2
∏
t=0
a
t
− Π
φ
(s
t
,z)
2
+βD
KL
Q
ω
(s
0:H− 1
,a
0:H− 2
)||N (0,I)
| {z }
skill representation training
+ D
KL
⌊Q
ω
(s
0:H− 1
,a
0:H− 2
)⌋|| P(s
0
)
| {z }
skill prior training
.
214
Here⌊·⌋ indicates that we stop gradient flow from the prior training objective into the skill
inference network for improved training stability. After training the skill inference network with
above objective, we train the skill posterior network Q
ζ
by minimizing KL divergence to the skill
inference network’s output on trajectories sampled from the demonstration data. We minimize the
following objective:
L
post
= D
KL
⌊Q
ω
(s
0:H− 1
,a
0:H− 2
)⌋|| Q
ζ
(s
0
)
We use a 1-layer LSTM with 128 hidden units for the inference network and 3-layer MLPs
with 128 hidden units in each layer for the low-level policy. We encode skills of horizon 10 into
10-dimensional skill representations z. Skill prior and posterior networks are implemented as 5-layer
MLPs with 128 hidden units per layer. They both parametrize mean and standard deviation of
Gaussian output distributions. All networks use batch normalization after every layer and leaky
ReLU activation functions. We tune the regularization weightβ to be 1× 10
− 3
for the maze and
5× 10
− 4
for kitchen and office environment.
For the demonstration discriminator D(s) we use a 3-layer MLP with only 32 hidden units per
layer to avoid overfitting. It uses a sigmoid activation function on the final layer and leaky ReLU
activations otherwise. We train the discriminator with binary cross-entropy loss on samples from
task-agnostic and demonstration datasets:
L
D
=− 1
N
·
N/2
∑
i=1
logD(s
d
i
)
| {z }
demonstrations
+
N/2
∑
j=1
log
1− D(s
j
)
| {z }
task-agnostic data
We optimize all networks using the RAdam optimizer (Liu et al., 2020) with parametersβ
1
= 0.9
and β
2
= 0.999, batch size 128 and learning rate 1× 10
− 3
. On a single NVIDIA Titan X GPU
we can train the skill representation and skill prior in approximately 5 hours, the skill posterior in
approximately 3 hours and the discriminator in approximately 3 hours.
215
B.2 Implementation Details: Downstream RL
The architecture of the policy mirrors the one of the skill prior and posterior networks. The
critic is a simple 2-layer MLP with 256 hidden units per layer. The policy outputs the parameters of
a Gaussian action distribution while the critic outputs a single Q-value estimate. We initialize the
policy with the weights of the skill posterior network.
We use the hyperparameters of the standard SAC implementation (Haarnoja et al., 2018b)
with batch size 256, replay buffer capacity of 1× 10
6
and discount factor γ = 0.99. We collect
5000 warmup rollout steps to initialize the replay buffer before training. We use the Adam opti-
mizer (Kingma and Ba, 2015) withβ
1
= 0.9,β
2
= 0.999 and learning rate 3× 10
− 4
for updating
policy, critic and temperaturesα andα
q
. Analogous to SAC, we train two separate critic networks
and compute the Q-value as the minimum over both estimates to stabilize training. The correspond-
ing target networks get updated at a rate ofτ = 5× 10
− 3
. The policy’s actions are limited in the
range[− 2,2] by a tanh ”squashing function” (see Haarnoja et al. (2018b), appendix C).
We use automatic tuning ofα andα
q
in the maze navigation task and set the target divergences
to 1 and 10 respectively. In the kitchen and office environments we obtained best results by using
constant values ofα =α
q
= 1× 10
− 1
. In all experiments we setκ = 0.9.
For all RL results we average the results of three independently seeded runs and display mean
and standard deviation across seeds.
B.3 Implementation Details: Comparisons
BC+RL. This comparison is representative of demonstration-guided RL approaches that use BC
objectives to initialize and regularize the policy during RL (Rajeswaran et al., 2018; Nair et al.,
2018). We pre-train a BC policy on the demonstration dataset and use it to initialize the RL policy.
We use SAC to train the policy on the target task. Similar to Nair et al. (2018) we augment the
216
Figure F.1: Qualitative re-
sults for GAIL+RL on maze
navigation. Even though it
makes progress towards the
goal (red), it fails to ever ob-
tain the sparse goal reaching
reward.
Figure F.2: We compare the exploration behavior in the
maze. We roll out skills sampled from SPiRL’s task-agnostic
skill prior (left) and our task-specific skill posterior ( right)
and find that the latter leads to more targeted exploration
towards the goal (red).
policy update with a regularization term that minimizes the L2 loss between the predicted mean of
the policy’s output distribution and the output of the BC pre-trained policy
1
.
Demo Replay. This comparison is representative of approaches that initialize the replay buffer of
an off-policy RL agent with demonstration transitions (Vecerik et al., 2017; Hester et al., 2018). In
practice we use SAC and initialize a second replay buffer with the demonstration transitions. Since
the demonstrations do not come with reward, we heuristically set the reward of each demonstration
trajectory to be a high value (100 for the maze, 4 for the robotic environments) on the final transition
and zero everywhere else. During each SAC update, we sample half of the training mini-batch from
the normal SAC replay buffer and half from the demonstration replay buffer. All other aspects of
SAC remain unchanged.
B.4 Environment Details
1
We also tried sampling action targets directly from the demonstration replay buffer, but found using a BC policy as
target more effective on the tested tasks.
217
1
2
3
4
5
6
7
a
b
c
Figure F.3: Office cleanup task. The robot agent
needs to place three randomly sampled objects
(1-7) inside randomly sampled containers (a-c).
During task-agnostic data collection we apply ran-
dom noise to the initial position of the objects.
Maze Navigation. We adapt the maze naviga-
tion task from Pertsch et al. (2020a) which ex-
tends the maze navigation tasks from the D4RL
benchmark (Fu et al., 2020). The starting po-
sition is sampled uniformly from a start region
and the agent receives a one-time sparse reward
of 100 when reaching the fixed goal position,
which also ends the episode. The 4D observa-
tion space contains 2D position and velocity
of the agent. The agent is controlled via 2D
velocity commands.
Robot Kitchen Environment. We use the
kitchen environment from Gupta et al. (2019).
For solving the target task, the agent needs to
execute a fixed sequence of four subtasks by
controlling an Emika Franka Panda 7-DOF robot via joint velocity and continuous gripper ac-
tuation commands. The 30-dimensional state space contains the robot’s joint angles as well as
object-specific features that characterize the position of each of the manipulatable objects. We
use 20 state-action sequences from the dataset of Gupta et al. (2019) as demonstrations. Since
the dataset does not have large variation within the demonstrations for one task, the support of
those demonstration is very narrow. We collect a demonstration dataset with widened support by
initializing the environment at states along the demonstrations and rolling out a random policy for
10 steps.
Robot Office Environment. We create a novel office cleanup task in which a 5-DOF WidowX
robot needs to place a number of objects into designated containers, requiring the execution
of a sequence of pick, place and drawer open and close subtasks (see Figure F.3). The agent
218
controls position and orientation of the end-effector and a continuous gripper actuation, resulting
in a 7-dimensional action space. For simulating the environment we build on the Roboverse
framework (Singh et al., 2020). During collection of the task-agnostic data we randomly sample
a subset of three of the seven objects as well as a random order of target containers and use
scripted policies to execute the task. We only save successful executions. For the target task we
fix object positions and require the agent to place three objects in fixed target containers. The
97-dimensional state space contains the agent’s end-effector position and orientation as well as
position and orientation of all objects and containers.
Differences to Pertsch et al. (2020a). While both maze navigation and kitchen environment are
based on the tasks in Pertsch et al. (2020a), we made multiple changes to increase task complexity,
resulting in the lower absolute performance of the SPiRL baseline in Figure 8.4. For the maze
navigation task we added randomness to the starting position and terminate the episode upon
reaching the goal position, reducing the max. reward obtainable for successfully solving the
task. We also switched to a low-dimensional state representation for simplicity. For the kitchen
environment, the task originally used in Gupta et al. (2019) as well as Pertsch et al. (2020a) was
well aligned with the training data distribution and there were no demonstrations available for this
task. In our evaluation we use a different downstream task (see section F) which is less well-aligned
with the training data and therefore harder to learn. This also allows us to use sequences from the
dataset of Gupta et al. (2019) as demonstrations for this task.
C Skill Representation Comparison
In Section 8.3.2 we described our skill representation based on a closed-loop low-level policy
as a more powerful alternative to the open-loop action decoder-based representation of Pertsch
et al. (2020a). To compare the performance of the two representations we perform rollouts with the
learned skill prior: we sample a skill from the prior and rollout the low-level policy for H steps. We
219
Ours Pertsch et al., 2020
Figure F.4: Comparison of our closed-loop skill representation with the open-loop representation
of Pertsch et al. (2020a). Top: Skill prior rollouts for 100 k steps in the maze environment. Bottom:
Subtask success rates for prior rollouts in the kitchen environment.
repeat this until the episode terminates and visualize the results for multiple episodes in maze and
kitchen environment in Figure F.4.
In Figure F.4 (top) we see that both representations lead to effective exploration in the maze
environment. Since the 2D maze navigation task does not require control in high-dimensional action
spaces, both skill representations are sufficient to accurately reproduce behaviors observed in the
task-agnostic training data.
In contrast, the results on the kitchen environment (Figure F.4, bottom) show that the closed-loop
skill representation is able to more accurately control the high-DOF robotic manipulator and reliably
solve multiple subtasks per rollout episode.
2
We hypothesize that the closed-loop skill policy is able
2
Seehttps://sites.google.com/view/skill-demo-rl for skill prior rollout videos with both skill represen-
tations in the kitchen environment.
220
Figure F.5: Downstream task performance for prior demonstration-guided RL approaches with
combined task-agnostic and task-specific data. All prior approaches are unable to leverage the
task-agnostic data, showing a performance decrease when attempting to use it.
to learn more robust skills from the task-agnostic training data, particularly in high-dimensional
control problems.
D Demonstration-Guided RL Comparisons with Task-Agnostic Experience
In Section 8.4.2 we compared our approach to prior demonstration-guided RL approaches which
are not designed to leverage task-agnostic datasets. We applied these prior works in the setting they
were designed for: using only task-specific demonstrations of the target task. Here, we conduct
experiments in which we run these prior works using the combined task-agnostic and task-specific
datasets to give them access to the same data that our approach used.
From the results in Figure F.5 we can see that none of the prior works is able to effectively
leverage the additional task-agnostic data. In many cases the performance of the approaches is worse
than when only using task-specific data (see Figure 8.4). Since prior approaches are not designed to
leverage task-agnostic data, applying them in the combined-data setting can hurt learning on the
target task. In contrast, our approach can effectively leverage the task-agnostic data for accelerating
demonstration-guided RL.
221
SkiLD (Demo-RL)
SkiLD (Imitation) w/ D-finetuning BC GAIL
SkiLD (Imitation)
Figure F.6: Imitation learning performance on maze navigation and kitchen tasks. Compared to
prior imitation learning methods, SkiLD can leverage prior experience to enable the imitation of
complex, long-horizon behaviors. Finetuning the pre-trained discriminator D(s) further improves
performance on more challenging control tasks like in the kitchen environment.
E Skill-Based Imitation Learning
We ablate the influence of the environment reward feedback on the performance of our approach
by setting the reward weight κ = 1.0, thus relying solely on the learned discriminator reward.
Our goal is to test whether our approach SkiLD is able to leverage task-agnostic experience to
improve the performance of pure imitation learning, i.e., learning to follow demonstrations without
environment reward feedback.
We compare SkiLD to common approaches for imitation learning: behavioral cloning (BC,
Pomerleau (1989)) and generative adversarial imitation learning (GAIL, Ho and Ermon (2016)). We
also experiment with a version of our skill-based imitation learning approach that performs online
finetuning of the pre-trained discriminator D(s) using data collected during training of the imitation
policy.
We summarize the results of the imitation learning experiments in Figure F.6. Learning purely
by imitating the demonstrations, without additional reward feedback, is generally slower than
222
Bottom
Burner
Microwave
Kettle
Top
Burner
Light
Switch
Slide
Cabinet
Bottom
Burner
Top
Burner
Light
Switch
Kettle
Top
Burner
Light
Switch
Slide
Cabinet
Hinge
Cabinet
Light
Switch
Slide
Cabinet
Bottom
Burner
Top
Burner
Light
Switch
Slide
Cabinet
Top
Burner
Light
Switch
Slide
Cabinet
Light
Switch
Slide
Cabinet
Bottom
Burner
Top
Burner
Light
Switch
9.45%
60.86%
29.68%
100.0%
46.59% 41.96%
5.72% 5.72%
100.0% 100.0%
42.1%
0.58%
16.95%
40.35%
47.36% 52.63%
31.16%
27.27%
25.32%
16.23%
100.0% 100.0%
78.21%
11.17%
10.61%
45.0%
35.0%
20.0%
Figure F.7: Subtask transition probabilities in the kitchen environment’s task-agnostic training
dataset from Gupta et al. (2019). Each dataset trajectory consists of four consecutive subtasks, of
which we display three (yellow: first, green: second, grey: third subtask). The transition probability
to the fourth subtask is always near 100 %. In Section 8.4.5 we test our approach on a target task
with good alignment to the task-agnostic data (Microwave - Kettle - Light Switch - Hinge Cabinet)
and a target task which is mis-aligned to the data (Microwave - Light Switch - Slide Cabinet - Hinge
Cabinet).
demonstration-guided RL on tasks that require more challenging control, like in the kitchen envi-
ronment, where the pre-trained discriminator does not capture the desired trajectory distribution
accurately. Yet, we find that our approach is able to leverage task-agnostic data to effectively
imitate complex, long-horizon behaviors while prior imitation learning approaches struggle. Further,
online finetuning of the learned discriminator improves imitation learning performance when the
pre-trained discriminator is not accurate enough.
In the maze navigation task the pre-trained discriminator represents the distribution of solution
trajectories well, so pure imitation performance is comparable to demonstration-guided RL. We find
that finetuning the discriminator on the maze “sharpens” the decision boundary of the discriminator,
i.e., increases its confidence in correctly estimating the demonstration support. Yet, this does not
lead to faster overall convergence since the pre-trained discriminator is already sufficiently accurate.
223
F Kitchen Data Analysis
For the kitchen manipulation experiments we use the dataset provided by Gupta et al. (2019)
as task-agnostic pre-training data. It consists of 603 teleoperated sequences, each of which shows
the completion of four consecutive subtasks. In total there are seven possible subtasks: opening
the microwave, moving the kettle, turning on top and bottom burner, flipping the light switch and
opening a slide and a hinge cabinet.
In Figure F.7 we analyze the transition probabilities between subtasks in the task-agnostic
dataset. We can see that these transition probabilities are not uniformly distributed, but instead
certain transitions are more likely than others, e.g., it is much more likely to sample a training
trajectory in which the agent first opens the microwave than one in which it starts by turning on the
bottom burner.
In Section 8.4.5 we test the effect this bias in transition probabilities has on the learning of
target tasks. Concretely, we investigate two cases: good alignment between task-agnostic data and
target task and mis-alignment between the two. In the former case we choose the target task Kettle -
Bottom Burner - Top Burner - Slide Cabinet, since the required subtask transitions are likely under
the training data distribution. For the mis-aligned case we choose Microwave - Light Switch - Slide
Cabinet - Hinge Cabinet as target task, since particularly the transition from opening the microwave
to flipping the light switch is very unlikely to be observed in the training data.
224
Appendix G
Skill-based Model-based Reinforcement Learning
A Further Ablation Studies
We include additional ablations on the Maze and Kitchen tasks to further investigate the influence
of skill horizon H and planning horizon N, which is important for skill learning and planning.
A.1 Skill Horizon
H=1 H=5 H=10 H=15 H=20
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Average Success
(a) Maze
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(b) Kitchen
Figure G.1: Ablation analysis on skill horizon H.
In both Maze and Kitchen, we find
that a too short skill horizon (H= 1,5)
unable to yield sufficient temporal ab-
straction. A longer skill horizon (H=
15,20) has little influence in Kitchen,
but it makes the downstream perfor-
mance much worse in Maze. This is
because with too long-horizon skills, a skill dynamics prediction becomes too difficult and stochas-
tic; and composing multiple skills can be not as flexible as short-horizon skills. The inaccurate skill
dynamics makes long-term planning harder, which is already a major challenge in maze navigation.
A.2 Planning Horizon
225
N=1 N=5 N=10 N=15 N=20
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Average Success
(a) Maze
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(b) Kitchen
Figure G.2: Ablation analysis on planning horizon N.
In Figure G.2b, we see that
short planning horizon makes learn-
ing slower in the beginning, because it
does not effectively leverage the skill
dynamics model to plan further ahead.
Conversely, if the planning horizon is
too long, the performance becomes
worse due to the difficulty in model-
ing every step accurately. Indeed, the
planning horizon 20 corresponds to 200 low-level steps, while the episode length in Kitchen is 280,
demanding the agent to make plan for nearly the entire episode. The performance is not sensitive
to intermediate planning horizons. On the other hand, the effect of the planning horizon differs
in Maze due to distinct environment characteristics. We find that very long planning horizon (eg.
20) and very short planning horizon (eg. 1) perform similarly in Maze (Figure G.2a). This could
attribute to the former creates useful long-horizon plans, while the latter avoids error accumulation
altogether.
A.3 Fine-Tuning Model
SkiMo (Ours) SkiMo w/ frozen dynamics
0.0 0.5 1.0 1.5 2.0
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Average Success
(a) Maze
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0
1
2
3
4
Average Subtasks
(b) Kitchen
Figure G.3: Ablation analysis on fine-tuning the model.
We freeze the skill dynamics
model together with the state encoder
to gauge the effect of fine-tuning after
pre-training. Figure G.3 show that
without fine-tuning the model, the
agent performs worse due to the dis-
crepancy between distributions of the
offline data and the downstream task. We hypothesize that fine-tuning is necessary when the agent
needs to adapt to a different task and state distribution after pre-training.
226
B Qualitative Analysis on Maze
B.1 Exploration and Exploitation
To gauge the agent’s ability of exploration and exploitation, we visualize the replay buffer for
each method in Figure G.4. In this visualization, we represent early trajectories in the replay buffer
with light blue dots and recent trajectories with dark blue dots. In Figure G.4a, the replay buffer of
SkiMo (ours) contains early explorations that span to most corners in the maze. After it finds the
goal, it exploits this knowledge and commits to paths that are between the start location and the goal
(in dark blue). This explains why our method can quickly learn and consistently accomplish the
task. Dreamer and TD-MPC only explore a small fraction of the maze, because they are prone to get
stuck at corners or walls without guided exploration from skills and skill priors. SPiRL + Dreamer,
SPiRL + TD-MPC and SkiMo w/o joint training explore better than Dreamer and TD-MPC, but all
fail to find the goal. This is because without the joint training of the model and policy, the skill space
is only optimized for action reconstruction, not for planning, which makes long-horizon exploration
and exploitation harder.
On the other hand, SkiMo + SAC and SPiRL are able to explore the most portion of the maze,
but comparatively the coverage is too wide to enable efficient learning. That is, even after the agent
finds the goal through exploration, it continues to explore and does not exploit this experience to
accomplish the task consistently (darker blue). This could attribute to the difficult long-horizon
credit assignment problem which makes policy learning slow, and the reliance on skill prior
which encourages exploration. On the contrary, our skill dynamics model effectively absorbs prior
experience to generate goal-achieving imaginary rollouts for the actor and critic to learn from,
which makes task learning more efficient. In essence, we find the skill dynamics model useful in
guiding the agent explore coherently and exploit efficiently.
227
(a) SkiMo (Ours) (b) Dreamer (c) TD-MPC (d) SPiRL + Dreamer
(e) SPiRL + TD-MPC (f) SkiMo w/o Joint
Training
(g) SkiMo + SAC (h) SPiRL
Figure G.4: Exploration and exploitation behaviors of our method and baseline approaches. We
visualize trajectories in the replay buffer at 1.5M training steps in blue (light blue for early trajectories
and dark blue for recent trajectories). Our method shows wide coverage of the maze at the early
stage of training, and fast convergence to the solution.
B.2 Long-Horizon Prediction
To compare the long-term prediction ability of the skill dynamics and flat dynamics, we visualize
imagined trajectories by sampling trajectory clips of 500 timesteps from the agent’s replay buffer
(the maximum episode length in Maze is 2,000), and predicting the latent state 500 steps ahead
(which will be decoded using the observation decoder) given the initial state and 500 ground-truth
actions (50 skills for SkiMo). The similarity between the imagined trajectory and the ground
truth trajectory can indicate whether the model can make accurate predictions far into the future,
producing useful imaginary rollouts for policy learning and planning.
SkiMo is able to reproduce the ground truth trajectory with little prediction error even when
traversing through hallways and doorways while Dreamer struggles to make accurate long-horizon
predictions due to error accumulation. This is mainly because SkiMo allows temporal abstraction in
228
the dynamics model, thereby enabling temporally-extended prediction and reducing step-by-step
prediction error.
Prediction Visualization
Ground truth Predicted Both Ground truth Predicted Both
(a) Dreamer
Prediction Visualization
Ground truth Predicted Both Ground truth Predicted Both
(b) SkiMo (Ours)
Figure G.5: Prediction results of 500 timesteps using a flat single-step model (a) and skill dynamics
model (b), starting from the ground truth initial state and 500 actions (50 skills for SkiMo). The
predicted states from the flat model deviate from the ground truth trajectory quickly while the
prediction of our skill dynamics model has little error.
C Implementation Details
C.1 Computing Resources
Our approach and all baselines are implemented in PyTorch (Paszke et al., 2017). All experi-
ments are conducted on a workstation with an Intel Xeon E5-2640 v4 CPU and a NVIDIA Titan Xp
GPU. Pre-training of the skill policy and skill dynamics model takes around 10 hours. Downstream
RL for 2M timesteps takes around 18 hours. The policy and model update frequency is the same
over all algorithms but Dreamer (Hafner et al., 2019). Since only Dreamer trains on primitive
229
actions, it has 10 times more frequent model and policy updates than skill-based algorithms, which
leads to slower training (about 52 hours).
C.2 Algorithm Implementation Details
For the baseline implementations, we use the official code for SPiRL and re-implemented
Dreamer and TD-MPC in PyTorch, which are verified on DM control tasks (Tassa et al., 2018). The
table below (Table G.1) compares key components of SkiMo with model-based and skill-based
baselines and ablated methods.
Table G.1: Comparison to prior work and ablated methods.
Method Skill-based Model-based Joint training
Dreamer and TD-MPC % ! %
SPiRL ! % %
SPiRL + Dreamer and SPiRL + TD-MPC ! ! %
SkiMo w/o joint training ! ! %
SkiMo + SAC ! % !
SkiMo (Ours) and SkiMo w/o CEM ! ! !
Dreamer (Hafner et al., 2019) We use the same hyperparameters with the official implementation.
TD-MPC (Hansen et al., 2022) We use the same hyperparameters with the official implemen-
tation, except that we do not use the prioritized experience replay (Schaul et al., 2016). The
same implementation is used for the SPiRL + TD-MPC baseline and our method with only minor
modification.
SPiRL (Pertsch et al., 2020a) We use the official implementation of the original paper and use
the hyperparameters suggested in the official implementation.
SPiRL + Dreamer (Pertsch et al., 2020a) We use our implementation of Dreamer and simply
replace the action space with the latent skill space of SPiRL. We use the same pre-trained SPiRL
230
skill policy and skill prior networks with the SPiRL baseline. Initializing the high-level downstream
task policy with the skill prior, which is critical for downstream learning performance (Pertsch
et al., 2020a), is not possible due to the policy network architecture mismatch between Dreamer and
SPiRL. Thus, we only use the prior divergence to regularize the high-level policy instead. Directly
pre-train the high-level policy did not lead to better performance, but it might have worked better
with more tuning.
SPiRL + TD-MPC (Hansen et al., 2022) Similar to SPiRL + Dreamer, we use our implementation
of TD-MPC and replace the action space with the latent skill space of SPiRL. The initialization of
the task policy is also not available due to the different architecture used for TD-MPC.
SkiMo (Ours) The skill-based RL part of our method is inspired by Pertsch et al. (2020a) and the
model-based component is inspired by Hansen et al. (2022) and Hafner et al. (2019). We elaborate
our skill and skill dynamics learning in Algorithm 10, planning algorithm in Algorithm 11, and
model-based RL in Algorithm 12. Table G.2 lists the all hyperparameters that we used.
Algorithm 10 SkiMo RL (skill and skill dynamics learning)
Require: D: offline task-agnostic data
1: Randomly initializeθ,ψ
2: ψ
− ← ψ ▷ initialize target network
3: for each iteration do
4: Sample mini-batch B=(s,a)
(0:NH)
∼ D
5: [θ,ψ]← [θ,ψ]− λ
[θ,ψ]
∇
[θ,ψ]
L(B) ▷L from Equation (9.5)
6: ψ
− ← (1− τ)ψ
− +τψ ▷ update target network
7: end for
8: returnθ,ψ,ψ
− 231
Algorithm 11 SkiMo RL (CEM planning)
Require: θ,ψ,φ : learned parameters, s
t
: current state
1: µ
0
,σ
0
← 0,1 ▷ initialize sampling distribution
2: for i= 1,...,N
CEM
do
3: Sample N
sample
trajectories of length N fromN
µ
i− 1
,(σ
i− 1
)
2
▷ sample skill sequences from
normal distribution
4: Sample N
π
trajectories of length N usingπ
φ
,D
ψ
▷ sample skill sequences via imaginary rollouts
5: Estimate N-step returns of N
sample
+ N
π
trajectories using R
φ
,Q
φ
6: Compute µ
i
,σ
i
with top-k return trajectories ▷ update parameters for next iteration
7: end for
8: Sample a skill z∼ N
µ
N
CEM
,(σ
N
CEM
)
2
9: return z
232
Algorithm 12 SkiMo RL (downstream task learning)
Require: θ,ψ,ψ
− : pre-trained parameters
1: B← / 0 ▷ initialize replay buffer
2: Randomly initializeφ
3: φ
− ← φ ▷ initialize target network
4: π
φ
← p
θ
▷ initialize task policy with skill prior
5: for not converged do
6: t← 0,s
0
∼ ρ
0
▷ initialize episode
7: for episode not done do
8: z
t
∼ CEM(s
t
) ▷ MPC with CEM planning in Algorithm 11
9: s,r
t
← s
t
,0
10: for H steps do
11: s,r← ENV(s,π
L
θ
(E
ψ
(s),z
t
)) ▷ rollout low-level skill policy
12: r
t
← r
t
+ r
13: end for
14: B← B∪(s
t
,z
t
,r
t
) ▷ collect H-step environment interaction
15: t← t+ H
16: s
t
← s
17: Sample mini-batch B=(s,z,r)
(0:N)
∼ B
18: [φ,ψ]← [φ,ψ]− λ
[φ,ψ]
∇
[φ,ψ]
L
′
REC
(B) ▷L
′
REC
from Equation (9.6)
19: φ
π
← φ
π
− λ
φ
∇
φ
π
L
RL
(B) ▷L
RL
from Equation (9.7). Update only policy parameters
20: ψ
− ← (1− τ)ψ
− +τψ ▷ update target network
21: φ
− ← (1− τ)φ
− +τφ ▷ update target network
22: end for
23: end for
24: returnψ,φ
233
C.3 Environments and Data
Maze (Fu et al., 2020; Pertsch et al., 2021) Since our goal is to leverage offline data collected
from diverse tasks in the same environment, we use a variant of the maze environment (Fu et al.,
2020), suggested in Pertsch et al. (2021). The maze is of size 40× 40; an initial state is randomly
sampled near a pre-defined region (the green circle in Figure 9.3a); and the goal position is fixed
shown as the red circle in Figure 9.3a. The observation consists of the agent’s 2D position and
velocity. The agent moves around the maze by controlling the continuous value of its(x,y) velocity.
The maximum episode length is 2,000 but an episode is also terminated when the agent reaches
the circle around the goal with radius 2. The reward of 100 is given at task completion. We use
the offline data of 3,046 trajectories, collected from randomly sampled start and goal state pairs
from Pertsch et al. (2021). Thus, the offline data and downstream task share the same environment,
but have different start and goal states (i.e. different tasks). This data can be used to extract
short-horizon skills like navigating hallways or passing through narrow doors.
Kitchen (Gupta et al., 2019; Fu et al., 2020) The 7-DoF Franka Panda robot arm needs to
perform four sequential tasks (open microwave, move kettle, turn on bottom burner, and flip
light switch). The agent has a 30D observation space (11D robot proprioceptive state and 19D
object states), which removes a constant 30D goal state in the original environment, and 9D action
space (7D joint velocity and 2D gripper velocity). The agent receives a reward of 1 for every
sub-task completion. The episode length is 280 and an episode also ends once all four sub-tasks
are completed. The initial state is set with a small noise in every state dimension. We use 603
trajectories collected by teleoperation from Gupta et al. (2019) as the offline task-agnostic data.
The data involves interaction with all seven manipulatable objects in the environment, but during
downstream learning the agent needs to execute an unseen sequence of four subtasks. That is, the
agent can transfer a rich set of manipulation skills, but needs to recombine them in new ways to
solve the task.
234
Mis-aligned Kitchen (Pertsch et al., 2021) The environment and task-agnostic data are the same
with Kitchen but we use the different downstream task (open microwave, flip light switch, slide
cabinet door, and open hinge cabinet, as illustrated in Figure 9.3c). This task ordering is not aligned
with the sub-task transition probabilities of the task-agnostic data, which leads to challenging
exploration following the prior from data. This is because the transition probabilities in the Kitchen
human-teleoperated dataset are not uniformly distributed; instead certain transitions are more likely
than others. For example, the first transition in our target task — from opening the microwave to
flipping the light switch — is very unlikely to be observed in the training data. This simulates the
real-world scenario where the large offline dataset may not be meticulously curated for the target
task.
CALVIN (Mees et al., 2022) We adapt the CALVIN environment (Mees et al., 2022) for long-
horizon learning with the state observation. The CALVIN environment uses a Franka Emika Panda
robot arm with 7D end-effector pose control (relative 3D position, 3D orientation, 1D gripper action).
The 21D observation space consists of the 15D proprioceptive robot state and 6D object state. We
use the teleoperated play data (Task D→Task D) of 1,239 trajectories from Mees et al. (2022) as
our task-agnostic data. The agent receives a sparse reward of 1 for every sub-task completion in
the correct order. The episode length is 360 and an episode also ends once all four sub-tasks are
completed. In data, there exist 34 available target sub-tasks, and each sub-task can transition to any
other sub-task, which makes any transition probability lower than 0.1% on average. Most of the
subtask transitions in our downstream task occupy less than 0.08% of all transitions in the CALVIN
task-agnostic dataset.
235
Table G.2: SkiMo hyperparameters.
Hyperparameter Maze FrankaKitchen CALVIN
Model architecture
# Layers of O
θ
, p
θ
,π
L
θ
,E
ψ
,D
ψ
,π
φ
,R
φ
,Q
φ
5
Activation funtion elu
Hidden dimension 128 128 256
State embedding dimension 128 256 256
Skill encoder (q
θ
) 5-layer MLP LSTM LSTM
Skill encoder hidden dimension 128
Pre-training
Pre-training batch size 512
# Training mini-batches per update 5
Model-Actor joint learning rate (λ
[θ,ψ]
) 0.001
Encoder KL regularization (β) 0.0001
Reconstruction loss coefficient ( λ
O
) 1
Consistency loss coefficient ( λ
L
) 2
Low-level actor loss coefficient ( λ
BC
) 2
Planning discount (ρ) 0.5
Skill prior loss coefficient ( λ
SP
) 1
Downstream RL
Model, actor learning rate 0.001
Skill dimension 10
Skill horizon (H) 10
Planning horizon (N) 10 3 1
Batch size 128 256 256
# Training mini-batches per update 10
State normalization True False False
Prior divergence coefficient ( α) 1 0.5 0.1
Alpha learning rate 0.0003 0 0
Target divergence 3 N/A N/A
# Warm up step 50,000 5,000 5,000
# Environment step per update 500
Replay buffer size 1,000,000
Target update frequency 2
Target update tau (τ) 0.01
Discount factor (γ) 0.99
Reward loss coefficient ( λ
R
) 0.5
Value loss coefficient ( λ
Q
) 0.1
CEM
CEM iteration (N
CEM
) 6
# Sampled trajectories (N
sampled
) 512
# Policy trajectories (N
π
) 25
# Elites (k) 64
CEM momentum 0.1
CEM temperature 0.5
Maximum std 0.5
Minimum std 0.01
Std, CEM planning horizon decay step 100,000 25,000 25,000
236
q
ζ
(z|s
0
)
q
ω
(z|s,a)
Pre-Training on Task-agnostic Data 1
Training Procedure
s
0
s
1
s
2
a
0
a
1
a
2
a
H−1
s
2
s
0
s
1 s
4
s
H
…
z
̂ h
H
h
0
h
H
latent
consistency
skill
encoder
state
encoder
skill
dynamics
state
encoder
behavior
cloning
representation
learning
skill
policy
̂ a
0:H−1
(a) In pretraining, SkiMo leverages offline task-agnostic data to extract skill dynamics and a skill repertoire.
Unlike prior works that keep the model and skill policy training separate, we propose to jointly train them to
extract a skill space that is conducive to plan upon.
̂ v
t
s
0
s
t
̂ h
t+NH
̂ z
t
r
t
Environment
Downstream Task Learning in Imagination
2
s
0
s
0
h
0
̂ h
H
̂ h
2H
Planning in Skill Space
3
̂ z
0
̂ v
1
̂ v
2
̂ a
t:t+H−1
times H
h
t
task
policy
value
skill
dynamics
̂ z
1
task
policy
skill
dynamics
value
skill
policy
…
̂ z
2
task
policy
(b) In downstream RL, we learn a high-level task policy in the skill space (skill-based RL) and leverage the
skill dynamics model to generate imaginary rollouts for policy optimization and planning (model-based RL).
Figure G.6: Illustration of SkiMo.
237
Abstract (if available)
Abstract
Humans are remarkably efficient at learning new complex long-horizon tasks, such as cooking and furniture assembly, from only a few trials. To do so, we leverage our prior experience by building up a rich repertoire of skills and knowledge about the world and similar tasks. This enables us to efficiently reason, plan, and perform a wide variety of tasks. However, in most reinforcement learning and imitation learning approaches, every task is learned from scratch, requiring a large amount of interaction data and limiting its scalability to realistic, complex tasks. In this thesis, we propose novel benchmarks and skill-based learning approaches for scaling robot learning from simple short-horizon tasks to complex long-horizon tasks faced in our daily lives -- consisting of multiple subtasks and requiring high dexterity skills. To study the problem of scaling robot learning to complex long-horizon tasks, we first develop both simulated and real-world "furniture assembly" benchmarks, which require reasoning over long horizons and dexterous manipulation skills. We then present a series of skill chaining algorithms that solve such long-horizon tasks by executing a sequence of pre-defined skills, which need to be smoothly connected, like catching and shooting in basketball. Finally, we extend skill-based learning to efficiently leverage very diverse skills learned from large-scale task-agnostic agent experience allowing for scalable skill reuse for complex tasks. We hope our benchmarks and skill-based learning approaches can help robot learning researchers to transition to solving long-term tasks, easily compare the performance of their approaches with prior work, and eventually solve more complicated long-horizon tasks in the real world.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Algorithms and systems for continual robot learning
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Accelerating robot manipulation using demonstrations
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Machine learning of motor skills for robotics
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Augmented simulation techniques for robotic manipulation
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Learning from planners to enable new robot capabilities
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Learning objective functions for autonomous motion generation
PDF
Closing the reality gap via simulation-based inference and control
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
AI-driven experimental design for learning of process parameter models for robotic processing applications
PDF
Planning and learning for long-horizon collaborative manipulation tasks
Asset Metadata
Creator
Lee, Youngwoon
(filename)
Core Title
Scaling robot learning with skills
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
11/03/2022
Defense Date
10/27/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep reinforcement learning,furniture assembly,imitation learning,OAI-PMH Harvest,robot learning,robotic manipulation,robotics,skill-based reinforcement learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lim, Joseph J. (
committee chair
), Sukhatme, Gaurav (
committee chair
), Bansal, Somil (
committee member
), Nikolaidis, Stefanos (
committee member
)
Creator Email
lee504@usc.edu,lywoon89@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112305796
Unique identifier
UC112305796
Identifier
etd-LeeYoungwo-11300.pdf (filename)
Legacy Identifier
etd-LeeYoungwo-11300
Document Type
Dissertation
Format
theses (aat)
Rights
Lee, Youngwoon
Internet Media Type
application/pdf
Type
texts
Source
20221107-usctheses-batch-990
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep reinforcement learning
furniture assembly
imitation learning
robot learning
robotic manipulation
robotics
skill-based reinforcement learning