Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Accelerating robot manipulation using demonstrations
(USC Thesis Other)
Accelerating robot manipulation using demonstrations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACCELERATING ROBOT MANIPULATION USING DEMONSTRATIONS
by
Gautam Vijay Salhotra
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Gautam Vijay Salhotra
Acknowledgements
This thesis is the culmination of collaborations in robotics and AI, guided and supported by exceptional
individuals.
First and foremost, I would like to thank my PhD advisor, Prof. Gaurav Sukhatme. His unwavering support throughout the program has been instrumental in shaping me into the researcher I am today. I especially
appreciate his encouragement to let me explore different avenues of research, his belief in my ideas, and his
commitment to fostering a supportive research environment that allowed me to pursue independent thought.
I would also like to express a sincere thanks to my PhD thesis committee members, Prof. Somil Bansal and
Prof. Daniel Seita, for their invaluable guidance and advice throughout the process.
I would like to thank my collaborators, without whom these endeavours to produce quality research
would not have been possible. Specifically, I would like to thank I-Chun Arthur Liu, Peter Englert, Jun
Yamada, Karl Pertsch, Youngwoon Lee, Marcus Dominguez-Kuhne, and Max Pflueger. Beyond the contributions reflected in this thesis, I am thankful for the collaborations that have significantly shaped my
doctoral experience. These include Chris Denniston, Karkala Shashank Hegde, Sumeet Batra, Eric Heiden,
David Millard, Ryan Julian, Aravind Kumaraguru, my academic lab RESL, and team CoSTAR at NASA
JPL led by Ali Agha. They helped create a stimulating research environment, facilitate discussions, and
open many possibilities for collaboration. The insights and experiences gained through these have been
truly memorable.
ii
I am also grateful for the opportunity to have worked with experienced and thoughtful collaborators during my internships at Bosch Research (Karsten Knese and Gerd Schmidt) and Amazon Robotics (Manikantan Nambi and Chaitanya Mitash), as well as my residency at Alphabet (Stefan Schaal, Ajinkya Jain, Keegan
Go, Quan Vuong, Sergey Levine, Yevgen Chebotar, and Alex Herzog). I learned a great deal from each of
these experiences and appreciate the support and guidance I received.
Finally, I am grateful to my partner and friends for their unwavering support and encouragement throughout my PhD journey. Most importantly, I am grateful to my parents, who have sacrificed everything to help
me get to where I am today.
Thank you.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Motion Planning as Pseudo Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 LfD and Reinforcement Learning for Deformable Object Manipulation . . . . . . . . . . . . 6
2.2.1 Learning from Demonstrations (LfD) . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Combining Reinforcement Learning and Expert Guidance . . . . . . . . . . . . . . 8
2.3 Learning from Demonstrations across Morphologies . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Imitation Learning and Reinforcement Learning with Demonstrations (RLfD) . . . . 8
2.3.2 Imitation from Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Trajectory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Learned Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Visual Servoing for Robotic Assembly with Precision . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Classical Methods for Robotic Assembly . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Learned Models for Robot Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 Multi-task Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3: Using Motion Planners as Pseudo Demonstrations in Learning . . . . . . . . . . . . 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Motion Planner Augmented Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Motion Planner Augmented Reinforcement Learning . . . . . . . . . . . . . . . . . 18
3.2.3 Action Space Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.5 Motion Planner Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
3.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Efficient RL with Motion Planner Augmented Action Spaces . . . . . . . . . . . . . 24
3.3.4 Safe Policy Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4: Learning Deformable Object Manipulation from Demonstrations . . . . . . . . . . 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Formulation and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Tasks and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Real Robot Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.4 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 5: Learning from Demonstrations across Morphologies . . . . . . . . . . . . . . . . . 46
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Formulation and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Learned Spatio-temporal Dynamics Model . . . . . . . . . . . . . . . . . . . . . . 50
5.2.3 Indirect Trajectory Optimization with Learned Dynamics . . . . . . . . . . . . . . . 51
5.2.4 Learning from the Optimized Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 SOTA Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2.1 Real-world results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.3 Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 6: LfD for Visual Servoing with Precision . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.2 Demonstration Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.4 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
v
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A Additional Ablations for Motion Planner augmented RL . . . . . . . . . . . . . . . . . . . 87
A.1 Reuse of Motion Plan Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.2 Further Study on Action Space Rescaling . . . . . . . . . . . . . . . . . . . . . . . 88
A.3 Performance in Uncluttered Environments . . . . . . . . . . . . . . . . . . . . . . . 88
A.4 Handling of Invalid Target Joint States for Motion Planning . . . . . . . . . . . . . 89
A.5 Ablation of Motion Planning Algorithms . . . . . . . . . . . . . . . . . . . . . . . 90
A.6 Ablation of Model-free RL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 90
B Additional Experimental Details for Motion Planner augmented RL . . . . . . . . . . . . . 90
B.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
B.1.1 2D Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
B.1.2 Sawyer Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
B.1.3 Sawyer Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
B.1.4 Sawyer Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
B.2 Training Details for Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B.2.1 Wall-clock Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
C Ablation Studies for LfD across Morphologies . . . . . . . . . . . . . . . . . . . . . . . . . 95
C.1 Ablate the Method for Creating Optimized Dataset DStudent . . . . . . . . . . . . . 96
C.2 Ablate the Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.3 Compare Performance of Optimized Dataset D
1p
Optim . . . . . . . . . . . . . . . . . 100
C.4 Ablate Modality of Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . 101
C.5 Ablate Reference State Initialization in DMfD . . . . . . . . . . . . . . . . . . . . 102
C.6 Ablate the Effect of Cross-morphology on LfD Baselines . . . . . . . . . . . . . . . 103
C.7 Ablate the Effect of Environment Difficulty on LfD Baselines . . . . . . . . . . . . 104
D Additional Experimental Details for LfD across morphologies . . . . . . . . . . . . . . . . 105
D.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
D.2 Hyperparameter Choices for MAIL . . . . . . . . . . . . . . . . . . . . . . . . . . 106
D.3 Performance Metrics for Real-world Cloth Experiments . . . . . . . . . . . . . . . 106
D.4 Collected Dataset of Teacher Demonstrations . . . . . . . . . . . . . . . . . . . . . 108
D.5 Random Actions Dataset used for Training the Dynamics Model . . . . . . . . . . . 109
vi
List of Tables
4.1 Performance metric with normalized performance pˆ(H) for state-based environments.
We obtain the models at the end of training for each method, for a total of 5 seeds. For
each method, we evaulate it over 100 episodes and obtain performance statistics on the
accumulated results. We show the mean and variance as well as the median, 25th&75th
percentiles of performance. Experts are oracles with state information. . . . . . . . . . . . . 39
4.2 Performance metric with normalized performance pˆ(H) for image-based environments.
We obtain the models at the end of training for each method, for a total of 5 seeds. For
each method, we evaulate it over 100 episodes and obtain performance statistics on the
accumulated results. We show the mean & variance as well as the median, 25th&75th
percentiles of performance. Experts are oracles with state information. We do not highlight
their performance because they are not image-based. . . . . . . . . . . . . . . . . . . . . . 39
6.1 Summary of the key differences between our approach and other robotic transformer methods 64
6.2 Ablation to vary length of sequence of previous observations sent to the transformer . . . . . 68
6.3 Ablation to vary number of camera images that we feed to our model . . . . . . . . . . . . . 69
B.1 Environment specific parameters for MoPA-SAC . . . . . . . . . . . . . . . . . . . . . . . 91
B.2 SAC hyperparameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B.3 Comparison of the wall-clock training time in hours . . . . . . . . . . . . . . . . . . . . . . 95
C.4 Ablation results for MAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
C.5 Ablation of GAIfO on the effect of cross-morphology. We compare the normalized
performance, measured at the end of the task. . . . . . . . . . . . . . . . . . . . . . . . . . 104
C.6 Measuring performance on the easy cloth task, CLOTH FOLD DIAGONAL PINNED. We
compare the normalized performance, measured at the end of the task. . . . . . . . . . . . . 105
D.7 Hyper-parameters for training the forward dynamics model. . . . . . . . . . . . . . . . . . . 106
vii
D.8 CEM hyper-parameters tested for tuning the trajectory optimization. We conducted ten
rollouts for each parameter set and used the set with the highest average normalized
performance on the teacher demonstrations. Population size is determined by the number of
environment interactions. The number of elites for each CEM iteration is 10% of population
size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
D.9 Hyper-parameters used in the LfD method (DMfD). . . . . . . . . . . . . . . . . . . . . . . 109
viii
List of Figures
3.1 Our framework extends an RL policy with a motion planner. If the predicted action by the
policy is above a threshold ∆qstep, the motion planner is called; otherwise, it is directly
executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Learning manipulation skills in an obstructed environment is challenging due to frequent
collisions and narrow passages amongst obstacles. For example, when a robot moves the
table leg to assemble, it is likely to collide with other legs and get stuck between legs.
Moreover, once the table leg is moved, the robot is tasked to insert the leg into the hole,
which requires contact-rich interactions. To learn both collision-avoidance and contact-rich
skills, our method (MoPA-RL) combines motion planning and model-free RL. In the
images, the green robot visualizes the target state provided by the policy. Initially, motion
planning can be used to navigate to target states a1, a2, and a3 while avoiding collision.
Once the arm passes over other legs, a sequence of primitive actions a4 − a10 are directly
executed to assemble the leg and tabletop. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Manipulation tasks in obstructed environments. (a) 2D Push: The 2D reacher agent has to
push the green box to the goal (black circle). (b) Sawyer Push: Sawyer arm should push
the red box toward the goal (green circle). (c) Sawyer Lift: Sawyer arm takes out the can
from the long box. (d) Sawyer Assembly: Sawyer arm moves and inserts the table leg into
the hole in the table top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 To balance the chance of choosing direct action execution and motion planning during
exploration, we increase the action space for direct action execution A. . . . . . . . . . . . 20
3.5 Success rates of our MoPA-SAC (green) and baselines averaged over 4 seeds. All methods
are trained for the same number of environment steps. MoPA-SAC can solve all four tasks
by leveraging the motion planner and learn faster than the baselines. . . . . . . . . . . . . . 24
3.6 End-effector positions of SAC (left) and MoPA-SAC (right) after the first 100k training
environment steps in 2D Push are plotted in blue dots. The use of motion planning allows
the agent to explore the environment more widely early on in training. . . . . . . . . . . . . 25
ix
3.7 (a) Averaged contact force in an episode over 7 executions in 2D Push. Leveraging a
motion planner, all variants of our method naturally learn collision-safe trajectories. (b)
Comparison of our model with different action range values ∆qMP on Sawyer Lift. (c)
Comparison of our model w/ and w/o action rescaling or w/o direct action execution on
Sawyer Lift. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Learning deformable manipulation For our method DMfD, we describe a learned agent
that achieves state-of-the-art performance among methods that use expert demonstrations,
for solving difficult deformable manipulation tasks such as straightening 1D ropes and
folding 2D cloths based on scene images. We set a new benchmark on the Straighten
Rope (Figure 4.1a) task which requires the agent to straighten a rope with two end effectors,
shown as white spheres, and on the Cloth Fold (Figure 4.1b) task which requires the agent
to fold a flattened cloth into half, along an edge. Both tasks are from the SoftGym suite [79].
Additionally, we introduce and solve a new task constrained to a single end effector - the
Cloth Fold Diagonal task, which requires an agent to fold a square cloth along a diagonal.
In the pinned version (Figure 4.1c) of this task, the cloth is clamped to the table at a corner;
in the unpinned version (Figure 4.1d) it is not. Figure 4.1e shows the unpinned version
being executed on a real robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Schematic of our method. The agent obtains observations from the environment (during
experience collection) or the replay buffer B (during training). Pre-populated expert
demonstrations in the replay buffer are shown in Green. The training pipeline works with
state-based or image-based observations. With state-based observations, the actor and critic
get an encoding of the system state (oQ = oπ = os), shown as Black and Blue arrows. With
image-based observations, the actor gets an encoding of the image whereas the critic gets
encodings of the both the state and the image (oπ = oimage, oQ = os ∪ oimage), denoted by
Black and Red arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Performance comparisons. Learning curves of the normalized performance pˆ(H)
for all environments during training. The first column(4.3a & 4.3d) shows SoftGym
state-based environments. The second column(4.3b & 4.3e) shows SoftGym image-based
environments, and the third column (4.3c & 4.3f) shows our new Cloth Fold Diagonal
environments. All environments were trained until convergence. State-based DMfD is
in light blue, and the image-based agent is in dark blue. The expert performance is the
solid black line. We compare against the baselines described in Section 4.3.2. Behavioural
Cloning does not train online, its results are shown in Table 4.1 and Table 4.2. We plot
the mean µ of the curves as a solid line, and shade one standard deviation (µ ± σ).
DMfD consistently beats the baselines, with comparable or better variance. For a detailed
discussion see Section 4.3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Pipeline for real robot experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
x
4.5 Ablation studies. Ablations were performed on the Straighten Rope environment, to
verify the necessity for each feature used. State-based DMfD is shown in light blue, and
image-based DMfD is in dark blue. Entropy regularization Figure 4.5a and one ablation
for reference state-initialization Figure 4.5b were run on the state-based environment.
The other ablations (Figure 4.5c, Figure 4.5d, Figure 4.5e, and Figure 4.5f) require an
image-based environment. We plot the mean µ of the curves as a solid line, and shade one
standard deviation (µ ± σ). Detailed discussion of these features is in Section 4.3.5. In
Figure 4.5e, ‘100 episodes∗
’ refers to 100 episodes of data copied 80 times to mimic the
buffer of 8000 episodes without actually creating as many expert demonstrations. . . . . . . 40
5.1 MAIL generalizes LfD to large morphological mismatches between teacher and student in
difficult manipulation tasks. We show an example task: hang a cloth to dry on a plank (DRY
CLOTH). The demonstrations are bimanual, yet the robot learns to execute the task with a
single arm and gripper. The learned policy transfers to the real world and is robust to object
variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 An example cloth folding task with demonstrations from a teacher with n = 2 end-effectors,
deployed on a Franka Panda with m = 1 end-effector (parallel-jaw gripper). We train a
network to predict the forward dynamics of the object being manipulated in simulation,
using a random action dataset DRandom. For every state transition, we match the predicted
particle displacements from our model, ∆Ppred, to that of the simulator, ∆Psim. Given this
learned dynamics and the teacher demonstrations we use indirect trajectory optimization
to find student actions that solve the task. The optimization objective is to match with
the object states in the demonstration. Finally, we pass the optimized dataset DStudent to
a downstream LfD method to get a final policy π that generalizes to task variations and
extends task learning to image space, enabling real-world deployment. . . . . . . . . . . . 49
5.3 SOTA performance comparisons. For each training run, we used the best model in each
seed’s training run, and evaluated using 100 rollouts across 5 seeds, different from the
training seed. Bar height denotes the mean, error bars indicate the standard deviation.
MAIL outperforms all baselines, in some cases by as much as 24%. . . . . . . . . . . . . . 54
5.4 Sample trajectories of the THREE BOXES task. A three-picker teacher trajectory to reach
the goal state (Figure 5.4a). Final policies of the two-picker and one-picker agent, and
real-world execution of the one-picker agent. . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Real-world results for CLOTH FOLD and DRY CLOTH. . . . . . . . . . . . . . . . . . . . . 59
6.1 Sample camera images used during training and inference. . . . . . . . . . . . . . . . . . . 65
6.2 Workspace image, which includes the workspace camera, wrist camera, space mouse, and
the NIST board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 Learning curves of ablated models on Sawyer Assembly. (a) Comparison of our MoPA-SAC
with different number of samples reused from motion plan trajectories. (b) Comparison of
our MoPA-SAC with different action space rescaling parameter ω. . . . . . . . . . . . . . . 88
A.2 Success rate on (a) Sawyer Lift w/o box and (b) Sawyer Assembly w/o legs. . . . . . . . . . 89
xi
A.3 Ablation of invalid target handling on (a) Sawyer Lift and (b) Sawyer Assembly. . . . . . . . 90
A.4 Learning curves of ablated models on Sawyer Assembly. (a) Comparison of our model
with different motion planner algorithms. (b) Comparison of our model with different RL
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.5 Environments used in our experiments, with one end-effector. The end-effectors are
pickers (white spheres). In CLOTH FOLD (left) the robot has to fold the cloth (orange and
pink) along an edge (inspired by the SoftGym [79] two-picker cloth fold task). In DRY
CLOTH (middle) the robot has to hang the cloth (orange and pink) on the drying rack
(brown plank). In THREE BOXES (right), the robot has to rearrange three rigid boxes along
a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C.6 Ablations to MAIL components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C.7 Predictions of the learned spatio-temporal dynamics model Tψ and the FleX simulator.
Predictions are made for the same state and action, shown for both cloth tasks. The learned
model supports optimization approximately 50x faster than the simulator, albeit at the cost
of accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
C.8 Performance comparison between DMfD trained on DStudent obtained using different
learned dynamics models: 1D CNN-LSTM and Perceiver IO. For each training run, we
used the best model in each seed’s training run, and evaluated using 100 rollouts across 5
seeds, different from the training seed. Bar height denotes the mean, error bars indicate the
standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.9 Ablation on the modality of demonstrations on LfD performance. Similar performance
shows that MAIL can learn from a wide variety of demonstrations, or even a mixture of
them, without loss in performance. See Section C.4. . . . . . . . . . . . . . . . . . . . . . 103
C.10 Ablation on the effect of reference state initialization (RSI) and imitation reward (IR)
on LfD performance. RSI is not helpful here because our tasks are not as dynamic or long
horizon as DeepMimic [99]. See Section C.5. . . . . . . . . . . . . . . . . . . . . . . . . . 103
D.11 Performance function for CLOTH FOLD on the real robot. At time t, we measure the
fraction of pixels visible to the maximum number of pixels visible ftop = pixtop,t/pixmax
and fbot = pixbot,t/pixmax. Performance for the top of the cloth should be 1 when it is
not visible, p(top) = 1 − ftop. Performance for the bottom of the cloth should be 1 when
it is exactly half-folded on top of the top side, p(bot) = min [2 (1 − fbot), 2fbot]. Final
performance is an average of both metrics, p(st) = p(top) + p(bottom)/2. Note that the
cloth is flattened at the start, thus pixmax = pixtop,0. . . . . . . . . . . . . . . . . . . . . . 107
xii
Abstract
Robot manipulation of complex objects, such as cloth, is challenging due to difficulties in perceiving and exploring the environment. Pure reinforcement learning (RL) is difficult in this setting, as it requires extensive
exploration of the state space, which can be inefficient and dangerous. Demonstrations from humans can
alleviate the need for exploration, but collecting good demonstrations can be time-consuming and expensive.
Therefore, a good balance between perception, exploration, and imitation is needed to solve manipulation
of complex objects.
This dissertation focuses on dexterous manipulation of complex objects, using images and without assuming full state information during inference. It also aims to achieve efficient learning by reducing interactions with the environment during exploration and reducing the overhead of collecting demonstrations.
To achieve these goals, we present i. a learning algorithm that uses a motion planner in the loop, to enable efficient long horizon exploration, ii. A framework for visual manipulation of complex deformable
objects using demonstrations from a set of agents with different embodiments. iii. An LfD algorithm for
dexterous tasks with rigid objects, such as peg insertion with high precision, using images and a multi-task
attention-based architecture.
These contributions enable robots to manipulate complex objects efficiently and with high precision,
using images alone. This opens up new possibilities for robots to be used in a wider range of applications,
such as manufacturing, logistics, and healthcare.
xiii
Chapter 1
Introduction
The integration of intelligent robot manipulation is crucial for realizing the next generation of automation solutions. However, current applications are largely confined to robots operating within enclosed environments
and performing predefined movements in industrial settings. This limited scope restricts the deployment of
automation solutions to their full potential and hinders human productivity by preventing seamless collaboration between humans and robots. Extensive research is needed to address these challenges and further
advance industry automation, household robotic services, and collaborative human-robot tasks. This dissertation addresses low-level robot manipulation, tackling challenges such as grasping household objects that
are hard to reason about (such as cloth), and executing precision automated industrial assembly tasks. These
advancements can be seamlessly integrated with high-level methods, such as task planners and foundation
models, to enhance the effectiveness of complex multi-step tasks. Moreover, our work can capitalize on
the latest robotic hardware advancements, incorporating additional sensory inputs like touch and enabling
dynamic motions to effectively solve manipulation tasks.
Researchers have employed a diverse range of methodologies to address low-level manipulation challenges. Traditional methods, including motion planning, trajectory optimization, and model predictive control (MPC), offer high efficiency but may struggle with high-dimensional inputs, such as images. Additionally, they often rely on accurate system models. Learning-based methods, such supervised learning
1
and reinforcement learning (RL), can handle high-dimensional inputs and typically do not require system
models. However, they are difficult to tune, may not be robust to out-of-distribution inputs, and are sampleinefficient to train due to their vast number of parameters. Despite these challenges, learning-based methods
have demonstrated remarkable success in various domains, including vision and natural language processing, often excelling despite perverse and often baffling training schemes that incorporate large datasets.
1.1 Contributions
This dissertation aims to enhance learning-based methods by incorporating traditional methods in the form
of guiding demonstrations, known as Learning from Demonstrations (LfD). LfD involves feeding demonstrations, derived from methods like motion planners, trajectory optimizers, or hand-engineered motions,
into the training schema of learning-based methods. We make the following contributions:
1. Combining Motion Planning with RL: We introduce an algorithm that places a motion planner inside a reinforcement learning policy, that we call Motion Planner Augmented Reinforcement Learning
(MoPA-RL). This combination enables the policy to effectively explore long horizons during training.
MoPA-RL represents a departure from traditional LfD and demonstrates the potential of combining
traditional methods with learning paradigms. For more details, refer to Chapter 3 and the corresponding paper published in the Conference on Robot Learning (CoRL) 2020.
2. Providing demonstrations for challenging cloth manipulation: We present an LfD method to solve
deformable manipulation tasks using states or images as inputs, given expert demonstrations. We call
it Deformable Manipulation from Demonstrations (DMfD). DMfD effectively balances the trade-off
between online exploration and guidance from expert demonstrations, enabling efficient exploration
of high-dimensional spaces. We evaluate DMfD on a set of representative manipulation tasks for a
1-dimensional rope and a 2-dimensional cloth from the SoftGym suite of tasks. Additionally, we
2
create two challenging environments for folding a 2D cloth using image-based observations. DMfD
demonstrates robust performance when deployed on a real robot, with minimal loss in performance
compared to simulation. For more details, refer to Chapter 4 and the corresponding paper published
in the IEEE International Conference on Intelligent Robots and Systems (IROS) 2022.
3. Framework for cross-morphological LfD: We propose a framework that extends LfD methods like
DMfD to utilize demonstrations across different robots, called Morphological Adaptation in Imitation
Learning (MAIL). MAIL bridges the robot morphology gap, allowing an agent to learn from demonstrations by other agents with significantly different morphologies. MAIL can learn from suboptimal
demonstrations, provided they offer some guidance towards the desired solution. We demonstrate
MAIL’s effectiveness on manipulation tasks involving rigid and deformable objects, including 3D
cloth manipulation interacting with rigid obstacles. We train a visual control policy for a robot with
one end-effector using demonstrations from a simulated agent with two end-effectors. The trained
policy is deployed to a real Franka Panda robot and successfully handles multiple variations in object
properties (size, rotation, translation) and cloth-specific properties (color, thickness, size, material).
For more details, refer to Chapter 5 and the corresponding paper published in the Conference on Robot
Learning (CoRL) 2023.
4. Multi-task visual control for precision assembly: We introduce a method to create a multi-task
visual servoing policy for precise last-inch peg-in-hole insertion. Our method builds upon the RT-1
architecture, incorporating modifications supported by ablation studies and justifications. We demonstrate the effectiveness of our approach on various real-world last-inch tasks and exhibit its ability to
generalize to unseen peg-hole combinations. Our method lays the foundation for developing a generalized manipulation model capable of handling both coarse and fine motion, including position and
orientation control. For more details, refer to Chapter 6.
3
Through these contributions, we aim to advance the frontier of robot manipulation research and equip
robots with the tools necessary to achieve general manipulation skills comparable to those of humans. By
enabling robots to manipulate objects with dexterity and precision, we pave the way for the widespread
adoption of robots in various domains, revolutionizing automation and human-robot collaboration.
4
Chapter 2
Related Work
2.1 Motion Planning as Pseudo Demonstrations
Controlling robots to solve complex manipulation tasks using deep RL [28, 73, 74, 29] has been an active
research area. Specifically, model-free RL [28, 73, 29] has been well studied for robotic manipulation tasks,
such as picking and placing objects [168], in-hand dexterous manipulation [4, 108], and peg insertion [13].
To tackle long-horizon tasks, hierarchical RL (HRL) approaches have extended RL algorithms with temporal abstractions, such as options [133, 7], modular networks [3, 70, 72], and goal-conditioned low-level
policies [91]. However, these approaches are typically tested on controlled, uncluttered environments and
often require a large number of samples.
On the other hand, motion planning [2, 51, 96, 64, 63, 49], a cornerstone of robot motion generation, can
generate a collision-free path in cluttered environments using explicit models of the robot and environment.
Probabilistic roadmaps (PRMs) [2, 51, 96] and rapidly-exploring random trees (RRTs) [64, 63, 49] are two
common sampling-based motion planning techniques. These planning approaches can effectively compute
a collision-free path in static environments; however, these methods have difficulties handling dynamic
environments and contact-rich interactions, which frequently occur in object manipulation tasks.
There are several works that combine motion planning and reinforcement learning to get the advantages
of both approaches. A typical approach is to decompose the problem into the parts that can and cannot be
5
solved with MP. Then, RL is used to learn the part that the planner cannot handle [20, 147, 124, 14, 157].
However, this separation often relies on task-specific heuristics that are only valid for a limited task range.
MP can be incorporated with RL in the form of a modular framework with a task-specific module switching
rule [67] and HRL with pre-specified goals for the motion planner [5]. [157] uses a motion planner as a lowlevel controller and learn an RL policy in a high-level action (subgoal) space, which limits the capability of
learning contact-rich skills. Instead, we propose to learn how to balance the use of the motion planner and
primitive action (direct action execution) using model-free RL with minimal task-specific knowledge.
2.2 LfD and Reinforcement Learning for Deformable Object Manipulation
Deformable object manipulation has been a challenge in robotics with many real-world applications, such as
folding clothes [155], cooking food [54], or assisting humans [47]. Its high-dimensional state representation
and complex dynamics make manipulation tasks significantly more difficult than rigid body manipulation.
Traditionally, analytical methods have been employed to solve deformable object manipulation tasks. Methods such as Finite Element Method [153] are used to model object dynamics. Control methods such as
trajectory optimization [169] and model predictive control [1] use these models to specify control inputs
and manipulate the object. Although these have proven to be successful under certain conditions, it is difficult to generalize them to perturbations or variations in the environment. Recently, data-driven methods
have gained popularity in solving manipulation tasks [106], such as Imitation Learning (IL) [60, 140, 35],
Reinforcement Learning (RL) [90, 58, 158], and combining IL with RL [36, 81]. However, most of the
successes have been in rigid body manipulation. The low observability and controllability of deformable
objects, coupled with the typically high dimensionality of the parameter space in learning methods, make it
challenging for learning alone to solve these tasks. Here, we focus on deformable object manipulation with
a novel expert-guided RL method.
6
2.2.1 Learning from Demonstrations (LfD)
Two common methods of learning from demonstrations include IL and Offline RL. IL is a powerful machine
learning technique used to imitate expert demonstrations. IL has been applied to soft body manipulation
e.g., DART [119] has been used for bed making, where human demonstrations are used on a robot and
the Transporter Network [118] with goal conditioning has been used for manipulating beads, cloths, and
bags. Dynamic Movement Primitives have been used [47] to learn cloth manipulation from demonstrations.
The common issue with these methods is that they tend to fail when encountering a new state due to the
accumulating errors from covariate shift [112]. Moreover, these methods’ performance is often bounded
by the quality of expert demonstrations. Similar to IL, Offline RL [59, 111] generally learns from past
demonstrations without online environment interactions. Offline RL has two properties: all transitions are
stored in an offline dataset, and network updates occur on the entire batch of transitions. In particular,
Offline RL can handle large, diverse datasets which produce more generalizable policies [111]. But they
often achieve sub-optimal performance when used in online fine-tuning, discussed in [92].
2.2.2 Reinforcement Learning (RL)
Reinforcement learning enables an agent to learn in an interactive environment via trial and error. RL has
been applied to manipulation problems [47, 110]; additionally [158] empowers RL agents with motion
planning techniques to manipulate cubes and assemble furniture and [81] extends it to the visual domain.
However, most of these apply to rigid body manipulation. A limited number of RL methods have been used
for deformable manipulation [79, 89, 155], some e.g., CURL [62] and DrQ [161] using vision. [95] provides
a thorough overview of reinforcement learning techniques used for robotic manipulation tasks.
7
2.2.3 Combining Reinforcement Learning and Expert Guidance
IL techniques are trained to perform a task from demonstrations by learning the mapping between observations and actions. Hence, when demonstrations can be easily given for a problem, IL is a preferred method.
RL, in contrast, is suitable when a reward function can be easily specified and the environment can be easily
explored. However, it is time-consuming to naively explore the state space without expert demonstrations.
Thus, there have been several studies [36, 93] [167] focusing on how to combine IL and RL effectively
gaining the advantages of IL, where the agent explores by learning from expert demonstrations, and RL,
where the agent learns to improve the policy further. Deep Mimic [99] uses RSI to address exploration cost
by initializing from past high-value states, since some high reward states may be difficult to reach but valuable for exploration. Advantage Weighted Actor Critic (AWAC) [92] is another method that utilizes expert
demonstrations. It proposes an implicit policy constraint to efficiently train an off-policy RL algorithm to
learn from offline data followed by online fine-tuning. Although these methods are not specifically designed
for deformable object manipulation, they have shown significant performance improvements in other areas.
In this thesis, we demonstrate that RL combined with appropriate use of expert data can greatly improve
deformable object manipulation, among other tasks.
2.3 Learning from Demonstrations across Morphologies
2.3.1 Imitation Learning and Reinforcement Learning with Demonstrations (RLfD)
Imitation learning methods [98, 22, 18, 61, 35] and methods that combine reinforcement learning and
demonstrations [109, 145, 101, 81] have shown excellent results in learning a mapping between observations
and actions from demonstrations. However, their objective function requires access to the demonstrator’s
ground truth actions for optimization. This is infeasible for cross-morphology transfer due to action space
mismatch. To work around this, prior works have proposed systems for teachers to provide demonstrations
8
in the students’ morphology [88] which limits the ability of teachers to efficiently provide data. Similar
to imitation learning, offline RL [75, 59, 24] learns from demonstrations stored in a dataset without online
environment interactions. While offline RL can work with large datasets of diverse rollouts to produce generalizable policies [57, 111], it requires the availability of rollouts that have the same action space as the
learning agent. MAIL learns across morphologies and is not affected by this limitation.
2.3.2 Imitation from Observation
Imitation from observation (IfO) methods [107, 98, 142, 139, 131, 141, 125] learn from the states of the
demonstration; they do not use state-action pairs. In [159], an approach is proposed to learn repetitive
actions using Dynamic Movement Primitives [40] and Bayesian optimization to maximize the similarity
between human demonstrations and robot actions. Many IfO methods [107, 139, 131, 31] assume that the
student can take a single action to transition from the demonstration’s current state to the next state. Some
methods [107, 139] use this to train an inverse dynamics model to infer actions. Others extract keypoints
from the observations and compute actions by subtracting consecutive keypoint vectors. XIRL [162] uses
temporal cycle consistency between demonstrations to learn task progress as a reward function, which is
then fed to RL methods. However, when the student has a different action space than the teacher, it may
require more than one action for the student to reach consecutive demonstration states. For example, in an
object rearrangement task, a two-picker teacher agent can move two objects with one pick-place action. But
a one-picker student will need two or more actions to achieve the same result. Zero-shot visual imitation [98]
assumes that the statistics of visual observations and agents observations will be similar. However, when
solving a task with different numbers of arms, some intermediate states will not be seen in teacher demonstrations. State-of-the-art learning from observation methods [141, 71] have made significant advancements
in exploiting information between states. However, their tasks have much longer horizons, hence more states
and learning signals than ours. Whether these methods work well on short-horizon, difficult manipulation
9
tasks is uncertain. To address this and provide a meaningful comparison, we conducted experiments to
compare our proposed method with these (Chapter 5).
2.3.3 Trajectory Optimization
Trajectory optimization algorithms [10, 9, 52] optimize a trajectory by minimizing a cost function, subject
to a set of constraints. It has been used for manipulation of rigid and deformable objects [45], even through
contact [105] using complementarity constraints [86]. Indirect trajectory optimization only optimizes the
actions of a trajectory and uses a simulator for the dynamics instead of adding dynamics constraints at every
step.
2.3.4 Learned Dynamics
Learning dynamics models is useful when there is no simulator, or if the simulator is too slow or too inaccurate. Learned models have been used with Model-Predictive Control (MPC) to speed up prediction
times [151, 146, 137]. A common use case is model-based RL [103], where learning the dynamics is part of
the algorithm and has been shown to learn dynamics from states and pixels [32] and applied to real-world
tasks [154].
2.4 Visual Servoing for Robotic Assembly with Precision
2.4.1 Classical Methods for Robotic Assembly
Automated assembly has been an open challenge in robotics for decades, with a wide range of applications in
manufacturing and other industries. The NIST Assembly Task Board challenge [53] serves as a benchmark
for the latest advances in assembly tasks.
Classical methods for robot assembly typically rely on geometric models of the parts and the environment, as well as feedback from encoders and force-torque sensors. These methods have been used to develop
10
analytical solutions to assembly problems, such as peg-in-hole insertion. Remote center-of-compliance
(RCC) [17] reduces reliance on visual or force-based perception by using a spring-loaded mechanism to
keep the robot end-effector aligned with the assembly goal. Compliant motion planning [82] uses planning
algorithms to generate trajectories that minimize the risk of damaging the parts or the robot in the event
of unexpected contact. [38] uses vision sensors to compensate for initial misalignment of the parts before
contact, and [135] uses force-torque sensors to monitor the contact forces and adjust the robot’s motion
accordingly.
While classical methods have been successful in many assembly applications, they can be highly sensitive to errors in modeling or perception, environmental disturbances, and extension to unseen parts. They
rely heavily on accurate models of the parts and the environment. Any errors in these models can lead to
assembly failures. They can be sensitive to environmental disturbances, such as perturbations of objects or
fixtures. They can be difficult to extend to new assembly tasks, as they require careful design and tuning.
These limitations have motivated the development of new approaches to robot assembly, such as learningbased methods.
2.4.2 Learned Models for Robot Assembly
Learning-based methods have gained popularity in the robotics community as a way to address the limitations of classical assembly methods. Reinforcement learning (RL) is a particularly popular approach, with
many works focusing on model-based algorithms such as guided policy search (GPS) [136] due to their sample efficiency. However, discontinuous contact dynamics have made it challenging to extend these methods
to contact-rich manipulation tasks [129].
More recent works have used model-free, off-policy RL, including Q-learning [41], deep-Q networks
(DQN) [165], soft actor-critic (SAC) [8], deep deterministic policy gradients (DDPG) [6, 84, 83], and hierarchical RL [37]. On-policy RL has also been explored, with methods such as trust region policy optimization
11
(TRPO) [68], proximal policy optimization (PPO) [134, 33, 94, 126, 148], and asynchronous advantage
actor-critic (A3C) [121]. However, all of these methods are significantly more sample inefficient than classical methods.
This limitation is being addressed by RL works that use demonstrations, either from a human or a
pre-programmed agent. Some examples of these methods include interleaving SAC with Riemannian Motion Policies (RMP) [67], DDPG from demonstration (DDPGfD) [83, 85], Advantage-weighted actor-critic
(AWAC) [166], and Inverse RL [156].
Another approach to reducing the sample complexity of RL for assembly is to use inductive bias on the
tasks via a residual policy. Residual policy methods [46, 67] use a fixed base policy for manipulation, and
only employ RL to learn a correction over this base policy. Learning from demonstration (LfD) may also be
used to derive a baseline policy [16, 144].
In addition to RL, there are a number of non-RL learning approaches that have been proposed for
robot assembly. These include Learning from video demonstrations [149], Demonstration-initialized selfsupervised learning [23, 127, 128], Using human intervention during robot exploration [128].
Our method is a pure supervised learning approach, learning from pre-programmed demonstrations to
perform the canonical peg-in-hole insertion task, using visual servoing.
Sim-to-Real for Assembly
In an effort to alleviate the sample complexity issues of learning on the real robot, there have been a
number of impressive efforts in sim-to-real (sim2real) for assembly [16, 116, 33]. These works have used
on-policy [134, 55] and off-policy RL methods [85, 6] to solve NIST-style tasks [8, 16, 165]. Many of these
approaches use force/torque (F/T) sensors to set thresholds. However, most of them require fine-tuning in
the real world [8] and use state information [134].
12
Overall, learned methods have made significant progress in robot assembly in recent years. However,
there are still a number of challenges to be addressed, such as sample complexity, sim2real transfer, and
robustness to environmental disturbances.
2.4.3 Multi-task Visual Servoing
Transformer-based robotic policies have recently gained popularity due to their ability to learn complex
tasks from language descriptions [15, 123]. RT-1, in particular, models the task as a sequence modeling
problem and uses a transformer to learn a mapping from language and vision observations to robot actions.
Existing Transformer-based robotic manipulation methods for real-world applications have focused on
efficiently learning tasks from a set of demonstrations per task [123]. However, recent works such as Behavior Transformer [120], RT-X [15], and Robocat [11] have advocated for training a single model on
large-scale robotic datasets.
While the use of high-capacity Transformer models to learn robotic control policies is a relatively new
development, multi-task and language-conditioned learning have been explored in robotics for many years.
Our method explores this paradigm for dexterous tasks.
13
Chapter 3
Using Motion Planners as Pseudo Demonstrations in Learning
Deep reinforcement learning (RL) agents are able to learn contact-rich manipulation tasks by maximizing a
reward signal, but require large amounts of experience, especially in environments with many obstacles that
complicate exploration. In contrast, motion planners use explicit models of the agent and environment to
plan collision-free paths to faraway goals, but suffer from inaccurate models in tasks that require contacts
with the environment. To combine the benefits of both approaches, we propose motion planner augmented
RL (MoPA-RL) which augments the action space of an RL agent with the long-horizon planning capabilities
of motion planners. Based on the magnitude of the action, our approach smoothly transitions between
directly executing the action and invoking a motion planner. We evaluate our approach on various simulated
manipulation tasks and compare it to alternative action spaces in terms of learning efficiency and safety.
The experiments demonstrate that MoPA-RL increases learning efficiency, leads to a faster exploration,
and results in safer policies that avoid collisions with the environment. Videos and code are available at
https://clvrai.com/mopa-rl.
3.1 Introduction
This chapter is based on Jun Yamada et al. “Motion Planner Augmented Reinforcement Learning for Robot Manipulation in
Obstructed Environments”. In: Conference on Robot Learning. 2020.
14
RL Policy
∥at
∥∞ ≤ Δqstep
Motion Planner
False
Environment
True
at = Δqt
s MP-Augmented Agent t
τ0:H
Figure 3.1: Our framework extends an RL policy with a motion planner. If the predicted action
by the policy is above a threshold ∆qstep, the motion planner is called; otherwise, it is directly executed.
In recent years, deep reinforcement learning (RL) has shown
promising results in continuous control problems [77, 117, 29,
70, 72]. Driven by rewards, robotic agents can learn tasks such
as grasping [76, 48] and peg insertion [21]. However, prior
works mostly operated in controlled and uncluttered environments, whereas in real-world environments, it is common to have
many objects unrelated to the task, which makes exploration challenging. This problem is exacerbated in situations where feedback is scarce and considerable exploration is required before a
learning signal is received.
Motion planning (MP) is an alternative for performing robot tasks in complex environments, and has
been widely studied in the robotics literature [2, 65, 50, 19]. MP methods, such as RRT [65] and PRM [2],
can find a collision-free path between two robot states in an obstructed environment using explicit models
of the robot and the environment. However, MP struggles on tasks that involve rich interactions with objects
or other agents since it is challenging to obtain accurate contact models. Furthermore, MP methods cannot
generate plans for complex manipulation tasks (e.g., object pushing) that cannot be simply specified by a
single goal state.
In this chapter, we propose motion planner augmented RL (MoPA-RL) which combines the strengths of
both MP and RL by augmenting the action space of an RL agent with the capabilities of a motion planner.
Concretely, our approach trains a model-free RL agent that controls a robot by predicting state changes in
joint space, where an action with a large joint displacement is realized using a motion planner while a small
action is directly executed. By predicting a small action, the agent can perform sophisticated and contactrich manipulation. On the other hand, a large action allows the agent to efficiently explore an obstructed
environment with collision-free paths computed by MP.
15
Motion Plannner Direct Action Execution
a1 a2 a3 a4,a5,a6 a7,a8,a9,a10
Figure 3.2: Learning manipulation skills in an obstructed environment is challenging due to frequent collisions and narrow passages
amongst obstacles. For example, when a robot moves the table leg to assemble, it is likely to collide with other legs and get stuck
between legs. Moreover, once the table leg is moved, the robot is tasked to insert the leg into the hole, which requires contact-rich
interactions. To learn both collision-avoidance and contact-rich skills, our method (MoPA-RL) combines motion planning and
model-free RL. In the images, the green robot visualizes the target state provided by the policy. Initially, motion planning can be
used to navigate to target states a1, a2, and a3 while avoiding collision. Once the arm passes over other legs, a sequence of primitive
actions a4 − a10 are directly executed to assemble the leg and tabletop.
Our approach has three benefits: (1) MoPA-RL can add motion planning capabilities to any RL agent
with joint space control as it does not require changes to the agent’s architecture or training algorithm; (2)
MoPA-RL allows an agent to freely switch between MP and direct action execution by controlling the scale
of action; and (3) the agent naturally learns trajectories that avoid collisions by leveraging motion planning,
allowing for safe execution even in obstructed environments.
The main contribution of this chapter is a framework augmenting an RL agent with a motion planner,
which enables effective and safe exploration in obstructed environments. In addition, we propose three
challenging robotic manipulation tasks with the additional challenges of collision-avoidance and exploration
in obstructed environments. We show that the proposed MoPA-RL learns to solve manipulation tasks in these
obstructed environments while model-free RL agents suffer from local optima and difficult exploration.
3.2 Motion Planner Augmented Reinforcement Learning
In this chapter, we address the problem of solving manipulation tasks in the presence of obstacles. Exploration by deep reinforcement learning (RL) approaches for robotic control mostly relies on small perturbations in the action space. However, RL agents struggle to find a path to the goal in obstructed environments
16
(a) 2D Push (b) Sawyer Push (c) Sawyer Lift (d) Sawyer Assembly
Figure 3.3: Manipulation tasks in obstructed environments. (a) 2D Push: The 2D reacher agent has to push the green box to the
goal (black circle). (b) Sawyer Push: Sawyer arm should push the red box toward the goal (green circle). (c) Sawyer Lift: Sawyer
arm takes out the can from the long box. (d) Sawyer Assembly: Sawyer arm moves and inserts the table leg into the hole in the table
top.
due to collisions and narrow passages. Therefore, we propose to harness motion planning (MP) techniques
for RL agents by augmenting the action space with a motion planner. In Section 3.2.2, we describe our
framework, MoPA-RL, in detail. Afterwards, we elaborate on the motion planner implementation and RL
agent training.
3.2.1 Preliminaries
We formulate the problem as a Markov decision process (MDP) defined by a tuple (S, A, P, R, ρ0, γ) consisting of states s ∈ S, actions a ∈ A, transition function P(s
′ ∈ S|s, a), reward R(s, a), initial state
distribution ρ0, and discount factor γ ∈ [0, 1]. The agent’s action distribution at time step t is represented
by a policy πϕ(at
|st) with state st ∈ S and action at ∈ A, where ϕ denotes the parameters of the policy.
Once the agent executes the action at
, it receives a reward rt = R(st
, at). The performance of the agent is
evaluated using the discounted sum of rewards PT −1
t=0 γ
tR(st
, at), where T denotes the episode horizon.
In continuous control, the action space can be defined as the joint displacement at = ∆qt
, where qt
represents robot joint angles. To prevent collision and reduce control errors, the action space is constrained
to be small, A = [−∆qstep, ∆qstep]
d
, where ∆qstep represents the maximum joint displacement for a direct
action execution [28] and d denotes the dimensionality of the action space.
17
Algorithm 1 Motion Planner Augmented RL (MoPA-RL)
Require: Motion planner MP, augmented MDP M˜ (S, A˜, P , ˜ R, ρ ˜
0, γ), action limits ∆qstep, ∆qMP number
of reused trajectories M
1: Initialize policy πϕ and replay buffer D
2: for i = 1, 2, . . . do
3: Initialize episode s0 ∼ ρ0,t˜← 0, t ← 0
4: while episode not terminated do
5: a˜t˜ ∼ πϕ(˜at˜|st)
6: if ||a˜t˜||∞ > ∆qstep then
7: Ht˜, τ0:Ht˜ ← MP(qt
, qt + ˜at˜) ▷ Motion planner execution
8: st+Ht˜
, r˜t˜ ← P˜(st
, ∆τ0:Ht˜
), R˜(st
, ∆τ0:Ht˜
)
9: for j = 1, . . . , M do
10: Sample intermediate transitions τn:m from τ0:Ht˜
11: a, ˜ r˜ ← ∆τn:m, R˜(st+n, ∆τn:m)
12: D ← D ∪ {(st+n, a, ˜ r, s ˜ t+m, m − n)} ▷ Reuse motion plan trajectories
13: end for
14: else
15: Ht˜ ← 1
16: st+Ht˜
, r˜t˜ ← P˜(st
, a˜t˜), R˜(st
, a˜t˜) ▷ Direct action execution
17: end if
18: D ← D ∪ {(st
, a˜t˜, r˜t˜, st+Ht˜
, Ht˜)}
19: t ← t + Ht˜,t˜← t˜+ 1
20: Update πϕ using model-free RL
21: end while
22: end for
On the other hand, a kinematic motion planner MP(qt
, gt) computes a collision-free path τ0:H =
(qt
, qt+1, . . . , qt+H = gt) from a start joint state qt
to a goal joint state gt
, where H is the number of
states in the path. The sequence of actions at:t+H−1 that realize the path τ0:H can be obtained by computing
the displacement between consecutive joint states, ∆τ0:H = (∆qt
, . . . , ∆qt+H−1).
3.2.2 Motion Planner Augmented Reinforcement Learning
To efficiently learn a manipulation task in an obstructed environment, we propose motion planner augmented
RL (MoPA-RL). Our method harnesses a motion planner for controlling a robot toward a faraway goal
without colliding with obstacles, while directly executing small actions for sophisticated manipulation. By
utilizing MP, the robot can effectively explore the environment avoiding obstacles and passing through
18
narrow passages. For contact-rich tasks where MP often fails due to an inaccurate contact model, actions
can be directly executed instead of calling a planner.
As illustrated in Figure 3.1, our framework consists of two components: an RL policy πϕ(a|s) and a
motion planner MP(q, g). In our framework, the motion planner is integrated into the RL policy by enlarging
its action space. The agent directly executes an action if it is in the original action space. If an action is
sampled from outside of the original action space, which requires a large movement of the agent, the motion
planner is called and computes a path to realize the large joint displacement.
To integrate the motion planner with an MDPM, we first define an augmented MDPM˜ (S, A˜, P , ˜ R, ρ ˜
0, γ),
where A˜ = [−∆qMP, ∆qMP]
d
is an enlarged action space with ∆qMP > ∆qstep, P˜(s
′
|s, a˜) denotes the augmented transition function, and R˜(s, a˜) is the augmented reward function. Since one motion planner call
can execute a sequence of actions in the original MDP M, the augmented MDP M˜ can be considered as
a semi-MDP [133], where an option a˜ executes an action sequence ∆τ0:H computed by the motion planner. For simplicity of notation, we use a˜ and ∆τ0:H interchangeably. The augmented transition function
P˜(s
′
|s, a˜) = P˜(s
′
|s, ∆τ0:H) is the state distribution after taking a sequence of actions and the augmented
reward function R˜(s, a˜) = R˜(s, ∆τ0:H) is the discounted sum of rewards along the path.
On the augmented MDP M˜ , the policy πϕ(˜a|s) chooses an action a˜, which represents a change in the
joint state ∆q. The decision whether to call the motion planner or directly execute the predicted action is
based on its maximum magnitude ||a˜||∞, i.e., the maximum size of the predicted displacement, as illustrated
in Figure 3.1. If the joint displacement is larger than an action threshold ∆qstep for any joint (i.e., ||a˜||∞ >
∆qstep), which is likely to lead to collisions, the motion planner is used to compute a collision-free path
τ0:H towards the goal g = q + ˜a. To follow the path, the agent executes the action sequence ∆τ0:H over H
time steps. Otherwise, i.e., ||a˜||∞ ≤ ∆qstep, the action is directly executed using a feedback controller for
a single time step, as is common practice in model-free RL. This procedure is repeated until the episode is
terminated. Then, the policy πϕ is trained to maximize the expected returns Eπϕ
hPT˜
t˜=0 γ
tR˜(st˜, a˜t˜)
i
, where
19
T˜ is the episode horizon on M˜ and t =
Pt˜−1
i=0 Hi
is the number of primitive actions executed before time
step t˜. The complete RL training with the motion planner augmented agent is described in Algorithm 1.
The proposed method has three advantages. First, our method gives the policy the freedom to choose
whether to call the motion planner or directly execute actions by predicting large or small actions, respectively. Second, the agent naturally produces trajectories that avoid collisions by leveraging MP, allowing
for s afe policy execution even in obstructed environments. The third advantage is that MoPA-RL can add
motion planning capabilities to any RL algorithm with joint space control as it does not require changes to
the agent’s architecture or training procedure.
3.2.3 Action Space Rescaling
Direct Motion Planner
Direct Motion Planner
ui
ai
Δqstep
ω
f(ui
)
ΔqMP
Figure 3.4: To balance the chance of choosing direct
action execution and motion planning during exploration, we increase the action space for direct action
execution A.
The proposed motion planner augmented action space A˜ =
[−∆qMP, ∆qMP]
d
extends the typical action space for modelfree RL, A = [−∆qstep, ∆qstep]
d
. An action a˜ from the original
action space A is directly executed with a feedback controller.
On the other hand, an action from outside of A is handled
by the motion planner. However, in practice, ∆qMP is much
larger than ∆qstep, which results in a drastic difference between
the proportions of the action spaces for direct action execution
and motion planning. Especially with high-dimensional action
spaces, this leads to very low probability
∆qstep/∆qMPd
of selecting direct action execution during exploration. Hence, this naive action space partitioning biases using motion planning over direct action execution
and leads to failures of learning contact-rich manipulation tasks.
To circumvent this issue, we balance the ratio of sampling actions for direct action execution a˜ ∈ A and
motion plan actions a˜ ∈ A \ A ˜ by rescaling the action space. Figure 3.4 illustrates the distribution of direct
2
action and motion plan action in A˜ (y-axis) and the desired distribution (x-axis). To increase the portion of
direct action execution, we apply a piecewise linear function f to the policy output u ∈ [−1, 1]d
and get
joint displacements ∆q, as shown by the red line in Figure 3.4. From the policy output u, the action (joint
displacement) of the i-th joint can be computed by
a˜i = f(ui) =
∆qstep
ω
ui
|ui
| ≤ ω
sign(ui)
h
∆qstep + (∆qMP − ∆qstep)
|ui|−ω
1−ω
i otherwise
, (3.1)
where ω ∈ [0, 1] determines the desired ratio between the sizes of the two action spaces and sign(·) is the
sign function.
3.2.4 Training Details
We model the policy πϕ as a neural network. The policy and critic networks consist of 3 fully connected
layers of 256 hidden units with ReLU nonlinearities. The policy outputs the mean and standard deviation
of a Gaussian distribution over an action space. To bound the policy output u in [−1, 1], we apply tanh
activation to the policy output. Before executing the action in the environment, we transform the policy
output with the action rescaling function f(u) described in Equation 3.1. The policy is trained using a
model-free RL method, Soft Actor-Critic [29].
To improve sample efficiency, we randomly sample M intermediate transitions of a path from the motion
planner, and store the sampled transitions in the replay buffer. By making use of these additional transitions,
the agent experience can cover a wider region in the state space during training (see Appendix Section A.1).
For hyperparameters and more details about training, refer to Appendix Section B.2.
21
3.2.5 Motion Planner Implementation
Our method seamlessly integrates model-free RL and MP through the augmented action space. Our method
is agnostic to the choice of MP algorithm. Specifically, we use RRT-Connect [56] from the open motion
planning library (OMPL) [130] due to its fast computation time. After the motion planning, the resulting
path is smoothed using a shortcutting algorithm [26]. For collision checking, we use the collision checking
function provided by the MuJoCo physics engine [138].
The expensive computation performed by the motion planner can be a major bottleneck for training.
Thus, we design an efficient MP procedure with several features. First, we reduce the number of costly
MP executions by using a simpler motion planner that attempts to linearly interpolate between the initial
and goal states instead of the sampling-based motion planner. If the interpolated path is collision-free, our
method uses this path for execution and skips calling the expensive MP algorithm. If the path has collision,
then RRT-Connect is used to find a collision-free path amongst obstacles.
Moreover, the RL policy can predict a goal joint state that is in collision or not reachable. A simple way
to resolve it is to ignore the current action and sample a new action. However, it slows down training because
the policy can repeatedly output invalid actions, especially in an environment with many obstacles. Thus,
our method finds an alternative collision-free goal joint state by iteratively reducing the action magnitude
and checking collision, similar to [163]. This strategy prevents the policy from being stuck or wasting
samples, which results in improved training efficiency. Finally, we allow the motion planner to derive plans
while grasping an object by considering the object as a part of the robot once it holds the object.
3.3 Experiments
We design our experimental evaluation to answer the following questions: (1) Can MoPA-RL solve complex manipulation tasks in obstructed environments more efficiently than conventional RL algorithms? (2) Is
22
MoPA-RL better able to explore the environment? (3) Does MoPA-RL learn policies that are safer to execute?
3.3.1 Environments
To answer these questions, we conduct experiments on the following hard-exploration tasks in obstructed
settings, simulated using the MuJoCo physics engine [138] (see Figure 3.3 for visualizations):
• 2D Push: A 4-joint 2D reacher needs to push an object into a goal location in the presence of multiple
obstacles.
• Sawyer Push: A Rethink Sawyer robot arm with 7 DoF needs to push an object into a goal position,
both of which are inside a box.
• Sawyer Lift: The Sawyer robot arm needs to grasp and lift a block out of a deep box.
• Sawyer Assembly: The Sawyer arm needs to move a leg attached to its gripper towards a mounting
location while avoiding other legs, and assemble the table by inserting the leg into the hole. The
environment is built upon the IKEA furniture assembly environment [69].
We train 2D Push, Sawyer Push and Sawyer Assembly using sparse rewards: when close to the object or
goal the agent receives a reward proportional to the distance between end-effector and object or object and
goal; otherwise there is no reward signal. For Sawyer Lift, we use a shaped reward function, similar to [21].
In all tasks, the agent receives a sparse completion reward upon solving the tasks. Further details about the
environments can be found in Appendix Section B.1.
3.3.2 Baselines
We compare the performance of our method with the following approaches:
• SAC: A policy trained to predict displacements in the robot’s joint angles using Soft Actor-Critic
(SAC, [29]), a state-of-the-art model-free RL algorithm.
23
0.00 0.25 0.50 0.75 1.00
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
SAC
SAC Large
SAC IK
MoPA-SAC
MoPA-SAC discrete
MoPA-SAC IK
(a) 2D Push
0.0 0.3 0.6 0.9 1.2 1.5
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(b) Sawyer Push
0.0 0.3 0.6 0.9 1.2 1.5
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(c) Sawyer Lift
0.0 0.3 0.6 0.9 1.2 1.5
Environment steps (1M)
0.0
0.2
0.4
0.6
0.8
1.0
Success rate
(d) Sawyer Assembly
Figure 3.5: Success rates of our MoPA-SAC (green) and baselines averaged over 4 seeds. All methods are trained for the same
number of environment steps. MoPA-SAC can solve all four tasks by leveraging the motion planner and learn faster than the
baselines.
• SAC Large: A variant of SAC predicts joint displacements in the extended action space A˜. To realize
a large joint displacement, a large action a˜ is linearly interpolated into a sequence of small actions
{a1, . . . am} such that a˜ =
Pm
i=1 ai and ||ai
||∞ ≤ ∆qstep. This baseline tests the importance of
collision-free MP for a large action space.
• SAC IK: A policy trained with SAC which outputs displacement in Cartesian space instead of the
joint space. For 3D manipulation tasks, the policy also outputs the displacement in end-effector
orientation as a quaternion. Given the policy output, a target joint state is computed using inverse
kinematics and the joint displacement is applied to the robot.
• MoPA-SAC (Ours): Our method predicts a joint displacement in the extended action space.
• MoPA-SAC Discrete: Our method with an additional discrete output that explicitly chooses between
MP and RL, which replaces the need for a threshold ω.
• MoPA-SAC IK: Our method with end-effector space control instead of joint space displacements.
Again, inverse kinematics is used to obtain a target joint state, which is either directly executed or
planned towards with the motion planner.
3.3.3 Efficient RL with Motion Planner Augmented Action Spaces
We compare the learning performance of all approaches on four tasks in Figure 3.5. Only our MoPA-SAC
is able to learn all four tasks, while other methods converge more slowly or struggle to obtain any rewards.
24
The difference is especially large in the Sawyer Lift and Sawyer Assembly tasks. This is because precise
movements are required to maneuver the robot arm to reach inside the box that surrounds the objects or
avoid the other table legs while moving the leg in the gripper to the hole. While conventional model-free
RL agents struggle to learn such complex motions from scratch, our approach can leverage the capabilities
of the motion planner to successfully learn to produce collision-free movements.
Figure 3.6: End-effector positions of SAC (left) and
MoPA-SAC (right) after the first 100k training environment steps in 2D Push are plotted in blue dots. The use
of motion planning allows the agent to explore the environment more widely early on in training.
To further analyze why augmenting RL agents with
MP capability improves learning performance, we compare the exploration behavior in the first 100k training
steps of our MoPA-SAC agent and the conventional SAC
agent on the 2D Push task in Figure 3.6. The SAC agent
initially explores only in close proximity to its starting
position as it struggles to find valid trajectories between
the obstacles. In contrast, the motion planner augmented
agent explores a wider range of states by using MP to find collision-free trajectories to faraway goal states.
This allows the agent to quickly learn the task, especially in the presence of many obstacles. Efficient exploration is even more challenging in the obstructed 3D environments. Therefore, only the MoPA-SAC agent
that leverages the motion planner for efficient exploration is able to learn the manipulation tasks.
The comparison between different action spaces for our method in Figure 3.5 shows that directly predicting joint angles and using the motion planner based on action magnitude leads to the best learning performance (MoPA-SAC (Ours)). In contrast, computing the target joint angles for the motion planner using
inverse kinematics (MoPA-SAC IK) often produces configurations that are in collision with the environment,
especially when manipulations need to be performed in narrow spaces. MoPA-SAC Discrete needs to jointly
learn how and when to use MP by predicting a discrete switching variable. We find that this approach rarely
uses MP, leading to worse performance.
25
MoPA-SAC (Ours)
MoPA-SAC IK
MoPA-SAC Discrete
SAC
SAC IK
(a) Contact force
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.2
0.4
0.6
0.8
Success rate
qMP = 0.1
qMP = 0.3
qMP = 0.5
qMP = 1.0
(b) Action range ∆qMP
0.0 0.2 0.4 0.6 0.8 1.0
Environment Steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Success rate
MoPA-SAC (Ours)
MoPA-SAC w/o rescaling
MoPA-SAC w/o direct
(c) Action space rescaling
Figure 3.7: (a) Averaged contact force in an episode over 7 executions in 2D Push. Leveraging a motion planner, all variants of our
method naturally learn collision-safe trajectories. (b) Comparison of our model with different action range values ∆qMP on Sawyer
Lift. (c) Comparison of our model w/ and w/o action rescaling or w/o direct action execution on Sawyer Lift.
3.3.4 Safe Policy Execution
The ability to execute safe collision-free trajectories in obstructed environments is important for the application of RL in the real world. We hypothesize that the MoPA-RL agents can leverage MP to learn trajectories
that avoid unnecessary collisions. To validate this, we report the average contact force of all robot joints on
successful rollouts from the trained policies in Figure 3.7a. The MoPA-RL agents show low average contact
forces that are mainly the result of the necessary contacts with the objects that need to be pushed or lifted.
Crucially, these agents are able to perform the manipulations safely while avoiding collisions with obstacles.
In contrast, conventional RL agents are unable to effectively avoid collisions in the obstructed environments,
leading to high average contact forces.
3.3.5 Ablation Studies
Action range: We analyze the influence of the action range ∆qMP on task performance in Figure 3.7b. We
find that for too small action ranges the policy cannot efficiently explore the environment and does not learn
the task. Yet, for too large action ranges the number of possible actions the agent needs to explore is large,
leading to slow convergence. In between, our approach is robust to the choice of action range and able to
learn the task efficiently.
26
Action rescaling and direct action execution: In Figure 3.7c we ablate the action space rescaling introduced in Section 3.2.3. We find that action space rescaling improves learning performance by encouraging
balanced exploration of both single-step and motion planner action spaces. More crucial is however our
hybrid action space formulation with direct and MP action execution: MoPA-SAC trained without direct
action execution struggles on contact-rich tasks, since it is challenging to use the motion planner for solving
contact-rich object manipulations.
Additional ablation studies are available in Appendix Section A.
3.4 Conclusion
In this chapter, we proposed a flexible framework that combines the benefits of both motion planning and
reinforcement learning for sample-efficient learning of continuous robot control in obstructed environments.
Specifically, we augment a model-free RL with a sampling-based motion planner with minimal task-specific
knowledge. The RL policy can learn when to use the motion planner and when to take a single-step action
directly through reward maximization. The experimental results show that our approach improves the training efficiency over conventional RL methods, especially on manipulation tasks in the presence of many
obstacles. These results are promising and motivate future work on using more advanced motion planning
techniques in the action space of reinforcement learning. Another interesting direction is the transfer of our
framework to real robot systems.
27
Chapter 4
Learning Deformable Object Manipulation from Demonstrations
We present a novel Learning from Demonstration (LfD) method, Deformable Manipulation from Demonstrations (DMfD), to solve deformable manipulation tasks using states or images as inputs, given expert
demonstrations. Our method uses demonstrations in three different ways, and balances the trade-off between exploring the environment online and using guidance from experts to explore high dimensional spaces
effectively. We test DMfD on a set of representative manipulation tasks for a 1-dimensional rope and a 2-
dimensional cloth from the SoftGym suite of tasks, each with state and image observations. Our method
exceeds baseline performance by up to 12.9% for state-based tasks and up to 33.44% on image-based tasks,
with comparable or better robustness to randomness. Additionally, we create two challenging environments
for folding a 2D cloth using image-based observations, and set a performance benchmark for them. We
deploy DMfD on a real robot with a minimal loss in normalized performance during real-world execution
compared to simulation (∼ 6%). Videos and code are available at uscresl.github.io/dmfd.
4.1 Introduction
Manipulating deformable objects is a formidable challenge: extracting state information and modeling are
both difficult problems and the task is dauntingly high-dimensional with a large action space. Learning to
This chapter is based on Gautam Salhotra et al. “Learning Deformable Object Manipulation From Expert Demonstrations”.
In: IEEE Robotics and Automation Letters 7.4 (2022), pp. 8775–8782. DOI: 10.1109/LRA.2022.3187843.
28
manipulate deformable objects from expert demonstrations may offer a way forward to alleviate some of
these problems.
We present a new Learning from Demonstration (LfD) method – Deformable Manipulation from Demonstrations (DMfD) – that works with high-dimensional state or image observations. It absorbs expert guidance, whether from human execution or hand-engineered, while learning online to solve challenging deformable manipulation tasks such as cloth folding. DMfD is an asymmetric actor-critic method that uses
expert data in three ways: 1. the replay buffer is pre-populated with expert trajectories before training, 2.
during training, we leverage an advantage-weighted loss, where the replay buffer samples are weighted to
encourage the policy to stay close to the stored expert actions, and 3. during experience collection using
reference state initialization. Our results show that non-trivial and novel combination of these equips the
agent with the ability to explore high dimensional spaces effectively while leveraging guidance from expert
demonstrations. Our contributions are as follows.
1. To encourage wide exploration, we add an exploration term (a soft state value function) to the
advantage-weighted loss. This term samples actions according to the current policy instead of actions from the replay buffer. This is an improvement over the original advantaged-weighed formulation [100, 92] which only samples actions from the replay buffer to update its policy. To deploy our
methods in real-world settings, we extend the advantage-weighted framework to the image domain,
using CNNs and data augmentation (random crops [161]) to prevent overfitting.
2. During experience collection, we introduce probabilistic reference state initialization (RSI). Instead
of always resetting the agent to the states seen by the expert [99], we invoke RSI probabilistically.
This promotes exploration and learning in states that are difficult to reach (for example, due to high
dimensionality or the dynamics of the environment) while the agent has the opportunity to learn from
previously seen states.
29
(a) Straighten Rope
(b) Cloth Fold
(c) Cloth Fold Diagonal Pinned
(d) Cloth Fold Diagonal Unpinned
(e) Cloth Fold Diagonal Unpinned on real robot
Figure 4.1: Learning deformable manipulation For our method DMfD, we describe a learned agent that achieves state-of-the-art
performance among methods that use expert demonstrations, for solving difficult deformable manipulation tasks such as straightening 1D ropes and folding 2D cloths based on scene images. We set a new benchmark on the Straighten Rope (Figure 4.1a) task
which requires the agent to straighten a rope with two end effectors, shown as white spheres, and on the Cloth Fold (Figure 4.1b)
task which requires the agent to fold a flattened cloth into half, along an edge. Both tasks are from the SoftGym suite [79]. Additionally, we introduce and solve a new task constrained to a single end effector - the Cloth Fold Diagonal task, which requires an
agent to fold a square cloth along a diagonal. In the pinned version (Figure 4.1c) of this task, the cloth is clamped to the table at a
corner; in the unpinned version (Figure 4.1d) it is not. Figure 4.1e shows the unpinned version being executed on a real robot.
30
3. We create two new environments (2D deformables, image-based observations), with one robot arm.
We deploy DMfD on a real robot with a minimal sim2real gap (∼6%), indicating that it can work in
real-world settings.
4. DMfD outperforms LfD and non-LfD baselines on both state-based environments (by up to 12.9%
median performance) and on image-based environments (by up to 33.44% median performance).
Sample rollouts of our method for difficult image-based manipulation tasks and real robot experiments
can be seen in Figure 4.1.
4.2 Formulation and Approach
We formulate the deformable manipulation problem as a partially observable Markov decision process
(POMDP). Consider a POMDP with state space S, action space A, observation space O, discount factor γ,
horizon H, dynamics function T : S × A → S and reward function r : S × A → R. At time t, the agent
is at state s ∈ S, gets an observation o ∈ O and takes action a ∈ A. It reaches state s
′
, and gets back an
observation o
′
, reward rt = r(st
, at). The discounted reward from time t is given by Rt =
PH
i=t
γ
i
ri
. We
generalize a single task over a family of variants V that determine properties of the object to be manipulated.
The initial state is a function of the variant, s0(v), v ∼ V.
The problem reduces to finding the best policy π ∈ Π, that maximizes the expected discounted reward
J(π) of an episode, over task variants v and the distribution induced by the policy.
J(π) = E
τ∼π(τ),v∼V
[R0] (4.1)
subject to st+1 = T (st
, at), and initial state s0(v). π(τ )is the likelihood of trajectory τ = (s0, a0, s1, a1, . . . , sH)
under policy π and initial condition s0.
31
Figure 4.2: Schematic of our method. The agent obtains observations from the environment (during experience collection) or
the replay buffer B (during training). Pre-populated expert demonstrations in the replay buffer are shown in Green. The training
pipeline works with state-based or image-based observations. With state-based observations, the actor and critic get an encoding
of the system state (oQ = oπ = os), shown as Black and Blue arrows. With image-based observations, the actor gets an encoding
of the image whereas the critic gets encodings of the both the state and the image (oπ = oimage, oQ = os ∪ oimage), denoted by
Black and Red arrows.
We assume the availability of expert data, which may be hand-engineered solutions, demonstrations
by a human expert, or any other method of procuring trajectories that solve the task. Thus, we have a
demonstration dataset that we wish to learn from, in addition to the agent’s rollouts during experience
collection. We choose to maximize expected advantage Aπ
(st
, at) instead of the return Rt because it is
an unbiased estimator of the expected return with lower variance [152]. We maximize this advantage over
a sampling of transitions from a replay buffer B of a mixture of policies, using a sampling policy πB.
This formulation is similar to Advantage-Weighted Regression (AWR) [100] with experience replay over
a mixture of policies. Our policy optimization problem can be defined as maximizing advantage while
remaining close to the sampling policy.
π
∗ = argmax
π∼Π
E
s∼dπ(s)
E
a∼π(·|s)
[A
π
(s, a)] (4.2)
s.t. DKL(π(·|s)∥πB(·|s)) ≤ ϵ (4.3)
32
Algorithm 2 Deformable Manipulation from Demonstrations
Require: Task distribution V, Expert trajectories E
1: Initialize replay buffer B = E
2: Initialize πθ, Qϕ
3: for iteration i = 1, 2, . . . do
4: Sample batch (s, o, aB, o
′
, r, d) ∼ B
5: Get current policy aπ ∼ πθ(o)
6: Compute critic loss LQ as in Equation 4.5
7: ϕ ← OP T(ϕ, ▽LQ) ▷ Optimize critic
8: Compute actor loss Lπ as in Equation 4.8
9: θ ← OP T(θ, ▽Lπ) ▷ Optimize actor
10: τ1, τ2, ..., τK ∼ πθ(τ ) ▷ Experience collection
11: B ← B ∪ {τ1, τ2, ..., τK}
12: end for
13: Return πθ
where dπ(s) is a state distribution induced by π and DKL is the KL divergence. Following AWR, we reduce
the objective and constraint to an advantage-weighted objective for a policy with parameters θ
LA = E
s,a∼B
log πθ(a|s) exp
1
λ
A
π
(s, a)
(4.4)
where λ is a temperature parameter (see [100] for a complete derivation). The loss function LQ for the critic
Qϕ (with parameters ϕ) is based on the error between the estimated Q-value qϕ,B and the Bellman update b.
LQ = EB[∥qϕ,B − b∥
2
] (4.5)
where b = r+γE[Qϕ(s
′
, a
′
)] during the episode and b = r at the last timestep t = H. Since state estimation
is difficult for deformable manipulation, we extend this formulation to the partially observable case. Thus,
the policy acts on the observation πθ(a|o) instead of the state πθ(a|s).
Our (actor-critic) method learns from an expert dataset, while having access to online interaction with
the environment. Before training, we populate the replay buffer B with expert trajectories E, (replay-buffer
spiking [80]). This is known to improve performance (even with few episodes), since it shows the existence
33
of a good policy with large reward. It helps the algorithm realize good actions early on (Section 4.3.4). Unlike offline RL, we have easy access to the simulator giving the agent the ability to explore the environment
to find potentially better trajectories than the offline expert dataset. To promote this, we update the replay
buffer during training, thus updating the mixture of policies that make up the sampling policy. Thus, we
have the ability to learn from, and even exceed, expert data in the environment. We add entropy regularization to the actor, to balance exploration and exploitation [30]. We require that the policy maximize an
entropy-regularized version of the value function
V (s) = E
a∼π(·|s)
[Q(s, a) − α log πθ(a|o)] (4.6)
where α is a weighting hyper-parameter and a is sampled from the current policy. We propose an entropy
loss term to minimize,
LE = E
s,a,o∼B
[α log πθ(a|o) − Q(s, a)] (4.7)
Our policy loss is wE-weighted linear combination,
Lπ = (1 − wE)LA + wELE, 0 ≤ wE ≤ 1 (4.8)
While this does not have a tractable closed-form solution, we can optimize it numerically with gradient
steps. As is typical, we alternate gradient steps for actor and critic respectively. The algorithm is shown in
2.
During experience collection, with a tuned probability pη, we reset the robot to some environment state
that the expert was in. We then compare the trajectory of the agent with the trajectory of the expert and
provide an imitation reward based on the states achieved. Reference state initialisation (RSI) [99] was introduced for dynamic tasks. It helps to explore and learn in high-dimensional states that are difficult to reach.
34
However, always using RSI (i.e., always reset to a state the expert has seen) prevents the agent from exploring the environment freely and may lead to overfitting to those demonstrations. As Section 4.3.4 discusses,
both 0% and 100% RSI are worse than probabilistic RSI, implying that expert guidance helps when applied
sparingly. Thus, once again, we have the ability to learn from and exceed the expert. Probabilistic RSI is
similar to replay buffer spiking [80] referenced above, in that knowing the existence of some good actions
and rewards (without using only those) is beneficial. Further, this decreased dependence on experts allows
us to work with suboptimal experts, potentially reducing the burden on the human expert.
Our state encoder network is composed of multi-layer perceptrons with tanh activation, as we normalize
our actions to [−1, 1]. Our image encoder network is a Convolutional Neural Network (CNN) to process
images. We also augment the input image with random crops, a known improvement for vision-based
reinforcement learning [161] Figure 4.2 shows these architectures. Note that the critic receives state input in
addition to the observation, whereas the actor only gets the type of observation chosen for the environment.
This asymmetry has been shown to be useful for stabilising the critic [102], and is justified in Section 4.3.4.
Network specifics are given in Section 5.3.
4.3 Experiments
4.3.1 Tasks and Experimental Setup
We use four different tasks and two different observation types in our experiments (below), all of which
are conducted in the SoftGym suite [79]. We encode object states with an object-specific reduced-state
that SoftGym provides, and use it to train all methods that require object state as input. Details for the
reduced-state representation for each task are given below. The image observation is a 32x32 RGB image
of the environment showing the object and robot end-effector. Each task has a number of deformable object
property variants for effective domain randomization.
35
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC
AWAC
SAC-LfD
Expert
(a) Straighten Rope State
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC-CURL
SAC-DrQ
SAC-LfD
SAC-BC
Expert
(b) Straighten Rope Image
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC-CURL
SAC-DrQ
SAC-LfD
SAC-BC
Expert
(c) Cloth Fold Diagonal Pinned Image
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Million Steps
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC
AWAC
SAC-LfD
Expert
(d) Cloth Fold State
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Million Steps
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC-CURL
SAC-DrQ
SAC-LfD
SAC-BC
Expert
(e) Cloth Fold Image
0.0 0.2 0.4 0.6 0.8 1.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
DMfD (Ours)
SAC-CURL
SAC-DrQ
SAC-LfD
SAC-BC
Expert
(f) Cloth Fold Diagonal Unpinned Image
Figure 4.3: Performance comparisons. Learning curves of the normalized performance pˆ(H) for all environments during training.
The first column(4.3a & 4.3d) shows SoftGym state-based environments. The second column(4.3b & 4.3e) shows SoftGym imagebased environments, and the third column (4.3c & 4.3f) shows our new Cloth Fold Diagonal environments. All environments were
trained until convergence. State-based DMfD is in light blue, and the image-based agent is in dark blue. The expert performance
is the solid black line. We compare against the baselines described in Section 4.3.2. Behavioural Cloning does not train online, its
results are shown in Table 4.1 and Table 4.2. We plot the mean µ of the curves as a solid line, and shade one standard deviation
(µ ± σ). DMfD consistently beats the baselines, with comparable or better variance. For a detailed discussion see Section 4.3.5.
36
Straighten Rope: The objective is to stretch the ends of the rope a fixed distance from each other, to force
the rope to be straightened. The reduced state is the (x, y, z) coordinates of 10 equidistant keypoints along
the rope, including the endpoints. Performance is measured by comparing the distance between endpoints
to a fixed length parameter.
Cloth Fold: The objective is to fold a flattened cloth into half, along an edge, using two end-effectors. The
reduced state is the (x, y, z) coordinates of each corner. Performance is measured by comparing how close
the left and right corners are to each other.
Cloth Fold Diagonal Pinned: The objective is to fold the cloth along a specified diagonal of a square cloth,
with a single end-effector. One corner of the cloth is pinned to the table by a heavy block. The reduced state
is the (x, y, z) coordinates of each corner. Performance is measured by comparing how close the bottom-left
and top-right corners are to each other. This is a new task introduced in this chapter.
Cloth Fold Diagonal Unpinned: The objective is to fold the cloth along a specified diagonal of a square
cloth, with a single end-effector. The cloth is free to move on the table top. The reduced state is the (x, y, z)
coordinates of each corner. Performance is measured by comparing how close the bottom-left and top-right
corners are to each other. This is a new task we introduce in this chapter.
In each task, image-based environments were observed to be more difficult to solve than state-based
environments; thus for the new Cloth Fold Diagonal tasks we focus on the more difficult (image-based)
setting. This produces 6 environments (4 from SoftGym: state- and image-based settings for Straighten
Rope and Cloth Fold) and 2 newly introduced here (both image-based settings for Cloth Fold Diagonal
Pinned and Cloth Fold Diagonal Unpinned). We create demonstrations using hand-engineered solutions,
where the expert is an oracle with access to the full state and dynamics.
The following subsections compare the performance of agents in each task, as measured by a normalized
metric (in [0, 1]) described in SoftGym. The normalized performance at time t, pˆ(t) is given by pˆ(t) =
(p(st) − p(s0))/(popt − p(s0)) where p is the environment-specific performance function of state st at time
37
t, and popt is the best possible performance on the task. As in SoftGym, we use the normalized performance
at the end of the episode, pˆ(H).
We used an actor critic model with the actor and critic networks both having 2 hidden 1024-wide layers with tanh activations. Additionally, for vision input, we use a convolutional neural network with 4
convolution layers each with 32 channels, single stride, a 3x3 kernel and LeakyReLU activation functions,
followed by 2 1024-wide dense layers. Additionally, our RSI probability pη was 0.2 for state and 0.3 for
image observations. Our entropy regularization weight was wE = 0.1, with coefficient α = 0.5 in entropy
regularization and a discount factor of γ = 0.9. Additionally our expert dataset was optimally tuned to hold
8,000 episodes each with an episode horizon of 75 steps.
We ran our experiments on a server with Intel Xeon CPU cores (3.00GHz) and NVIDIA GeForce RTX
2080 Ti GPUs. Our experiments ran with 16 CPUs and 1 GPU allocated. We ran image-based methods for
1M steps, as in SoftGym experiments, but were able to run state-based methods for 3M steps as they were
much faster to train, due to the low dimensional reduced-state input. With high dimensional inputs such as
images, even elements such as the replay buffer, expert dataset, and vision encoder need a lot more memory
and compute, which slows down training. For example, we observed that when running our experiment on
the Cloth Fold environment, it took 34 hours to run 1M steps on the image-based version, and 39 hours to
run 3M steps on the state-based version.
4.3.2 Performance Comparisons
We compare our method with these Non-LfD baselines:
• SAC: A SOTA off-policy actor-critic RL algorithm.
• SAC-CURL: A SOTA off-policy image-based RL algorithm using contrastive learning [62].
• SAC-DrQ: A SOTA off-policy image-based RL algorithm using data augmentations and regularized
Q-function [161].
38
SAC AWAC Expert BC DMfD (ours)
Straighten Rope State
µ ± σ 0.7006 ± 0.2460 0.6229 ± 0.2977 0.8285 ± 0.0986 0.7307 ± 0.2113 0.9023±0.1232
25th% 0.5932 0.3288 0.7512 0.6321 0.8778
median 0.7608 0.7078 0.8205 0.8057 0.9347
75th% 0.8978 0.8983 0.9107 0.8646 0.9702
Cloth Fold State
µ −0.2769 ± 0.7194 0.5992 ± 0.2461 0.7060 ± 0.1585 0.2119 ± 0.4308 0.7708± 0.1168
25th% -0.5379 0.5062 0.6367 -0.0827 0.7203
median -0.1452 0.6686 0.7259 0.0018 0.7764
75th% 0.0380 0.7743 0.8147 0.6606 0.8458
Table 4.1: Performance metric with normalized performance pˆ(H) for state-based environments. We obtain the models at the
end of training for each method, for a total of 5 seeds. For each method, we evaulate it over 100 episodes and obtain performance
statistics on the accumulated results. We show the mean and variance as well as the median, 25th&75th percentiles of performance.
Experts are oracles with state information.
CURL DrQ Expert BC DMfD (ours)
Straighten Rope Image
µ ± σ 0.5512±0.2617 0.5355±0.2589 0.8285 ± 0.0986 0.4536±0.2558 0.6684±0.2563
25th% 0.3105 0.3241 0.7512 0.2424 0.4708
median 0.5819 0.5271 0.8205 0.3643 0.7189
75th% 0.7620 0.7402 0.9107 0.6585 0.8881
Cloth Fold Image
µ ± σ -0.021±0.2373 -0.5297±0.6045 0.7060±0.1585 0.1374± 0.0959 0.3948±0.3176
25th% -0.001 -0.7893 0.6367 0.0902 0.0000
median 0.000 -0.3502 0.7259 0.1586 0.4930
75th% 0.000 -0.0392 0.8147 0.2099 0.6682
Cloth Fold Diagonal Pinned Image
µ ± σ 0.6788±0.0442 0.7745±0.0346 0.9055±0.0085 0.5703±0.3492 0.8952±0.0099
25th% 0.6572 0.7633 0.8980 0.2763 0.8923
median 0.6784 0.7749 0.9047 0.7433 0.8957
75th% 0.6947 0.7853 0.9138 0.8956 0.8991
Cloth Fold Diagonal Unpinned Image
µ ± σ 0.7894±0.0363 0.8349±0.0469 0.9272±0.0107 0.9049± 0.0094 0.9403±0.0345
25th% 0.7709 0.8107 0.9179 0.9031 0.9117
median 0.7844 0.8457 0.9302 0.9074 0.9505
75th% 0.8114 0.8696 0.9369 0.9105 0.9745
Table 4.2: Performance metric with normalized performance pˆ(H) for image-based environments. We obtain the models at the
end of training for each method, for a total of 5 seeds. For each method, we evaulate it over 100 episodes and obtain performance
statistics on the accumulated results. We show the mean & variance as well as the median, 25th&75th percentiles of performance.
Experts are oracles with state information. We do not highlight their performance because they are not image-based.
39
Raw RGB Image Preprocessed Image DMfD
Vision Encoder
Actor
π(a|oπ)
Robot Execution
Figure 4.4: Pipeline for real robot experiments.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
With entropy regularization
Without entropy regularization
(a) Entropy regularization
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
RSI+IR 100%
RSI+IR 20%
Without RSI+IR
(b) RSI, State-based observations
0.0 0.2 0.4 0.6 0.8 1.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
RSI+IR 100%
RSI+IR 30%
Without RSI+IR
(c) RSI, Image-based observations
0.0 0.2 0.4 0.6 0.8 1.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
With random crop
Without random crop
(d) Random Crop
0.0 0.2 0.4 0.6 0.8 1.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
8k episodes
4k episodes
1k episodes
100 episodes*
(e) Expert Dataset Size
0.0 0.2 0.4 0.6 0.8 1.0
Million Steps
0.0
0.2
0.4
0.6
0.8
1.0
Performance
Image+State
Image only
State only
(f) Critics inputs
Figure 4.5: Ablation studies. Ablations were performed on the Straighten Rope environment, to verify the necessity for each
feature used. State-based DMfD is shown in light blue, and image-based DMfD is in dark blue. Entropy regularization Figure 4.5a
and one ablation for reference state-initialization Figure 4.5b were run on the state-based environment. The other ablations (Figure 4.5c, Figure 4.5d, Figure 4.5e, and Figure 4.5f) require an image-based environment. We plot the mean µ of the curves as a
solid line, and shade one standard deviation (µ ± σ). Detailed discussion of these features is in Section 4.3.5. In Figure 4.5e, ‘100
episodes∗
’ refers to 100 episodes of data copied 80 times to mimic the buffer of 8000 episodes without actually creating as many
expert demonstrations.
40
We also compare with these LfD baselines:
• AWAC: A SOTA off-policy RL algorithm that learns from offline data followed by online finetuning [92].
• BC-State: A behavior cloning policy trained on state-action pairs [140].
• SAC-LfD: SAC with pre-populated expert data in the replay buffer.
• BC-Image: A behavior cloning policy trained on image-action pairs [140].
• SAC-BC: SAC with initialized actor networks from pre-trained BC-Image on expert demonstrations.
We use Softgym’s implementations and hyperparameters for baselines where applicable, taken from the
official implementations of the algorithms cited. We did not include the PlaNet [32] baseline as it did not
beat the other image-based baselines. Figure 4.3 shows training curves and Table 4.2 shows the comparison
at the end of training. DMfD outperforms all baselines. For both state- and image-based environments
as the tasks get more difficult, DMfD outperforms baselines by higher margins. A detailed discussion is
in Section 4.3.5.
4.3.3 Real Robot Experiments
Setup: We use the DMfD model trained in simulation to perform the Cloth Fold Diagonal Unpinned task
on a Franka Emika Panda robot arm and the default gripper. An Intel RealSense camera is used to capture
RGB images of a top-down view of the cloth. To obtain real-world images that resemble simulated images,
we center crop the original RGB images from the camera, segment the cloth from the background, and fill
the cloth and the background with colors from the simulated cloth and table, ensuring robustness to different
colors of the cloth and background in the real-world setup (Figure 4.4). Our method does not require any
training or fine-tuning in the physical setting.
41
Results: We evaluate our method on ten rollouts. Each rollout has a different cloth orientation (ranging
from -19° to +25°). We use a checkpoint from 60,000 environment steps in simulation to initialize the actor
of our agent. In simulation, our policy has a mean accuracy of 91.28% over ten rollouts. On the real robot,
we obtain a mean accuracy of 85.58%.
4.3.4 Ablations
We test our ablations over 3 seeds each, and plot the mean and variance of performance during training.
We run these ablations in the Straighten Rope environment, with state- and image-based observations as
applicable.
Entropy Regularization: Figure 4.5a shows that using entropy regularization enables the agent to explore
the environment further, surpassing its initial performances of learning from the expert data in the replay
buffer. We see high variance in the baseline, indicating less robustness to randomness (e.g., seed, task
variants, etc.) and unstable training performance.
Probabilistic Reference State Initialization (RSI): Figure 4.5b and Figure 4.5c show ablations for using
RSI. With the default configuration of RSI (RSI+IR 100%), the agent shows worse performance than not
using RSI. In other words, simply applying RSI in deformable object manipulation may lead to poor results
due to constantly resetting the agent to the predefined states, which prevents the agent from freely exploring
the environment. However, the agent can benefit from expert demonstrations without limiting exploration
by invoking RSI probabilistically.
Random Crops of Image Observations: Figure 4.5d shows that using random crops as an augmentation
technique improves performance. This confirms that employing random crops stabilizes visual RL training
which would otherwise overfit.
Number of Expert Trajectories: We estimate how much expert data is optimal for our agent (Figure 4.5e).
Using 1000 expert episodes is noticeably worse than 4000 episodes. However, the difference between
42
4000 and 8000 episodes is small, indicating that the marginal utility of adding additional expert trajectories
reduces with the number of trajectories. To test the sample efficiency of DMfD, we propose to use only
100 episodes of data but duplicate them to fill the replay buffer. As shown, it is possible to achieve similar
performance as by using larger amounts of data, but we have found it to be less robust to environment
variations and training seeds. We conclude it is best to use as many expert demonstrations as possible when
expert demonstrations are easily obtainable. However, when they are not readily available, duplicating
expert data to fill up the replay buffer is a viable way to learn from expert demonstrations.
Critic Inputs: In Figure 4.5f, we examine the effect of different types of inputs to the critics. Having
states information as input to the critics is essential in obtaining higher performance. This is because states
contain valuable information vital to the tasks but not readily interpretable in images (e.g., the true coordinates of cloth corners). However, the addition of images to states has the best performance ( Figure 4.5f),
likely because the critics are able to see what the actor sees, and can provide a better guiding value estimate.
4.3.5 Discussion
Compared to experts, Table 4.1 shows that our state-based agent beats the expert in both state environments.
The image-based solutions are comparable to the expert at best, as they do not have privileged state information. When comparing with baselines, we see that the performance gap between our method and baselines
increases with respect to the difficulty of the task. In easier tasks, our method’s capabilities are not fully
utilized. We observe this for both state- and image-based environments. For example, in a harder task like
Cloth Fold Image Figure 4.3e, the baseline methods are at or below 0 performance at the end of training.
Because we use expert data in multiple ways, our state-based method outperforms SAC, a baseline
that does not use expert data. The lack of expert data severely affects the performance of SAC on the
difficult state-based task, Cloth Fold. The benefits of using expert data, in all environments, are shown in
43
Figure 4.5b, Figure 4.5c and Figure 4.5e. Moreover, given a pre-populated replay buffer, we can think of
RSI giving DMfD an extra boost essentially for ‘free’ (since we reuse the same expert data). Conversely,
AWAC achieves better performance on difficult tasks with the help of expert data. However, a lack of entropy
regularization means that it is more prone to reaching a local optimum during training. This can be seen in
the higher variance than DMfD during training, indicating lower robustness to randomness. In fact, in the
Straighten Rope Image experiment, this high variance after 1M steps eventually leads to a deterioration in
performance.
Image-based environments are harder and this is where DMfD outperforms the baselines even further.
In image-based environments, LfD baselines outperform non-LfD baselines in the Straighten Rope, Cloth
Fold, and Cloth Fold Diagonal Unpinned environments. However, non-LfD methods have more consistent
performance than LfD methods. This implies that LfD baselines are not as robust as the non-LfD methods,
and they may require more sophisticated solutions for consistently better performance. In other words,
designing a robust LfD method in these environments is nontrivial. As shown in Figure 4.3 and Table 4.2,
DMfD consistently outperforms all baselines. It is adept at learning these challenging tasks while leveraging
expert demonstrations. The experiments provide strong evidence that DMfD is consistently equal or better
performant than the baselines across all environments, while being robust to noise.
4.4 Conclusion
We describe a new reinforcement learning-based method - Deformable Manipulation from Demonstrations
(DMfD) - that leverages expert demonstrations and outperforms state-of-the-art Learning from Demonstration (LfD) methods for representative manipulation tasks on 1D (rope) and 2D (cloths) deformable objects.
For both state-based and image-based inputs, DMfD effectively leverages expert demonstrations as follows:
1. we pre-populate the replay buffer with expert trajectories before training, 2. during training, we improve
on the standard advantage-weighted loss by adding an exploration term (and extending it to image-based
44
inputs), and 3. during experience collection we improve on reference state initialization by using it probabilistically. For image-based inputs, we use an asymmetric actor-critic architecture, where the actor acts
based solely on environment images while the critics learn from both image and state information. To make
our policy more robust to different variations of the environments, we applied random cropping to sampled images during the actor-critic updates. We demonstrate the effectiveness of DMfD on two challenging
deformable object manipulation tasks from the SoftGym suite. We also create two new challenging environments for folding a 2D cloth using image-based observations, and set a performance benchmark for them.
We show a consistent and noticeable performance improvement over baselines in state-based environments
(up to 12.9% on median) and an even higher improvement on tougher image-based environments (up to
33.44% on median). We also observe comparable or lower variance than the baselines, indicating higher
robustness to noise. To validate the feasibility of DMfD in real-world settings, we conducted real robot
experiments and achieved a minimal sim2real gap (∼6%) in normalized performance.
45
Chapter 5
Learning from Demonstrations across Morphologies
Some Learning from Demonstrations (LfD) methods handle small mismatches in the action spaces of the
teacher and student. Here we address the case where the teacher’s morphology is substantially different
from that of the student. Our framework, Morphological Adaptation in Imitation Learning (MAIL), bridges
this gap allowing us to train an agent from demonstrations by other agents with significantly different morphologies. MAIL learns from suboptimal demonstrations, so long as they provide some guidance towards a
desired solution. We demonstrate MAIL on manipulation tasks with rigid and deformable objects including
3D cloth manipulation interacting with rigid obstacles. We train a visual control policy for a robot with one
end-effector using demonstrations from a simulated agent with two end-effectors. MAIL shows up to 24%
improvement in a normalized performance metric over LfD and non-LfD baselines. It is deployed to a real
Franka Panda robot, handles multiple variations in properties for objects (size, rotation, translation), and
cloth-specific properties (color, thickness, size, material). An overview is on uscresl.github.io/mail.
5.1 Introduction
Learning from Demonstration (LfD) [101, 81] is a set of supervised learning methods where a teacher (often,
but not always, a human) demonstrates a task, and a student (usually a robot) uses this information to learn to
This chapter is based on Gautam Salhotra, I-Chun Arthur Liu, and Gaurav S. Sukhatme. “Learning Robot Manipulation from
Cross-Morphology Demonstration”. In: Conference on Robot Learning. 2023.
46
perform the same task. Some LfD methods cope with small morphological mismatches between the teacher
and student [107, 160] (e.g., five-fingered hand to two-fingered gripper). However, they typically fail for a
large mismatch (e.g., bimanual human demonstration to a robot arm with one gripper). The key difference
is that to reproduce the transition from a demonstration state to the next, no single student action suffices - a
sequence of actions may be needed.
Supervised methods are appealing where demonstration-free methods [30] do not converge or underperform [34] and purely analytical approaches are computationally infeasible [45, 9]. In such settings, human
demonstrations of complex tasks are often readily available e.g., it is straightforward for a human to show a
robot how to fold a cloth. An LfD-based imitation learning approach is appealing in such settings provided
we allow the human demonstrator to use their body in the way they find most convenient (e.g., using two
hands to hang a cloth on a clothesline to dry). This requirement induces a potentially large morphology
mismatch - we want to learn and execute complex tasks with deformable objects on a single manipulator
robot using natural human demonstrations.
We propose a framework, Morphological Adaptation in Imitation Learning (MAIL), to bridge this mismatch. We focus on cases where the number of end-effectors is different from teacher to student, although
the method may be extended to other forms of morphological differences. MAIL enables policy learning for
a robot with m end-effectors from teachers with n end-effectors. It does not require demonstrator actions,
only the states of the objects in the environment making it potentially useful for a variety of end-effectors
(pickers, suction gripper, two-fingered grippers, or even hands). It uses trajectory optimization to convert
state-based demonstrations into (suboptimal) trajectories in the student’s morphology. The optimization
uses a learned (forward) dynamics model to trade accuracy for speed, especially useful for tasks with highdimensional state and observation spaces. The trajectories are then used by an LfD method, optionally with
exploration components like reinforcement learning, which is adapted to work with sub-optimal demonstrations and improve upon them by interacting with the environment.
47
end-effectors (n=2)
demonstrations
n end-effectors (m=1) rollouts in
simulation and real world
m
Learned
Spatiotemporal
Dynamics
Model
Indirect
Trajectory
Optimization
endeffectors
demos
m
LfD Method
Figure 5.1: MAIL generalizes LfD to large morphological mismatches between teacher and student in difficult manipulation tasks. We show an
example task: hang a cloth to dry on a plank (DRY CLOTH). The demonstrations are bimanual, yet the robot learns to execute the task with a single
arm and gripper. The learned policy transfers to the real world and is robust
to object variations.
Though the original demonstrations
contain states, we generalize the solution
to work with image observations in the final policy. We showcase our method on
challenging cloth manipulation tasks (Section 5.3.1) for a robot with one end-effector,
using image observations, shown in Figure 5.1. This setting is challenging for multiple reasons. First, cloth manipulation is
easy for bimanual human demonstrators but challenging for a one-handed agent (even humans find cloth
manipulation non-trivial with one hand). Second, deformable objects exist in a continuous state space;
image observations in this setting are also high-dimensional. Third, the cloth being manipulated makes a
large number of contacts (hundreds) that are made/broken per time step. These can significantly slow down
simulation, and consequently learning and optimization. We make the following contributions:
1. We propose a novel framework, MAIL, that bridges the large morphological mismatch in LfD. MAIL
trains a robot with m end-effectors to learn manipulation from demonstrations with a different (n) number
of end-effectors.
2. We demonstrate MAIL on challenging cloth manipulation tasks on a robot with one end-effector. Our
tasks have a high-dimensional (> 15000) state space, with several 100 contacts being made/broken per
step, and are non-trivial to solve with one end-effector. Our learned agent outperforms baselines by up
to 24% on a normalized performance metric and transfers zero-shot to a real robot. We introduce a new
variant of 3D cloth manipulation with obstacles - DRY CLOTH.
3. We illustrate MAIL providing different instances of end-effector transfer, such as a 3-to-2, 3-to-1, and 2-
to-1 end-effector transfer, using a simple rearrangement task with three rigid bodies in simulation and the
48
Random Actions
m end-effectors
DRandom
=
{(s, a,s′), …}
Learned Dynamics
∥ΔPsim − ΔPpred∥2
Teacher Demos
n end-effectors
DTeacher
=
{(s,s′),…}
Optimized
Demos
m end-effectors
DStudent =
{(s, a,s′),…}
LfD Method
R(st
, at
)
π(o)
Indirect Trajectory
Optimization
∥sgoal − sH∥2
Figure 5.2: An example cloth folding task with demonstrations from a teacher with n = 2 end-effectors, deployed on a Franka
Panda with m = 1 end-effector (parallel-jaw gripper). We train a network to predict the forward dynamics of the object being
manipulated in simulation, using a random action dataset DRandom. For every state transition, we match the predicted particle
displacements from our model, ∆Ppred, to that of the simulator, ∆Psim. Given this learned dynamics and the teacher demonstrations we use indirect trajectory optimization to find student actions that solve the task. The optimization objective is to match with
the object states in the demonstration. Finally, we pass the optimized dataset DStudent to a downstream LfD method to get a final
policy π that generalizes to task variations and extends task learning to image space, enabling real-world deployment.
real world. We further explain how MAIL can potentially handle more instances of n-to-m end-effector
transfer.
5.2 Formulation and Approach
5.2.1 Preliminaries
We formulate the problem as a partially observable Markov Decision Process (POMDP) with state s ∈ S,
action a ∈ A, observation o ∈ O, transition function T : S × A → S, horizon H, discount factor γ
and reward function r : S × A → R. The discounted return at time t is Rt =
PH
i=t
γ
i
r(si
, ai) and
si ∼ T (si−1, ai−1). A task is instantiated with a variant sampled from the task distribution, v ∼ V. The
initial environment state depends on the task variant, s0(v), v ∼ V. We train a policy πθ to maximize
expected reward J(πθ) of an episode over task variants v, J(πθ) = Ev∼V[R0], subject to initial state s0(v)
and the dynamics from T .
For an agent with morphology M, we differentiate between datasets available as demonstrations (DM
Demo)
and those that are optimized (DM
Optim). For our cloth environments, our teacher morphology is two-pickers
(M = 2p) and student morphology is one-picker (M = 1p). We assume the demonstrations are from
49
teachers with a morphology that can be different from the student (and from each other). We refers to these
as teacher demonstrations, DT eacher, to emphasize that they do not necessarily come from an expert or an
oracle. Further, these can be suboptimal. The demonstrations are state trajectories τT = (s0, . . . , sH−1).
The teacher dataset is made up of KT such trajectories, DT eacher = {τT,i}∀i = 1, . . . , KT , using a few task
variations from the task distribution vd ∼ V.
We now discuss the components of MAIL, shown in Figure 5.2. The user provides teacher demonstrations DT eacher. First, we create a dataset of random actions, DRandom, and use it to train a dynamics
model, Tψ. Tψ reduces computational cost when dealing with contact-rich simulations like cloth manipulation (Section 5.3.1). Next, we convert each teacher demonstration to a trajectory suitable for the student’s
morphology. For our tasks, we find gradient-free indirect trajectory optimization [52] performs the best
(Appendix Section C.1). We used Tψ for this optimization as it provides the appropriate speed-accuracy
trade-off. The optimization objective is to match with object states in the demonstration (we cannot match
demonstration actions across morphologies). We combine these optimized trajectories to create a dataset
DStudent for the student. Finally, we pass DStudent to a downstream LfD method to learn a policy π that
generalizes from the task variations in DT eacher to the task distribution V. It also extends π to use image
observations and deploys zero-shot on a real robot (rollouts in Figure 5.5).
5.2.2 Learned Spatio-temporal Dynamics Model
MAIL uses trajectory optimization to convert demonstrations into (suboptimal) trajectories in the student’s
morphology. This can be prohibitively slow for large state spaces and complex tasks such as cloth manipulation. Robotic simulators have come a long way in advancing fidelity and speed, but simulating complex
deformable objects and contact-rich manipulation still requires significant computation making optimization intractable for challenging simulations. We use the NVIDIA FLeX simulator that is based on extended
position-based dynamics [87]. We learn a CNN-LSTM based spatio-temporal forward dynamics model with
50
parameters ψ, Tψ, to approximate cloth dynamics, T . This offers a speed-accuracy trade-off with a tractable
computation time in environments with large state spaces and complex dynamics. The states of objects are
represented as N particle positions: s = P = {pi}i=1...N . Each particle state consists of its x, y, and z coordinates. For each task, we generate a corpus of random pick-and-place actions and store them in the dataset
DRandom = {di}, where i = 1, . . . , KR and di = (Pi
, ai
, P′
i
). For each datum i, we feed Pi
to the CNN
network to extract features of particle connectivity. These features are concatenated with ai and input to the
LSTM model to extract features based on the previous particle positions. A fully connected layer followed
by layer normalization and tanh activation is used to learn the non-linear combinations of features. The
outputs are the predicted particle displacements. The objective function is the distance between predicted
and ground-truth particle displacements, ∥∆Psim − ∆Ppred∥2. Here ∆Psim = {∆pi}i=1,...,N is obtained
from the simulator and ∆pi = pi+1 − pi for every particle i.
Due to its simplicity, the CNN-LSTM dynamics model provides fast inference, compared to a simulator
which may have to perform many collision checks at any time step. This speedup is crucial when optimizing
over a large state space, as long as the errors in particle positions are tolerable. In our experiments, we were
able to get 162 fps with Tψ, compared to 3.4 fps with the FleX simulator, a 50x speed up (Figure C.7).
However, this stage is optional if the environment is low-dimensional, or if the simulation speed-up from
inference is not significant. Simulation accuracy is important when training a final policy, to provide accurate pick-place locations for execution on a real robot. Hence, the learned dynamics model is not used for
training in the downstream LfD method.
5.2.3 Indirect Trajectory Optimization with Learned Dynamics
We use indirect trajectory optimization [52] to find the open-loop action trajectory to match the teacher
state trajectory, τT . This optimizes for the student’s actions while propagating the state with a simulator.
We use the learned dynamics Tψ to give us fast, approximate optimized trajectories. This is in contrast
51
to direct trajectory optimization (or collocation) that optimizes both states and actions at every time step.
Direct trajectory optimization requires dynamics constraints to ensure consistency among states being optimized, which can be challenging for discontinuous dynamics. We use the Cross-Entropy Method (CEM)
for optimization, and compare this against other methods, such as SAC (Appendix C.1). Optimization hyperparameters are described in D.8. The optimization objective is to match the object’s goal state sgoal in
the demonstration with the same task variant vd. Formally, the problem is defined as:
min
at
∥sgoal − sH∥2 subject to s0 = s0(vd) and st+1 = T (st
, at) ∀t = 0, . . . , H − 1 (5.1)
where sH is the predicted final state. Note that if τT has a longer time horizon, it would help to match
intermediate states and use multiple-shooting methods. After optimizing the action trajectories for each
demonstration τT,i ∈ DT eacher, we use them with the simulator to obtain the optimized trajectories in the
student’s morphology. These are combined to create the student dataset, DStudent = {τ1, τ2, τ3, . . . }, where
τi = (st
, ot
, at
, st+1, ot+1, rt
, d)∀t = 1 . . . H − 1. For generalizability and real-world capabilities, we
train an LfD method using DStudent. Note that we use the learned dynamics model at this stage, trading
faster simulation speed for lower accuracy in the learned model. This is also partially responsible for why
DStudent contains suboptimal demonstrations. To reduce the effect of learned model errors, once we obtain
the optimized actions, we perform a rollout with the true simulator to get the demonstration data.
5.2.4 Learning from the Optimized Dataset
Our chosen LfD method is DMfD [113], an off-policy RLfD actor-critic method that utilized expert demonstrations as well as rollouts from its own exploration. It learns using an advantage-weighted formulation [92]
balanced with an exploration component [30]. As mentioned above, we use the simulator instead of the
learned dynamics model Tψ at this stage, because accuracy is important in the final reactive policy. Hence,
we cannot take the speed-accuracy tradeoff that Tψ provides. However, one may choose to use other LfD
52
methods that do not need to interact with the environment [12], in which case neither a simulator nor learned
dynamics are needed.
As part of tuning, we employ 100 demonstrations, about two orders of magnitude fewer than the 8000
recommended by the original method in Chapter 4. To prevent the policy from overfitting to suboptimal
demonstrations in DStudent, we disable demonstration-state matching, i.e., resetting the agent to demonstration states and applying imitation reward (see Appendix C.5). These were originally proposed [99] as
reference state initialization (RSI). These modifications are essential for our LfD implementation, where the
demonstrations do not come from an expert.
From DMfD, the policy π is parameterized by parameters θ, and learns from data collected in a replay
buffer B. The policy loss contains an advantage-weighted loss LA where actions are weighted by the advantage function Aπ
(s, a) = Qπ
(s, a) − V
π
(s) and temperature parameter λ. It also contains an entropy
component LE to promote exploration during data collection. The final policy loss Lπ is a combination of
these terms (Equation 5.2).
LA = E
s,a,o∼B
log πθ(a|o) exp
1
λ
A
π
(s, a)
LE = E
s,a,o∼B
[α log πθ(a|o) − Q(s, a)]
Lπ = (1 − wE)LA + wELE, 0 ≤ wE ≤ 1 (5.2)
where wE is a tuneable hyper-parameter. The resulting policy is denoted as πθ. We pre-populate buffer
B with DStudent. Using LfD, we extend from state inputs to image observations, and generalize from vd to
any variation sampled from V.
5.3 Experiments
Our experiments are designed to answer the following: (1) How does MAIL compare to state-of-theart (SOTA) methods? (Section 5.3.2) (2) How well can MAIL solve tasks in the real world? (Section 5.3.2.1)
53
(3) Does MAIL generalize to different n-to-m end-effector transfers? (Section 5.3.3) Additional experiments demonstrating how different MAIL components affect performance are in Section C.
5.3.1 Tasks
We experiment with cloth manipulation tasks that are easy for humans to demonstrate but difficult to perform
on a robot. We also discuss a simpler rearrangement task with rigid bodies to illustrate generalizability. The
tasks are shown in Appendix Figure C.5. We choose a 6-dimensional pick-and-place action space, with
xyz positions for pick and place. The end-effectors are pickers in simulation, and a two-finger parallel jaw
gripper on the real robot.
CLOTH FOLD: Fold a square cloth in half, along a specified line. DRY CLOTH: Pick up a square cloth
from the ground and hang it on a plank to dry, variant of [89]. THREE BOXES: A simple environment with
three boxes along a line that needs to be rearranged to designated goal locations. For details on metrics and
task variants, see Appendix Section D.1.
Cloth fold Dry cloth
−0.2
0.0
0.2
0.4
0.6
0.8
Normalized Performance
GNS
SAC-CURL
SAC-DrQ
GPIL
GAIfO
SAC-DrQ-IR
MAIL (ours)
Figure 5.3: SOTA performance comparisons. For each training
run, we used the best model in each seed’s training run, and evaluated using 100 rollouts across 5 seeds, different from the training
seed. Bar height denotes the mean, error bars indicate the standard deviation. MAIL outperforms all baselines, in some cases
by as much as 24%.
We use particle positions as the state for training dynamics models and trajectory optimization.
We use a 32x32 RGB image as the visual observation, where applicable. We record pre-programmed
demonstrations for the teacher dataset for each task.
Details of the datasets to train the LfD method
and the dynamics model are in the Appendix (Section D.4 and Section D.5). The instantaneous reward, used in learning the policy, is the task performance metric at a given state. Further details on
architecture and training are in the supplementary material. In all experiments, we compare each method’s
54
normalized performance, measured at the end of the task given by pˆ(t) = p(st)−p(s0)
popt−p(s0)
, where p is the performance metric of state st at time t, and popt is the best performance achievable by the task. We use pˆ(H) at
the end of the episode (t = H).
5.3.2 SOTA Comparisons
Many LfD baselines (Section 2.3) are not directly applicable, as they do not handle large differences in
action space due to different morphologies. We compare MAIL with those LfD baselines that produce a
policy with image observations, given demonstrations without actions.
1. SAC-CURL [62]: An image-based RL algorithm that uses contrastive learning and SAC [30] as the
underlying RL algorithm. It does not require demonstrations.
2. SAC-DrQ [161]: An image-based RL algorithm that uses a regularized Q-function, data augmentations,
and SAC as the underlying RL algorithm. It does not require demonstrations.
3. GNS [115]: A SOTA method that represents cloth as a graph and predicts dynamics using a graph
neural network (GNN). It does not require demonstrations but learns dynamics from the random actions
dataset.We run this learned model with a planner [78], given full state information.
4. SAC-DrQ-IR: A custom variant of SAC [30] that uses DrQ-based [161] image encoding and a state-only
imitation reward (IR) to reach the desired state of the object to be manipulated. It does not imitate actions,
as they are unavailable.
5. GAIfO [141]: An adversarial imitation learning algorithm that trains a discriminator on state-state pairs
(s, s′
) from both the demonstrator and agent. This is a popular extension of GAIL [35] that learns the
same from state-action pairs (s, a).
6. GPIL [71] A goal-directed LfD method that uses demonstrations and agent interactions to learn a goal
proximity function. This function provides a dense reward to train a policy.
55
Figure 5.3 has performance comparisons against all baselines. In each environment, the first three
columns are demonstration-free baselines, and the last four are LfD methods. MAIL outperforms all baselines, in some cases by as much as 24%. For the easier CLOTH FOLD task, the SAC-DrQ baseline came
within 11% of MAIL.
(a) Teacher demonstration with three pickers.
(b) Final policy: Two pickers
(c) Final policy: One picker
(d) Final policy: One Franka Panda robot
Figure 5.4: Sample trajectories of the THREE BOXES task.
A three-picker teacher trajectory to reach the goal state (Figure 5.4a). Final policies of the two-picker and one-picker agent,
and real-world execution of the one-picker agent.
However, all baselines do not perform well in
the more difficult DRY CLOTH task. RL methods
fail because they have not explored the parameter
space enough without guidance from demonstrations. Our custom LfD baseline, SAC-DrQ-IR, has
reasonable performance, but the results show that
naive imitation alone is not a good form of guidance to solve it. The other LfD baselines, GAIfO
and GPIL, have poor performance in both environments. The primary reason is the effect of crossmorphological demonstrations. They perform significantly better with student morphology demonstrations,
even if they are suboptimal. Moreover, environment difficulty also plays an important part in the final
performance. These and other ablations are in Section C.
Surprisingly, the GNS baseline with structured dynamics does not perform well, even though it has
been used for cloth modeling [39]. This is because it is designed to learn particle dynamics via small
displacements, but our pick-and-place action space enables large displacements. Similar to [78], we break
down each pick-and-place action into 100 delta actions to work with the small displacements that GNS is
trained on. Thus, planning will accumulate errors from the 100 GNS steps for every action of the planner,
which can grow superlinearly due to compounding errors. This makes it difficult to solve the task. It is
especially seen in DRY CLOTH, where the displacements required to move the entire cloth over the plank
56
are much higher than the displacements needed for CLOTH FOLD. The rollouts of MAIL on DRY CLOTH
show the agent following the demonstrated guidance - it learned to hang the cloth over the plank. It also
displayed an emergent behavior to straighten out the cloth on the plank to spread it out and achieve higher
performance. This was not seen in the two-picker teacher demonstrations.
5.3.2.1 Real-world results
For DRY CLOTH and CLOTH FOLD tasks, we deploy the learned policies on a Franka Panda robot with a
single parallel-jaw gripper (Figure 5.5, statistics over 10 rollouts). We test the policies with many different
variations of square cloth (size, rotation, translation, thickness, color, and material). See Section D.3 for performance metrics. The policies achieve ∼ 80% performance, close to the average simulation performance,
for both tasks.
5.3.3 Generalizability
We show examples of how MAIL learns from a demonstrator with a different number of end-effectors,
in a simple THREE BOXES task (Figure 5.4). Consider a three-picker agent that solves the task in one
pick-place action. Given teacher demonstrations DT eacher, we transfer them into one-picker or two-picker
demonstrations using indirect trajectory optimization and the learned dynamics model. These are the optimized datasets that are fed to a downstream LfD method. In both cases, the LfD method learns a model,
specific to that morphology, to solve the task. It generalizes from state inputs in the demonstrations to the
image inputs received from the environment. Figure 5.4 shows the three picker demonstration, a 3-to-2 and
3-to-1 end-effector transfer. We have also done this for the 2-to-1 case (omitted here for brevity). These
examples illustrate n-to-m end-effector transfer with n > m; it is trivial to perform the transfer for n-to-m
with n ≤ m by simply appending the teacher’s action space with m − n arms that do no operations.
57
5.3.4 Limitations
MAIL requires object states in demonstrations and during simulation training, however, full state information is not needed at deployment time. It has been tested on the pick-place action space. It has been tested
only on cases where the number of end-effectors is different from teacher to student.While it works for
high-frequency actions (Appendix C.7), it will likely be difficult to optimize actions to create the student
dataset for high-dimensional actions. This is because the curse of dimensionality will apply for larger action
spaces when optimizing for Dstudent. The state-visitation distribution of demonstration trajectories must
overlap with that of the student agent; this overlap must contain the equilibrium states of the demonstration.
For example, a one-gripper agent cannot reach a demonstration state where two objects are moving simultaneously, but it can reach a state where both objects are stable at their goal locations (equilibrium). MAIL
cannot work when the student robot is unable to reach the goal or intermediate states in the demonstration.
For example, in trying to open a flimsy bag with two handles, both end-effectors may simultaneously be
needed to keep the bag open. When we discuss generalizability for the case n ≤ m, our chosen method to
tackle morphological mismatch is to use fewer arms on the student robot, in lieu of trajectory optimization.
This is an inefficient approach since we ignore some arms of the student robot. MAIL builds a separate
policy for each student robot morphology and each task. While it is possible to train a multi-task policy
conditioned on a given task (provided as an embedding or a natural language instruction), extending MAIL
to output policies for a variable number of end-effectors would require more careful consideration. Subsequent work could learn a single policy conditioned on the desired morphology - another way to think about
a base model for generalized LfD.
5.4 Conclusion
We presented MAIL, a framework that enables LfD across morphologies. Our framework enables learning
from demonstrations where the number of end-effectors is different from teacher to student. This enables
58
Cloth fold
Performance 0.818
Dry cloth
Spread metric 8/10
Figure 5.5: Real-world results for CLOTH FOLD and DRY CLOTH.
teachers to record demonstrations in the setting of their own morphology, and vastly expands the set of
demonstrations to learn from. We show an improvement of up to 24% over SOTA baselines and discuss
other baselines that are unable to handle a large mismatch between teacher and student. Our experiments
are on challenging household cloth manipulation tasks performed by a robot with one end-effector based
on bimanual demonstrations. We showed that our policy can be deployed zero-shot on a real Franka Panda
robot, and generalizes across cloths of varying size, color, material, thickness, and robustness to cloth rotation and translation. We further showed examples of LfD generalizability with instances of transfer from
n-to-m end-effectors, with multiple rigid objects. We believe that this is an important step towards allowing LfD to train a robot to learn from any robot demonstrations, regardless of robot morphology, expert
knowledge, or the medium of demonstration.
59
Chapter 6
LfD for Visual Servoing with Precision
This chapter presents a novel multi-task visual servoing policy for precise last-inch peg-in-hole insertion.
Our method builds upon the RT-1 architecture, incorporating modifications supported by ablation studies
and justifications. We demonstrate the effectiveness of our approach on various real-world last-inch tasks
and exhibit its ability to generalize to unseen peg-hole combinations. We lay the foundation for developing
a generalized manipulation model capable of handling both coarse and fine motion, including position and
orientation control. Such a model would be able to learn both fine and coarse manipulation tasks, providing
a significant step towards achieving a comprehensive manipulation model capable of addressing diverse
manipulation challenges.
6.1 Introduction
Robotic assembly is a challenging task that requires precise and accurate manipulation, as well as the ability
to adapt to diverse objects, poses, and environments. While robotic assembly is ubiquitous in many industries, traditional assembly robots are expensive and require meticulous engineering to achieve precision.
This can lead to solutions in hardware, such as fixtures and adapters rather than focussing on intelligence.
One might get solutions that are highly sensitive to perturbations of the robotic workcell, or a strictly controlled environment with no scope for working alongside a human.
60
One promising approach to addressing these challenges is to use reinforcement learning (RL). RL methods have been shown to be effective at learning complex tasks, including robotic manipulation. However,
RL can be finicky in the real world, especially for contact-rich tasks such as assembly. This is because collecting data on a real robot can be extremely expensive, and unpredictable moves created by initial policies
can endanger the robot and its equipment.
Another approach is to use simulation to train RL agents. However, simulation has had a limited impact on robotic assembly due to the large gap between simulation and reality. Some works avoid higherdimensional inputs (such as camera images) by training state-based policies in simulation for contact-rich
manipulation [134]. Simulation of contact-rich tasks involves complex meshes (especially for geometrically
complex objects), computing mechanics for possibly thousands of contact points across rigid bodies at every timestep, handling inter-penetration of meshes, etc. Hence, this gap is particularly large for precise and
contact-rich manipulation tasks.
One way to avoid the sim2real gap is to train on the real robot, using visual servoing. Visual servoing
is a technique that uses visual feedback to control a robot. This allows the robot to adapt to unexpected
changes in the environment without the need for expensive fixtures or precise state estimation.
However, training a separate visual servoing model for each task can be inefficient. To address this, we
propose using a high-capacity multi-task architecture based on transformers. Specifically, we use the RT-1
architecture [12] with language-conditioned tasks, which has been shown to be effective for a variety of
robotic tasks.
We extend RT-1 to work with precise manipulation in the following ways:
1. We reduce the history sequence length.
2. We change the EfficientNet architecture to work with two images: a dynamic image that is zoomed in
to the robot end-effector, and a static image of the entire workspace.
61
3. We increase the resolution of both images.
We evaluate our model on peg insertion tasks on the NIST board 1, achieving a precision of 2mm. This
demonstrates the potential of our approach for precise and robust robotic assembly.
6.2 Method
6.2.1 Preliminaries
Robot learning: We desire a robot policy to solve a language-conditioned task from vision, by framing it
as a sequential decision-making environment. In each episode, at timestep t = 0, the policy π is presented
with a language instruction i and an initial set of image observations o0 which consists of workspace camera
image ows,0 and a wrist camera image owr,0. At any time t, the policy produces an action distribution
from which an action is sampled, at ∼ π(· | i, ot), and applied to the robot. This process continues, with
the policy iteratively producing actions at by sampling from a learned distribution π(· | i, {oj}
t
j=0) and
applying those actions to the robot. The episode ends when a termination condition is achieved. A binary
reward r ∈ {0, 1} is provided to the agent at the end of the episode, where a reward of 1 indicates success
and 0 indicates failure at the task described by i. The goal is to learn policy π that maximizes the average
reward, in expectation over a distribution of instructions, starting states o0, and transition dynamics.
Transformers. Our method uses a Transformer [143] to parameterize policy π. A Transformer is a sequence model mapping an input sequence {ξh}
H
h=0 to an output sequence {yk}
K
k=0 using combinations of
self-attention layers and fully-connected neural networks. Transformers were originally designed for text
sequences, where each input ξj and output yk represents a text token, they have been extended to images [97]
as well as other modalities [66]. As detailed in the next section, we parameterize π by first mapping inputs
i, {oj}
t
j=0 to a sequence {ξh}
H
h=0 and action outputs at
to a sequence {yk}
K
k=0 before using a Transformer
to learn the mapping {ξh}
H
h=0 → {yk}
K
k=0.
62
Imitation learning. Imitation learning methods train the policy π on a dataset D of demonstrations [104,
164, 44]. Specifically, we assume access to a dataset D = {(i
(n)
, {(x
(n)
t
, a
(n)
t
)}
T
(n)
t=0 )}
N
n=0 of episodes,
all of which are successful (i.e., have a final reward of 1). We learn π using behavioral cloning [104],
which optimizes π by minimizing the negative log-likelihood of actions at given the images and language
instructions.
6.2.2 System Overview
The method in this chapter builds upon the existing robotic transformer method RT-1 [12], to work with
precise manipulation tasks such as those on the NIST board 1 [53]. Specifically, we focus on enhancing
the model’s output resolution, allowing for finer control over robot actions. In contrast, recent extensions to
RT-1 have focused on the input or dataset side, assuming that the task dexterity is the same.
For example, RT-Trajectory [27] introduces an additional image to the model input, of a desired trajectory to follow. RT-Sketch [132] introduces a sketch as an additional image, to serve as a goal. However,
these images are static throughout the episode, similar to the task’s language instruction. Similarly, RTX [15] augments the original dataset with demonstration data across various embodiments.
Our proposed modifications to RT-1 achieve finer dexterous control through the following enhancements:
1. Wrist-Camera Input: We augment the observation space with an additional image from a wrist
camera, that is zoomed in on end-effector and the task. This non-stationary image provides more a
detailed, task specific information compared to a static goal image.
2. Higher image resolution: We input larger images of resolution 512 x 640, in contrast to the default
256x320 for RT-1, to capture finer details for precise manipulation.
63
Method Input Output
RT-1, RT-2, RT-X Language instruction, robot image Action tokens
RT-Trajectory Language instruction, robot image, goal image Action tokens
RT-Sketch Language instruction, robot image, sketch image Action tokens
Our method Language instruction, robot image, wrist image Fine-grained action tokens
Table 6.1: Summary of the key differences between our approach and other robotic transformer methods
3. Reduced Sequence Length: Since this task relies primarily on the current observation, we reduce
the sequence length compared to RT-1. However, we maintain a minimal history sequence to account
for potential occlusions in the current observation.
4. Fine-grained Action Mapping: We create a fine-grained mapping from tokens to the real world,
with each bin having a width of 0.3 mm, in contrast to 7.84mm for RT-1. This enables more precise
control over robot movements.
5. Action Repeat: We introduce an action repeat hyperparameter, similar to the idea for fine actions in
Reinforcement Learning [122].
Table 6.1 compares the input and output specifications or our method to other robotic transformer methods.
6.3 Experiments
6.3.1 Tasks
We evaluated our method on peg insertion tasks on the NIST board, with a precision of 2mm. Specifically,
we used two tasks for training:
1. Insert a 12mm diameter peg into a 16mm diameter hole. The corresponding language instruction was
"Insert medium peg in large hole".
64
(a) Stationary workspace camera (b) Wrist camera, moves with robot EEF
Figure 6.1: Sample camera images used during training and inference.
2. Insert an 8mm diameter peg into a 12mm diameter hole. The corresponding language instruction was
"Insert small peg in medium hole".
For each task, we provided the robot with two camera images as input: a stationary workspace camera
image and a wrist camera image that moves along with the end-effector. Figure 6.1 shows samples of these
camera images. We also provided the robot with a language instruction describing the task. The pegs are
initially already grasped by the robot end-effector, with a fixed pose of the peg with respect to the robot
end-effector. The holes are on the NIST board, which is attached rigidly to a table next to the robot base.
The peg insertion tasks we used are challenging for robotic manipulation due to the tight tolerances and
the need for precise control of the robot’s end-effector. These tasks are representative of the types of tasks
that our method is designed to solve.
6.3.2 Demonstration Collection
To ensure robustness to various lighting conditions, we collected demonstrations at all times of day and
night. Additionally, we applied photometric variations to the data during training to further enhance the
model’s resilience to different environmental lighting scenarios.
65
We collected peg insertion demonstrations with varying starting locations for a given hole position in
the robot’s frame, shown in Figure 6.2. The starting locations were varied within a 12cm x 12cm x 8cm
cuboid around the hole, providing sufficient variability for last-inch insertion.
Note that the model does not utilize proprioceptive information (joint angles) as it is designed to learn
from robot-agnostic data. Consequently, the agent does not automatically learn to simply match joint angles
or absolute positions of the end-effector. This allows us to keep the hole fixed in the robot’s frame of
reference during data collection.
We compared various methods for collecting demonstrations of the task. One option is to use keyboard
teleoperation to achieve fine-grained control over the end-effector’s position. However, keyboard control
proved to be unintuitive and resulted in slow operation.
Another option is robot teleoperation using a 3D Connexion Spacemouse, which offered intuitiveness
and speed. However, it did not provide the desired level of precision for the insertion task, often leading to
overshooting the goal position and subsequent corrections by the teleoperator.
Finally, we opted for a fully scripted demonstration approach, which provided the ideal combination of
speed, reliability, robustness, and accuracy. Each demonstration took approximately 10 seconds to complete,
and this setup enabled us to collect thousands of demos overnight.
6.3.3 Results
Single task We achieved a 92% success rate on training a model for task 1. We observed that using only freespace demonstrations did not provide the desired success rate. To address this, we added demonstrations
that involved recovery from failure, simulating scenarios where the insertion is incorrect and the peg hits the
board.
By incorporating these failure recovery demonstrations, we also achieved a 93% success rate with a
model trained for task 2.
66
Figure 6.2: Workspace image, which includes the workspace camera, wrist camera, space mouse, and the NIST board.
67
Train Test
Seq. length 2 92% 60%
Seq. length 6 60% 30%
Table 6.2: Ablation to vary length of sequence of previous observations sent to the transformer
Multitask Training a multitask model resulted in a 93.33% success rate. We then evaluated the model
on an unseen task, namely inserting an 8mm diameter peg into a 16mm diameter hole, a combination of the
seen tasks 1 and 2. The model achieved a 80% success rate on this unseen task.
6.3.4 Ablations
We conducted ablation studies to validate our design choices.
Sequence Length We evaluated different values for sequence length. As shown in Table 6.2, a sequence
length of 2 outperforms a sequence length of 6. This is because our task is highly state-based, and we
can obtain state information from both camera observations. When using a single camera, vision might be
occluded, necessitating a history of observations to infer what lies behind the occluded region. With two
cameras, this requirement is reduced, allowing us to approximate a state-only representation and reduce
the sequence length to 2. Moreover, a smaller sequence length results in a lower-dimensional observation
vector, requiring less training data. This proved beneficial in our case.
Image Resolution We tested different values for the resolution of the image observations. A higher
image resolution improved performance. This is not entirely unexpected, since we get more detail per pixel
with a high resolution image. But it is particularly crucial for precision manipulation tasks. With the new
resolution, each pixel on the static camera corresponds to 1mm in the real world. This means that the
16mm hole for task 1 is approximately 16 pixels wide in the 512x640 range. This subtle change provides
the desired update in observation that is needed for precise manipulation.
Number of Cameras
68
Train Test
Single camera (RT-1) 64% 20%
Two cameras 92% 60%
Table 6.3: Ablation to vary number of camera images that we feed to our model
We tested the performance with varying numbers of cameras. As shown in Table 6.3, using two camera
images outperforms the single camera setting that RT-1 uses. This aligns with expectations for last-inch
tasks.
6.4 Conclusion
This chapter presented a visual servoing policy for precise last-inch peg insertion. Our method builds
upon the RT-1 architecture, incorporating modifications for enhanced performance in high-precision tasks.
Through ablation studies, we validated the effectiveness of our design choices with justification for each
model parameter modification. We demonstrated the capability of our approach on multiple last-inch tasks
and exhibited a degree of generalization to unseen scenarios.
Future work will focus on expanding the model’s capabilities beyond last-inch insertion to encompass
entire rollouts starting from a distant position. This will necessitate handling both small and large endeffector displacements, encompassing both coarse and fine motion within a unified framework. Additionally,
incorporating rectangular pegs and other NIST board configurations demands the inference of end-effector
orientation, further broadening the scope of tasks the model can handle.
By accommodating both coarse and fine motion, we aim towards a generalized manipulation model
capable of learning both existing coarse tasks mastered by robotic transformer models (e.g., "pick up the
soda can") and the newly introduced precise manipulation tasks. This represents a significant step toward a
comprehensive model capable of handling diverse manipulation challenges.
69
Chapter 7
Conclusions
Dexterous manipulation of complex objects, such as cloth, poses significant challenges due to difficulties
in perception and exploration. Traditional reinforcement learning (RL) approaches are often inefficient
and impractical in this setting, as they require extensive exploration of the state space, which can be timeconsuming and potentially hazardous. While demonstrations from humans can alleviate the need for extensive exploration, collecting high-quality demonstrations can be labor-intensive and expensive. Therefore,
achieving efficient and successful robot manipulation necessitates a delicate balance between perception,
exploration, and imitation.
This dissertation addresses these challenges by presenting advancements in dexterous low-level manipulation tasks using guiding demonstrations. We have developed techniques that promote efficient learning by
reducing interactions with the environment during exploration and by minimizing the overhead associated
with collecting demonstrations. Our contributions include:
1. A reinforcement learning algorithm that integrates a motion planner into the learning process, enabling
efficient exploration of long horizons.
2. A LfD algorithm for visual manipulation of complex deformable objects using demonstrations.
3. A framework for cross-morphological LfD, to enable visual manipulation of objects using demonstrations from a set of agents with diverse embodiments.
70
4. A method for creating a multi-task visual servoing policy that achieves precise last-inch peg-in-hole
insertion using a multi-task attention-based architecture.
Through these contributions, we hope to advance the field of robot manipulation research and equip
robots with the tools necessary to achieve manipulation skills comparable to those of humans. By enabling
robots to manipulate objects with dexterity and precision, we pave the way for their widespread adoption in
various domains, revolutionizing automation and human-robot collaboration.
7.1 Future Directions
Building upon the current contributions, there are numerous promising directions to further enhance lowlevel robot manipulation skills. These advancements complement advancements in high-level planning
and improved sensing and actuation, which, when combined with dexterous low-level manipulation, can
significantly boost a robot’s overall agility and dexterity.
1. Harnessing Demonstration Observations: The cross-morphology LfD framework utilized statebased demonstrations from a simulated agent in SoftGym, with the robot being extracted away. To
further expand this framework, we can focus on improving the quality of recorded demonstrations
and streamlining the recording process. For instance, we can incorporate video demonstrations from
the real world as a learning source. This would involve either employing vision techniques to perform
state estimation and obtain the state-based demonstration or extending the current framework to learn
directly from observations. This approach was partially explored for image-based observations in
Chapter 6.
2. Adapting to Student Robot Limitations: The current MAIL framework is ineffective when the
student robot is unable to replicate the goal or intermediate states demonstrated by the teacher robot.
For example, opening a flimsy plastic bag with two handles requires simultaneous manipulation of
71
both handles. Using only one hand would result in the bag collapsing due to the inability to maintain a
firm grasp on both handles. To address this challenge, we can analyze the state-visitation distributions
of the teacher and student robots to identify and mitigate such limitations.
3. Universal Deployment to Different Morphology or Embodiment: Our cross-morphology LfD
framework represents a step towards enabling robots to learn from any robotic agent. However, it
currently requires a separate policy for each robot morphology. Future work could extend our method
to develop a single policy conditioned on the desired morphology, bringing us closer to achieving a
truly universal foundation model. Our contributions to the RT-X project and the Open X-Embodiment
Dataset [15] represent initial steps in this direction.
4. Scalable Actuation across Displacement Scales: We aim to develop a single model capable of
executing both coarse and fine motions. This would enable training a single model on a range of
tasks, from grasping large objects like a coke can [12] to performing precise manipulations like USB
insertion. This would require updating the current supervised learning paradigm for training attentionbased models. Alternatively, one could train with supervised learning and then fine-tune with reinforcement learning. However, fine-tuning on specific tasks could affect the success rate of completing
other tasks.
72
Bibliography
[1] Frank Allgöwer and Alex Zheng. Nonlinear model predictive control. Vol. 26. Birkhäuser, 2012.
[2] Nancy M Amato and Yan Wu. “A randomized roadmap method for path and manipulation
planning”. In: IEEE International Conference on Robotics and Automation. 1996.
[3] Jacob Andreas, Dan Klein, and Sergey Levine. “Modular multitask reinforcement learning with
policy sketches”. In: International Conference on Machine Learning. 2017, pp. 166–175.
[4] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew,
Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. “Learning
dexterous in-hand manipulation”. In: arXiv preprint arXiv:1808.00177 (2018).
[5] Daniel Angelov, Yordan Hristov, Michael Burke, and Subramanian Ramamoorthy. “Composing
Diverse Policies for Temporally Extended Tasks”. In: IEEE Robotics and Automation Letters 5.2
(2020), pp. 2658–2665.
[6] Aleksandra Anna Apolinarska, Matteo Pacher, Hui Li, Nicholas Cote, Rafael Pastrana,
Fabio Gramazio, and Matthias Kohler. “Robotic assembly of timber joints using reinforcement
learning”. In: Automation in Construction (2021).
[7] Pierre-Luc Bacon, Jean Harb, and Doina Precup. “The option-critic architecture”. In: Association
for the Advancement of Artificial Intelligence. 2017.
[8] Cristian C. Beltran-Hernandez, Damien Petit, Ixchel G. Ramirez-Alpizar, and Kensuke Harada.
“Variable compliance control for robotic peg-in-hole assembly: A deep-reinforcement-learning
approach”. In: Applied Sciences (2020).
[9] James M Bern, Pol Banzet, Roi Poranne, and Stelian Coros. “Trajectory Optimization for
Cable-Driven Soft Robot Locomotion.” In: Robotics: Science and Systems. Vol. 1. 2019.
[10] P. T. de Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein. “A Tutorial on the
Cross-Entropy Method”. In: Annals of Operations Research 134 (2005), pp. 19–67.
[11] Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza,
Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. “RoboCat: A Self-Improving
Foundation Agent for Robotic Manipulation”. In: arXiv preprint arXiv:2306.11706 (2023).
73
[12] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn,
Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. “Rt-1: Robotics
transformer for real-world control at scale”. In: arXiv preprint arXiv:2212.06817 (2022).
[13] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff,
and Dieter Fox. “Closing the sim-to-real loop: Adapting simulation randomization with real world
experience”. In: IEEE International Conference on Robotics and Automation. 2019,
pp. 8973–8979.
[14] Hao-Tien Lewis Chiang, Aleksandra Faust, Marek Fiser, and Anthony Francis. “Learning
navigation behaviors end-to-end with autorl”. In: IEEE Robotics and Automation Letters 4.2
(2019), pp. 2007–2014.
[15] Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic Learning Datasets and
RT-X Models. https://arxiv.org/abs/2310.08864. 2023.
[16] Todor Davchev, Kevin Sebastian Luck, Michael Burke, Franziska Meier, Stefan Schaal, and
Subramanian Ramamoorthy. “Residual learning from demonstration: Adapting DMPs for
contact-rich manipulation”. In: IEEE Robotics and Automation Letters 7.2 (2022), pp. 4488–4495.
[17] Samuel Hunt Drake. “Using compliance in lieu of sensory feedback for automatic assembly”.
PhD thesis. Massachusetts Institute of Technology, 1978.
[18] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,
Pieter Abbeel, and Wojciech Zaremba. “One-Shot Imitation Learning”. In: Proceedings of the 31st
International Conference on Neural Information Processing Systems. NIPS’17. Long Beach,
California, USA: Curran Associates Inc., 2017, pp. 1087–1098. ISBN: 9781510860964.
[19] Mohamed Elbanhawi and Milan Simic. “Sampling-based robot motion planning: A review”. In:
IEEE Access (2014).
[20] Peter Englert and Marc Toussaint. “Learning Manipulation Skills from a Single Demonstration”.
In: International Journal of Robotics Research 37.1 (2018), pp. 137–154.
[21] Linxi Fan, Yuke Zhu, Jiren Zhu, Zihua Liu, Orien Zeng, Anchit Gupta, Joan Creus-Costa,
Silvio Savarese, and Li Fei-Fei. “SURREAL: Open-Source Reinforcement Learning Framework
and Robot Manipulation Benchmark”. In: Conference on Robot Learning. 2018, pp. 767–782.
[22] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-Shot Visual
Imitation Learning via Meta-Learning”. In: Proceedings of the 1st Annual Conference on Robot
Learning. Ed. by Sergey Levine, Vincent Vanhoucke, and Ken Goldberg. Vol. 78. Proceedings of
Machine Learning Research. PMLR, 13–15 Nov 2017, pp. 357–368. URL:
https://proceedings.mlr.press/v78/finn17a.html.
[23] Letian Fu, Huang Huang, Lars Berscheid, Hui Li, Ken Goldberg, and Sachin Chitta. “Safely
learning visuo-tactile feedback policies in real for industrial insertion”. In: arXiv preprint
arXiv:2210.01340 (2022).
74
[24] Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, and Peter Dürr. “Super-human
performance in gran turismo sport using deep reinforcement learning”. In: IEEE Robotics and
Automation Letters 6.3 (2021), pp. 4257–4264.
[25] Scott Fujimoto, Herke van Hoof, and David Meger. “Addressing Function Approximation Error in
Actor-Critic Methods”. In: International Conference on Machine Learning. 2018.
[26] Roland Geraerts and Mark H Overmars. “Creating high-quality paths for motion planning”. In: The
International Journal of Robotics Research 26.8 (2007), pp. 845–863.
[27] Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao,
Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu,
Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. RT-Trajectory: Robotic Task
Generalization via Hindsight Trajectory Sketches. 2023. arXiv: 2311.01977 [cs.RO].
[28] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. “Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates”. In: IEEE International Conference on
Robotics and Automation. 2017.
[29] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. In: International
Conference on Machine Learning. 2018.
[30] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor”. In: International
conference on machine learning. PMLR. 2018, pp. 1861–1870.
[31] Firas Al-Hafez, Davide Tateo, Oleg Arenz, Guoping Zhao, and Jan Peters. “LS-IQ: Implicit Reward
Regularization for Inverse Reinforcement Learning”. In: Eleventh International Conference on
Learning Representations (ICLR). 2023. URL: https://openreview.net/pdf?id=o3Q4m8jg4BR.
[32] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
James Davidson. “Learning latent dynamics for planning from pixels”. In: International conference
on machine learning. PMLR. 2019.
[33] Marius Hebecker, Jens Lambrecht, and Markus Schmitz. “Towards real-world force-sensitive
robotic assembly through deep reinforcement learning in simulations”. In: IEEE/ASME
International Conference on Advanced Intelligent Mechatronics (AIM). 2021.
[34] Julius Hietala, David Blanco–Mulero, Gokhan Alcan, and Ville Kyrki. “Learning Visual Feedback
Control for Dynamic Cloth Folding”. In: 2022 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). 2022, pp. 1455–1462. DOI: 10.1109/IROS47612.2022.9981376.
[35] Jonathan Ho and Stefano Ermon. “Generative Adversarial Imitation Learning”. In: Advances in
Neural Information Processing Systems. Ed. by D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and
R. Garnett. Vol. 29. Curran Associates, Inc., 2016. URL:
https://proceedings.neurips.cc/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf.
75
[36] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt,
and David Silver. “Distributed Prioritized Experience Replay”. In: 6th International Conference on
Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings. OpenReview.net, 2018. URL:
https://openreview.net/forum?id=H1Dy---0Z.
[37] Zhimin Hou, Jiajun Fei, Yuelin Deng, and Jing Xu. “Data-efficient hierarchical reinforcement
learning for robotic assembly control applications”. In: IEEE Transactions on Industrial
Electronics (2021).
[38] Shouren Huang, Kenichi Murakami, Yuji Yamakawa, Taku Senoo, and Masatoshi Ishikawa. “Fast
peg-and-hole alignment using visual compliance”. In: IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). 2013.
[39] Zixuan Huang, Xingyu Lin, and David Held. “Mesh-based Dynamics with Occlusion Reasoning
for Cloth Manipulation”. In: Proceedings of Robotics: Science and Systems. New York City, NY,
USA, June 2022. DOI: 10.15607/RSS.2022.XVIII.011.
[40] Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. “Learning rhythmic movements by
demonstration using nonlinear oscillators”. In: IEEE/RSJ International Conference on Intelligent
Robots and Systems. Vol. 1. 2002, pp. 958–963. DOI: 10.1109/IRDS.2002.1041514.
[41] Tadanobu Inoue, Giovanni De Magistris, Asim Munawar, Tsuyoshi Yokoya, and Ryuki Tachibana.
“Deep reinforcement learning for high precision assembly tasks”. In: IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). 2017.
[42] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu,
David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. “Perceiver IO:
A General Architecture for Structured Inputs & Outputs”. In: International Conference on
Learning Representations. 2021.
[43] Eric Jang, Shixiang Gu, and Ben Poole. “Categorical Reparameterization with Gumbel-Softmax”.
In: International Conference on Learning Representations. 2017.
[44] Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch,
Sergey Levine, and Chelsea Finn. “Bc-z: Zero-shot task generalization with robotic imitation
learning”. In: Conference on Robot Learning. PMLR. 2022, pp. 991–1002.
[45] Shiyu Jin, Diego Romeres, Arvind Ragunathan, Devesh K Jha, and Masayoshi Tomizuka.
“Trajectory optimization for manipulation of deformable objects: Assembly of belt drive units”. In:
2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2021,
pp. 10002–10008.
[46] Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll,
Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. “Residual reinforcement learning for
robot control”. In: International Conference on Robotics and Automation (ICRA). 2019.
76
[47] Ravi P Joshi, Nishanth Koganti, and Tomohiro Shibata. “Robotic cloth manipulation for clothing
assistance task using dynamic movement primitives”. In: Proceedings of the Advances in Robotics.
2017, pp. 1–6.
[48] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,
Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. “Scalable Deep
Reinforcement Learning for Vision-Based Robotic Manipulation”. In: Conference on Robot
Learning. 2018, pp. 651–673.
[49] Sertac Karaman and Emilio Frazzoli. “Sampling-based algorithms for optimal motion planning”.
In: International Journal of Robotics Research 30.7 (2011), pp. 846–894.
[50] Sertac Karaman and Emilio Frazzoli. “Sampling-based algorithms for optimal motion planning”.
In: International Journal of Robotics Research 30.7 (2011), pp. 846–894.
[51] Lydia Kavraki and Jean-Claude Latombe. “Randomized preprocessing of configuration for fast
path planning”. In: IEEE International Conference on Robotics and Automation. 1994.
[52] Matthew Kelly. “An introduction to trajectory optimization: How to do your own direct
collocation”. In: SIAM Review 59.4 (2017), pp. 849–904.
[53] Kenneth Kimble, Karl Van Wyk, Joe Falco, Elena Messina, Yu Sun, Mizuho Shibata,
Wataru Uemura, and Yasuyoshi Yokokohji. “Benchmarking protocols for evaluating small parts
robotic assembly systems”. In: IEEE Robotics and Automation Letters (2020).
[54] Shishir Kolathaya, William Guffey, Ryan W. Sinnet, and Aaron D. Ames. “Direct collocation for
dynamic behaviors with nonprehensile contacts: Application to flipping burgers”. In: IEEE
Robotics and Automation Letters (2018).
[55] Shir Kozlovsky, Elad Newman, and Miriam Zacksenhouse. “Reinforcement learning of impedance
policies for peg-in-hole tasks: Role of asymmetric matrices”. In: IEEE Robotics and Automation
Letters (2022).
[56] James J Kuffner and Steven M LaValle. “RRT-connect: An efficient approach to single-query path
planning”. In: IEEE International Conference on Robotics and Automation. Vol. 2. IEEE. 2000,
pp. 995–1001.
[57] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. “Conservative q-learning for
offline reinforcement learning”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 1179–1191.
[58] Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg.
“AC-Teach: A Bayesian Actor-Critic Method for Policy Learning with an Ensemble of Suboptimal
Teachers”. In: Conference on Robot Learning. PMLR. 2020, pp. 717–734.
[59] Sascha Lange, Thomas Gabel, and Martin Riedmiller. “Batch reinforcement learning”. In:
Reinforcement learning. Springer, 2012, pp. 45–73.
77
[60] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. “Dart: Noise injection
for robust imitation learning”. In: Conference on robot learning. PMLR. 2017, pp. 143–156.
[61] Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Ken Goldberg. “DART: Noise
Injection for Robust Imitation Learning”. In: 1st Annual Conference on Robot Learning, CoRL
2017, Mountain View, California, USA, November 13-15, 2017, Proceedings. Vol. 78. Proceedings
of Machine Learning Research. PMLR, 2017, pp. 143–156. URL:
http://proceedings.mlr.press/v78/laskey17a.html.
[62] Michael Laskin, Aravind Srinivas, and Pieter Abbeel. “CURL: Contrastive Unsupervised
Representations for Reinforcement Learning”. In: Proceedings of the 37th International
Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of
Machine Learning Research. PMLR, 13–18 Jul 2020, pp. 5639–5650. URL:
https://proceedings.mlr.press/v119/laskin20a.html.
[63] Steven M LaValle, James J Kuffner, BR Donald, et al. “Algorithmic and computational robotics:
new directions”. In: AK Peters, 2001. Chap. Rapidly-exploring random trees: Progress and
prospects, pp. 293–308.
[64] Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Tech. rep. TR
98-11. Computer Science Department, Iowa State University, 1998.
[65] Steven M. Lavalle. Rapidly-Exploring Random Trees: A New Tool for Path Planning. Tech. rep.
Iowa State University, 1998.
[66] Kuang-Huei Lee, Ofir Nachum, Mengjiao Sherry Yang, Lisa Lee, Daniel Freeman,
Sergio Guadarrama, Ian Fischer, Winnie Xu, Eric Jang, Henryk Michalewski, et al. “Multi-game
decision transformers”. In: Advances in Neural Information Processing Systems 35 (2022),
pp. 27921–27936.
[67] Michelle A Lee, Carlos Florensa, Jonathan Tremblay, Nathan Ratliff, Animesh Garg, Fabio Ramos,
and Dieter Fox. “Guided uncertainty-aware policy optimization: Combining learning and
model-based strategies for sample-efficient policy learning”. In: IEEE International Conference on
Robotics and Automation (ICRA). 2020.
[68] Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese,
Li Fei-Fei, Animesh Garg, and Jeannette Bohg. “Making sense of vision and touch: Learning
multimodal representations for contact-rich tasks”. In: IEEE Transactions on Robotics (2020).
[69] Youngwoon Lee, Edward S Hu, Zhengyu Yang, Alex Yin, and Joseph J Lim. “IKEA Furniture
Assembly Environment for Long-Horizon Complex Manipulation Tasks”. In: arXiv preprint
arXiv:1911.07246 (2019).
[70] Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward Hu, and Joseph J. Lim.
“Composing Complex Skills by Learning Transition Policies”. In: International Conference on
Learning Representations. 2019.
78
[71] Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J. Lim. “Generalizable Imitation
Learning from Observation via Inferring Goal Proximity”. In: Advances in Neural Information
Processing Systems. 2021.
[72] Youngwoon Lee, Jingyun Yang, and Joseph J. Lim. “Learning to Coordinate Manipulation Skills
via Skill Behavior Diversification”. In: International Conference on Learning Representations.
2020.
[73] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-End Training of Deep
Visuomotor Policies”. In: Journal of Machine Learning Research (2016).
[74] Sergey Levine and Vladlen Koltun. “Guided policy search”. In: International Conference on
Machine Learning. 2013.
[75] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. “Offline Reinforcement Learning:
Tutorial, Review, and Perspectives on Open Problems”. In: arXiv e-prints (2020), arXiv–2005.
[76] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. “Learning Hand-Eye
Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection”. In:
International Symposium on Experimental Robotics. 2016.
[77] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In:
International Conference on Learning Representations (2016).
[78] Xingyu Lin, Yufei Wang, Zixuan Huang, and David Held. “Learning visible connectivity dynamics
for cloth smoothing”. In: Conference on Robot Learning. PMLR. 2022, pp. 256–266.
[79] Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. “SoftGym: Benchmarking Deep
Reinforcement Learning for Deformable Object Manipulation”. In: Conference on Robot Learning
(CoRL) (2020).
[80] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. “Bbq-networks:
Efficient exploration in deep reinforcement learning for task-oriented dialogue systems”. In:
Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018.
[81] I-Chun Arthur Liu, Shagun Uppal, Gaurav S. Sukhatme, Joseph J Lim, Peter Englert, and
Youngwoon Lee. “Distilling Motion Planner Augmented Policies into Visual Control Policies for
Robot Manipulation”. In: Proceedings of the 5th Conference on Robot Learning. Ed. by
Aleksandra Faust, David Hsu, and Gerhard Neumann. Vol. 164. Proceedings of Machine Learning
Research. PMLR, Aug. 2022, pp. 641–650. URL:
https://proceedings.mlr.press/v164/liu22b.html.
[82] Tomas Lozano-Pérez, Matthew T Mason, and Russell H Taylor. “Automatic synthesis of
fine-motion strategies for robots”. In: The International Journal of Robotics Research (1984).
79
[83] Jianlan Luo, Oleg Sushkov, Rugile Pevceviciute, Wenzhao Lian, Chang Su, Mel Vecerik, Ning Ye,
Stefan Schaal, and Jon Scholz. “Robust multi-modal policies for industrial assembly via
reinforcement learning and demonstrations: A large-scale study”. In: arXiv preprint
arXiv:2103.11512 (2021).
[84] Jieliang Luo and Hui Li. “A learning approach to robot-agnostic force-guided high precision
assembly”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2021.
[85] Jieliang Luo and Hui Li. “Dynamic experience replay”. In: Conference on Robot Learning (CoRL).
2019.
[86] Zhi-Quan Luo, Jong-Shi Pang, and Daniel Ralph. Mathematical programs with equilibrium
constraints. Cambridge University Press, 1996.
[87] Miles Macklin, Matthias Müller, and Nuttapong Chentanez. “XPBD: position-based simulation of
compliant constrained dynamics”. In: In Proceedings of the 9th International Conference on
Motion in Games (2016).
[88] Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu,
Animesh Garg, Silvio Savarese, and Li Fei-Fei. “Scaling robot supervision to hundreds of hours
with roboturk: Robotic manipulation dataset through human reasoning and dexterity”. In: 2019
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2019,
pp. 1048–1055.
[89] Jan Matas, Stephen James, and Andrew J Davison. “Sim-to-real reinforcement learning for
deformable object manipulation”. In: Conference on Robot Learning. PMLR. 2018, pp. 734–743.
[90] Danielczuk Michael, Andrey Kurenkov, Ashwin Balakrishna, Matthrew Matl, David Wang,
Roberto Martin-Martin, Animesh Garg, Silvio Savarese, and Ken Goldberg. “Mechanical Search:
Multi-Step Retrieval of a Target Object Occluded by Clutter”. In: International Conference on
Robotics and Automation (ICRA). (2019).
[91] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. “Data-efficient hierarchical
reinforcement learning”. In: Neural Information Processing Systems. 2018, pp. 3303–3313.
[92] Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. “Awac: Accelerating online
reinforcement learning with offline datasets”. In: arXiv preprint arXiv:2006.09359 (2020).
[93] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel.
“Overcoming exploration in reinforcement learning with demonstrations”. In: 2018 IEEE
international conference on robotics and automation (ICRA). IEEE. 2018, pp. 6292–6299.
[94] Yashraj Narang, Kier Storey, Iretiayo Akinola, Miles Macklin, Philipp Reist, Lukasz Wawrzyniak,
Yunrong Guo, Adam Moravanszky, Gavriel State, Michelle Lu, et al. “Factory: Fast contact for
robotic assembly”. In: Robotics: Science and Systems. 2022.
[95] Hai Nguyen and Hung La. “Review of deep reinforcement learning for robot manipulation”. In:
2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE. 2019, pp. 590–595.
80
[96] M.H. Overmars. A random approach to motion planning. Tech. rep. RUU-CS-92-32. Department
of Computer Science, Utrecht University, 1992.
[97] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. “Image transformer”. In: International conference on machine learning. PMLR. 2018,
pp. 4055–4064.
[98] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu,
Evan Shelhamer, Jitendra Malik, Alexei A. Efros, and Trevor Darrell. “Zero-Shot Visual
Imitation”. In: ICLR. 2018.
[99] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. “Deepmimic:
Example-guided deep reinforcement learning of physics-based character skills”. In: ACM
Transactions on Graphics (TOG) (2018).
[100] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. “Advantage-Weighted Regression:
Simple and Scalable Off-Policy Reinforcement Learning”. In: CoRR abs/1910.00177 (2019).
arXiv: 1910.00177.
[101] Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. “Demonstration-Guided Reinforcement
Learning with Learned Skills”. In: 5th Conference on Robot Learning (2021).
[102] Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel.
“Asymmetric Actor Critic for Image-Based Robot Learning”. In: Proceedings of Robotics: Science
and Systems. Pittsburgh, Pennsylvania, June 2018. DOI: 10.15607/RSS.2018.XIV.008.
[103] Athanasios S Polydoros and Lazaros Nalpantidis. “Survey of model-based reinforcement learning:
Applications on robotics”. In: Journal of Intelligent & Robotic Systems 86.2 (2017), pp. 153–173.
[104] Dean A Pomerleau. “Alvinn: An autonomous land vehicle in a neural network”. In: Advances in
neural information processing systems 1 (1988).
[105] Michael Posa, Cecilia Cantu, and Russ Tedrake. “A direct method for trajectory optimization of
rigid bodies through contact”. In: The International Journal of Robotics Research 33.1 (2014),
pp. 69–81.
[106] James A. Preiss, David Millard, Tao Yao, and Gaurav S. Sukhatme. “Tracking Fast Trajectories
with a Deformable Object using a Learned Model”. In: IEEE International Conference on Robotics
and Automation (ICRA). 2022.
[107] Ilija Radosavovic, Xiaolong Wang, Lerrel Pinto, and Jitendra Malik. “State-only imitation learning
for dexterous manipulation”. In: 2021 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE. 2021, pp. 7865–7871.
[108] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman,
Emanuel Todorov, and Sergey Levine. “Learning Complex Dexterous Manipulation with Deep
Reinforcement Learning and Demonstrations”. In: Robotics: Science and Systems. 2018.
81
[109] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman,
Emanuel Todorov, and Sergey Levine. “Learning Complex Dexterous Manipulation with Deep
Reinforcement Learning and Demonstrations”. In: Proceedings of Robotics: Science and Systems
(RSS). 2018.
[110] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman,
Emanuel Todorov, and Sergey Levine. “Learning complex dexterous manipulation with deep
reinforcement learning and demonstrations”. In: Robotics: Science and Systems (2017).
[111] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. “Bridging offline
reinforcement learning and imitation learning: A tale of pessimism”. In: Advances in Neural
Information Processing Systems (2021).
[112] Stephane Ross and Drew Bagnell. “Efficient Reductions for Imitation Learning”. In: Proceedings
of the Thirteenth International Conference on Artificial Intelligence and Statistics. Ed. by
Yee Whye Teh and Mike Titterington. Vol. 9. Proceedings of Machine Learning Research. PMLR,
2010.
[113] Gautam Salhotra, I-Chun Arthur Liu, Marcus Dominguez-Kuhne, and Gaurav S. Sukhatme.
“Learning Deformable Object Manipulation From Expert Demonstrations”. In: IEEE Robotics and
Automation Letters 7.4 (2022), pp. 8775–8782. DOI: 10.1109/LRA.2022.3187843.
[114] Gautam Salhotra, I-Chun Arthur Liu, and Gaurav S. Sukhatme. “Learning Robot Manipulation
from Cross-Morphology Demonstration”. In: Conference on Robot Learning. 2023.
[115] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and
Peter Battaglia. “Learning to simulate complex physics with graph networks”. In: International
conference on machine learning. PMLR. 2020, pp. 8459–8468.
[116] Gerrit Schoettler, Ashvin Nair, Juan Aparicio Ojea, Sergey Levine, and Eugen Solowjow.
“Meta-reinforcement learning for robotic industrial insertion tasks”. In: IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). 2020.
[117] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal
policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).
[118] Daniel Seita, Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, Ken Goldberg,
and Andy Zeng. “Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned
transporter networks”. In: 2021 IEEE International Conference on Robotics and Automation
(ICRA). IEEE. 2021, pp. 4568–4575.
[119] Daniel Seita, Nawid Jamali, Michael Laskey, Ajay Kumar Tanwani, Ron Berenstein,
Prakash Baskaran, Soshi Iba, John Canny, and Ken Goldberg. “Deep transfer learning of pick
points on fabric for robot bed-making”. In: The International Symposium of Robotics Research
(2019), pp. 275–290.
[120] Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. “Behavior
Transformers: Cloning k modes with one stone”. In: Advances in neural information processing
systems 35 (2022), pp. 22955–22968.
82
[121] Lin Shao, Toki Migimatsu, and Jeannette Bohg. “Learning to scaffold the development of robotic
manipulation skills”. In: IEEE International Conference on Robotics and Automation (ICRA).
2020.
[122] Sahil Sharma, Aravind Srinivas, and Balaraman Ravindran. “Learning to repeat: Fine grained
action repetition for deep reinforcement learning”. In: arXiv preprint arXiv:1702.06054 (2017).
[123] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. “Perceiver-Actor: A Multi-Task Transformer for
Robotic Manipulation”. In: Proceedings of the 6th Conference on Robot Learning (CoRL). 2022.
[124] Satinder P Singh, Andrew G Barto, Roderic Grupen, and Christopher Connolly. “Robust
reinforcement learning in motion planning”. In: Advances in Neural Information Processing
Systems. 1994, pp. 655–662.
[125] Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, and Sergey Levine. “AVID: Learning
Multi-Stage Tasks via Pixel-Level Translation of Human Videos”. In: Proceedings of Robotics:
Science and Systems. Corvalis, Oregon, USA, July 2020. DOI: 10.15607/RSS.2020.XVI.024.
[126] Dongwon Son, Hyunsoo Yang, and Dongjun Lee. “Sim-to-real transfer of bolting tasks with tight
tolerance”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020.
[127] Oren Spector and Dotan Di Castro. “InsertionNet: A scalable solution for insertion”. In: IEEE
Robotics and Automation Letters (2021).
[128] Oren Spector, Vladimir Tchuiev, and Dotan Di Castro. “Insertionnet 2.0: Minimal contact
multi-step insertion using multimodal multiview sensory input”. In: 2022 International Conference
on Robotics and Automation (ICRA). IEEE. 2022, pp. 6330–6336.
[129] Oren Spector and Miriam Zacksenhouse. “Deep reinforcement learning for contact-rich skills using
compliant movement primitives”. In: arXiv:2008.13223 [cs] (2020).
[130] Ioan A. ¸Sucan, Mark Moll, and Lydia E. Kavraki. “The Open Motion Planning Library”. In: IEEE
Robotics & Automation Magazine 19.4 (2012), pp. 72–82.
[131] Yin-Tung Albert Sun, Hsin-Chang Lin, Po-Yen Wu, and Jung-Tang Huang. “Learning by Watching
via Keypoint Extraction and Imitation Learning”. In: Machines 10.11 (2022). ISSN: 2075-1702.
DOI: 10.3390/sun22KeypointExtraction.
[132] Priya Sundaresan, Quan Vuong, Jiayuan Gu, Peng Xu, Ted Xiao, Sean Kirmani, Tianhe Yu,
Michael Stark, Ajinkya Jain, Karol Hausman, Dorsa Sadigh, Jeannette Bohg, and Stefan Schaal.
RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches. 2024. arXiv:
2403.02709 [cs.RO].
[133] Richard S Sutton, Doina Precup, and Satinder Singh. “Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning”. In: Artificial intelligence 112.1-2
(1999), pp. 181–211.
83
[134] Bingjie Tang, Michael A. Lin, Iretiayo Akinola, Ankur Handa, Gaurav S. Sukhatme, Fabio Ramos,
Dieter Fox, and Yashraj Narang. “IndustReal: Transferring Contact-Rich Assembly Tasks from
Simulation to Reality”. In: Proceedings of Robotics: Science and Systems (RSS). 2023.
[135] Te Tang, Hsien-Chung Lin, Yu Zhao, Wenjie Chen, and Masayoshi Tomizuka. “Autonomous
alignment of peg and hole by force/torque measurement for robotic assembly”. In: IEEE
International Conference on Automation Science and Engineering (CASE). 2016.
[136] Garrett Thomas, Melissa Chien, Aviv Tamar, Juan Aparicio Ojea, and Pieter Abbeel. “Learning
robotic assembly from CAD”. In: IEEE International Conference on Robotics and Automation
(ICRA). 2018.
[137] Thomas George Thuruthel, Egidio Falotico, Federico Renda, and Cecilia Laschi. “Learning
dynamic models for open loop predictive control of soft robotic manipulators”. In: Bioinspiration
& Biomimetics 12.6 (Oct. 2017), p. 066003. DOI: 10.1088/1748-3190/aa839f.
[138] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-based
control”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012,
pp. 5026–5033.
[139] Faraz Torabi, Garrett Warnell, and Peter Stone. “Behavioral Cloning from Observation”. In:
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence,
IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, July 2018,
pp. 4950–4957. DOI: 10.24963/ijcai.2018/687.
[140] Faraz Torabi, Garrett Warnell, and Peter Stone. “Behavioral cloning from observation”. In: arXiv
preprint arXiv:1805.01954 ) (2018).
[141] Faraz Torabi, Garrett Warnell, and Peter Stone. “Generative adversarial imitation from
observation”. In: arXiv preprint arXiv:1807.06158 (2018).
[142] Faraz Torabi, Garrett Warnell, and Peter Stone. “Recent Advances in Imitation Learning from
Observation”. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, July
2019, pp. 6325–6331. DOI: 10.24963/ijcai.2019/882.
[143] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural
information processing systems 30 (2017).
[144] Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon Scholz. “A
Practical Approach to Insertion with Variable Socket Position Using Deep Reinforcement
Learning”. In: 2019 International Conference on Robotics and Automation (ICRA). 2019,
pp. 754–760. DOI: 10.1109/ICRA.2019.8794074.
[145] Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot,
Nicolas Manfred Otto Heess, Thomas Rothörl, Thomas Lampe, and Martin A. Riedmiller.
“Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse
Rewards”. In: ArXiv abs/1707.08817 (2017).
84
[146] Arun Venkatraman, Roberto Capobianco, Lerrel Pinto, Martial Hebert, Daniele Nardi, and
J Andrew Bagnell. “Improved learning of dynamics models for control”. In: 2016 International
Symposium on Experimental Robotics. Springer. 2017, pp. 703–713.
[147] Rok Vuga, Bojan Nemec, and Ales Ude. “Enhanced Policy Adaptation Through Directed
Explorative Learning”. In: International Journal of Humanoid Robotics 12.3 (2015).
[148] Nghia Vuong, Hung Pham, and Quang-Cuong Pham. “Learning sequences of manipulation
primitives for robotic assembly”. In: IEEE International Conference on Robotics and Automation
(ICRA). 2021.
[149] Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. “You only demonstrate once:
Category-level manipulation from single visual demonstration”. In: arXiv preprint
arXiv:2201.12716 (2022).
[150] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou.
“Aggressive driving with model predictive path integral control”. In: 2016 IEEE International
Conference on Robotics and Automation (ICRA). IEEE. 2016, pp. 1433–1440.
[151] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and
Evangelos A Theodorou. “Information theoretic MPC for model-based reinforcement learning”. In:
2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2017,
pp. 1714–1721.
[152] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist
reinforcement learning”. In: Machine learning 8.3 (1992), pp. 229–256.
[153] Peter Wriggers. Nonlinear finite element methods. Springer Science & Business Media, 2008.
[154] Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel.
“DayDreamer: World Models for Physical Robot Learning”. In: Conference on Robot Learning
(2022).
[155] Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. “Learning to
Manipulate Deformable Objects without Demonstrations”. In: Proceedings of Robotics: Science
and Systems. Corvalis, Oregon, USA, July 2020. DOI: 10.15607/RSS.2020.XVI.065.
[156] Zheng Wu, Wenzhao Lian, Vaibhav Unhelkar, Masayoshi Tomizuka, and Stefan Schaal. “Learning
dense rewards for contact-rich manipulation tasks”. In: IEEE International Conference on Robotics
and Automation (ICRA). 2021.
[157] Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, and Silvio Savarese.
“ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile
Manipulation”. In: arXiv preprint arXiv:2008.07792 (2020).
[158] Jun Yamada, Youngwoon Lee, Gautam Salhotra, Karl Pertsch, Max Pflueger, Gaurav S. Sukhatme,
Joseph J. Lim, and Peter Englert. “Motion Planner Augmented Reinforcement Learning for Robot
Manipulation in Obstructed Environments”. In: Conference on Robot Learning. 2020.
85
[159] Jingyun Yang, Junwu Zhang, Connor Settle, Akshara Rai, Rika Antonova, and Jeannette Bohg.
“Learning Periodic Tasks from Human Demonstrations”. In: IEEE International Conference on
Robotics and Automation (ICRA) (2022).
[160] Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. “Robot Learning Manipulation
Action Plans by “Watching” Unconstrained Videos from the World Wide Web”. In: Proceedings of
the AAAI Conference on Artificial Intelligence 29.1 (Mar. 2015). DOI: 10.1609/aaai.v29i1.9671.
[161] Denis Yarats, Ilya Kostrikov, and Rob Fergus. “Image Augmentation Is All You Need:
Regularizing Deep Reinforcement Learning from Pixels”. In: 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net,
2021. URL: https://openreview.net/forum?id=GY6-6sTvGaf.
[162] Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeanette Bohg, and
Debidatta Dwibedi. “XIRL: Cross-embodiment Inverse Reinforcement Learning”. In: Conference
on Robot Learning (CoRL) (2021).
[163] Liangjun Zhang and Dinesh Manocha. “An efficient retraction-based RRT planner”. In: IEEE
International Conference on Robotics and Automation. 2008, pp. 3743–3750.
[164] Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and
Pieter Abbeel. “Deep imitation learning for complex manipulation tasks from virtual reality
teleoperation”. In: 2018 IEEE International Conference on Robotics and Automation (ICRA).
IEEE. 2018, pp. 5628–5635.
[165] Xiang Zhang, Shiyu Jin, Changhao Wang, Xinghao Zhu, and Masayoshi Tomizuka. “Learning
insertion primitives with discrete-continuous hybrid action space for robotic assembly tasks”. In:
arXiv:2110.12618 [cs] (2021).
[166] Tony Z Zhao, Jianlan Luo, Oleg Sushkov, Rugile Pevceviciute, Nicolas Heess, Jon Scholz,
Stefan Schaal, and Sergey Levine. “Offline meta-reinforcement learning for industrial insertion”.
In: International Conference on Robotics and Automation (ICRA). 2022.
[167] Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool,
János Kramár, Raia Hadsell, Nando de Freitas, and Nicolas Heess. “Reinforcement and
Imitation Learning for Diverse Visuomotor Skills”. In: Proceedings of Robotics: Science and
Systems. Pittsburgh, Pennsylvania, June 2018. DOI: 10.15607/RSS.2018.XIV.009.
[168] Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool,
János Kramár, Raia Hadsell, Nando de Freitas, and Nicolas Heess. “Reinforcement and Imitation
Learning for Diverse Visuomotor Skills”. In: Robotics: Science and Systems. 2018.
[169] Simon Zimmermann, Roi Poranne, and Stelian Coros. “Dynamic manipulation of deformable
objects with implicit integration”. In: IEEE Robotics and Automation Letters 6.2 (2021),
pp. 4209–4216.
86
Appendices
A Additional Ablations for Motion Planner augmented RL
We introduced Motion Planner augmented reinforcement learning (MoPA-RL) in Chapter 3. In this appendix section, we provide further analysis to study (1) the effect of reusing motion planning trajectories
to augment training data (Section A.1); (2) the ablation on the action space rescaling (Section A.2); (3)
the performance of our approach in uncluttered environments compared to baselines (Section A.3); (4) the
ablation of invalid target joint state handling (Section A.4); (5) the ablation of motion planner algorithms
(Section A.5); and (6) the ablation of RL algorithms (Section A.6).
A.1 Reuse of Motion Plan Trajectories
As mentioned in Section 3.2.4, to improve sample efficiency of motion plan actions, we sample M intermediate trajectories of the motion plan trajectory τ0:H = (qt
, qt+1, . . . , qt+H) and augment the replay buffer
with sub-sampled motion plan transitions (st+ai
, ∆τai:bi
, st+bi
, R˜(st+ai
, ∆τai:bi
)), where ai < bi ∈ [0, H]
and i ∈ [1, M] (see Algorithm 1). Figure A.1a shows the success rates of our model with different M, the
number of sub-sampled motion plan transitions per motion plan action. Reusing trajectory of MP in this
way improves the sample efficiency as the success rate starts increasing earlier than the one without reusing
motion plan trajectories (M = 0). However, augmenting too many samples (M = 30, 45) degrades the performance since it biases the distribution of the replay buffer to motion plan actions and reduces the sample
efficiency of the direct action executions, which results in slow learning of contact-rich skills. This biased
87
distribution of transitions leads to convergence towards sub-optimal solutions while the model without bias
M = 0 eventually finds a better solution.
A.2 Further Study on Action Space Rescaling
In Section 3.2.3, we propose action space rescaling to balance the sampling ratio between direct action
execution and motion planning. As illustrated in Figure A.1b, our method without action space rescaling
(ω = 0.1) fails to solve Sawyer Assembly while the policy with action space rescaling learns to solve
the task. This failure is mainly because direct action execution is crucial for inserting the table leg and
the policy without action space rescaling rarely explores the direct action execution space, which makes
the agent struggle to solve the contact-rich task. We also find that the ω value is not sensitive in Sawyer
Assembly, as different ω values achieve similar success rates.
0.0 0.3 0.6 0.9 1.2
Environment steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Success rate
M=45
M=30
M=15
M=10
M=5
M=0
(a) Number of sub-sampled motion plan transitions M
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Success rate
= 0.1
= 0.3
= 0.5
= 0.7
(b) Action space rescaling ratio ω
Figure A.1: Learning curves of ablated models on Sawyer Assembly. (a) Comparison of our MoPA-SAC with different number of
samples reused from motion plan trajectories. (b) Comparison of our MoPA-SAC with different action space rescaling parameter
ω.
A.3 Performance in Uncluttered Environments
We further verify whether our method does not degrade the performance of model-free RL in uncluttered
environments. Therefore, we remove obstacles, such as a box on a table in Sawyer Lift and three other table
legs in Sawyer Assembly. Figure A.2a and Figure A.2b show that our method is as sample efficient as the
88
baseline SAC and it is even better in Sawyer Lift w/o box because our method does not need to learn how to
control an arm for the reaching skill.
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Success rate
SAC
MoPA-SAC (Our method)
(a) Sawyer Lift w/o box
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.2
0.4
0.6
0.8
Success rate
SAC
MoPA-SAC (Our method)
(b) Sawyer Assembly w/o legs
Figure A.2: Success rate on (a) Sawyer Lift w/o box and (b) Sawyer Assembly w/o legs.
A.4 Handling of Invalid Target Joint States for Motion Planning
When a predicted target joint state g = q + ˜a for motion planning is in collision with obstacles, instead of
penalizing or using the invalid action a˜, we search for a valid action by iteratively moving the target joint
state towards the current joint state and executing the new valid action, as described in Section 3.2.5.
We investigate the importance of handling the invalid actions for motion planning by comparing to
a naive approach for handling invalid actions that the robot does not execute any action and a transition
(st
, at
, rt
, st+1) is added into a replay buffer, where st = st+1 and rt
is the reward of being at the current
state. Figure A.3a and Figure A.3b show that MoPA-SAC with naive handling of invalid states cannot learn
to solve the tasks, which implies that our proposed handling of invalid target state is very crucial to train
MoPA-SAC agents. A reason behind this behavior is that the agent can explore the state space even though
the invalid target joint state is given.
89
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.2
0.4
0.6
0.8
Success rate
w/ Invalid target handlng
w/o Invalid target handling
(a) Sawyer Lift
0.0 0.3 0.6 0.9 1.2 1.5
Environment Steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Success rate
w/ Invalid target handlng
w/o Invalid target handling
(b) Sawyer Assembly
Figure A.3: Ablation of invalid target handling on (a) Sawyer Lift and (b) Sawyer Assembly.
A.5 Ablation of Motion Planning Algorithms
We test whether our framework is compatible with different motion planning algorithms. Figure A.4a shows
the comparison of our method using RRT-Connect and RRT* [50]. MoPA-SAC with RRT* learns to solve
tasks less efficiently than MoPA-SAC with RRT-Connect since, in our experiments, RRT-Connect finds
better paths than RRT* within the limited time given to both planners.
A.6 Ablation of Model-free RL Algorithms
To verify the compatibility of our method with different RL algorithms, we replaced SAC with TD3 [25]
and compare the learning performance. As illustrated in Figure A.4b, MoPA-TD3 shows unstable training,
though the best performing seed can achieve around 1.0 success rate.
B Additional Experimental Details for Motion Planner augmented RL
B.1 Environment Details
All environments in Section 3.3 are simulated in the MuJoCo physics engine [138]. The positions of the
end-effector, object, and goal are defined as peef, pobj, and pgoal, respectively. T is the maximum episode
horizon.
90
0.0 0.3 0.6 0.9 1.2 1.5
Environment steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Success rate
MoPA-SAC RRT-Connect
MoPA-SAC RRT*
(a) Different motion planners
0.0 0.3 0.6 0.9 1.2 1.5
Environment steps (1M)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Success rate
MoPA-SAC
MoPA-TD3
(b) Different RL algorithms
Figure A.4: Learning curves of ablated models on Sawyer Assembly. (a) Comparison of our model with different motion planner
algorithms. (b) Comparison of our model with different RL algorithms.
Table B.1: Environment specific parameters for MoPA-SAC
Environment Action dimension Reward scale ∆qstep ∆qMP ω M T
2D Push 4 0.2 0.1 1.0 0.7 30 400
Sawyer Push 7 1.0 0.05 0.5 0.7 15 250
Sawyer Lift 8 0.5 0.05 0.5 0.5 15 250
Sawyer Assembly 7 1.0 0.05 0.5 0.7 15 250
B.1.1 2D Push
A 2D-reacher agent with 4 joints needs to first reach an object while avoiding obstacles and then push the
object to the goal region.
Success criteria: ||pgoal − pobj||2 ≤ 0.05.
Initialization: The x and y position of goal and box are randomly sampled from U(−0.35, −0.24) and
U(0.13, 0.2) respectively. Moreover, the random noise sampled from U(−0.02, 0.02) is added to the agent’s
initial pose.
Observation: The observation consists of (sin θ, cos θ) for each joint angle θ, angular joint velocity,
the box position pobj = (xobj, yobj), the box velocity, the goal position pgoal, and end-effector position
peef = (xeef, yeef).
91
Rewards: Instead of defining a dense reward over all states which can cause sub-optimal solutions, we
define the reward function such that the agent receives a signal only when the end-effector is close to the
object (i.e., ||peef −pobj||2 ≤ 0.1). The reward function consists of rewards for reaching the box and pushing
the box to the goal region.
Rpush = 0.1 · 1||peef−pobj||2≤0.1
(1 − tanh(5 · ||peef − pobj||2))
+ 0.3 · 1||pobj−pgoal||2≤0.1
(1 − tanh(5 · ||pobj − pgoal||2)) + 150 · 1success
(B.1)
B.1.2 Sawyer Push
The Sawyer Push task requires the agent to reach an object in a box and push the object toward a goal region.
Success criteria: ||pgoal − pobj||2 ≤ 0.05.
Initialization: The random noise sampled from N (0, 0.02) is added to the goal position and the initial
pose of the Sawyer arm.
Observation: The observation consists of each joint state (sin θ, cos θ), angular joint velocity, the goal
position pgoal, the object position and quaternion, end-effector coordinates peef, the distance between the
end-effector and object, and the distance between the object and target.
Rewards:
Rpush = 0.1 · 1||peef−pobj||2≤0.1
(1 − tanh(5 · ||peef − pobj||2))
+ 0.3 · 1||pobj−pgoal||2≤0.1
(1 − tanh(5 · ||pobj − pgoal||2)) + 150 · 1success
(B.2)
B.1.3 Sawyer Lift
In Sawyer Lift, the agent has to pick up an object inside a box. To lift the object, the Sawyer arm first needs
to get into the box, grasp the object, and lift the object above the box.
Success criteria: The goal criteria is to lift the object above the box height.
92
Initialization: Random noise sampled from N (0, 0.02) is added to the initial position of a sawyer arm.
The target position is always above the height of the box.
Observation: The observation consists of each joint state (sin θ, cos θ), angular joint velocity, the goal
position, the object position and quaternion, end-effector coordinates, the distance between the end-effector
and object.
Rewards: This task can be decomposed into three stages; reach, grasp, and lift. For each of the stages,
we define the reward function, and the agent receives the maximum reward over three values. The success
of grasp is detected when both of the two fingers touch the object.
Rlift = max
0.1 · (1 − tanh(10 · ||peef − pobj||2))
| {z }
reach
, 0.35 · 1grasp
| {z }
grasp
,
0.35 · 1grasp + 0.15 · (1 − tanh(15 · max(p
z
goal − p
z
obj, 0)))
| {z }
lift
+ 150 · 1success
(B.3)
B.1.4 Sawyer Assembly
The Sawyer Assembly task is to assemble the last table leg to the table top where other three legs are already
assembled. The Sawyer arm needs to avoid the other table legs while moving the leg in its gripper to the hole
since collision with other table legs can move the table. Note that the table leg that the agent manipulates is
attached to the gripper; therefore, it does not need to learn how to grasp the leg.
Success criteria: The task is considered successful when the table leg is inserted into the hole. The goal
position is at the bottom of the hole, and its success criteria is represented by ||pgoal − pleg-head||2 ≤ 0.05,
where pleg-head is position of head of the table leg.
Initialization: Random noise sampled from N (0, 0.02) is added to the initial position of the Sawyer
arm. The pose of the table top is fixed.
9
Observation: The observation consists of each joint state (sin θ, cos θ), angular joint velocity, the hole
position pgoal, positions of two ends of the leg in hand pleg-head, pleg-tail, and quaternion of the leg.
Rewards:
Rassembly = 0.4 · 1||pleg-head−pgoal||2≤0.3
(1 − tanh(15 · ||pleg-head − pgoal||2)) + 150 · 1success (B.4)
B.2 Training Details for Section 3.3
For reward scale in our baseline, we use 10 for all environments. In our method, each reward can be much
larger than the one in baseline because it uses a cumulative reward along a motion plan trajectory when the
motion planner is called. Therefore, larger reward scale in our method degrades the performance, and using
small reward scale 0.1 ∼ 0.5 enables the agent to solve tasks. Moreover, α in SAC, which is a coefficient of
entropy, is automatically tuned. To train a policy over discrete actions with SAC, we use Gumbel-Softmax
distribution [43] for categorical reparameterization with temperature of 1.0.
Table B.2: SAC hyperparameter
Parameter Value
Optimizer Adam
Learning rate 3e-4
Discount factor (γ) 0.99
Replay buffer size 106
Number of hidden layers for all networks 2
Number of hidden units for all networks 256
Minibatch size 256
Nonlinearity ReLU
Target smoothing coefficient (τ ) 0.005
Target update interval 1
Network update per environment step 1
Target entropy −dim(A)
94
B.2.1 Wall-clock Time
The wall-clock time of our method depends on various factors, such as the computation time of an MP
path and the number of policy updates. As Table B.3 shows, MoPA-RL learns quicker in wall-clock time
compared to SAC for 1.5M environment steps. This is because SAC updates the policy once for every
taken action, and our method requires fewer policy actions for completing an episode. As a result, our
method performs fewer costly policy updates. Moreover, while a single call to the motion planner can be
computationally expensive ( 0.3 seconds in our case), we need to invoke it less frequently since it produces a
multi-step plan (40 steps on average in our experiments). We further increased the efficiency of our method
by introducing a simplified interpolation planner.
Table B.3: Comparison of the wall-clock training time in hours
Sawyer Push Sawyer Lift Sawyer Assembly
MoPA-SAC 15 17 14
SAC 24 24 24
C Ablation Studies for LfD across Morphologies
This appendix section presents ablation studies to further investigate the method presented in Chapter 5.
We use the DRY CLOTH task for all ablations unless specified; it is the most challenging of our tasks.
We provide detailed answers to the following questions. Figure C.6 illustrates the ablations corresponding
to each part of the overall method. (1) How do different methods perform in creating optimized dataset
DStudent? (2) What is the best architecture to learn the task dynamics? (3) How good is DStudent compared
to the recorded demonstrations? (4) How well does the downstream LfD method handle different kinds of
demonstrations? (5) How does the use of expert state matching affect the downstream LfD? (6) How do the
baselines perform across related morphologies and environment?
95
We discovered that the Cross-Entropy Method (CEM) is the most effective optimizer for generating a
DStudent from demonstrations. When combined with CEM, the 1D CNN-LSTM architecture produces the
best results for trajectory optimization. Our optimized DStudent performs similarly to the pre-programmed
D
1p
Demo, which has access to full state information of the environment. By utilizing our chosen downstream
LfD method, we can successfully complete tasks with a variety of demonstrations and achieve superior
performance compared to both DStudent and DT eacher. Expert state matching negatively impacts the performance of DMfD. Lastly, we found that GAIfO trained on our DStudent outperforms GAIfO trained on
the DT eacher, and the difficulty of the environment significantly influences the performance of GAIfO and
GPIL.
C.1 Ablate the Method for Creating Optimized Dataset DStudent
We answer the question: how do different methods perform in creating optimized dataset DStudent? We
ablate the optimizer used to create DStudent from the demonstrations, labeled ABL1 in Figure C.6, and
compare the following methods, given state inputs from DT eacher.
• Random: A trivial random guesser, that serves as a lower benchmark.
• SAC: An RL algorithm that tries to reach the goal states of the demonstrations.
• Covariant Matrix Adaption Evolution Strategy (CMA-ES): An evolutionary strategy that samples
optimization parameters from a multi-variate Gaussian, and updates the mean and covariance at each
iteration.
• Model Predictive Path Integral (MPPI): An information theoretic MPC algorithm that can support
learned dynamics and complex cost criteria [151, 150].
• Cross-Entropy Method (CEM, ours): A well-known gradient-free optimizer, where we assume a
Gaussian distribution for optimization parameters.
96
Figure C.5: Environments used in our experiments, with one end-effector. The end-effectors are pickers (white spheres). In
CLOTH FOLD (left) the robot has to fold the cloth (orange and pink) along an edge (inspired by the SoftGym [79] two-picker cloth
fold task). In DRY CLOTH (middle) the robot has to hang the cloth (orange and pink) on the drying rack (brown plank). In THREE
BOXES (right), the robot has to rearrange three rigid boxes along a line.
Random
Actions
m end-effectors
Learned Spatio-temporal
Dynamics Model
Teacher Demos
n end-effectors
Optimized
Demos
m end-effectors
LfD Method
Indirect Trajectory
Optimization
ABL 2 ABL 1
ABL 3,4
ABL 5
Figure C.6: Ablations to MAIL components.
97
We did not use gradient-based trajectory optimizers since the contact-rich simulation will give rise to
discontinuous dynamics and noisy gradients. As shown in Table C.4a, SAC is unable to improve upon the
random baseline, likely because of the very large state-space of our environment (> 15000 states for > 5000
cloth particles) and error accumulations from the imprecision of learned dynamics model. Trajectory optimizers achieve the highest performance, and we chose CEM as the best optimizer based on the performance
of the optimized trajectory.
C.2 Ablate the Dynamics Model
We answer the question: what is the best architecture to learn the task dynamics? We ablate the learned
dynamics model Tψ, labeled ABL2 in Figure C.6. The environment state is the state from DT eacher i.e.,
positions of cloth particles. This is a structured but large state space since the cloth is discretized into
> 5000 particles.
Table C.4b shows the performance of trajectories achieved by using the dynamics models. We see
that CNN-LSTM models work better than models that contain only CNNs, graph networks (GNS [115]),
transformers (Perceiver IO [42]), or LSTMs. We hypothesize that this is the case since we need to capture the
spatial structure of cloth and capture a temporal element across the whole trajectory since particle velocity is
not captured in the state. Further, a 1D CNN works better because the cloth state can be simply represented
as a 2D vector (N × 3 which represents the xyz for N particles). This is easier to learn with than the 3D
state vector fed into 2D CNNs.
GNS performs poorly also due to the reasons of error accumulation from large displacements, discussed
in Section 5.3.2. Although Perceiver IO did not perform as well as CNN-LSTM, it did not affect the
downstream performance for the LfD method. We conducted an experiment to compare DMfD performance
when it was trained on the DStudent obtained from Perceiver IO and CNN-LSTM and found that they had
98
Tψ Simulator Tψ Simulator
Figure C.7: Predictions of the learned spatio-temporal dynamics model Tψ and the FleX simulator. Predictions are made for
the same state and action, shown for both cloth tasks. The learned model supports optimization approximately 50x faster than the
simulator, albeit at the cost of accuracy.
comparable results, shown in Figure C.8. This indicates that MAIL is adaptable to different DStudent and
capable of learning from suboptimal demonstrations.
Our learned dynamics model Tψ was significantly faster than the simulator. We tested it on a simple
training run of SAC [30], without parallelization. Our learned dynamics gave 162 fps, about 50x faster
than the 3.4 fps with the simulator. However, the dynamics error was not insignificant. We compute the
state changes in cloth by considering the cloth particles as a point cloud, and computing distances between
point clouds using the chamfer distance. We then executed actions on the cloth for the DRY CLOTH task,
comparing the cloth state before an action with the model’s predicted state and the simulator’s true state
after the action. Over 100 state transitions, we observed a cloth movement of 0.67 m in the true simulator,
and an error of 0.17 m between the true and predicted state of the cloth. This accuracy was tolerable for
trajectory optimization, qualitatively shown in Figure C.7, where we did not need optimal demonstrations.
99
Cloth fold Dry cloth
0.0
0.2
0.4
0.6
0.8
Normalized Performance
1D CNN, LSTM
PerceiverIO
Figure C.8: Performance comparison between DMfD trained on DStudent obtained using different learned dynamics models: 1D
CNN-LSTM and Perceiver IO. For each training run, we used the best model in each seed’s training run, and evaluated using 100
rollouts across 5 seeds, different from the training seed. Bar height denotes the mean, error bars indicate the standard deviation.
C.3 Compare Performance of Optimized Dataset D
1p
Optim
We answer the question: how good is DStudent compared to the recorded demonstrations? This ablation
gauges the performance of the optimized dataset that we used as the student dataset for LfD, DStudent =
D
1p
Optim. We compare this to other relevant datasets to solve the task, as shown in Table C.4c. It is labeled
ABL3 in Figure C.6. The two-picker demonstrations D
2p
Demo are recorded for an agent with two pickers as
end-effectors. This is used as the teacher demonstrations in our experiment DT eacher = D
2p
Demo. The onepicker demonstrations D
1p
Demo are recorded for an agent with one picker as an end-effector. This is to contrast
against the optimized demonstrations in the same morphology, D
1p
Optim. The random action trajectories are
with a one-picker agent, added as a lower performance benchmark. They are the same random trajectories
used to train the spatio-temporal dynamics model Tψ. Naturally, the teacher dataset is the best, as it is trivial
to do this task with two pickers. The one-picker dataset has about the same performance as the optimized
dataset D
1p
Optim, both of which are suboptimal. It can be inferred that it is not trivial to manipulate cloth with
one hand. This is the kind of task we wish to unlock with this method: tasks that are easy to do for teachers
in one morphology but difficult to program or record demonstrations for in the student’s morphology. Note
100
that D
1p
Optim has been optimized on the fast but inaccurate learned dynamics model, which is one reason for
the reduced performance. This is why the downstream LfD method uses the simulator, as accuracy is very
important in the final policy.
Method 25th% µ ± σ median 75th%
Random 0.000 0.003±0.088 0.000 0.000
SAC 0.000 0.000±0.006 0.000 0.000
CMA-ES 0.104 0.270±0.258 0.286 0.489
MPPI 0.070 0.289±0.264 0.275 0.474
CEM 0.351 0.502±0.242 0.501 0.702
(a) Ablation on the method chosen for creating demonstrations.
Method 25th% µ ± σ median 75th%
Perceiver IO 0.305 0.450±0.258 0.486 0.628
GNS -0.182 0.002±0.223 -0.042 0.149
2D CNN, LSTM 0.157 0.376±0.305 0.382 0.602
No CNN, LSTM 0.327 0.465±0.213 0.463 0.595
1D CNN, No LSTM 0.202 0.407±0.237 0.387 0.587
1D CNN, LSTM
(ours)
0.351 0.502±0.242 0.501 0.702
(b) Ablation on the dynamics network architecture.
Dataset 25th% µ ± σ median 75th%
DRandom 0.000 0.003±0.088 0.000 0.000
D
1p
Demo 0.344 0.484±0.169 0.446 0.641
D
2p
Demo 0.696 0.744±0.068 0.724 0.785
D
1p
Optim 0.351 0.502±0.242 0.501 0.702
(c) Compare the performance of the optimized dataset.
Table C.4: Ablation results for MAIL
C.4 Ablate Modality of Demonstrations
We answer the question: how well does the downstream LfD method handle different kinds of demonstrations? This ablates the composition of the student dataset fed into LfD, and is labeled ABL4 in Figure C.6.
We compare the following datasets for DStudent, using the notation for datasets explained in Section 5.2.1:
101
• Demonstrations in one-picker morphology, D
1p
Demo: These are non-trivial to create and are thus not
as performant, discussed above. Creating these is increasingly difficult as the task becomes more
challenging.
• Optimized demos, D
1p
Optim: This is optimized from the two-picker teacher demonstrations (DT eacher =
D
2p
Demo), which are easy to collect as the task is trivial with two pickers.
• 50% D
1p
Demo and 50% D
1p
Optim: A mix of trajectories from the two cases above. This is an example of
handling multiple demonstrators with different morphologies.
Figure C.9 illustrates that all three variants achieve similar final performance. This demonstrates that the
downstream LfD method is capable of solving the task with a variety of suboptimal demonstrations. This
could be from one dataset of demonstrations, or even a combination of datasets obtained from a heterogeneous set of teachers.
An interesting observation here is that by comparing Figure C.9 and Table C.4c, we see that the final
policy is better than the suboptimal demonstrations by a considerable margin, and also slightly improves
upon the performance of the teacher demonstrations. This improvement comes from the LfD method’s
ability to effectively utilize demonstrations and generalize across task variations. This result, combined
with the ablation that we need demonstrations in Section 5.3.2, shows that our downstream LfD method is
well adapted to work with suboptimal demonstrations to solve a task.
C.5 Ablate Reference State Initialization in DMfD
We answer the question: how does the use of demonstration state matching affect the downstream LfD? An
improvement we made over the original DMfD algorithm is to disable matching with expert states, known
as RSI-IR, first proposed in [99]. We justify this improvement in this ablation, labeled ABL5 in Figure C.6.
102
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Performance
0.0 0.1 0.2 0.3 0.4 0.5
Million Steps
0.0
0.1
100% one-picker demos
50% one-picker demos and 50% DStudent
100% DStudent
Figure C.9: Ablation on the modality of demonstrations on
LfD performance. Similar performance shows that MAIL can
learn from a wide variety of demonstrations, or even a mixture
of them, without loss in performance. See Section C.4.
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Performance
0.0 0.1 0.2 0.3 0.4 0.5
Million Steps
0.0
0.1
DMfD with RSI,IR
DMfD without RSI,IR
Figure C.10: Ablation on the effect of reference state initialization (RSI) and imitation reward (IR) on LfD performance. RSI is not helpful here because our tasks are not as
dynamic or long horizon as DeepMimic [99]. See Section C.5.
As shown in Figure C.10, removing RSI and IR has a net positive effect throughout training, and around
10% on the final policy performance. This means that matching expert states exactly via imitation reward
does not help, even during the initial stages of training when the policy is randomly initialized. We believe
this is because RSI helps when there are hard-to-reach intermediate states that the policy cannot reach
during the initial stages of training. This is true for dynamic or long-horizon tasks, such as karate chops
and roundhouse kicks. However, our tasks are quasi-static, and also have a short horizon of 3 for the cloth
tasks. In other words, removing this technique allows the policy to freely explore the state space while the
demonstrations can still guide the RL policy learning via the advantage-weighted loss from DMfD.
C.6 Ablate the Effect of Cross-morphology on LfD Baselines
We answer the question: how do established LfD baselines perform across morphologies? This ablation
studies the effect of cross-morphology in the demonstrations, where we compare the performance of GAIfO,
when provided demonstrations from the teacher dataset DT eacher and (suboptimal) student dataset DStudent,
for the DRY CLOTH task.
103
As we can see in Table C.5, there is a 36% performance improvement when using DStudent instead of
DT eacher. The primary difference that the agent sees during training is the richness of demonstration states,
as the demonstration actions are not available to learn from. Since the student morphology has only one
picker, any demonstration for the task (DRY CLOTH) includes multiple intermediate states of the cloth in
various conditions of being partially hung for drying. By contrast, the teacher requires fewer pick-place
steps to complete the task, and thus there are fewer intermediate states in the demonstrations.
C.7 Ablate the Effect of Environment Difficulty on LfD Baselines
We answer the question: how do established LfD baselines perform across environments? Given the subpar
performance of the LfD baselines GAIfO and GPIL on our SOTA environments, we ablated the effect of
environment difficulty. We took the easy cloth environment (CLOTH FOLD) and used an easier variant of
it, CLOTH FOLD DIAGONAL PINNED [113]. In this variant, the agent has to fold cloth along a diagonal,
which can be done by manipulating only one corner of the cloth. Moreover, one corner of the cloth is pinned
to prevent sliding, making it easier to perform. We used state-based observations and a small-displacement
action space, where the agent outputs incremental picker displacements instead of pick-and-place locations.
We can see in Table C.6 that the same baselines are able to perform significantly better in this environment.
Hence, we believe manipulating with long-horizon pick-place actions, with an image observation, makes it
challenging for the baselines to perform cloth manipulation tasks described in Section 5.3.1 and Section D.1.
Method 25th% µ ± σ median 75th%
DTeacher -0.198 -0.055±0.183 -0.043 0.078
DStudent 0.199 0.363±0.245 0.409 0.528
Table C.5: Ablation of GAIfO on the effect of cross-morphology. We compare the normalized performance, measured at the end
of the task.
104
Method 25th% µ ± σ median 75th%
GPIL 0.356 0.427±0.162 0.487 0.553
GAIfO 0.115 0.374±0.267 0.471 0.592
Table C.6: Measuring performance on the easy cloth task, CLOTH FOLD DIAGONAL PINNED. We compare the normalized
performance, measured at the end of the task.
D Additional Experimental Details for LfD across morphologies
D.1 Tasks
Here we give more details about the tasks, including the performance functions, teacher dataset, and sample
images. Figure C.5 shows images all of simulation environments used for SOTA comparisons and generalizability, with one end-effector. In each environment, the end-effectors are pickers (white spheres). In
cloth-based environments, the cloth is discretized into an 80x80 grid of particles, giving a total of 6400
particles.
1. CLOTH FOLD: Fold a square cloth in half, along a specified line. The performance metric is the
distance of the cloth particles left of the folding line, to those on the right of the folding line. A fully
folded cloth should have these two halves virtually overlap. Teacher demonstrations are from an agent
with two pickers (i.e., DT eacher = D
2p
Demo); we solve the task on a student agent with one picker. Task
variations are in cloth rotation.
2. DRY CLOTH: Pick up a square cloth from the ground and hang it on a plank to dry, variant of [89].
The performance metric is the number of cloth particles (in simulation) on either side of the plank
and above the ground. Teacher demonstrations are from an agent with two pickers (i.e., DT eacher =
D
2p
Demo); we solve the task on a student agent with one picker. Task variations are in cloth rotations
and translations with respect to the plank.
105
3. THREE BOXES: A simple environment with three boxes along a line that need to be rearranged
to designated goal locations. Teacher demonstrations are from an agent with three pickers (i.e.,
DT eacher = D
3p
Demo); we solve the task on student agents with one picker and two pickers. Performance is measured by the distance of each object from its goal location. This task is used to
illustrate the generalizability of MAIL with various n-to-m end-effector transfers, and is not used in
the SOTA comparisons.
D.2 Hyperparameter Choices for MAIL
In this section, Table D.7 shows the hyperparameters chosen for training the forward dynamics model Tψ.
Table D.8 shows the details of CEM hyperparameter choices. Table D.9 shows the hyperparameters for our
chosen LfD method (DMfD).
Parameter Description
CNN 4 layers, 32 channels, 3x3 kernel, leaky ReLU activation.
stride = 2 for the first layer, stride = 1 for subsequent layers
LSTM One layer
Hidden size = 32
Other Parameters Learning rate α =1e-5
Batch size = 128
Table D.7: Hyper-parameters for training the forward dynamics model.
D.3 Performance Metrics for Real-world Cloth Experiments
In this section, we explain the metrics for measuring performance of the cloth, to explain the sim2real results
discussed in Section 5.3.2.1
For CLOTH FOLD task, we measure performance at time t by the number of pixels of the top color
pixtop,t and bottom color pixbot,t of the flattened cloth, compared to the maximum number of pixels, pixmax
(Figure D.11).
106
0.00 0.25 0.50 0.75 1.00
Fraction of pixels ftop, fbot
0.0
0.2
0.4
0.6
0.8
1.0
Performance
Top metric p(top)
Bottom metric p(bot)
Figure D.11: Performance function for CLOTH FOLD on the real robot. At time t, we measure the fraction of pixels visible
to the maximum number of pixels visible ftop = pixtop,t/pixmax and fbot = pixbot,t/pixmax. Performance for the top of
the cloth should be 1 when it is not visible, p(top) = 1 − ftop. Performance for the bottom of the cloth should be 1 when it is
exactly half-folded on top of the top side, p(bot) = min [2 (1 − fbot), 2fbot]. Final performance is an average of both metrics,
p(st) = p(top) + p(bottom)/2. Note that the cloth is flattened at the start, thus pixmax = pixtop,0.
107
Planning
Horizon
Number of
optimization iterations
Number of env
interactions
1 1 2 21,000
2 2 2 15,000
3 2 2 21,000
4 2 2 31,000
5 2 2 34,000
6 2 10 21,000
7 2 1 21,000
8 2 1 15,000
9 2 1 32,000
10 3 2 21,000
11 3 10 21,000
12 4 2 21,000
13 4 10 21,000
Table D.8: CEM hyper-parameters tested for tuning the trajectory optimization. We conducted ten rollouts for each parameter set
and used the set with the highest average normalized performance on the teacher demonstrations. Population size is determined by
the number of environment interactions. The number of elites for each CEM iteration is 10% of population size.
For DRY CLOTH task, it is challenging to measure pixels on the sides and top of the plank. Moreover,
we could be double counting pixels if they are visible in both side and top views. Hence, we measure the
cloth to determine whether the length of the cloth on top of the plank is equal to or greater than the side of
the square cloth. We call this the spread metric.
The policies achieve ∼ 80% performance, which is about the average performance of our method in
simulation, for both tasks. However, since these performance metrics are different in the simulation and real
world, we cannot quantify the sim2real gap through these numbers.
D.4 Collected Dataset of Teacher Demonstrations
We have 100 demonstrations provided by the teacher, mentioned on Section 5.2.4. The diversity of the task
comes from the initial conditions for these demonstrations, which are sampled from the task distribution
vd ∼ V. This variability in the initial state adds diversity to the dataset. The quality and performance of
these teacher demonstrations were briefly discussed in the ablations (Section C.4).
108
Parameter Description
State encoding Fully connected network (FCN)
2 hidden layers of 1024, ReLU activation
Image encoding
32x32 RGB input, with random crops.
CNN: 4 layers, 32 channels, stride 1, 3x3 kernel, leaky ReLU activation
FCN: 1 layer of 1024 neurons, tanh activation
Actor Fully connected network
2 hidden layers of 1024, leaky ReLU activation
Critic Fully connected network
2 hidden layers of 1024, leaky ReLU activation
Other parameters
Discount factor: γ = 0.9
Entropy loss weight: wE = 0.1
Entropy regularizer coefficient: α = 0.5
Batch size = 256
Replay buffer size = 600,000
RSI-IR probability = 0 (disabled)
Table D.9: Hyper-parameters used in the LfD method (DMfD).
All demonstrations come from a scripted policy. For ClothFold, the teacher has two end-effectors and
picks two corners of the cloth to move them towards the other two corners. For DryCloth, the teacher has two
end-effectors and picks two corners of the cloth to move them to the other side of the rack. They maintain
the same distance between each other during the move to ensure the cloth is spread out when it hangs on
the rack. For ThreeBoxes, the teacher has three end-effectors. It picks up all the boxes simultaneously and
places them in their respective goals.
D.5 Random Actions Dataset used for Training the Dynamics Model
We trained the dynamics model on random actions from various states, to cover the state-action distributions
our tasks would operate under.
For CLOTH FOLD, our random action policy is to pick a random cloth particle and move the particle to a
random goal location within the action space. For DRY CLOTH, the random action policy is to pick a random
cloth particle, and move it to a random goal location around the drying rack, to learn cloth interactions
around the rack. For completeness, we also trained a forward dynamics model for the THREE BOXES task.
109
Here, the random action policy is to pick the boxes in order and sample a random place location within the
action space.
Each task’s episode horizon is 3. Our actions are pick-and-place actions, and the action space is in the
full range of visibility of the camera. For DRY CLOTH, this limit is [−0.5, 0, 0.5] to [0.5, 0.7, 0.5]. For
CLOTH FOLD it is [−0.9, 0, 0.9] to [0.9, 0.7, 0.9]. For THREE BOXES it is −0.1 to 1.35.
110
Abstract (if available)
Abstract
Robot manipulation of complex objects, such as cloth, is challenging due to difficulties in perceiving and exploring the environment. Pure reinforcement learning (RL) is difficult in this setting, as it requires extensive exploration of the state space, which can be inefficient and dangerous. Demonstrations from humans can alleviate the need for exploration, but collecting good demonstrations can be time-consuming and expensive. Therefore, a good balance between perception, exploration, and imitation is needed to solve manipulation of complex objects.
This dissertation focuses on dexterous manipulation of complex objects, using images and without assuming full state information during inference. It also aims to achieve efficient learning by reducing interactions with the environment during exploration and reducing the overhead of collecting demonstrations. To achieve these goals, we present i. a learning algorithm that uses a motion planner in the loop, to enable efficient long horizon exploration, ii. A framework for visual manipulation of complex deformable objects using demonstrations from a set of agents with different embodiments. iii. An LfD algorithm for dexterous tasks with rigid objects, such as peg insertion with high precision, using images and a multi-task attention-based architecture.
These contributions enable robots to manipulate complex objects efficiently and with high precision, using images alone. This opens up new possibilities for robots to be used in a wider range of applications, such as manufacturing, logistics, and healthcare.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling robot learning with skills
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Augmented simulation techniques for robotic manipulation
PDF
Hierarchical tactile manipulation on a haptic manipulation platform
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Algorithms and systems for continual robot learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Learning objective functions for autonomous motion generation
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Trajectory planning for manipulators performing complex tasks
PDF
Learning from planners to enable new robot capabilities
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Learning affordances through interactive perception and manipulation
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
Asset Metadata
Creator
Salhotra, Gautam
(author)
Core Title
Accelerating robot manipulation using demonstrations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
04/22/2024
Defense Date
12/05/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,cross-embodiment learning,deformable object manipulation,learning across morphologies,learning from demonstrations,LfD,motion planning,OAI-PMH Harvest,object manipulation,reinforcement learning,RL,robotics,robotics assembly,trajectory optimization,visual servoing
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav Suhas (
committee chair
), Bansal, Somil (
committee member
), Seita, Daniel (
committee member
)
Creator Email
me@gautamsalhotra.com,salhotra@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113889809
Unique identifier
UC113889809
Identifier
etd-SalhotraGa-12821.pdf (filename)
Legacy Identifier
etd-SalhotraGa-12821
Document Type
Dissertation
Format
theses (aat)
Rights
Salhotra, Gautam
Internet Media Type
application/pdf
Type
texts
Source
20240422-usctheses-batch-1143
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
cross-embodiment learning
deformable object manipulation
learning across morphologies
learning from demonstrations
LfD
motion planning
object manipulation
reinforcement learning
RL
robotics
robotics assembly
trajectory optimization
visual servoing