Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Datadriven acquisition of closedloop robotic skills
(USC Thesis Other)
Datadriven acquisition of closedloop robotic skills
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DataDriven Acquisition of ClosedLoop Robotic Skills
by
Yevgen Chebotar
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
Committee:
Prof. Gaurav S. Sukhatme (Chair) Computer Science
Prof. Heather Culbertson Computer Science
Prof. Satyandra K. Gupta Mechanical Engineering
May 2019
Copyright 2019 Yevgen Chebotar
Table of Contents
1 Introduction 1
1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I Training Decomposition 4
2 Path Integral Guided Policy Search 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Guided Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Policy Improvement with Path Integrals . . . . . . . . . . . . . . . . . . . 10
2.4 Path Integral Guided Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 PI
2
for Guided Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Global Policy Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Global Policy Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 Learning Visuomotor Policies . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 SingleInstance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.3 Evaluating Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Combining ModelBased and ModelFree Updates for TrajectoryCentric Reinforce
ment Learning 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 ModelBased Optimization of TVLG Policies . . . . . . . . . . . . . . . . 24
3.3.2 Policy Improvement with Path Integrals . . . . . . . . . . . . . . . . . . . 25
3.4 Integrating ModelBased Updates into PI
2
. . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 TwoStage PI
2
update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 ModelBased Substitution with LQRFLM . . . . . . . . . . . . . . . . . . 27
ii
3.4.3 Optimizing Cost Residuals with PI
2
. . . . . . . . . . . . . . . . . . . . . . 27
3.4.4 Summary of PILQR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Training Parametric Policies with GPS . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.2 Real Robot Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
II SelfSupervision 35
4 SelfSupervised Regrasping using Reinforcement Learning 36
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Grasp Stability Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Biomimetic Tactile Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Hierarchical Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 SpatioTemporal HMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 SelfSupervised Reinforcement Learning for Regrasping . . . . . . . . . . . . . . 42
4.4.1 Mapping from Tactile Features to Grasp Adjustments . . . . . . . . . . . . 42
4.4.2 Policy Search for Learning Mapping Parameters . . . . . . . . . . . . . . . 43
4.4.3 Learning a General Regrasping Policy . . . . . . . . . . . . . . . . . . . . 44
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Evaluation of Grasp Stability Prediction . . . . . . . . . . . . . . . . . . . 46
4.5.2 Learning Individual Linear Regrasping Policies . . . . . . . . . . . . . . . 46
4.5.3 Evaluation of General Regrasping Policy . . . . . . . . . . . . . . . . . . . 48
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 TimeContrastive Networks: SelfSupervised Learning from Video 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Imitation with TimeContrastive Networks . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Training TimeContrastive Networks . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Learning Robotic Behaviors with Reinforcement Learning . . . . . . . . . 56
5.3.3 Direct Human Pose Imitation . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.1 Discovering Attributes from General Representations . . . . . . . . . . . . 59
5.4.2 Learning Object Interaction Skills . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.3 SelfRegression for Human Pose Imitation . . . . . . . . . . . . . . . . . . 65
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iii
III Imitation and Transfer Learning 68
6 MultiModal Imitation Learning from Unstructured Demonstrations using Genera
tive Adversarial Nets 69
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 MultiModal Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Relation to InfoGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.1 Task Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.2 MultiTarget Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . 77
6.6.3 MultiTask Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Closing the SimtoReal Loop: Adapting Simulation Randomization with Real World
Experience 81
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Closing the SimtoReal Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.1 Simulation Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.2 Learning Simulation Randomization . . . . . . . . . . . . . . . . . . . . . 84
7.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.2 Simulation Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4.3 Comparison to Standard Domain Randomization . . . . . . . . . . . . . . 88
7.4.4 Real Robot Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4.5 Comparison to TrajectoryBased Parameter Learning . . . . . . . . . . . . 92
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8 Conclusions and Future Work 94
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A Appendix 96
A.1 Appendix: Combining ModelBased and ModelFree Updates for Trajectory
Centric Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.1.1 Derivation of LQRFLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.1.2 PI
2
Update through Constrained Optimization . . . . . . . . . . . . . . . . 98
A.1.3 Detailed Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.4 Additional Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2 Appendix. TimeContrastive Networks: SelfSupervised Learning from Video . . 102
iv
A.2.1 Objects Interaction Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2.2 Pose Imitation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2.3 Imitation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.2.4 Imitation Invariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3 Appendix: Closing the SimtoReal Loop: Adapting Simulation Randomization
with Real World Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.3.1 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.3.2 SimOpt Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References 113
v
List of Figures
2.1 Door opening and pickandplace using our path integral guided policy search
algorithm. Door opening can handle variability in the door pose, while the
pickandplace policy can handle various initial object poses. . . . . . . . . . . . 6
2.2 The architecture of our neural network policy. . . . . . . . . . . . . . . . . . . . . 13
2.3 Task setup and execution. Left: door opening task. Right: pickandplace task.
For both tasks, the pose of the object of interest (door or bottle) is randomized,
and the robot must perform the task using monocular camera images from the
camera mounted over the robot’s shoulder. . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Task adaptation success rates over the course of training with PI
2
and LQR for
single instances of door opening and pickandplace tasks. Each iteration consists
of 10 trajectory samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Robot RGB camera images used for controlling the robot. Top: door opening
task. Bottom: pickandplace task. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Training curves for PIGPS, PIGPSW, and REPS on a simulated point mass
task. Left: Runs chosen based on lowest mean cost at iteration 50; Right: Runs
chosen based on lowest mean cost across all iterations. Each iteration consists of
30 trajectory samples for a single task instance. . . . . . . . . . . . . . . . . . . . 18
2.7 Success rates during training generalized policies with global policy sampling
for door opening (left) and pickandplace (right). Each iteration consists of 50
trajectory samples: 10 samples of each of the 5 random task instances. Dashed
lines: success rates after local policy sampling training. . . . . . . . . . . . . . . 19
3.1 Real robot tasks used to evaluate our method. Left: The hockey task which
involves discontinuous dynamics. Right: The power plug task which requires
high level of precision. Both of these tasks are learned from scratch without
demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 We evaluate on a set of simulated robotic manipulation tasks with varying
difficulty. Left to right, the tasks involve pushing a block, reaching for a target,
and opening a door in 3D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vi
3.3 Left: Average final distance from the block to the goal on one condition of the
gripper pusher task. This condition is difficult due to the block being initialized
far away from the gripper and the goal area, and only PILQR is able to succeed
in reaching the block and pushing it toward the goal. Results for additional
conditions are available in Appendix A.1.4, and the supplementary video demon
strates the final behavior of each learned policy. Right: Final distance from the
reacher end effector to the target averaged across 300 random test conditions per
iteration. MDGPS with LQRFLM, MDGPS with PILQR, TRPO, and DDPG all
perform competitively. However, as the log scale for the x axis shows, TRPO and
DDPG require orders of magnitude more samples. MDGPS with PI
2
performs
noticeably worse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Left: Minimum angle in radians of the door hinge (lower is better) averaged
across 100 random test conditions per iteration. MDGPS with PILQR outper
forms all other methods we compare against, with orders of magnitude fewer
samples than DDPG and TRPO, which is the only other successful algorithm.
Right: Single condition comparison of the hockey task performed on the real
robot. Costs lower than the dotted line correspond to the puck entering the goal. 32
3.5 Left: Experimental setup of the hockey task and the success rate of the final
PILQRMDGPS policy. Red and blue: goal positions used for training, green:
new goal position. Right: Single condition comparison of the power plug task
performed on the real robot. Note that costs above the dotted line correspond to
executions that did not actually insert the plug into the socket. Only our method
(PILQR) was able to consistently insert the plug all the way into the socket by
the final iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Regrasping scenario: the robot partially misses the object with one of the fingers
during the initial grasp (left), predicts that the current grasp will be unstable,
places the object down, and adjusts the hand configuration to form a firm grasp
of the object using all of its fingers (right). . . . . . . . . . . . . . . . . . . . . . . 37
4.2 The schematic of the electrode arrangements on the BioTac sensor (left). Tac
tile image used for the STHMP features (right). The X values are the refer
ence electrodes. The 19 BioTac electrodes are measured relative to these 4
reference electrodes. V1 and V2 are created by taking an average response
of the neighboring electrodes: V 1 = avg(E17;E18;E12;E2;E13;E3) and
V 2=avg(E17;E18;E15;E5;E13;E3). . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Objects and experimental setup used for learning the grasp stability predictor
and the regrasping behavior. If an object falls out of the hand it returns to its
initial position due to the shape of the bowl. Topleft: the cylinder. Topright:
the box. Bottomleft: the ball. Bottomright: the novel object. . . . . . . . . . . . 45
vii
4.4 Top left: schematic of the electrode arrangements on the BioTac sensor and the
corresponding tactile image used for the STHMP features. V1, V2 and V3 are
computed by averaging the neighboring electrode values. Top right, bottom left,
bottom right: reinforcement learning curves for regrasping individual objects
using REPS. Policy updates are performed every 100 regrasps. . . . . . . . . . . 47
5.1 TimeContrastive Networks (TCN): Anchor and positive images taken from
simultaneous viewpoints are encouraged to be close in the embedding space,
while distant from negative images taken from a different time in the same
sequence. The model trains itself by trying to answer the following questions
simultaneously: What is common between the differentlooking blue frames?
What is different between the similarlooking red and blue frames? The resulting
embedding can be used for selfsupervised robotics in general, but can also
naturally handle 3rdperson imitation. . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Singleview TCN: positives are selected within a small window around anchors,
while negatives are selected from distant timesteps in the same sequence. . . . . 56
5.3 Training signals for pose imitation: timecontrastive, selfregression and human
supervision. The timecontrastive signal lets the model learn rich representations
of humans or robots individually. Selfregression allows the robot to predict its
own joints given an image of itself. The human supervision signal is collected
from humans attempting to imitate robot poses. . . . . . . . . . . . . . . . . . . 58
5.4 Simulated dish rack task. Left: Thirdperson VR demonstration of the dish
rack task. Middle: View from the robot camera during training. Right: Robot
executing the dish rack task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Real robot pouring task. Left: Thirdperson human demonstration of the pouring
task. Middle: View from the robot camera during training. Right: Robot
executing the pouring task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 Learning progress of the pouring task for two different demonstration videos
(left and right). For each training, only a single 3rdperson human demonstration
is used, as shown in Fig. 5.5. This graph reports the weight in grams measured
from the target recipient after each pouring action (maximum weight is 189g)
along with the standard deviation of all 10 rollouts per iteration. . . . . . . . . . 65
5.7 TCN for selfsupervised human pose imitation: architecture, training and imi
tation. The embedding is trained unsupervised with the timecontrastive loss,
while the joints decoder can be trained with selfsupervision, human supervision
or both. Output joints can be used directly by the robot planner to perform the
imitation. Human pose is never explicitly represented. . . . . . . . . . . . . . . . 66
5.8 tSNE embedding colored by agent for model ”TC+Self”. We show that images
are locally coherent with respect to pose while being invariant to agent or viewpoint. 67
6.1 Left: Walker2D running forwards, running backwards, jumping. Right: Hu
manoid running forwards, running backwards, balancing. . . . . . . . . . . . . . 76
viii
6.2 Left: Reacher with 2 targets: random initial state, reaching one target, reaching
another target. Right: Gripperpusher: random initial state, grasping policy,
pushing (when grasped) policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Results of the imitation GAN with (top row) and without (bottom row) the
latent intention cost. Left: Reacher with 2 targets(crosses): final positions of the
reacher (circles) for categorical (1) and continuous (2) latent intention variable.
Right: Reacher with 4 targets(crosses): final positions of the reacher (circles) for
categorical (3) and continuous (4) latent intention variable. . . . . . . . . . . . . 78
6.4 Left: Rewards of different Reacher policies for 2 targets for different intention
values over the training iterations with (1) and without (2) the latent intention
cost. Right: Two examples of a heatmap for 1 target Reacher using two latent
intentions each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Top: Rewards of Walker2D policies for different intention values over the
training iterations with (left) and without (right) the latent intention cost. Bottom:
Rewards of Humanoid policies for different intention values over the training
iterations with (left) and without (right) the latent intention cost. . . . . . . . . . 79
6.6 Timelapse of the learned Gripperpusher policy. The intention variable is
changed manually in the fifth screenshot, once the grasping policy has grasped
the block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Policies for opening a cabinet drawer and swingpeginhole tasks trained by
alternatively performing reinforcement learning with multiple agents in simula
tion and updating simulation parameter distribution using a few real world policy
executions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 The pipeline for optimizing the simulation parameter distribution. After training
a policy on current distribution, we sample the policy both in the real world and
for a range of parameters in simulation. The discrepancy between the simulated
and real observations is used to update the simulation parameter distribution in
SimOpt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 An example of a wide distribution of simulation parameters in the swingpeg
inhole task where it is not possible to find a solution for many of the task
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Left: Performance of the policy training with domain randomization for different
variances of the distribution of the cabinet position along the Xaxis in the
drawer opening task. Right: Initial distribution of the cabinet position in the
source environment, located at extreme left, slowly starts to change to the target
environment distribution as a function of running 5 iterations of SimOpt. . . . . 90
7.5 Policy performance in the target drawer opening environment trained on ran
domized simulation parameters at different iterations of SimOpt. As the source
environment distribution gets adjusted, the policy transfer improves until the
robot can successfully solve the task in the fourth SimOpt iteration. . . . . . . . 91
ix
7.6 Running policies trained in simulation at different iterations of SimOpt for
real world swingpeginhole and drawer opening tasks. Left: SimOpt adjusts
physical parameter distribution of the soft rope, peg and the robot, which results
in a successful execution of the task on a real robot after two SimOpt iterations.
Right: SimOpt adjusts physical parameter distribution of the robot and the drawer.
Before updating the parameters, the robot pushes too much on the drawer handle
with one of its fingers, which leads to opening the gripper. After one SimOpt
iteration, the robot can better control its gripper orientation, which leads to an
accurate task execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.7 Covariance matrix heat maps over 3 SimOpt updates of the swingpeginhole
task beginning with the initial covariance matrix. . . . . . . . . . . . . . . . . . . 92
A.1 The initial conditions for the gripper pusher task that we train TVLG policies on.
The top left and bottom right conditions are more difficult due to the distance
from the block to the goal and the configuration of the arm. The top left condition
results are reported in Section 3.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2 Topleft, topright, bottomleft: single condition comparisons of the gripper
pusher task in three additional conditions, which correspond to the topright,
bottomright, and bottomleft conditions depicted in Figure A.1, respectively.
The PILQR method outperforms other baselines in two out of the three conditions.
The conditions presented in the top and middle figure are significantly easier
than the other conditions presented in the work. Bottom right: Additional results
on the door opening task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.3 tSNE colored by attribute combinations: TCN (left) does a better job than
ImageNetInception (right) at separating combinations of attributes. . . . . . . . 102
A.4 Labelfree pouring imitation: nearest neighbors (right) for each reference image
(left) for different models (multiview TCN, Shuffle & Learn and ImageNet
Inception). These pouring test images show that the TCN model can distinguish
different hand poses and amounts of poured liquid simply from unsupervised
observation while being invariant to viewpoint, background, objects and subjects,
motionblur and scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.5 Labelfree pose imitation: nearest neighbors (right) for each reference frame
(left) for each row. Although only trained with selfsupervision (no human
labels), the multiview TCN can understand correspondences between humans
and robots for poses such as crouching, reaching up and others while being
invariant to viewpoint, background, subjects and scale. . . . . . . . . . . . . . . . 104
A.6 Varying the amount of unsupervised data: increasing the number of unsupervised
sequences decreases the imitation error for both models. . . . . . . . . . . . . . . 106
A.7 L2 robot error breakdown by robot joints. From left to right, we report errors
for the 8 joints of the Fetch robot, followed by the joints average, followed by
the joints average excluding the ”shoulder pan” join. . . . . . . . . . . . . . . . . 106
x
A.8 tSNE embedding before (left) and after (right) training, colored by view. Before
training, we observe concentrated clusters of the same color, indicating that the
manifold is organized in a highly viewspecific way, while after training each
color is spread over the entire manifold. . . . . . . . . . . . . . . . . . . . . . . . 107
A.9 tSNE embedding before (left) and after (right) training, colored by agent. Before
training, we observe concentrated clusters of the same color, indicating that the
manifold is organized in a highly agentspecific way, while after training each
color is spread over the entire manifold. . . . . . . . . . . . . . . . . . . . . . . . 107
A.10 Testing TC+Human+Self model for orientation invariance: while the error in
creases for viewpoints not seen during training (30
○
, 90
○
and 150
○
), it remains
competitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.11 Testing for scale invariance: while the error increases when decreasing the
distance of the camera to the subject (about half way compared to training), it
remains competitive and lower than the humansupervised baseline. . . . . . . . 108
A.12 Selfsupervised imitation examples. Although not trained using any human
supervision (model ”TC+Self”), the TCN is able to approximately imitate human
subjects unseen during training. Note from the rows (1,2) that the TCN discov
ered the mapping between the robot’s torso joint (up/down) and the complex set
of human joints commanding crouching. In rows (3,4), we change the capture
conditions compared to training (see rows 1 and 2) by using a freeform camera
motion, a closeup scale and introduction some motionblur and observe that
imitation is still reasonable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xi
List of Tables
4.1 Performance of the individual and combined regrasping policies. . . . . . . . . . . 49
5.1 Pouring alignment and classification errors: all models are selected at their lowest
validation loss. The classification error considers 5 classes related to pouring
detailed in Table 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Pouring alignment and classification errors: these models are selected using
the classification score on a small labeled validation set, then ran on the full test
set. We observe that multiview TCN outperforms other models with 15x shorter
training time. The classification error considers 5 classes related to pouring: ”hand
contact with recipient”, ”within pouring distance”, ”container angle”, ”liquid is
flowing” and ”recipient fullness”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Detailed attributes classification errors, for model selected by validation loss. . . . 62
5.4 Imitation error for different combinations of supervision signals. The error reported
is the joints distance between prediction and groundtruth. Note perfect imitation is
not possible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 Drawer opening: simulation parameter distribution. . . . . . . . . . . . . . . . . . 110
A.2 Swingpeginhole: simulation parameter distribution. . . . . . . . . . . . . . . . . 110
A.3 Drawer opening: SimOpt parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.4 Swingpeginhole: SimOpt parameters. . . . . . . . . . . . . . . . . . . . . . . . . 112
xii
Abstract
As robots enter our daily lives they will have to perform a high variety of complex tasks, make
sense of their highdimensional sensory inputs, and cope with highly unstructured, uncertain and
dynamic realworld environments. In order to address these challenges, the robots have to be able
to not only improve their existing skills but also acquire new skills when presented with novel tasks.
In this thesis, we develop several datadriven robot learning approaches and demonstrate their
ability to learn challenging tasks using various sources of training signals and sensory information.
First, this work presents a method for decomposing the training of highdimensional visuo
motor policies into efficient learning of local components for tasks with nonlinear and discon
tinuous dynamics. Next, we develop an algorithm for combining modelbased optimization with
modelfree learning that removes the bias of the simplified models while still maintaining high
sample efficiency for learning tasks with complex dynamics. In the second part of the thesis, a
concept of selfsupervision is used to employ external sources of information, such as pretrained
reward estimators and reward signals from different domains, to better automate the policy training.
In particular, we present a selfsupervised framework for learning grasp correction policies using
tactile data and reinforcement learning, and show how robots can learn from thirdperson human
video demonstrations in a selfsupervised manner by building a common representation model
of human videos and robot camera images. In the last part of the thesis, we first concentrate on
imitation learning as a way to accelerate the acquisition of robotic skills, and present a method
for training multimodal stochastic policies from unstructured and unlabelled demonstrations that
employs recent developments in generative adversarial learning. Finally, we show how to reduce
the amount of needed realworld trials by moving a large portion of policy training into simulated
environments and using only small amounts of realworld experience to adjust the distribution of
simulated scenarios for a better transferability to the real world.
xiii
Chapter 1
Introduction
There are numerous advantages of introducing robots into various areas of our society, such as
elderly care, medicine, construction, logistics etc. However, as we start deploying robots in our
unstructured realworld environments, they will have to perform a large number of complex skills
and understand their uncertain environment using highdimensional sensory data. Tedious manual
engineering of robotic skills does not seem to offer a scalable solution that can generalize across
multiple applications where adaptation and learning from experience is required. In order to
address these challenges, the robots have to make use of various sources of data that they encounter
during their operation. They need to be able to use these data to improve their existing skills and
learn new skills when facing previously unseen tasks. In this thesis, we develop a range of methods
for efficient robot learning and employ them to learn complex sensorybased robotic policies.
An important component of a robust robotic skill is the ability to incorporate sensory infor
mation to quickly react to changes in the environment. Furthermore, in many situations it is
not sufficient to only use sensory information at the initial stage of the movement planning, as
the errors in the perceptual system or robot control tend to accumulate over the course of the
movement and result in an inexact or incorrect execution. Sensory feedback needs to be continually
evaluated during the task execution resulting in closedloop robotic policies. This not only helps to
continuously correct the robot motion but also leads to relaxed requirements on precision of the
models used for planning the movement, or even allows to learn behaviors in a modelfree manner.
The methods presented in this thesis make use of visual and tactile information to train closedloop
robotic policies that can generalize across various conditions of a task.
In the recent years, machine learning has achieved impressive results in a wide range of
domains, such as computer vision, speech recognition, healthcare etc. A specific area that has
wide applications in robotics is reinforcement learning (RL) (R. S. Sutton & Barto, 1998). RL
allows us to concentrate more on designing the goals of the tasks rather than the actual details
of the robot behavior, which is in many cases a much easier and more feasible task for a human.
However, this convenience often comes at a price of a high sample complexity, i.e. high number
of robot trials required to learn a behavior, as the robot might need to explore a large space of
motions in order to optimize the user’s reward or cost function. Furthermore, automation of the
data collection is a challenging task on itself as the robot experiments might require a significant
amount of human supervision. In this thesis, we use several techniques to address these challenges.
In the first part of the thesis, we develop methods for improving the efficiency of robot learning
1
by decomposing the complex policy training. First, we show how to decompose the learning
of complex highdimensional robot policies that can cope with complex system dynamics into
efficient training of simpler lowerdimensional controllers, for such tasks as door opening and
pickandplace. Next, we show how to combine a highlyefficient modelbased optimization with
the more general modelfree training and significantly speed up the learning of tasks with mixed
linear and discontinuous dynamics, such as hockey playing.
Although providing an intuitive way to specify a task, designing suitable reward functions still
might require an indepth knowledge of the problem and more critically often needs a specific
instrumentation of the environment that is not realistic in the outofthelab world. In the second
part of this thesis, we employ the concept of selfsupervision to show how to use external sources
of information such as pretrained reward estimators and reward signals from different domains
to mitigate the issue of explicitly specifying and instrumenting the reward function during the
robot operation. First, we employ tactile data to learn a grasp stability predictor that can be used
for learning regrasping policies that adjust the robot hand configuration in order to perform grasp
corrections and improve the grasping performance. Next, we show how a robot can employ external
sources of data, such as thirdperson videos of human demonstrations of a task, for creating a
training signal that can be used to train a robot policy. This is achieved by constructing a common
embedding space for the human videos and robot camera images that uses multiview video capture
to provide an unsupervised training signal and learn a model that is invariant to the view angle and
appearance of the objects.
Even the most elaborate RL algorithms often have to explore large number of robot interactions
in order to achieve the required performance. One way to reduce the amount of needed exploration
is to initialize the robot behaviors using provided demonstrations, e.g. from a human. Learning
from demonstration or imitation learning (IL) has been widely used in previous research (Schaal,
1999; Billard et al., 2008; Argall et al., 2009). However, the traditional IL methods have been
mostly constrained to using isolated demonstrations of a particular skill, often provided in the
form of kinesthetic teaching. In the real world, the demonstrations usually come unstructured,
where particular skills should be first identified and disentangled from each other. In the third
part of this thesis, we start by developing a method that addresses this challenge by learning a
multimodal policy distribution that can disentangle unstructured demonstrations and represent
skills for multiple tasks. We use a generative adversarial framework (Goodfellow et al., 2014) to
train a probabilistic generator policy, and develop an informationtheoretic extension to ensure the
multimodality that alleviates the problem of the mode collapse of the policy.
Finally, the last part of the thesis presents a method for reducing the needed amount of
realworld robot trials by moving a large portion of policy training into simulated environments.
While the collection of realworld robotic data is laborious and expensive, simulators offer several
advantages as they can run faster than realtime and allow for acquiring large diversity of training
data. However, due to the imprecise simulation models, behaviors learned in simulations often
cannot be directly applied on realworld systems, a problem known as the reality gap (Jakobi et
al., 1995). In this work, we develop a method that learns policies entirely in simulation using a
distribution of simulated scenarios and reduces the reality gap by collecting a small amount of
realworld experience to adjust this distribution for a better transferability to the real world.
2
1.1 Thesis Outline
This thesis is organized as follows:
• Chapter 2 presents a policy search method for learning complex policies that map directly
from sensory inputs to motor torques, for tasks with nonlinear and discontinuous dynamics.
First, lowdimensional local policies are learned for single task instances using policy im
provement with path integrals (PI
2
) (Theodorou et al., 2010). Afterwards, a highdimensional
global policy is trained using guided policy search (Levine & Koltun, 2013).
• Chapter 3 introduces an algorithm PILQR for combining fast modelbased optimization
using iterative linearquadratic regulators (ILQR) (Tassa et al., 2012) and corrective model
free updates with the PI
2
framework (Theodorou et al., 2010) . The training is performed by
first optimizing the policy using ILQR with fitted linear models (Levine & Abbeel, 2014)
and subsequently minimizing the residual error with PI
2
.
• In Chapter 4, we present a framework for learning regrasping behaviors using tactile data
combined with selfsupervision and reinforcement learning. We first learn a grasp stability
predictor using tactile data and then use the output of the predictor as the reward function
for learning the regrasping policy through trial and error. Additionally, we show how to
combine simple linear regrasping policies for single objects into a general highdimensional
policy with supervised learning.
• Chapter 5 shows a selfsupervised method for learning embedding of human video demon
strations and robot camera images that can be used for thirdperson robotic imitation learning.
The embedding model is first learned from multiview video data with a timecontrastive loss.
Afterwards, the robot learns to imitate by minimizing the distance between the embedding
of the demonstration and its own camera images.
• In Chapter 6, we present our work on imitation learning from unstructured demonstrations.
Multimodal policies are learned from unlabelled demonstrations of multiple tasks using
the generative adversarial framework (Goodfellow et al., 2014) combined with information
theoretic regularization that ensures the multimodality.
• Chapter 7 describes our approach for improving simulation to reality robotic skill transfer by
learning the randomization of simulation parameters. The policies are first trained entirely in
simulation on a distribution of simulated scenarios. Afterwards, we execute a few robot trials
in the real world and use the resulting data to update the simulation parameter distribution in
order to achieve an improved policy transfer.
• In Chapter 8, we conclude and present ideas for the future work.
3
Part I
Training Decomposition
4
Chapter 2
Path Integral Guided Policy Search
We present a policy search method for learning complex feedback control policies that map from
highdimensional sensory inputs to motor torques, for manipulation tasks with discontinuous
contact dynamics. We build on a prior technique called guided policy search (GPS), which
iteratively optimizes a set of local policies for specific instances of a task, and uses these to train a
complex, highdimensional global policy that generalizes across task instances. We extend GPS in
the following ways: (1) we propose the use of a modelfree local optimizer based on path integral
stochastic optimal control (PI
2
), which enables us to learn local policies for tasks with highly
discontinuous contact dynamics; and (2) we enable GPS to train on a new set of task instances
in every iteration by using onpolicy sampling: this increases the diversity of the instances that
the policy is trained on, and is crucial for achieving good generalization. We show that these
contributions enable us to learn deep neural network policies that can directly perform torque
control from visual input. We validate the method on a challenging door opening task and a
pickandplace task, and we demonstrate that our approach substantially outperforms the prior
LQRbased local policy optimizer on these tasks. Furthermore, we show that onpolicy sampling
significantly increases the generalization ability of these policies.
2.1 Introduction
Reinforcement learning (RL) and policy search methods have shown considerable promise for
enabling robots to automatically learn a wide range of complex skills (Tedrake et al., 2004;
Kohl & Stone, 2004; Kober et al., 2008; M. Deisenroth et al., 2011), and recent results in deep
reinforcement learning suggest that this capability can be extended to learn nonlinear policies that
integrate complex sensory information and dynamically choose diverse and sophisticated control
strategies (Lillicrap et al., 2016; Levine et al., 2016). However, applying direct deep reinforcement
learning to realworld robotic tasks has proven challenging due to the high sample complexity
of these methods. An alternative to direct deep reinforcement learning is to use guided policy
search (GPS) methods, which use a set of local policies optimized on specific instances of a
task (such as different positions of a target object) to train a global policy that generalizes across
instances (Levine et al., 2016). In this setup, reinforcement learning is used only to train simple
5
Figure 2.1: Door opening and pickandplace using our path integral guided policy search algorithm.
Door opening can handle variability in the door pose, while the pickandplace policy can handle
various initial object poses.
local policies, while the highdimensional global policy, which might be represented by a deep
neural network, is only trained with simple and scalable supervised learning methods.
The GPS framework can in principle use any learner to optimize the local policies. Prior
implementations generally use a modelbased method with local timevarying linear models and
a local policy optimization based on linearquadratic regulators (LQR) (Levine et al., 2016).
We find that this procedure fails to optimize policies on tasks that involve complex contact
switching discontinuities, such as door opening or picking and placing objects. In this work,
we present a method for local policy optimization using policy improvement with path integrals
(PI
2
) (Theodorou et al., 2010), and demonstrate the integration of this method into the GPS
framework. We then enable GPS to train on new task instances in every iteration by extending
the onpolicy sampling approach proposed in recent work on mirror descent guided policy search
(MDGPS) (Montgomery & Levine, 2016). This extension enables robots to continually learn and
improve policies on new task instances as they are experienced in the world, rather than training
on a fixed set of instances in every iteration. This increases the diversity of experience and leads to
substantially improved generalization, as demonstrated in our experimental evaluation.
We present realworld results for localizing and opening a door, as well as localizing and
grasping an object and then placing it upright at a desired target location, shown in Figure 2.1.
Both tasks are initialized from demonstration and learned with our proposed path integral guided
policy search algorithm, using deep visual features fed directly into the neural network policy. Our
experimental results demonstrate that the use of stochastic local optimization with PI
2
enables our
method to handle the complex contact structure of these tasks, and that the use of random instance
sampling from the global policy enables superior generalization, when compared to prior guided
policy search methods.
6
2.2 Related Work
Policy search methods have been used in robotics for a variety of tasks, such as manipulation
(Pastor et al., 2009; M. Deisenroth et al., 2011; Chebotar et al., 2014), playing table tennis (Kober
et al., 2011) and ballinacup (Kober et al., 2008) games, regrasping (Chebotar, Hausman, Su,
Sukhatme, & Schaal, 2016), and locomotion (Kohl & Stone, 2004; Tedrake et al., 2004; Endo et
al., 2008). Most of these works use carefully designed, specialized policies that either employ
domain knowledge, or have a low number of parameters. It has been empirically observed that
training highdimensional policies, such as deep neural networks, becomes exceedingly difficult
with standard modelfree policy search methods (M. Deisenroth et al., 2013). Although deep
reinforcement learning methods have made considerable progress in this regard in recent years
(Schulman et al., 2015; Lillicrap et al., 2016), their high sample complexity has limited their
application to realworld robotic learning problems.
Guided policy search (Levine & Koltun, 2013) (GPS) seeks to address this challenge by
decomposing policy search into trajectory optimization and supervised learning of a general high
dimensional policy. GPS was applied to various robotic tasks (Levine & Abbeel, 2014; Levine
et al., 2015, 2016). However, the use of a modelbased “teacher” to supervise the policy has
placed considerable limitations on such methods. Most prior work has used LQR with fitted local
timevarying linear models as the teacher (Levine & Abbeel, 2014), which can handle unknown
dynamics, but struggles with problems that are inherently discontinuous, such as door opening: if
the robot misses the door handle, it is difficult for a smooth LQRbased optimizer to understand
how to improve the behavior. We extend GPS to tasks with highly discontinuous dynamics and
nondifferentiable costs by replacing the modelbased LQR supervisor with PI
2
, a modelfree
reinforcement learning algorithm based on stochastic optimal control (Theodorou et al., 2010).
PI
2
has been successfully used to learn parameters of trajectorycentric policies such as dynamic
movement primitives (Ijspeert et al., 2002) in grasping (Stulp et al., 2011), pickandplace (Stulp,
Theodorou, & Schaal, 2012) and variable impedance control tasks (Stulp, Buchli, et al., 2012).
Compared to policy gradient methods, PI
2
does not compute a gradient, which is often sensitive to
noise and large derivatives of the expected cost (M. Deisenroth et al., 2013). In (Kalakrishnan et
al., 2011), PI
2
is used to learn force/torque profiles for door opening and picking tasks to improve
imperfect kinesthetic demonstrations. The policy is represented as an endeffector trajectory, and
the motion is initialized from demonstration. In contrast, we use PI
2
to learn feedforward joint
torque commands of timevarying linearGaussian controllers. In (Englert & Toussaint, 2016), the
robot also learns a policy for the door opening task initialized from demonstration. The training
is divided into analytically learning a highdimensional projection of the policy using provided
models and Bayesian learning of a lowerdimensional projection for blackbox objectives, such
as binary success outcome of the task. In this work, we use the GPS framework to learn the
highdimensional policies using PI
2
as the teacher, which can also handle discontinuous cost
functions. Furthermore, we use the controls learned by PI
2
on several task instances (e.g. several
door positions) to supervise the training of a single deep neural network policy that can succeed
for various door poses using visual input from the robot’s camera.
7
Stochastic policy search methods can be improved by limiting the information loss between
updates, by means of a KLdivergence constraint (Peters et al., 2010). In this work, we similarly
constrain the KLdivergence between PI
2
updates, in a framework similar to (Lioutikov et al., 2014)
and (G´ omez et al., 2014). In (van Hoof et al., 2015), the authors propose to learn highdimensional
sensorbased policies through supervised learning using the relative entropy method to reweight
stateaction samples of the policy. While the goal of learning highdimensional nonlinear policies
is similar to our work, we optimize the individual instance trajectories separately, and then combine
them into a single policy with supervised learning. As shown in our simulated experimental
evaluation, this substantially improves the effectiveness of the method and allows us to tackle more
complex tasks. We also extend guided policy search by choosing new random instances at each
iteration, based on the onpolicy sampling technique proposed in (Montgomery & Levine, 2016),
which substantially improves the generalization of the resulting policy.
Our deep neural network policies directly use visual features from the robot’s camera to perform
the task. The features are learned automatically on a pose detection proxy task, using an improved
version of the spatial feature points architecture (Levine et al., 2016) based on convolutional neural
networks (CNNs) (Fukushima, 1980; Schmidhuber, 2015). In (Koutn´ ık et al., 2014), visual and
control layers of a racing video game policy are learned separately using neuroevolution. Using
pretrained visual features enables efficient learning of smaller controllers with RL. In our work,
visual features are pretrained based on object and robot endeffector poses. By combining visual
pretraining, initialization from kinesthetic demonstrations, and global policy sampling with PI
2
,
we are able to learn complex visuomotor behaviors for contactrich tasks with discontinuous
dynamics, such as door opening and pickandplace.
2.3 Background
In this section, we provide background on guided policy search (GPS) (Levine & Abbeel, 2014)
and describe the general framework of modelfree stochastic trajectory optimization using policy
improvement with path integrals (PI
2
) (Theodorou et al., 2010), which serve as the foundations for
our algorithm.
2.3.1 Guided Policy Search
The goal of policy search methods is to optimize parameters of a policy
(u
t
Sx
t
), which defines
a probability distribution over robot actions u
t
conditioned on the system state x
t
at each time step
t of a task execution. Let = (x
1
;u
1
;:::;x
T
;u
T
) be a trajectory of states and actions. Given
a task cost function l(x
t
;u
t
), we define the trajectory cost l() =∑
T
t=1
l(x
t
;u
t
). The policy
optimization is performed with respect to the expected cost of the policy:
J()=E
[l()]=
S
l()p
()d;
8
wherep
() is the policy trajectory distribution given the system dynamicsp(x
t+1
Sx
t
;u
t
):
p
()=p(x
1
)
T
M
t=1
p(x
t+1
Sx
t
;u
t
)
(u
t
Sx
t
):
Standard policy gradient methods optimize J() directly with respect to the parameters by
estimating the gradient ∇
J(). The main disadvantage of this approach is that it requires
large amounts of policy samples and it is prone to fall into poor local optima when learning
highdimensional policies with a large number of parameters (M. Deisenroth et al., 2013).
Guided policy search (GPS) introduces a twostep approach for learning highdimensional
policies by leveraging advantages of trajectory optimization and supervised learning. Instead of
directly learning the policy parameters with reinforcement learning, a trajectorycentric algorithm
is first used to learn simple controllersp(u
t
Sx
t
) for trajectories with various initial conditions of
the task. We refer to these controllers as local policies. In this work, we employ timevarying
linearGaussian controllers of the formp(u
t
Sx
t
) =N (K
t
x
t
+ k
t
;C
t
) to represent these local
policies.
After optimizing the local policies, the optimized controls from these policies are used to
create a training set for learning a complex highdimensional global policy in a supervised manner.
Hence, the final global policy generalizes to the initial conditions of multiple local policies and
can contain thousands of parameters, which can be efficiently learned with supervised learning.
Furthermore, while trajectory optimization might require the full state x
t
of the system to be
known, it is possible to only use the observations o
t
of the full state for training a global policy
(u
t
So
t
). In this way, the global policy can predict actions from raw observations at test time
(Levine et al., 2016).
Supervised training does not guarantee that the global and local policies match as they might
have different state distributions. In this work, we build on MDGPS, which uses a constrained
formulation of the local policy objective (Montgomery & Levine, 2016):
min
;p
E
p
[l()] s.t.D
KL
(p()Y
())≤;
whereD
KL
(p()Y
()) can be computed as:
D
KL
(p()Y
())=
T
Q
t=1
E
p
[D
KL
(p(u
t
Sx
t
)Y
(u
t
Sx
t
))]:
The MDGPS algorithm alternates between solving the constrained optimization with respect to the
local policiesp, and minimizing the KLdivergence with respect to the parameters of the global
policy by training the global policy on samples from the local policies with supervised learning.
In prior GPS work, the local policies were learned by iteratively fitting timevarying linear
dynamics to data, and then improving the local policies with a KLconstrained variant of LQR.
While this allows for rapid learning of complex trajectories, the smooth LQR method performs
poorly in the presence of severe discontinuities. In this work, we instead optimize the local policies
using the modelfree PI
2
algorithm, which is described in the next section.
9
2.3.2 Policy Improvement with Path Integrals
PI
2
is a modelfree RL algorithm based on stochastic optimal control and statistical estimation
theory. Its detailed derivation can be found in (Theodorou et al., 2010). We outline the method
below and describe its application to learning feedforward commands of timevarying linear
Gaussian controllers.
The timevarying linearGaussian controllers we use to represent the local policies are given
byp(u
t
Sx
t
)=N (K
t
x
t
+ k
t
;C
t
). They are parameterized by the feedback gain matrix K
t
, the
feedforward controls k
t
, and the covariance C
t
. In this work, we employ PI
2
to learn only the
feedforward terms and covariances. After initialization, e.g. from human demonstrations, the
feedback part is kept fixed throughout the optimization. In this manner, we can keep the number of
learned parameters relatively small and be able to learn the local policies with a low number of
policy samples.
Each iteration of PI
2
involves generatingN samples by running the current policy on the robot.
After that, we compute the costtogoS
i;t
and probabilitiesP
i;t
for each samplei∈ 1:::N and for
each time stept:
S
i;t
=S(
i;t
)=
T
Q
j=t
l(x
i;j
;u
i;j
); P
i;t
=
e
−
1
S
i;t
∑
N
i=1
e
−
1
S
i;t
;
wherel(x
i;j
;u
i;j
) is the cost of samplei at timej. The probabilities follow from the FeynmanKac
theorem applied to stochastic optimal control (Theodorou et al., 2010). The intuition is that the
trajectories with lower costs receive higher probabilities, and the policy distribution shifts towards a
lower cost trajectory region. The costs are scaled by, which can be interpreted as the temperature
of a softmax distribution.
After computing the new probabilities P
i;t
, the policy is updated according to a weighted
maximumlikelihood estimate of the policy parameters, which are the means and covariances of
the sampled feedforward controls
k
i;t
:
k
t
=
N
Q
i=1
P
i;t
k
i;t
; C
t
=
N
Q
i=1
P
i;t
(
k
i;t
− k
t
)(
k
i;t
− k
t
)
⊺
Here, we adapt the approach described in (Stulp & Sigaud, 2012) and update the covariance
matrices based on the sample probabilities. This has the advantage of automatically determining
the exploration magnitude for each time step, instead of setting it to a constant as in the original
PI
2
derivation.
2.4 Path Integral Guided Policy Search
In this section, we first describe how to use PI
2
as the local policy optimizer in guided policy
search, and then introduce global policy sampling as a way to train policies on randomized task
instances using raw observations. Finally, we describe the neural network architecture used to
represent our visuomotor global policies and the pretraining procedure that we use to learn visual
features.
10
2.4.1 PI
2
for Guided Policy Search
Using the timevarying linearGaussian representation of the local policies makes it straightforward
to incorporate PI
2
into the GPS framework from Section 2.3.1.
In previous work (Levine & Abbeel, 2014), the optimization of local policies with LQR was
constrained by the KLdivergence between the old and updated trajectory distributions to ensure
steady policy improvement. Similar types of constraints were proposed in prior policy search work
(Peters et al., 2010). When optimizing with PI
2
, we similarly want to limit the change of the policy
and hence, avoid sampling too far from unexplored policy regions to gradually converge towards
the optimal policy. The policy update step can be controlled by varying the temperature of the
softmax probabilitiesP
i;t
. If the temperature is too low, the policy might converge quickly to a
suboptimal solution. If is too high, the policy will converge slowly and the learning can require a
large number of iterations. Prior work set this parameter empirically (Theodorou et al., 2010).
In this work, we employ an approach similar to relative entropy policy search (REPS) (Peters
et al., 2010) to determine based on the bounded information loss between the old and updated
policies. For that, the optimization goal of the local policyp(u
t
Sx
t
) is augmented with a KL
divergence constraint against the old policy p(u
t
Sx
t
):
min
p
E
p
[l()] s.t. D
KL
(p(u
t
Sx
t
)Y p(u
t
Sx
t
))≤;
where is the maximum KLdivergence between the old and new policies. The Lagrangian of this
problem depends on:
L
p
(p;)=E
p
[l()]+[D
KL
(p(u
t
Sx
t
)Y p(u
t
Sx
t
))−]
We compute the temperatures
t
separately for each time step by optimizing the dual function
g(
t
) with respect to the costtogo of the policy samples:
g(
t
)=
t
+
t
log
1
N
N
Q
i=1
e
−
1
t
S
i;t
:
The derivation of the dual function follows (Peters et al., 2010), but performed at each time step
independently as in (Lioutikov et al., 2014).
By replacing p(u
t
Sx
t
) with the current global policy
(u
t
Sx
t
), we obtain the mirror gradient
descent formulation of GPS (MDGPS) (Montgomery & Levine, 2016):
min
p
E
p
[l()] s.t. D
KL
(p(u
t
Sx
t
)Y
(u
t
Sx
t
))≤:
We can therefore use the same relative entropy optimization procedure for to limit deviation
of the updated local policy from the old global policy, which is also used in our global policy
sampling scheme as described in the next section.
11
Algorithm 1: MDGPS with PI
2
and Global Policy Sampling
1: for iterationk ∈ {1;:::;K} do
2: Generate samplesD = {
i
} by running noisy
on each randomly sampled instance
3: Perform one step of optimization with PI
2
independently on each instance:
min
p
E
p
[l()] s.t.D
KL
(p(u
t
Sx
t
)Y
(u
t
Sx
t
))≤
4: Train global policy with optimized controls using supervised learning:
← arg min
∑
i;t
D
KL
(
(u
t
Sx
i;t
)Yp(u
t
Sx
i;t
))
5: end for
2.4.2 Global Policy Sampling
In the standard GPS framework, the robot trajectories are always sampled from the local policies.
This has a limitation of being constrained to a fixed set of task instances and initial conditions
for the entire learning process. However, if we use PI
2
with a constraint against the previous
global policy, we in fact do not require a previous local policy for the same instance, and can
therefore sample new instances at each iteration. This helps us to train global policies with a better
generalization to various task conditions.
The approach, summarized in Algorithm 1, is similar to MDGPS and inherits its theoretical
underpinnings. However, unlike standard MDGPS, we sample new instances (e.g. new poses of
the door) at each iteration. We then perform a number of rollouts for each instance using the global
policy
(u
t
So
t
), with added noise. These samples are used to perform a single optimization
step with PI
2
, independently for each instance. As we use samples from the global policy, the
optimization is now constrained against KLdivergence to the old global policy and can be solved
through optimization of the temperature, as described in the previous section. After performing
one step of local policy optimization with PI
2
, the updated controls are fed back into the global
policy with supervised learning. The covariance of the global policy noise is updated by averaging
the covariances of the local policies at each iteration: C
=∑
i;t
C
p
i
;t
~(NT).
Our approach is different from direct policy gradient on the parameters of the global policy.
Although we sample from a highdimensional global policy, the optimization is still performed in
a lowerdimensional action space with a trajectorycentric algorithm that can take advantage of
the local trajectory structure. Consequently, the global policy is guided by this optimization with
supervised learning.
2.4.3 Global Policy Initialization
It is important to note that, especially for real robot tasks, it is often not safe or efficient to start
with an uninitialized or randomly initialized global policy, particularly when using very general
representations like neural networks. Therefore, in our work we initialize the global policies by
performing several iterations of standard GPS with local policy sampling using PI
2
on a fixed set
of task instances. In this case, the costtogoS
i;t
in PI
2
is augmented with a KLdivergence penalty
against the global policy as described in (Levine et al., 2016) (Appendix A), but the optimization is
performed using the KLdivergence constraint against the previous local policy. We also initialize
the local policies with kinesthetic teaching, to provide the algorithm with the overall structure
12
fully connected
fully connected
+ ReLU
stride 2
3x3 conv + ReLU
3x3 conv + ReLU
inputs
outputs
spatial feature maps (convolution outputs)
feature vectors (fully connected outputs)
2x2 max pool
3x3 conv + ReLU
2x2 max pool
3x3 conv + ReLU
2x2 max pool
3x3 conv + ReLU
2x2 max pool
3x3 conv + ReLU
1x1 conv + ReLU
upscale
1x1 conv + ReLU, upscale
1x1 conv + ReLU, upscale
1x1 conv + ReLU, upscale
1x1 conv + ReLU
1x1 conv + ReLU
Image (256 x 320 x 3)
16 32
32
32
32 32
32
+
spatial softmax
(expected 2D location)
Feature
points
64
Pretraining:
object and
robot pose
18
Robot
state
33
40
fully connected
+ ReLU
40
Joint
torques
7
fully connected
First image
feature points
64
Figure 2.2: The architecture of our neural network policy. The input RGB image is passed through
a 3x3 convolution with stride 2 to generate 16 features at a lower resolution. The next 5 layers are
3x3 convolutions followed by 2x2 maxpooling, each of which output 32 features at successively
reduced resolutions and increased receptive field. The outputs of these 5 layers are recombined by
passing each of them into a 1x1 convolution, converting them to a size of 125x157 by using nearest
neighbor upscaling, and summation (similar to (Tompson et al., 2014)). A final 1x1 convolution is
used to generate 32 feature maps. The spatial softargmax operator (Levine et al., 2016) computes
the expected 2D image coordinates of each feature. A fully connected layer is used to compute the
object and robot pose from these expected 2D feature coordinates for pretraining the vision layers.
The feature points for the current image are concatenated with feature points from the image at the
first timestep as well as the 33dimensional robot state vector, before being passed through two
fully connected layers to produce the output joint torques.
of the task at the start of training. After initialization, the global policy can already occasionally
perform the task, but is often overfitted to the initial task instances. By training further with global
policy sampling on random task instances, we can greatly increase the generalization capability of
the policy, as demonstrated in our experimental evaluation.
2.4.4 Learning Visuomotor Policies
Our goal is to use the proposed path integral guided policy search algorithm to learn policies for
complex object manipulation skills. Besides learning how to perform the physical motions, these
policies must also interpret visual inputs to localize the objects of interest. To that end, we use
learned deep visual features as observations for the policy. The visual features are produced by
a convolutional neural network, and the entire policy architecture, including the visual features
and the fullyconnected motor control layers, is shown in Figure 2.2. Our architecture resembles
prior work (Levine et al., 2016), with the visual features represented by feature points produced
via a spatial softmax applied to the last convolutional response maps. Unlike prior work, our
convolutional network includes pooling and skip connections, which allows the visual features
to incorporate information at various scales: lowlevel, highresolution, local features as well as
higherlevel features with larger spatial context. This serves to limit the amount of computation
performed at high resolutions while still generating highresolution features, enabling evaluation
13
Figure 2.3: Task setup and execution. Left: door opening task. Right: pickandplace task. For both
tasks, the pose of the object of interest (door or bottle) is randomized, and the robot must perform
the task using monocular camera images from the camera mounted over the robot’s shoulder.
of this deep model at camera frame rates.
We train the network in two stages. First, the convolutional layers are pretrained with a proxy
pose detection objective. To create data for this pretraining phase, we collect camera images while
manually moving the object of interest (the door or the bottle for the pickandplace task) into
various poses, and automatically label each image by using a geometrybased pose estimator based
on the point pair feature (PPF) algorithm (Hinterstoisser et al., 2016). We also collect images of
the robot learning the task with PI
2
(without vision), and label these images with the pose of the
robot endeffector obtained from forward kinematics. Each pose is represented as a 9DoF vector,
containing the positions of three points rigidly attached to the object (or robot), represented in
the world frame. The convolutional layers are trained using stochastic gradient descent (SGD)
with momentum to predict the endeffector and object poses, using a standard Euclidean loss. The
fully connected layers of the network are then trained using path integral guided policy search
to produce the joint torques, while the weights in the convolutional layers are frozen. Since our
model does not use memory, we include the first camera image feature points in addition to the
current image to allow the policy to remember the location of the object in case of occlusions
by the arm. In future work, it would be straightforward to also finetune the convolutional layers
endtoend with guided policy search as in prior work (Levine et al., 2016), but we found that we
could obtain satisfactory performance on the door and pickandplace tasks without endtoend
training of the vision layers.
2.5 Experiments
The first goal of our experiments is to compare the performance of guided policy search with PI
2
to the previous LQRbased variant on real robotic manipulation tasks with discontinuous dynamics
and nondifferentiable costs. We evaluate the algorithms on door opening and pickandplace tasks.
The second goal is to evaluate the benefits of global policy sampling, where new random instances
(e.g. new door poses) are chosen at each iteration. To that end, we compare the generalization
capabilities of policies trained with and without resampling of new instances at each iteration.
Finally, we present simulated comparisons to evaluate the design choices in our method, including
the particular variant of PI
2
proposed in this work.
14
2.5.1 Experimental Setup
We use a lightweight 7DoF robotic arm with a two finger gripper, and a camera mounted behind
the arm looking over the shoulder (see Figure 2.3). The input to the policy consists of monocular
RGB images, with depth data used only for ground truth pose detection during pretraining. The
robot is controlled at a frequency of 20Hz by directly sending torque commands to all seven joints.
We use different fingers for the door opening and pickandplace tasks.
Door opening
In the door opening task, the goal of the robot is to open the door as depicted in Figure 2.3 on the
left. The cost is based on an IMU sensor attached to the door handle. The desired IMU readings,
which correspond to a successfully opened door, are recorded during kinesthetic teaching of the
opening motion from human demonstration. We additionally add joint velocity and control costs
multiplied by a small constant to encourage smooth motions.
Pickandplace
The goal of the pickandplace task is to pick up a bottle and place it upright at a goal position, as
shown in Figure 2.3 on the right. The cost is based on the deviation of the final pose of the bottle
from the goal pose. We use object detection with PPF (Hinterstoisser et al., 2016) to determine
the final pose to evaluate the cost. This pose is only used for cost evaluation, and is not provided
to the policy. The goal pose is recorded during the human demonstration of the task. The cost is
based on the quadratic distance between three points projected into the goal and final object pose
transformations, and we again add small joint velocity and control costs.
2.5.2 SingleInstance Comparisons
Door opening
We first evaluate how our method performs on a single task instance (a single position of the door)
without vision, so as to construct a controlled comparison against prior work. After recording the
door opening motion from demonstration, the door is displaced by 5cm away from the robot. The
robot must adapt the demonstrated trajectory to the new door position and recover the opening
motion. The linearGaussian policy has 100 time steps with torque commands for each of the joints.
We add control noise at each of the time steps to start exploration around the initial trajectory. The
amount of the initial noise is set such that the robot can touch the door handle at least 10% of the
time, since the only feedback about the success of the task comes from the IMU readings on the
door handle.
We compare the PI
2
variant of GPS to the more standard LQR version (Levine et al., 2016)
over 10 iterations, with 10 sampled trajectories at each iteration. Figure 2.4 shows the success
rates for opening the displaced door using LQR and PI
2
. In the initial trials, the robot is able to
either open the door once or touch the door handle a few times. After the first two policy updates,
PI
2
could open the door in 50% of the samples, and after three updates the door could be opened
15
Door opening PI²
Pickandplace PI²

0 1 2 3 4 5 6 7 8 9
Success rate (%)
Iteration
Task adaptation
Door opening LQR
Door opening PI²
Pickandplace PI²
100
80
60
40
20
0
Figure 2.4: Task adaptation success rates over the course of training with PI
2
and LQR for single
instances of door opening and pickandplace tasks. Each iteration consists of 10 trajectory
samples.
consistently. LQR could not handle the nonlinearity of the dynamics due to the contacts with
the door and discontinuity in the cost function between moving and not moving the door handle.
This shows that the modelfree PI
2
method is better suited for tasks with intermittent contacts and
nondifferentiable costs.
Pickandplace
In the pickandplace task adaptation experiment, the goal of the robot is to adapt the demonstrated
trajectory to a displaced and rotated object. The bottle is displaced by 5cm and rotated by 30
degrees from its demonstrated position. The local policy consists of 200 time steps. The initial
noise is set such that the robot is able to grasp or partially grasp the bottle at its new position at
least 10% of the time. Figure 2.4 shows performance of PI
2
on recovering the pickandplace
behavior for the displaced bottle. We do not compare this performance to LQR as our cost function
could not be modeled with a linearquadratic approximation. The robot is able to achieve a 100%
success rate at placing the bottle upright at the target after 8 iterations. The learning is slower than
in the door task, and more exploration is required to adapt the demonstration, since the robot has to
not only grasp the object but also place it into a stable upright position to succeed. This is difficult,
as the bottle might rotate in hand during grasping, or might tip over when the robot releases it at
the target. Hence, both stages of the task must be executed correctly in order to achieve success.
We noticed that the robot learned to grasp the bottle slightly above its center of mass, which made
stable placement easier. Furthermore, this made the bottle rotate in the hand towards a vertical
orientation when placing it at the goal position.
Simulation
The goal of this experiment is to compare the use of PI
2
with GPS and global policy sampling
(called PIGPS), with an approach based on relative entropy policy search (REPS) (Peters et
al., 2010), wherein the global policy is directly trained using the samples reweighted by their
16
Figure 2.5: Robot RGB camera images used for controlling the robot. Top: door opening task.
Bottom: pickandplace task.
probabilitiesP
i;t
as in (van Hoof et al., 2015), without first fitting a local policy. We also evaluate a
hybrid algorithm which uses PI
2
to optimize a local policy, but reuses the persample probabilities
P
i;t
as weights for training the global policy (called PIGPSW). For these evaluations, we simulate
a 2dimensional point mass system with second order dynamics, where the task involves moving
the point mass from the start state to the goal. The state space consists of positions and velocities
(R
4
), and the action space corresponds to accelerations (R
2
). We use a fully connected neural
network consisting of two hidden layers of 40 units each with ReLU activations as the global
policy. Each algorithm is tuned for optimal performance using a grid search over hyperparameters.
The results of this experiment are shown in Figure 2.6. PIGPS achieves lower costs at
convergence, and with fewer samples. We also observe that PIGPSW and REPS have a tendency
to go unstable without achieving convergence when training complex neural network policies. We
found that this effect disappears when training a linear policy, which may suggest that training
a nonlinear policy with higher representation power is more stable when the training examples
are consistent (like those generated from an optimized local policy in PIGPS), rather than noisy
reweighted samples (like the ones used in REPS).
2.5.3 Evaluating Generalization
In this section, we evaluate the generalization performance of guided policy search with PI
2
on
randomized instances (e.g. random door poses or bottle positions), so as to determine the degree
to which global policy sampling improves generalization. We learn visuomotor neural network
policies for door opening and pickandplace tasks that map RGB images from the camera to the
torque commands.
17
0 10 20 30 40 50
Iteration
0
5
10
15
20
25
30
Cost± stddev
PIGPS
PIGPSW
REPS
0 10 20 30 40 50
Iteration
0
5
10
15
20
25
30
Cost± stddev
PIGPS
PIGPSW
REPS
Figure 2.6: Training curves for PIGPS, PIGPSW, and REPS on a simulated point mass task.
Left: Runs chosen based on lowest mean cost at iteration 50; Right: Runs chosen based on lowest
mean cost across all iterations. Each iteration consists of 30 trajectory samples for a single task
instance.
Door opening
The goal of applying global policy sampling in the door task is to teach the robot to open the door
placed at any pose inside the training area. The variation of the door position is 16cm inx and 8cm
iny direction. The orientation variation is 60
○
(±30
○
from the parallel orientation of the door with
respect to the table edge). Figure 2.5 (top) shows examples of the images from the robot’s camera
during these tasks. Training is initialized from 5 demonstrations, corresponding to 5 door poses.
The convolutional layers of the network are trained using 1813 images of the door in different
poses, and an additional 15150 images of the robot as described in Section 2.4.4.
We compare the performance of the global policy trained with the standard local policy
sampling method on the demonstrated door poses (as in prior work (Levine et al., 2016)) to global
policy sampling, where new random door poses are sampled for each iteration. We first perform
two iterations of standard GPS with local policy sampling to initialize the global policy, since
sampling from an untrained neural network would not produce reasonable behavior. In the global
policy sampling mode, we then train for 5 iterations with random door poses each time, and
compare generalization performance against a version that instead is trained for 5 more standard
local policy sampling iterations. In both cases, 7 total iterations are performed, with each iteration
consisting of 10 trajectory samples for each of 5 instances (for a total of 50 samples per iteration).
PI
2
is used to update each instance independently using the corresponding 10 samples. During
global policy sampling, we increase the control noise after initialization to add enough exploration
to touch the handle of a randomly placed door in at least 10% of the rollouts. Figure 2.7 (left)
shows the average success rates during training.
To test each policy, we evaluate it on 30 random door poses in the training area. When trained
using only local policy sampling on the same set of 5 instances at each iteration, the robot is able
to successfully open the door in 43.3% of the test instances. When trained with global policy
sampling, with a new set of random instances at each iteration, the final policy is able to open the
door in 93.3% of the test instances.
18
0
20
40
60
80
100
0 1 2 3 4
Success rate (%)
Iteration
Generalization: door opening
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18
Iteration
Generalization: pickandplace
Figure 2.7: Success rates during training generalized policies with global policy sampling for
door opening (left) and pickandplace (right). Each iteration consists of 50 trajectory samples:
10 samples of each of the 5 random task instances. Dashed lines: success rates after local policy
sampling training.
It is important to note that, during global policy sampling, we had to use relatively low learning
rates to train the neural network. In particular, after initialization we had to reduce learning rate by
5 times from 5× 10
−3
to 10
−3
. Otherwise, we faced the problem of the policy “forgetting” the old
task instances and performing poorly on them when trained on the new random set of instances
each iteration. This could be mitigated in future work by using experience replay and reusing old
instances for training.
Pickandplace
In the pickandplace global sampling experiments, we teach the robot to pick up a bottle placed
at any position inside a predefined rectangular training area of the table and place it vertically at
the goal position. The size of the training area is 30x40cm and orientation variation of the bottle
is 120
○
. Similar to the door opening task, training is initialized from 5 demonstrations, and the
convolutional layers are trained using 2708 images with object poses and 14070 task execution
images with endeffector poses. Example camera images are shown in Figure 2.5 (bottom).
As in the door task, the policy is initialized with 2 local sampling iterations. After that, we run
20 iterations of global policy sampling, each consisting of 10 trajectory samples from each of 5
random instances. During training, we gradually increased the training area by moving away from
the demonstrated bottle poses. This continuation method was necessary to ensure that the bottle
could be grasped at least 10% of the time during the trials without excessive amounts of added
noise, since the initial policy did not generalize effectively far outside of the demonstration region.
We evaluated the global policy on 30 random bottle poses. After the two initialization iterations,
the global policy is able to successfully place the bottle on the goal position with a success rate of
40%. After finishing the training with global policy sampling, the robot succeeds 86.7% of the
time. Figure 2.7 (right) shows the average success rates over the course of the training. Similar to
the task adaptation experiments, the learning of the pickandplace behavior is slower than on the
door opening task, and requires more iterations. Since the size of training region increased over
19
the course of training, the performance does not improve continuously, since the training instances
are harder (i.e., more widely distributed) in the later iterations.
We noticed that the final policy had less variation of the gripper orientation during grasping
than in the demonstrated instances. The robot exploited the compliance of the gripper to grasp the
bottle with only a slight change of the orientation. In addition, the general speed of the motion
decreased over the course of learning, such that the robot could place the object more carefully on
the goal position.
During global policy sampling phase, we had to reduce the learning rate of the SGD training
even more than in the door task by setting it to 10
−4
(compared to 5× 10
−3
for local policy
sampling) to avoid forgetting old task instances.
2.6 Conclusions
We presented the path integral guided policy search algorithm, which combines stochastic policy
optimization based on PI
2
with guided policy search for training highdimensional, nonlinear
neural network policies for visionbased robotic manipulation skills. The main contributions of
our work include a KLconstrained PI
2
method for local policy optimization, as well as a global
policy sampling scheme for guided policy search that allows new task instances to be sampled
at each iteration, so as to increase the diversity of the data for training the global policy and
thereby improve generalization. We evaluated our method on two challenging manipulation tasks
characterized by intermittent and variable contacts and discontinuous cost functions: opening a
door and picking and placing a bottle. Our experimental evaluation shows that PI
2
outperforms
the prior LQRbased local policy optimization method, and that global policy sampling greatly
improves the generalization capabilities of the learned policy.
20
Chapter 3
Combining ModelBased and ModelFree Updates for
TrajectoryCentric Reinforcement Learning
Reinforcement learning algorithms for realworld robotic applications must be able to handle com
plex, unknown dynamical systems while maintaining dataefficient learning. These requirements
are handled well by modelfree and modelbased RL approaches, respectively. In this work, we
aim to combine the advantages of these approaches. By focusing on timevarying linearGaussian
policies, we enable a modelbased algorithm based on the linearquadratic regulator that can be
integrated into the modelfree framework of path integral policy improvement. We can further
combine our method with guided policy search to train arbitrary parameterized policies such as
deep neural networks. Our simulation and realworld experiments demonstrate that this method
can solve challenging manipulation tasks with comparable or better performance than modelfree
methods while maintaining the sample efficiency of modelbased methods.
3.1 Introduction
Reinforcement learning (RL) aims to enable automatic acquisition of behavioral skills, which
can be crucial for robots and other autonomous systems to behave intelligently in unstructured
realworld environments. However, realworld applications of RL have to contend with two
often opposing requirements: dataefficient learning and the ability to handle complex, unknown
dynamical systems that might be difficult to model. Realworld physical systems, such as robots,
are typically costly and time consuming to run, making it highly desirable to learn using the lowest
possible number of realworld trials. Modelbased methods tend to excel at this (M. Deisenroth et
al., 2013), but suffer from significant bias, since complex unknown dynamics cannot always be
modeled accurately enough to produce effective policies. Modelfree methods have the advantage
of handling arbitrary dynamical systems with minimal bias, but tend to be substantially less
sampleefficient (Kober et al., 2013; Schulman et al., 2015). Can we combine the efficiency of
modelbased algorithms with the final performance of modelfree algorithms in a method that we
can practically use on realworld physical systems?
As we will discuss in Section 3.2, many prior methods that combine modelfree and model
based techniques achieve only modest gains in efficiency or performance (Heess et al., 2015; Gu et
21
Figure 3.1: Real robot tasks used to evaluate our method. Left: The hockey task which involves
discontinuous dynamics. Right: The power plug task which requires high level of precision. Both
of these tasks are learned from scratch without demonstrations.
al., 2016). In this work, we aim to develop a method in the context of a specific policy representa
tion: timevarying linearGaussian controllers. The structure of these policies provides us with an
effective option for modelbased updates via iterative linearGaussian dynamics fitting (Levine
& Abbeel, 2014), as well as a simple option for modelfree updates via the path integral policy
improvement (PI
2
) algorithm (Theodorou et al., 2010).
Although timevarying linearGaussian (TVLG) policies are not as powerful as representations
such as deep neural networks (Mnih et al., 2013; Lillicrap et al., 2016) or RBF networks (M. Deisen
roth et al., 2011), they can represent arbitrary trajectories in continuous stateaction spaces. Further
more, prior work on guided policy search (GPS) has shown that TVLG policies can be used to train
generalpurpose parameterized policies, including deep neural network policies, for tasks involving
complex sensory inputs such as vision (Levine & Abbeel, 2014; Levine et al., 2016). This yields a
generalpurpose RL procedure with favorable stability and sample complexity compared to fully
modelfree deep RL methods (Montgomery et al., 2017).
The main contribution of this work is a procedure for optimizing TVLG policies that integrates
both fast modelbased updates via iterative linearGaussian model fitting and corrective modelfree
updates via the PI
2
framework. The resulting algorithm, which we call PILQR, combines the
efficiency of modelbased learning with the generality of modelfree updates and can solve complex
continuous control tasks that are infeasible for either linearGaussian models or PI
2
by itself, while
remaining orders of magnitude more efficient than standard modelfree RL. We integrate this
approach into GPS to train deep neural network policies and present results both in simulation and
on a real robotic platform. Our realworld results demonstrate that our method can learn complex
tasks, such as hockey and power plug plugging (see Figure 3.1), each with less than an hour of
experience and no userprovided demonstrations.
22
3.2 Related Work
The choice of policy representation has often been a crucial component in the success of a RL
procedure (M. Deisenroth et al., 2013; Kober et al., 2013). Trajectorycentric representations, such
as splines (Peters & Schaal, 2008), dynamic movement primitives (Schaal et al., 2003), and TVLG
controllers (Lioutikov et al., 2014; Levine & Abbeel, 2014) have proven particularly popular in
robotics, where they can be used to represent cyclical and episodic motions and are amenable to a
range of efficient optimization algorithms. In this work, we build on prior work in trajectorycentric
RL to devise an algorithm that is both sampleefficient and able to handle a wide class of tasks, all
while not requiring human demonstration initialization.
More general representations for policies, such as deep neural networks, have grown in
popularity recently due to their ability to process complex sensory input (Mnih et al., 2013; Lillicrap
et al., 2016; Levine et al., 2016) and represent more complex strategies that can succeed from a
variety of initial conditions (Schulman et al., 2015, 2016). While trajectorycentric representations
are more limited in their representational power, they can be used as an intermediate step toward
efficient training of general parameterized policies using the GPS framework (Levine et al., 2016).
Our proposed trajectorycentric RL method can also be combined with GPS to supervise the
training of complex neural network policies. Our experiments demonstrate that this approach is
several orders of magnitude more sampleefficient than direct modelfree deep RL algorithms.
Prior algorithms for optimizing trajectorycentric policies can be categorized as modelfree
methods (Theodorou et al., 2010; Peters et al., 2010), methods that use global models (M. Deisen
roth et al., 2014; Pan & Theodorou, 2014), and methods that use local models (Levine & Abbeel,
2014; Lioutikov et al., 2014; Akrour et al., 2016). Modelbased methods typically have the
advantage of being fast and sampleefficient, at the cost of making simplifying assumptions about
the problem structure such as smooth, locally linearizable dynamics or continuous cost functions.
Modelfree algorithms avoid these issues by not modeling the environment explicitly and instead
improving the policy directly based on the returns, but this often comes at a cost in sample ef
ficiency. Furthermore, many of the most popular modelfree algorithms for trajectorycentric
policies use example demonstrations to initialize the policies, since modelfree methods require a
large number of samples to make large, global changes to the behavior (Theodorou et al., 2010;
Peters et al., 2010; Pastor et al., 2009).
Prior work has sought to combine modelbased and modelfree learning in several ways.
(Farshidian et al., 2014) also use LQR and PI
2
, but do not combine these methods directly into one
algorithm, instead using LQR to produce a good initialization for PI
2
. Their work assumes the
existence of a known model, while our method uses estimated local models. A number of prior
methods have also looked at incorporating models to generate additional synthetic samples for
modelfree learning (R. Sutton, 1990; Gu et al., 2016), as well as using models for improving the
accuracy of modelfree value function backups (Heess et al., 2015). Our work directly combines
modelbased and modelfree updates into a single trajectorycentric RL method without using
synthetic samples that degrade with modeling errors.
23
3.3 Preliminaries
The goal of policy search methods is to optimize the parameters of a policyp(u
t
Sx
t
), which
defines a probability distribution over actions u
t
conditioned on the system state x
t
at each time
stept of a task execution. Let = (x
1
;u
1
;:::;x
T
;u
T
) be a trajectory of states and actions. Given
a cost functionc(x
t
;u
t
), we define the trajectory cost asc() =∑
T
t=1
c(x
t
;u
t
). The policy is
optimized with respect to the expected cost of the policy
J()=E
p
[c()]=
S
c()p()d;
wherep() is the policy trajectory distribution given the system dynamicsp(x
t+1
Sx
t
;u
t
)
p()=p(x
1
)
T
M
t=1
p(x
t+1
Sx
t
;u
t
)p(u
t
Sx
t
):
One policy class that allows us to employ an efficient modelbased update is the TVLG controller
p(u
t
Sx
t
) = N (K
t
x
t
+ k
t
;
t
). In this section, we present the modelbased and modelfree
algorithms that form the constituent parts of our hybrid method. The modelbased method is an
extension of a KLconstrained LQR algorithm (Levine & Abbeel, 2014), which we shall refer to
as LQR with fitted linear models (LQRFLM). The modelfree method is a PI
2
algorithm with
pertime step KLdivergence constraints that is derived in previous work (Chebotar, Kalakrishnan,
et al., 2017).
3.3.1 ModelBased Optimization of TVLG Policies
The modelbased method we use is based on the iterative linearquadratic regulator (iLQR) and
builds on prior work (Levine & Abbeel, 2014; Tassa et al., 2012). We provide a full description
and derivation in Appendix A.1.1.
We use samples to fit a TVLG dynamics modelp(x
t+1
Sx
t
;u
t
)=N (f
x;t
x
t
+ f
u;t
u
t
;F
t
) and
assume a twicedifferentiable cost function. As in (Tassa et al., 2012), we can compute a second
order Taylor approximation of our Qfunction and optimize this with respect to u
t
to find the
optimal action at each time stept. To deal with unknown dynamics, (Levine & Abbeel, 2014)
impose a KLdivergence constraint between the updated policyp
(i)
and previous policyp
(i−1)
to stay within the space of trajectories where the dynamics model is approximately correct. We
similarly set up our optimization as
min
p
(i)
E
p
(i)[Q(x
t
;u
t
)]s:t:E
p
(i) D
KL
(p
(i)
Yp
(i−1)
)≤
t
: (3.1)
The main difference from (Levine & Abbeel, 2014) is that we enforce separate KL constraints for
each linearGaussian policy rather than a single constraint on the induced trajectory distribution
(i.e., compare Eq. (3.1) to the first equation in Section 3.1 of (Levine & Abbeel, 2014)).
24
LQRFLM has substantial efficiency benefits over modelfree algorithms. However, as our
experimental results in Section 3.6 show, the performance of LQRFLM is highly dependent on
being able to model the system dynamics accurately, causing it to fail for more challenging tasks.
3.3.2 Policy Improvement with Path Integrals
PI
2
is a modelfree RL algorithm based on stochastic optimal control. A detailed derivation of this
method can be found in (Theodorou et al., 2010).
Each iteration of PI
2
involves generatingN trajectories by running the current policy. Let
S(x
i;t
;u
i;t
) = c(x
i;t
;u
i;t
)+∑
T
j=t+1
c(x
i;j
;u
i;j
) be the costtogo of trajectory i ∈ {1;:::;N}
starting in state x
i;t
by performing action u
i;t
and following the policyp(u
t
Sx
t
) afterwards. Then,
we can compute probabilitiesP(x
i;j
;u
i;j
) for each trajectory starting at time stept
P(x
i;t
;u
i;t
)=
exp−
1
t
S(x
i;t
;u
i;t
)
∫
exp−
1
t
S(x
i;t
;u
i;t
)du
i;t
: (3.2)
The probabilities follow from the FeynmanKac theorem applied to stochastic optimal con
trol (Theodorou et al., 2010). The intuition is that the trajectories with lower costs receive
higher probabilities, and the policy distribution shifts towards a lower cost trajectory region. The
costs are scaled by
t
, which can be interpreted as the temperature of a softmax distribution. This
is similar to the dual variables
t
in LQRFLM in that they control the KL step size, however
they are derived and computed differently. After computing the new probabilitiesP , we update
the policy distribution by reweighting each sampled control u
i;t
byP(x
i;t
;u
i;t
) and updating the
policy parameters by a maximum likelihood estimate (Chebotar, Kalakrishnan, et al., 2017).
To relate PI
2
updates to LQRFLM optimization of a constrained objective, which is necessary
for combining these methods, we can formulate the following theorem.
Theorem 1. The PI
2
update corresponds to a KLconstrained minimization of the expected cost
togoS(x
t
;u
t
)=∑
T
j=t
c(x
j
;u
j
) at each time stept
min
p
(i)
E
p
(i)[S(x
t
;u
t
)]s:t:E
p
(i−1) D
KL
p
(i)
Yp
(i−1)
≤;
where is the maximum KLdivergence between the new policyp
(i)
(u
t
Sx
t
) and the old policy
p
(i−1)
(u
t
Sx
t
).
Proof. The Lagrangian of this problem is given by
L(p
(i)
;
t
)=E
p
(i)[S(x
t
;u
t
)]+
t
E
p
(i−1) D
KL
p
(i)
Yp
(i−1)
−:
By minimizing the Lagrangian with respect to p
(i)
we can find its relationship to p
(i−1)
(see
Appendix A.1.2), given by
25
p
(i)
(u
t
Sx
t
)∝p
(i−1)
(u
t
Sx
t
)E
p
(i−1) exp−
1
t
S(x
t
;u
t
): (3.3)
This gives us an update rule forp
(i)
that corresponds exactly to reweighting the controls from the
previous policyp
(i−1)
based on their probabilitiesP(x
t
;u
t
) described earlier. The temperature
t
now corresponds to the dual variable of the KLdivergence constraint.
The temperature
t
can be estimated at each time step separately by optimizing the dual
function
g(
t
)=
t
+
t
logE
p
(i−1) exp−
1
t
S(x
t
;u
t
); (3.4)
with derivation following from (Peters et al., 2010).
PI
2
was used by (Chebotar, Kalakrishnan, et al., 2017) to solve several challenging robotic
tasks such as door opening and pickandplace, where they achieved better final performance
than LQRFLM. However, due to its greater sample complexity, PI
2
required initialization from
demonstrations.
3.4 Integrating ModelBased Updates into PI
2
Both PI
2
and LQRFLM can be used to learn TVLG policies and both have their strengths and
weaknesses. In this section, we first show how the PI
2
update can be broken up into two parts, with
one part using a modelbased cost approximation and another part using the residual cost error
after this approximation. Next, we describe our method for integrating modelbased updates into
PI
2
by using our extension of LQRFLM to optimize the linearquadratic cost approximation and
performing a subsequent update with PI
2
on the residual cost. We demonstrate in Section 3.6 that
our method combines the strengths of PI
2
and LQRFLM while compensating for their weaknesses.
3.4.1 TwoStage PI
2
update
To integrate a modelbased optimization into PI
2
, we can divide it into two steps. Given an
approximation ^ c(x
t
;u
t
) of the real costc(x
t
;u
t
) and the residual cost ~ c(x
t
;u
t
) = c(x
t
;u
t
)−
^ c(x
t
;u
t
), let
^
S
t
=
^
S(x
t
;u
t
) be the approximated costtogo of a trajectory starting with state x
t
and
action u
t
, and
~
S
t
=
~
S(x
t
;u
t
) be the residual of the real costtogoS(x
t
;u
t
) after approximation.
We can rewrite the PI
2
policy update rule from Eq. (3.3) as
p
(i)
(u
t
Sx
t
)
∝p
(i−1)
(u
t
Sx
t
)E
p
(i−1) exp−
1
t
^
S
t
+
~
S
t
∝ ^ p(u
t
Sx
t
)E
p
(i−1) exp−
1
t
~
S
t
; (3.5)
26
where ^ p(u
t
Sx
t
) is given by
^ p(u
t
Sx
t
)∝p
(i−1)
(u
t
Sx
t
)E
p
(i−1) exp−
1
t
^
S
t
: (3.6)
Hence, by decomposing the cost into its approximation and the residual approximation error,
the PI
2
update can be split into two steps: (1) update using the approximated costs ^ c(x
t
;u
t
)
and samples from the old policyp
(i−1)
(u
t
Sx
t
) to get ^ p(u
t
Sx
t
); (2) updatep
(i)
(u
t
Sx
t
) using the
residual costs ~ c(x
t
;u
t
) and samples from ^ p(u
t
Sx
t
).
3.4.2 ModelBased Substitution with LQRFLM
We can use Theorem (1) to rewrite Eq. (3.6) as a constrained optimization problem
min
^ p
E
^ p
^
S(x
t
;u
t
)s:t: E
p
(i−1) D
KL
^ pYp
(i−1)
≤:
Thus, the policy ^ p(u
t
Sx
t
) can be updated using any algorithm that can solve this optimization
problem. By choosing a modelbased approach for this, we can speed up the learning process
significantly. Modelbased methods are typically constrained to some particular cost approximation,
however, PI
2
can accommodate any form of ~ c(x
t
;u
t
) and thus will handle arbitrary cost residuals.
LQRFLM solves the type of constrained optimization problem in Eq. (3.1), which matches the
optimization problem needed to obtain ^ p, where the costtogo
^
S is approximated with a quadratic
cost and a linearGaussian dynamics model.
*
We can thus use LQRFLM to perform our first
update, which enables greater efficiency but is susceptible to modeling errors when the fitted local
dynamics are not accurate, such as in discontinuous systems. We can use a PI
2
optimization on the
residuals to correct for this bias.
3.4.3 Optimizing Cost Residuals with PI
2
In order to perform a PI
2
update on the residual coststogo
~
S, we need to know what
^
S is for
each sampled trajectory. That is, what is the costtogo that is actually used by LQRFLM to make
its update? The structure of the algorithm implies a specific costtogo formulation for a given
trajectory – namely, the sum of quadratic costs obtained by running the same policy under the
TVLG dynamics used by LQRFLM. A given trajectory can be viewed as being generated by a
deterministic policy conditioned on a particular noise realization
i;1
;:::;
i;T
, with actions given
by
u
i;t
= K
t
x
i;t
+ k
t
+
»
t
i;t
; (3.7)
*
In practice, we make a small modification to the problem in Eq. (3.1) so that the expectation in the constraint
is evaluated with respect to the new distribution ^ p(xt) rather than the previous onep
(i−1)
(xt). This modification is
heuristic and no longer aligns with Theorem (1), but works better in practice.
27
where K
t
, k
t
, and
t
are the parameters ofp
(i−1)
. We can therefore evaluate
^
S(x
t
;u
t
) by simu
lating this deterministic controller from (x
t
;u
t
) under the fitted TVLG dynamics and evaluating
its timevarying quadratic cost, and then plugging these values into the residual cost.
In addition to the residual costs
~
S for each trajectory, the PI
2
update also requires control
samples from the updated LQRFLM policy ^ p(u
t
Sx
t
). Although we have the updated LQRFLM
policy, we only have samples from the old policyp
(i−1)
(u
t
Sx
t
). However, we can apply a form of
the reparametrization trick (Kingma & Welling, 2013) and again use the stored noise realization
of each trajectory
t;i
to evaluate what the control would have been for that sample under the
LQRFLM policy ^ p. The expectation of the residual costtogo in Eq. (3.5) is taken with respect
to the old policy distributionp
(i−1)
. Hence, we can reuse the states x
i;t
and their corresponding
noise
i;t
that was sampled while rolling out the previous policy p
(i−1)
and evaluate the new
controls according to ^ u
i;t
=
^
K
t
x
i;t
+
^
k
t
+
»
^
t
i;t
. This linear transformation on the sampled
control provides unbiased samples from ^ p(u
t
Sx
t
). After transforming the control samples, they
are reweighted according to their residual costs and plugged into the PI
2
update in Eq. (3.2).
3.4.4 Summary of PILQR algorithm
Algorithm 2 summarizes our method for combining LQRFLM and PI
2
to create a hybrid model
based and modelfree algorithm. After generating a set of trajectories by running the current
policy (line 2), we fit TVLG dynamics and compute the quadratic cost approximation ^ c(x
t
;u
t
)
and approximation error residuals ~ c(x
t
;u
t
) (lines 3, 4). In order to improve the convergence
behavior of our algorithm, we adjust the KLstep
t
of the LQRFLM optimization in Eq. (3.1)
based inversely on the proportion of the residual coststogo to the sampled coststogo (line
5). In particular, if the ratio between the residual and the overall cost is sufficiently small or
large, we increase or decrease, respectively, the KLstep
t
. We then continue with optimizing
for the temperature
t
using the dual function from Eq. (3.4) (line 6). Finally, we perform an
LQRFLM update on the cost approximation (line 7) and a subsequent PI
2
update using the cost
residuals (line 8). As PILQR combines LQRFLM and PI
2
updates in sequence in each iteration,
its computational complexity can be determined as the sum of both methods. Due to the properties
of PI
2
, the covariance of the optimized TVLG controllers decreases each iteration and the method
eventually converges to a single solution.
3.5 Training Parametric Policies with GPS
PILQR offers an approach to perform trajectory optimization of TVLG policies. In this work, we
employ mirror descent guided policy search (MDGPS) (Montgomery & Levine, 2016) in order
to use PILQR to train parametric policies, such as neural networks. Instead of directly learning
the parameters of a highdimensional parametric or “global policy” with RL, we first learn simple
TVLG policies, which we refer to as “local policies”p(u
t
Sx
t
) for various initial conditions of the
task. After optimizing the local policies, the optimized controls from these policies are used to
28
Algorithm 2: PILQR algorithm
1: for iterationk ∈ {1;:::;K} do
2: Generate trajectoriesD = {
i
} by running the current linearGaussian policy
p
(k−1)
(u
t
Sx
t
).
3: Fit TVLG dynamics ^ p(x
t+1
Sx
t
;u
t
).
4: Estimate cost approximation ^ c(x
t
;u
t
) using fitted dynamics and compute cost
residuals: ~ c(x
t
;u
t
)=c(x
t
;u
t
)− ^ c(x
t
;u
t
).
5: Adjust LQRFLM KL step
t
based on ratio of residual coststogo
~
S and sampled
coststogoS.
6: Compute
t
using dual function from Eq. (3.4).
7: Perform LQRFLM update to compute ^ p(u
t
Sx
t
):
min
p
(i) E
p
(i)[Q(x
t
;u
t
)] s:t: E
p
(i) D
KL
(p
(i)
Yp
(i−1)
)≤
t
.
8: Perform PI
2
update using cost residuals and LQRFLM actions to compute the
new policy:p
(k)
(u
t
Sx
t
)∝ ^ p(u
t
Sx
t
)E
p
(i−1) exp−
1
t
~
S
t
).
9: end for
create a training set for learning the global policy
in a supervised manner. Hence, the final
global policy generalizes across multiple local policies.
Using the TVLG representation of the local policies makes it straightforward to incorporate
PILQR into the MDGPS framework. Instead of constraining against the old local TVLG policy as
in Theorem (1), each instance of the local policy is now constrained against the old global policy
min
p
(i)
E
p
(i)[S(x
t
;u
t
)]s:t: E
p
(i−1) D
KL
p
(i)
Y
(i−1)
≤:
The twostage update proceeds as described in Section 3.4.1, with the change that the LQRFLM
policy is now constrained against the old global policy
(i−1)
.
3.6 Experimental Evaluation
Our experiments aim to answer the following questions: (1) How does our method compare to other
trajectorycentric and deep RL algorithms in terms of final performance and sample efficiency?
(2) Can we utilize linearGaussian policies trained using PILQR to obtain robust neural network
policies using MDGPS? (3) Is our proposed algorithm capable of learning complex manipulation
skills on a real robotic platform? We study these questions through a set of simulated comparisons
against prior methods, as well as realworld tasks using a PR2 robot. The performance of each
method can be seen in our supplementary video.
†
Our focus in this work is specifically on robotics
tasks that involve manipulation of objects, since such tasks often exhibit elements of continuous
and discontinuous dynamics and require sampleefficient methods, making them challenging for
both modelbased and modelfree methods.
†
https://sites.google.com/site/icml17pilqr
29
Figure 3.2: We evaluate on a set of simulated robotic manipulation tasks with varying difficulty.
Left to right, the tasks involve pushing a block, reaching for a target, and opening a door in 3D.
3.6.1 Simulation Experiments
We evaluate our method on three simulated robotic manipulation tasks, depicted in Figure 3.2 and
discussed below:
Gripper pusher. This task involves controlling a 4 DoF arm with a gripper to push a white block
to a red goal area. The cost function is a weighted combination of the distance from the gripper to
the block and from the block to the goal.
Reacher. The reacher task from OpenAI gym (Brockman et al., 2016) requires moving the end
of a 2 DoF arm to a target position. This task is included to provide comparisons against prior
methods. The cost function is the distance from the end effector to the target. We modify the cost
function slightly: the original task uses an`
2
norm, while we use a differentiable Huberstyle loss,
which is more typical for LQRbased methods (Tassa et al., 2012).
Door opening. This task requires opening a door with a 6 DoF 3D arm. The arm must grasp the
handle and pull the door to a target angle, which can be particularly challenging for modelbased
methods due to the complex contacts between the hand and the handle, and the fact that a contact
must be established before the door can be opened. The cost function is a weighted combination of
the distance of the end effector to the door handle and the angle of the door.
Additional experimental setup details, including the exact cost functions, are provided in Ap
pendix A.1.3.
We first compare PILQR to LQRFLM and PI
2
on the gripper pusher and door opening
tasks. Figure 3.3 (left) details performance of each method on the most difficult condition for
the gripper pusher task. Both LQRFLM and PI
2
perform significantly worse on the two more
difficult conditions of this task. While PI
2
improves in performance as we provide more samples,
LQRFLM is bounded by its ability to model the dynamics, and thus predict the costs, at the
moment when the gripper makes contact with the block. Our method solves all four conditions
with 400 total episodes per condition and, as shown in the supplementary video, is able to learn a
diverse set of successful behaviors including flicking, guiding, and hitting the block. On the door
opening task, PILQR trains TVLG policies that succeed at opening the door from each of the four
initial robot positions. While the policies trained with LQRFLM are able to reach the handle, they
fail to open the door.
Next we evaluate neural network policies on the reacher task. Figure 3.3 (right) shows results
for MDGPS with each local policy method, as well as two prior deep RL methods that directly
30
0 200 400 600 800 1000 1200 1400 1600
# samples
0.0
0.5
1.0
1.5
2.0
Avg final distance
PI2
LQRFLM
PILQR
Figure 3.3: Left: Average final distance from the block to the goal on one condition of the gripper
pusher task. This condition is difficult due to the block being initialized far away from the gripper
and the goal area, and only PILQR is able to succeed in reaching the block and pushing it toward
the goal. Results for additional conditions are available in Appendix A.1.4, and the supplementary
video demonstrates the final behavior of each learned policy. Right: Final distance from the reacher
end effector to the target averaged across 300 random test conditions per iteration. MDGPS with
LQRFLM, MDGPS with PILQR, TRPO, and DDPG all perform competitively. However, as
the log scale for the x axis shows, TRPO and DDPG require orders of magnitude more samples.
MDGPS with PI
2
performs noticeably worse.
learn neural network policies: trust region policy optimization (TRPO) (Schulman et al., 2015) and
deep deterministic policy gradient (DDPG) (Lillicrap et al., 2016). MDGPS with LQRFLM and
MDGPS with PILQR perform competitively in terms of the final distance from the end effector
to the target, which is unsurprising given the simplicity of the task, whereas MDGPS with PI
2
is
again not able to make much progress. On the reacher task, DDPG and TRPO use 25 and 150
times more samples, respectively, to achieve approximately the same performance as MDGPS with
LQRFLM and PILQR. For comparison, amongst previous deep RL algorithms that combined
modelbased and modelfree methods, SVG and NAF with imagination rollouts reported using
approximately up to five times fewer samples than DDPG on a similar reacher task (Heess et al.,
2015; Gu et al., 2016). Thus we can expect that MDGPS with our method is about one order
of magnitude more sampleefficient than SVG and NAF. While this is a rough approximation, it
demonstrates a significant improvement in efficiency.
Finally, we compare the same methods for training neural network policies on the door opening
task, shown in Figure 3.4 (left). TRPO requires 20 times more samples than MDGPS with PILQR
to learn a successful neural network policy. The other three methods were unable to learn a policy
that opens the door despite extensive hyperparameter tuning. We provide additional simulation
results in Appendix A.1.4.
31
0 50 100 150 200
# samples
0
1000
2000
3000
4000
5000
Avg cost
PI2
LQRFLM
PILQR
in the goal
Figure 3.4: Left: Minimum angle in radians of the door hinge (lower is better) averaged across
100 random test conditions per iteration. MDGPS with PILQR outperforms all other methods we
compare against, with orders of magnitude fewer samples than DDPG and TRPO, which is the
only other successful algorithm. Right: Single condition comparison of the hockey task performed
on the real robot. Costs lower than the dotted line correspond to the puck entering the goal.
3.6.2 Real Robot Experiments
To evaluate our method on a real robotic platform, we use a PR2 robot (see Figure 3.1) to learn the
following tasks:
Hockey. The hockey task requires using a stick to hit a puck into a goal 1:4m away. The cost
function consists of two parts: the distance between the current position of the stick and a target
pose that is close to the puck, and the distance between the position of the puck and the goal. The
puck is tracked using a motion capture system. Although the cost provides some shaping, this
task presents a significant challenge due to the difference in outcomes based on whether or not the
robot actually strikes the puck, making it challenging for prior methods, as we show below.
Power plug plugging. In this task, the robot must plug a power plug into an outlet. The cost
function is the distance between the plug and a target location inside the outlet. This task requires
fine manipulation to fully insert the plug. Our TVLG policies consist of 100 time steps and we
control our robot at a frequency of 20 Hz. For further details of the experimental setup, including
the cost functions, we refer the reader to Appendix A.1.3.
Both of these tasks have difficult, discontinuous dynamics at the contacts between the objects,
and both require a high degree of precision to succeed. In contrast to prior works (Daniel et al.,
2013) that use kinesthetic teaching to initialize a policy that is then finetuned with modelfree
methods, our method does not require any human demonstrations. The policies are randomly
initialized using a Gaussian distribution with zero mean. Such initialization does not provide
any information about the task to be performed. In all of the real robot experiments, policies are
updated every 10 rollouts and the final policy is obtained after 2025 iterations, which corresponds
to mastering the skill with less than one hour of experience.
32
0 50 100 150 200 250 300
# samples
−620
−615
−610
−605
−600
−595
−590
−585
Avg cost
PI2
LQRFLM
PILQR
plugged in
Figure 3.5: Left: Experimental setup of the hockey task and the success rate of the final PILQR
MDGPS policy. Red and blue: goal positions used for training, green: new goal position. Right:
Single condition comparison of the power plug task performed on the real robot. Note that costs
above the dotted line correspond to executions that did not actually insert the plug into the socket.
Only our method (PILQR) was able to consistently insert the plug all the way into the socket by
the final iteration.
In the first set of experiments, we aim to learn a policy that is able to hit the puck into the
goal for a single position of the goal and the puck. The results of this experiment are shown in
Figure 3.4 (right). In the case of the prior PI
2
method (Theodorou et al., 2010), the robot was not
able to hit the puck. Since the puck position has the largest influence on the cost, the resulting
learning curve shows little change in the cost over the course of training. The policy to move the
arm towards the recorded arm position that enables hitting the puck turned out to be too challenging
for PI
2
in the limited number of trials used for this experiment. In the case of LQRFLM, the robot
was able to occasionally hit the puck in different directions. However, the resulting policy could
not capture the complex dynamics of the sliding puck or the discrete transition, and was unable to
hit the puck toward the goal. The PILQR method was able to learn a robust policy that consistently
hits the puck into the goal. Using the step adjustment rule described in Section 3.4.4, the algorithm
would shift towards modelfree updates from the PI
2
method as the TVLG approximation of the
dynamics became less accurate. Using our method, the robot was able to get to the final position of
the arm using fast modelbased updates from LQRFLM and learn the puckhitting policy, which
is difficult to model, by automatically shifting towards modelfree PI
2
updates.
In our second set of hockey experiments, we evaluate whether we can learn a neural network
policy using the MDGPSPILQR algorithm that can hit the puck into different goal locations. The
goals were spaced 0:5m apart (see Figure 3.5 left). The strategies for hitting the puck into different
goal positions differ substantially, since the robot must adjust the arm pose to approach the puck
from the right direction and aim toward the target. This makes it quite challenging to learn a single
policy for this task. We performed 30 rollouts for three different positions of the goal (10 rollouts
each), two of which were used during training. The neural network policy was able to hit the
puck into the goal in 90% of the cases (see Figure 3.5 left). This shows that our method can learn
highdimensional neural network policies that generalize across various conditions.
33
The results of the plug experiment are shown in Figure 3.5 (right). PI
2
alone was unable to
reach the socket. The LQRFLM algorithm succeeded only 60% of the time at convergence. In
contrast to the peg insertionstyle tasks evaluated in prior work that used LQRFLM (Levine et al.,
2015), this task requires very fine manipulation due to the small size of the plug. Our method was
able to converge to a policy that plugged in the power plug on every rollout at convergence. The
supplementary video illustrates the final behaviors of each method for both the hockey and power
plug tasks.
‡
3.7 Conclusions
We presented an algorithm that combines elements of modelfree and modelbased RL, with the
aim of combining the sample efficiency of modelbased methods with the ability of modelfree
methods to improve the policy even in situations where the model’s structural assumptions are
violated. We show that a particular choice of policy representation – TVLG controllers – is
amenable to fast optimization with modelbased LQRFLM and modelfree PI
2
algorithms using
samplebased updates. We propose a hybrid algorithm based on these two components, where the
PI
2
update is performed on the residuals between the true samplebased cost and the cost estimated
under the local linear models. This algorithm has a number of appealing properties: it naturally
trades off between modelbased and modelfree updates based on the amount of model error, can
easily be extended with a KLdivergence constraint for stable learning, and can be effectively
used for realworld robotic learning. We further demonstrate that, although this algorithm is
specific to TVLG policies, it can be integrated into the GPS framework in order to train arbitrary
parameterized policies, including deep neural networks.
We evaluated our approach on a range of challenging simulated and realworld tasks. The
results show that our method combines the efficiency of modelbased learning with the ability
of modelfree methods to succeed on tasks with discontinuous dynamics and costs. We further
illustrate in direct comparisons against stateoftheart modelfree deep RL methods that, when
combined with the GPS framework, our method achieves substantially better sample efficiency.
‡
https://sites.google.com/site/icml17pilqr
34
Part II
SelfSupervision
35
Chapter 4
SelfSupervised Regrasping using Reinforcement
Learning
We introduce a framework for learning regrasping behaviors based on tactile data. First, we
present a grasp stability predictor that uses spatiotemporal tactile features collected from the
earlyobjectlifting phase to predict the grasp outcome with a high accuracy. Next, the trained
predictor is used to supervise and provide feedback to a reinforcement learning algorithm that
learns the required linear graspadjustment policies based on tactile feedback, with a small number
of parameters, for single objects. Finally, a general highdimensional regrasping policy is learned
in a supervised manner by using the outputs of the individual policies.
Our results gathered over more than 50 hours of real robot experiments indicate that the robot
is able to predict the grasp outcome with 93% accuracy. The robot is able to improve the grasp
success rate from 42% when randomly grasping an object to up to 97% when allowed to regrasp
the object in case of a predicted failure. In addition, our experiments indicate that the general
highdimensional policy learned using our method is able to outperform the respective linear
policies on each of the single objects that they were trained on. Moreover, the general policy is
able to generalize to a novel object that was not present during training.
4.1 Introduction
Autonomous grasping of unknown objects is a fundamental requirement for service robots perform
ing manipulation tasks in real world environments. Even though there has been a lot of progress in
the area of grasping, it is still considered an open challenge and even the stateoftheart grasping
methods may result in failures. Two questions arise immediately: i) how to detect these failures
early, and ii) how to adjust a grasp to avoid failure and improve stability.
Early grasp stability assessment is particularly important in the regrasping scenario, where, in
case of predicted failure, the robot must be able to place the object down in the same position in
order to regrasp it later. In many cases, early detection of grasping failures cannot be performed
using a vision system as they occur at the contact points and involve various tactile events such as
incipient slip. Recent developments of tactile sensors (Wettels et al., 2008) and spatiotemporal
36
Figure 4.1: Regrasping scenario: the robot partially misses the object with one of the fingers
during the initial grasp (left), predicts that the current grasp will be unstable, places the object
down, and adjusts the hand configuration to form a firm grasp of the object using all of its fingers
(right).
classification algorithms (Madry et al., 2014) enable us to use informative tactile feedback to
advance graspingfailuredetection methods.
In order to correct the grasp, the robot has to be able to process and use the information on
why the grasp has failed. Tactile sensors are one source of this information. Our grasp adjustment
approach aims to use this valuable tactile information in order to infer a local change of the
gripper configuration that will improve the grasp stability. In this work, we jointly address the
problems of graspfailure detection and grasp adjustment, i.e. regrasping using tactile feedback. In
particular, we use a failure detection method to guide and selfsupervise the regrasping behavior
using reinforcement learning. In addition to the regrasping approach, we present an extensive
evaluation of a spatiotemporal grasp stability predictor that is used on a biomimetic tactile sensor.
This prediction is then used as a reward signal that supervises the regrasping reinforcement learning
process. An example of a regrasping behavior is depicted in Fig. 4.1.
Hereby, we show that simple regrasping strategies can be learned using linear policies if enough
data is provided. However, these strategies do not generalize well to other classes of objects than
those they were trained on. The main reason for this shortcoming is that the policies are not
representative enough to capture the richness of different shapes and physical properties of the
objects. A potential solution to learn a more complex and generalizable regrasping strategy is to
employ a more complex policy class and gather a lot of realrobot data with a variety of objects to
learn the policy parameters. The main weakness of such a solution is that, in addition to requiring
large amounts of data, these complex policies often result in the learner becoming stuck in poor
local optima (Levine & Koltun, 2013; M. Deisenroth et al., 2013). We propose learning a complex
highdimensional regrasping policy in a supervised fashion. Our method uses simple linear policies
to guide the general policy to avoid poor local minima and to learn the general policy from smaller
amounts of data.
37
4.2 Related Work
The problem of grasp stability assessment has been addressed previously in various ways. (Dang
& Allen, 2014) utilize tactile sensory data to estimate grasp stability. However, their approach
does not take into account the temporal properties of the data and uses the tactile features only
from one time step at the end of the grasping process. There have also been other approaches that
model the entire time series of tactile data. In (Bekiroglu et al., 2011) and (Bekiroglu et al., 2010),
the authors train a Hidden Markov Model to represent the sequence of tactile data and then use
it for grasp stability assessment. The newest results in the analysis of the tactile time series data
for grasp stability assessment were presented in (Madry et al., 2014). The authors show that the
unsupervised feature learning approach presented in their work achieves better results than any of
the previously mentioned methods, including (Bekiroglu et al., 2011) and (Bekiroglu et al., 2010).
In this work, we show how the spatiotemporal tactile features developed in (Madry et al., 2014)
can be applied to the stateoftheart biomimetic tactile sensor, which enables us to better exploit
the capabilities of this advanced tactile sensor.
Biomimetic tactile sensors and humaninspired algorithms have been used previously for
grasping tasks. In (Wettels et al., 2009), the authors show an approach to control grip force while
grasping an object using the same biomimetic tactile sensor that is used in this work. (Su et al.,
2015) present a grasping controller that uses estimation of various tactile properties to gently pick
up objects of various weights and textures using a biomimetic tactile sensor. All of these works
tackle the problem of humaninspired grasping. In this work, however, we focus on the regrasping
behavior and grasp stability prediction. We also compare the grasp stability prediction results using
the finger forces estimated by the method from (Su et al., 2015) to the methods described in this
work.
Reinforcement learning has enjoyed success in many different applications including manipu
lation tasks such as playing table tennis (Kober et al., 2011) or executing a pool stroke (Pastor et
al., 2011). Reinforcement learning methods have also been applied to grasping tasks. (Kroemer et
al., 2010) propose a hierarchical controller that determines where and how to grasp an unknown
object. The authors use joint positions of the hand in order to determine the reward used for
optimizing the policy. Another approach was presented by (Montesano & Lopes, 2012), where the
authors address the problem of actively learning good grasping points from visual features using
reinforcement learning. In this work, we present a different approach to the problem of grasping
using reinforcement learning, i.e. regrasping. In addition, we use tactilefeaturebased reward
function to improve the regrasping performance.
Tactile features have been rarely used in reinforcement learning manipulation scenarios. (Pastor
et al., 2011) use pressure sensor arrays mounted on the robot’s fingers to learn a manipulation
skill of flipping a box using chopsticks. Similarly to (Pastor et al., 2011), (Chebotar et al.,
2014) use dynamic movement primitives and reinforcement learning to learn the task of gentle
surface scraping with a spatula. In both of these cases, the robot was equipped with the tactile
matrix arrays. In this work, we apply a reinforcement learning approach to the task of regrasping
using an advanced biomimetic tactile sensor together with stateoftheart spatiotemporal feature
descriptors.
38
The works most related to this work have focused on the problem of grasp adjustment based
on sensory data. (Dang & Allen, 2014) tackles the regrasping problem by searching for the closest
stable grasp in the database of all the previous grasps performed by the robot. A similar approach
is presented by (M. Li et al., 2014), where the authors propose a grasp adaptation strategy that
searches a database for a similar tactile sensing experience in order to correct the grasp. The
authors introduce an objectlevel impedance controller whose parameters are adapted based on
the current grasp stability estimates. In that case, the grasp adaptation is focused on inhand
adjustments rather than placing the object down and regrasping it. The main differences between
these approaches and ours are twofold: i) we employ spatiotemporal features and use tactile data
from the beginning of the lifting phase which enables us to achieve high accuracy in a short time
for grasp stability prediction, and ii) we use reinforcement learning with spatiotemporal features
which is supervised by the previously learned grasp stability predictor. This allows us to learn the
regrasping behavior in an autonomous and efficient way.
The idea of using supervised learning in policy search has been used in (Levine et al., 2015),
where the authors use trajectory optimization to direct the policy learning process and apply the
learned policies to various manipulation tasks. A similar approach was proposed in (Finn et al.,
2015a), where the authors use deep spatial autoencoders to learn the state representation and unify
a set of linear Gaussian controllers to generalize for the unseen situations. In our work, we use the
idea of unifying simple strategies to generate a complex generic policy. Here, however, we use
simple linear policies learned through reinforcement learning rather than optimized trajectories as
the examples that the general policy can learn from.
4.3 Grasp Stability Estimation
The first component of our regrasping framework is the grasp stability predictor that provides early
detection of grasp failures. We use short sequences of tactile data to extract necessary information
about the grasp quality. This requires processing tactile data both in the spatial and the temporal
domains. In this section, we first present the biomimetic tactile sensor used in this work and how
its readings are adapted for the feature extraction method. To the best of our knowledge, this is
the first time that the spatiotemporal tactile features are used for this advanced biomimetic tactile
sensor. Next, we describe the spatiotemporal method for learning a sparse representation of the
tactile sequence data and how it is used for the grasp stability prediction.
4.3.1 Biomimetic Tactile Sensor
In this work, we use a hapticallyenabled robot equipped with 3 biomimetic tactile sensors 
BioTacs (Wettels et al., 2008). Each BioTac consists of a rigid core housing an array of 19
electrodes surrounded by an elastic skin. The skin is inflated with an incompressible and conductive
liquid. When the skin is in contact with an object, the liquid is displaced, resulting in distributed
impedance changes in the electrode array on the surface of the rigid core. The impedance of
each electrode tends to be dominated by the thickness of the liquid between the electrode and the
immediately overlying skin.
39
Figure 4.2: The schematic of the electrode arrangements on the BioTac sensor (left). Tactile image
used for the STHMP features (right). The X values are the reference electrodes. The 19 BioTac
electrodes are measured relative to these 4 reference electrodes. V1 and V2 are created by taking
an average response of the neighboring electrodes:V 1=avg(E17;E18;E12;E2;E13;E3) and
V 2=avg(E17;E18;E15;E5;E13;E3).
In order to apply feature extraction methods from computer vision, as described in the next
section, the BioTac electrode readings have to be represented as tactile images. To form such a 2D
tactile image, the BioTac electrode values are laid out according to their spatial arrangement on the
sensor as depicted in Fig. 4.2.
4.3.2 Hierarchical Matching Pursuit
To describe a time series of tactile data, we employ a spatiotemporal feature descriptor  Spatio
Temporal Hierarchical Matching Pursuit (STHMP). We choose STHMP features as they have
been shown to have high performance in temporal tactile data classification tasks (Madry et al.,
2014).
STHMP is based on Hierarchical Matching Pursuit (HMP), which is an unsupervised feature
extraction method used for images (Bo et al., 2011). HMP creates a hierarchy of several layers of
simple descriptors using a matching pursuit encoder and spatial pooling.
To encode data as sparse codes, HMP learns a dictionary of codewords. The dictionary is
learned using the common codebooklearning method KSVD (Aharon et al., 2006). Given a
set of N Hdimensional observations (e.g. image patches) Y = [y
1
;:::;y
N
] ∈ R
H×N
, HMP
learns a Mword dictionary D = [d
1
;:::;d
M
] ∈ R
H×M
and the corresponding sparse codes
X = [x
1
;:::;x
N
] ∈R
M×N
that minimize the reconstruction error between the original and the
encoded data:
min
D;X
YY −DXY
2
F
s:t: ∀mYd
m
Y
2
= 1 and ∀iYx
i
Y
0
≤K;
40
where Y⋅Y
F
is a Frobenius norm,x
i
are the sparse vectors, Y⋅Y
0
is a zeronorm that counts number
of nonzero elements in a vector, andK is the sparsity level that limits the number of nonzero
elements in the sparse codes.
The optimization problem of minimizing the reconstruction error can be solved in an alternating
manner. First, the dictionaryD is fixed and each of the sparse codes is optimized:
min
x
i
Yy
i
−Dx
i
Y
2
s:t: Yx
i
Y
0
≤K:
Next, given the sparse code matrix X, the dictionary is recomputed by solving the following
optimization problem for each codeword in the dictionary:
min
dm
YY −DXY
2
F
s:t: Yd
m
Y
2
= 1:
The first step is combinatorial and NPhard but there exist algorithms to solve it approximately.
HMP uses orthogonal matching pursuit (OMP) (Pati et al., 1993) to approximate the sparse codes.
OMP proceeds iteratively by selecting a codeword that is best correlated with the current residual
of the reconstruction error. Thus, at each iteration, it selects the codewords that maximally reduce
the error. After selecting a new codeword, observations are orthogonally projected into the space
spanned by all the previously selected codewords. The residual is then recomputed and a new
codeword is selected again. This process is repeated until the desired sparsity levelK is reached.
The dictionary optimization step can be performed using a standard gradient descent algorithm.
4.3.3 SpatioTemporal HMP
In STHMP, the tactile information is aggregated both in the spatial and the temporal domains.
This is achieved by constructing a pyramid of spatiotemporal features at different coarseness
levels, which provides invariance to spatial and temporal deviations of the tactile signal. In the
spatial domain, the dictionary is learned and the sparse codes are extracted from small tactile image
patches. As the next step, the sparse codes are aggregated using spatial maxpooling. In particular,
the image is divided into spatial cells and each cell’s features are computed by maxpooling all the
sparse codes inside the spatial cell:
F(C
s
)= max
j∈Cs
Sx
j1
S;:::; max
j∈Cs
Sx
jm
S;
wherej ∈C
s
indicates that the image patch is inside the spatial cellC
s
andm is the dimensionality
of the sparse codes. The HMP features of each tactile image in the time sequence are computed
by performing maxpooling on various levels of the spatial pyramid, i.e. for different sizes of the
spatial cells.
After computing the HMP features for all tactile images in the time series, the pooling is
performed on the temporal level by constructing a temporal pyramid. The tactile sequence is
divided in subsequences of different lengths. For all subsequences, the algorithm performs
41
maxpooling of the HMP features resulting in a single feature descriptor for each subsequence.
Combined with spatial pooling, this results in a spatiotemporal pooling of the sparse codes:
F(C
st
)= max
j∈Cst
S^ x
j1
S;:::; max
j∈Cst
S^ x
jm
S;
where ^ x
j
is a HMP feature vector that is inside the spatiotemporal cellC
st
.
Finally, the features of all the spatiotemporal cells are concatenated to create a single feature
vectorF
P
for the complete tactile sequence:
F
P
= [C
11
;:::;C
ST
];
whereS is the number of the spatial cells andT is the number of the temporal cells. Hence, the
total dimensionality of the STHMP descriptor isS×T ×M.
After extracting the STHMP feature descriptor from the tactile sequence, we use Support
Vector Machine (SVM) with a linear kernel to learn a classifier for the grasp stability prediction as
described in (Madry et al., 2014). The tactile sequences used in this work consist of tactile data
shortly before and after starting to pick up a grasped object. This process is described in more
detail in Sec. 4.5.
In this work, we also compare features extracted only from tactile sensors with combinations of
tactile and nontactile features, such as forcetorque sensors, strain gages, hand orientation, finger
angles, etc. To achieve temporal invariance, we apply temporal maxpooling to these features
with the same temporal pyramid as for the tactile features. By doing so, we can combine tactile
STHMP and nontactile temporally pooled features by concatenating their feature vectors.
4.4 SelfSupervised Reinforcement Learning for Regrasping
Once a grasp is predicted as a failure by the grasp stability predictor, the robot has to place the
object down and regrasp it using the information acquired during the initial grasp. In order to
achieve this goal, we learn a mapping from the tactile features of the initial grasp to the grasp
adjustment. The parameters of this mapping function are learned using a reinforcement learning
approach. In the following, we explain how this mapping is computed. In addition, we describe
the policy search method and our approach to reducing the dimensionality of the problem.
4.4.1 Mapping from Tactile Features to Grasp Adjustments
Similarly to the grasp stability prediction, we use the STHMP features as the tactile data rep
resentation to compute the adjustment of the unsuccessful grasp. In particular, we use a linear
combination of the STHMP features to compute the change of the grasp pose. The weights of this
combination are learned using a policy search algorithm described in the next section.
Using multiple levels in the spatial and temporal pyramids of STHMP increases the dimen
sionality of tactile features substantially. This leads to a large number of parameters to learn for
the mapping function, which is usually a hard task for policy search algorithms (M. Deisenroth
42
et al., 2013). Therefore, we perform principal component analysis (PCA) (Jolliffe, 1986) on the
STHMP features and use only the largest principal components to compute the mapping.
The grasp pose adjustment is represented by the 3dimensional change of the gripper’s position
and 3 Euler angles describing the change of the gripper’s orientation. Each adjustment dimension
is computed separately using the largest principal components of the STHMP features:
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
x
y
z
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎝
w
x;1
⋯ w
x;n
w
y;1
⋯ w
y;n
⋮ ⋱ ⋮
w
;1
⋯ w
;n
⎞
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎝
1
2
⋮
n
⎞
⎟
⎟
⎟
⎠
;
wherew
i;j
are the weights of the tactile features
i
andn is the number of principal components
that are used for the mapping.
4.4.2 Policy Search for Learning Mapping Parameters
The linear combination weights of the mapping from the tactile features to the grasp adjustments
w
i;j
are learned using a reinforcement learning technique described in the following. All the
combination weights are first concatenated to form a single parameter vector:
= (w
x;1
;:::;w
x;n
;:::;w
;n
):
We define the policy() as a Gaussian distribution over with a mean and a covariance
matrix . In order to find good regrasping candidates, the parameter vector is sampled from
this distribution. In the next step, we compute the rewardR() by estimating the success of the
adjusted grasp using the grasp stability predictor described in Sec. 4.3. After a number of trials is
collected, the current policy is optimized and the process repeats. For policy optimization we use
the relative entropy policy search (REPS) algorithm (Peters et al., 2010). The main advantage of
this method is that, in the process of reward maximization, the loss of information during a policy
update is bounded, which leads to better convergence behavior.
The goal of REPS is to maximize the expected rewardJ() of the policy subject to bounded
information loss between the previous and updated policy. Information loss is defined as the
KullbackLeibler (KL) divergence between the two policies. Bounding the information loss limits
the change of the policy and hence, avoids sampling too far from unexplored policy regions.
Letq() be the old policy and() be the new policy after the policy update. Using this
notation, we can formulate a constrained optimization problem:
max
J()=
S
()R()d
s. t.
S
() log
()
q()
d ≤;
43
S
()d = 1;
where, as mentioned before,J() is the total expected reward of using the policy(). The first
constraint bounds the KLdivergence between the policies with the maximum information lost set
to. The second constraint ensures that() forms a proper probability distribution.
Solving the optimization problem with Lagrange multipliers results in the following dual
function:
g()=+ log
S
q() exp
R()
d;
where the integral term can be approximated from the samples using the maximumlikelihood
estimate of the expected value. Furthermore, from the Lagrangian it can be derived that:
()∝q() exp
R()
:
Therefore, we are able compute the new policy parameters with a weighted maximumlikelihood
solution. The weights are equal to exp(R()~), where the rewards are scaled by the parameter.
By decreasing one gives larger weights to the highreward samples. An increase of results in
more uniform weights. The parameter is computed according to the optimization constraints by
solving the dual problem.
Given a set of policy parameters {
1
;:::;
N
} and corresponding episode rewards, the policy
update rules for and can be formulated as follows (M. Deisenroth et al., 2013):
=
∑
N
i=1
d
i
i
∑
N
i=1
d
i
; =
∑
N
i=1
d
i
(
i
−)(
i
−)
⊺
∑
N
i=1
d
i
;
with weightsd
i
= exp(R()~).
4.4.3 Learning a General Regrasping Policy
After the individual linear policies have been learned, we train a larger highdimensional policy in a
supervised manner using the outputs of the individual policies. This is similar to the guided policy
search approach proposed in (Levine & Koltun, 2013). In our case, the guidance of the general
policy comes from the individual policies that can be efficiently learned for separate objects. As the
general policy class we choose a neural network with a large number of parameters. Such a policy
has enough representational richness to incorporate regrasping behavior for many different objects.
However, learning its parameters directly requires a very large number of experiments, whereas
supervised learning with already learned individual policies speeds up the process significantly.
To generate training data for learning the general policy, we sample grasp corrections from the
already learned individual policies using previously collected data. Input features and resulting
grasp corrections are combined in a “transfer” dataset, which is used to transfer the behaviors to the
general policy. In order to increase the amount of information provided to the general policy, we
increase the number of its input features by extracting a larger number of PCA components from
44
Figure 4.3: Objects and experimental setup used for learning the grasp stability predictor and the
regrasping behavior. If an object falls out of the hand it returns to its initial position due to the
shape of the bowl. Topleft: the cylinder. Topright: the box. Bottomleft: the ball. Bottomright:
the novel object.
the STHMP features. Using different features in the general policy than in the original individual
policies is one of the advantages of our setting. The individual policies provide outputs of the
desired behavior, while the general policy can have a different set of input features.
To train the neural network, we employ the meansquared error loss function and the Levenberg
Marquardt optimization algorithm (Hagan & Menhaj, 1994). In the hidden layers, we use neurons
with the hyperbolic tangent sigmoid transfer function:
a(x)=
2
1+ exp(−2x)
− 1:
For the activation of the output layer we use a linear transfer function, i.e. the output is a linear
combination of the inputs from the previous layer. In order to avoid overfitting of the training
data we employ the early stopping technique during training (Yao et al., 2007). The data set is
divided into mutually exclusive training, validation and test sets. While the network parameters are
optimized on the training set, the training stops once the performance on the validation set starts
decreasing.
45
4.5 Experimental Results
4.5.1 Evaluation of Grasp Stability Prediction
In our experiments, we use a Barrett arm and hand that is equipped with three biomimetic tactile
sensors (BioTacs) (Wettels et al., 2008). We use bowls (see Fig. 4.3) to bring the objects up right if
they fall out of the gripper during the extensive shaking motions that are performed later in the
experiment. This experimental setup enables us to fully automate the learning process and let the
robot run for many hours to autonomously learn the grasp stability predictor.
The experiment proceeds as follows. The robot reaches for the object to perform a randomly
generated top grasp. The randomness stems from white noise added to the top grasp. Standard
deviation of the noise is±10deg in roll and pitch of the gripper,±60deg in yaw, and±1cm in all
translational directions. These parameters are tuned such that there is always at least one finger
touching the object. After approaching and grasping the object using the force grip controller (Su
et al., 2015), the robot lifts the object and performs extensive shaking motions in all directions
to ensure that the grasp is stable. The shaking motions are performed by rapidly changing the
endeffector’s orientation by±15deg and position by±3cm in all directions multiple times. If the
object is still in the hand after the shaking motions, we consider it to be a successful grasp. The
wristmounted forcetorque sensor is used to determine if the object is still in the hand at the end
of the experiment.
The STHTMP features use a temporal window of 650ms before and 650ms after starting
picking up the object. Our goal is to determine early in the lifting phase if the grasp is going to
fail. In this manner, the robot can stop the motion early enough to avoid displacing the object, and
hence, it can regrasp it later. We evaluate our approach on three objects: a cylindrical object, a box
and a ball. We perform a 5fold crossvalidation on 500 grasp samples for each object. The robot
achieves a grasp classification accuracy of 90:7% on the cylinder, 82:4% on the box and 86:4% on
the ball.
4.5.2 Learning Individual Linear Regrasping Policies
After learning the grasp stability predictor, we evaluate the regrasping algorithm for individual
policies. The experimental setup for this scenario is similar to the one for the grasp stability
predictor. The robot uses the stability prediction to selfsupervise the learning process. In this
manner, we are able to let the robot run for many hours for each object to autonomously learn the
regrasping behavior.
As described in Sec. 4.4.1, we apply PCA and extract five principal components from the
STHMP features for learning individual policies. As a result, linear policies contain only 30
parameters (5 for each of the 6 grasp adjustment dimensions). This makes the policy search feasible
using relatively small amounts of data.
We evaluate the individual policies learned for the cylinder, box and ball objects. We perform
multiple policy updates for each object until the policy converges. For each update, we collect 100
regrasping samples. First, we perform a randomly generated top grasp. If the grasp is predicted
to fail, the algorithm samples the current regrasping policy and the robot performs up to three
46
0 1 2 3 4 5 6 7 8
0.2
0.3
0.4
0.5
0.6
0.7
Nr. of Policy Updates
Avg. Reward
Regrasping: Cylinder
0 1 2 3 4 5 6 7 8
0.2
0.3
0.4
0.5
0.6
0.7
Avg. Reward
Regrasping: Cylinder
0 1 2 3 4 5
0.3
0.4
0.5
0.6
0.7
Nr. of Policy Updates
Avg. Reward
Regrasping: Box
0 1 2 3 4 5
0.3
0.4
0.5
0.6
0.7
Avg. Reward
Regrasping: Box
0 1 2 3 4 5 6
0.4
0.5
0.6
Nr. of Policy Updates
Avg. Reward
Regrasping: Ball
0 1 2 3 4 5 6
0.4
0.5
0.6
Avg. Reward
Regrasping: Ball
Figure 4.4: Top left: schematic of the electrode arrangements on the BioTac sensor and the
corresponding tactile image used for the STHMP features. V1, V2 and V3 are computed by
averaging the neighboring electrode values. Top right, bottom left, bottom right: reinforcement
learning curves for regrasping individual objects using REPS. Policy updates are performed every
100 regrasps.
regrasps. If one of the regrasps is successful, the robot stops regrasping and performs the next
random grasp. The rewards for the reinforcement learning are specified as follows. 0.0: The grasp
is predicted unsuccessful by the grasp stability predictor. We do not perform any additional actions.
0.5: The grasp is predicted successful by the stability predictor. However, the object falls out of
the hand after additional extensive shaking motions. 1.0: The grasp is predicted successful and the
object is still in the hand after the shaking motions.
Fig. 4.4 shows the average reward values after each policy update for all the objects. The robot
is able to improve its regrasping behavior significantly. To evaluate the results of the policy search,
we perform 100 random grasps using the final policies on each of the objects that they were learned
on. The robot has three attempts to regrasp each object using the learned policy. Table 4.1 shows
the percentage of successful grasps on each object after each regrasp. Already after one regrasp,
the robot is able to correct the majority of the failed grasps by increasing the success rate of the
grasps from 41.8% to 83.5% on the cylinder, from 40.7% to 85.4% on the box and from 52.9%
to 84.8% on the ball. Moreover, allowing additional regrasps increases this value to 90.3% for
two and 97.1% for three regrasps on the cylinder, 93.7% and 96.8% on the box, and to 91.2% and
47
95.1% on the ball. These results indicate that the robot is able to learn a tactilebased regrasping
strategy for individual objects.
4.5.3 Evaluation of General Regrasping Policy
After training individual policies we create a “transfer” dataset with grasp corrections obtained
from the individual linear regrasping policies for all objects. For each set of tactile features, we
query the respective previouslylearned linear policy for the corresponding grasp correction. We
take the input features for the individual policies from the failed grasps in the opensource
*
BiGS
dataset (Chebotar, Hausman, Su, Molchanov, et al., 2016). The grasps in BiGS were collected
in an analogous experimental setup and can directly be used for creating the “transfer” dataset
. In total, the training set contains 3351 examples: 1380 for the cylinder, 1035 for the box and
936 for the ball. We use supervised learning with the obtained dataset to learn a combined policy
that mimics the behavior of the individual policies. In this work, we employ a neural network to
achieve this task.
To find the optimal architecture of the neural network, we evaluated different networks with
various depths and numbers of neurons to learn the nonlinear policy. The best performance is
achieved by using 20 STHMP PCA features as inputs. We have not observed any improvement
of the approximation accuracy when using more than one hidden layer. This indicates that the
STHMP algorithm already extracts most distinctive features from the tactile data and we do not
require additional deep network architecture for our task. The final neural network consists of one
hidden layer of 20 hidden units with tangent sigmoid activation functions, 20 input features and
6 outputs for grasp position and orientation adjustments. The resulting number of parameters in
the generalized policy is 546. Such a highdimensional policy would be hard to learn by directly
employing reinforcement learning. Our formulation as supervised learning, however, simplifies
this problem and makes the learning process with relatively small amounts of data more feasible.
Table 4.1 shows performance of the generalized policy on the single objects. Since the com
bined policy is deterministic, we only evaluate a single regrasp for each failed grasp. Interestingly,
the combined policy is able to achieve better performance on each of the single objects than the
respective linear policies learned specifically for these object after one regrasp. Furthermore, in
cases of the cylinder and the ball, the performance of the generalized policy is better than the
linear policies evaluated after two regrasps. This shows that the general policy generalizes well
between the single policies. In addition, by utilizing the knowledge obtained from single policies,
the generalized policy is able to perform better on the objects that the single policies were trained
on.
The performance of the generalized policy on the box object is slightly worse than on the two
other objects. A notable difference in this case is the increased importance of the gripper yaw angle
with respect to the grasp performance. The individual policy learned on the box learns to correct
the grasp such that the robot aligns its fingers with the box sides while regrasping. However, this is
not important for the cylinder and the ball objects due to their symmetric shapes. Therefore, the
*
http://bigs.robotics.usc.edu/
48
Object
Individual policies
Combined policy
No regrasps 1 regrasp 2 regrasps 3 regrasps
Cylinder 41.8 83.5 90.3 97.1 92.3
Box 40.7 85.4 93.7 96.8 87.6
Ball 52.9 84.8 91.2 95.1 91.4
New object 40.1    80.7
Table 4.1: Performance of the individual and combined regrasping policies.
regrasping policy for the box could not benefit from the two other policies when adjusting grasp in
the yaw direction.
We test performance of the generalized policy on a novel, more complex object (see the
bottomright corner in Fig. 4.3), which was not present during learning. It is worth noting that
the novel object combines features of the three objects that the policies were trained on. The
generalized policy is able to improve the grasping performance significantly, which shows its
ability to generalize to more complex objects. Nevertheless, there are some difficulties when
the robot performs regrasp on a part of the object that is different from the initial grasp, such as
switching from the round lid to the bottom part of the object, which is of a box form. In this case,
the regrasp is incorrect for the new part of the object, i.e. the yaw adjustment is suboptimal for the
box part of the object due to the round grasping surface (the lid) in the initial grasp. The reason is
the lack of this data point in the previously encountered situations in the training dataset.
During the experiments, we were able to observe many intuitive corrections made by the robot
using the learned regrasping policy. The robot was able to identify if one of the fingers was only
barely touching the object’s surface, causing the object to rotate in the hand. In this case, the
regrasp resulted in either rotating or translating the gripper such that all of its fingers were firmly
touching the object. Another noticeable trend learned through reinforcement learning was that the
robot would regrasp the middle part of the object which was closer to the center of mass, hence,
more stable for grasping. On the box object, the robot learned to change its grasp such that its
fingers were aligned with the box’s sides. These results indicate that not only can the robot learn a
set of linear regrasping policies for individual objects, but also that it can use them as the basis for
guiding the generalized regrasping behavior.
4.6 Conclusions
In this work, we presented a framework for learning regrasping behaviors based on tactile data.
We trained a grasp stability predictor that uses spatiotemporal tactile features collected from the
earlyobjectlifting phase to predict the grasp outcome with high accuracy. The trained predictor
was used to supervise and provide feedback to a reinforcement learning algorithm that also uses
spatiotemporal features extracted from a biomimetic tactile sensor to estimate the required grasp
adjustments. In addition, we proposed a method that is able to learn complex highdimensional
policies by using examples from simple policies learned through reinforcement learning. In this
49
manner, we were able to avoid requiring large amounts of data to learn complex policies. Instead,
we employed supervised learning techniques to mimic various behaviors of simple policies.
We were able to achieve a high grasp prediction accuracy and significantly improve grasp
success rate of single objects. Our experiments indicate that the combined policy learned using
our method is able to achieve better performance on each of the single objects than the respective
linear policies learned using reinforcement learning specifically for these objects after one regrasp.
Moreover, the general policy achieves a high success rate after one regrasp on a novel object that
was not present during training. These results show that our supervised policy learning method
applied to regrasping can generalize to more complex objects.
50
Chapter 5
TimeContrastive Networks: SelfSupervised Learning
from Video
We propose a selfsupervised approach for learning representations and robotic behaviors entirely
from unlabeled videos recorded from multiple viewpoints, and study how this representation can
be used in two robotic imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a viewpointinvariant representation
that captures the relationships between endeffectors (hands or robot grippers) and the environment,
object attributes, and body pose. We train our representations using a triplet loss, where multiple
simultaneous viewpoints of the same observation are attracted in the embedding space, while
being repelled from temporal neighbors which are often visually similar but functionally different.
This signal causes our model to discover attributes that do not change across viewpoint, but do
change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting
and background. We demonstrate that this representation can be used by a robot to directly mimic
human poses without an explicit correspondence, and that it can be used as a reward function
within a reinforcement learning algorithm. While representations are learned from an unlabeled
collection of taskrelated videos, robot behaviors such as pouring are learned by watching a
single 3rdperson demonstration by a human. Reward functions obtained by following the human
demonstrations under the learned representation enable efficient reinforcement learning that is
practical for realworld robotic systems.
5.1 Introduction
While supervised learning has been successful on a range of tasks where labels can be easily speci
fied by humans, such as object classification, many problems that arise in interactive applications
like robotics are exceptionally difficult to supervise. For example, it would be impractical to label
every aspect of a pouring task in enough detail to allow a robot to understand all the taskrelevant
properties. Pouring demonstrations could vary in terms of background, containers, and viewpoint,
and there could be many salient attributes in each frame, e.g. whether or not a hand is contacting
a container, the tilt of the container, or the amount of liquid currently in the target vessel or its
viscosity. Ideally, robots in the real world would be capable of two things: learning the relevant
51
Negative Triplet loss Anchor Positive Random or hard negative from temporal neighbors View 1 View 2 t Views (and modalities) Time TCN embedding deep network Selfsupervised imitation Figure 5.1: TimeContrastive Networks (TCN): Anchor and positive images taken from simul
taneous viewpoints are encouraged to be close in the embedding space, while distant from negative
images taken from a different time in the same sequence. The model trains itself by trying to
answer the following questions simultaneously: What is common between the differentlooking
blue frames? What is different between the similarlooking red and blue frames? The resulting
embedding can be used for selfsupervised robotics in general, but can also naturally handle
3rdperson imitation.
attributes of an object interaction task purely from observation, and understanding how human
poses and object interactions can be mapped onto the robot, in order to imitate directly from
thirdperson video observations.
In this work, we take a step toward addressing these challenges simultaneously through the use
of selfsupervision and multiviewpoint representation learning. We obtain the learning signal from
unlabeled multiviewpoint videos of interaction scenarios, as illustrated in Figure 5.1. By learning
from multiview videos, the learned representations effectively disentangle functional attributes
such as pose while being viewpoint and agent invariant. We then show how the robot can learn
to link this visual representation to a corresponding motor command using either reinforcement
learning or direct regression, effectively learning new tasks by observing humans.
The main contribution of our work is a representation learning algorithm that builds on top
of existing semantically relevant features (in our case, features from a network trained on the
ImageNet dataset (Deng et al., 2009; Szegedy et al., 2015)) to produce a metric embedding that is
sensitive to object interactions and pose, and insensitive to nuisance variables such as viewpoint
52
and appearance. We demonstrate that this representation can be used to create a reward function
for reinforcement learning of robotic skills, using only raw video demonstrations for supervision,
and for direct imitation of human poses, without any explicit jointlevel correspondence and again
directly from raw video. Our experiments demonstrate effective learning of a pouring task with
a real robot, moving plates in and out of a dish rack in simulation, and realtime imitation of
human poses. Although we train a different TCN embedding for each task in our experiments, we
construct the embeddings from a variety of demonstrations in different contexts, and discuss how
larger multitask embeddings might be constructed in future work.
5.2 Related Work
Imitation learning: Imitation learning (Argall et al., 2009) has been widely used for learning
robotic skills from expert demonstrations (Ijspeert et al., 2002; Ratliff et al., 2007; M¨ ulling et
al., 2013; Duan et al., 2017) and can be split into two areas: behavioral cloning and inverse
reinforcement learning (IRL). Behavioral cloning considers a supervised learning problem, where
examples of behaviors are provided as stateaction pairs (Pomerleau, 1991; Ross et al., 2011).
IRL on the other hand uses expert demonstrations to learn a reward function that can be used to
optimize an imitation policy with reinforcement learning (Abbeel & Ng, 2004). Both types of
imitation learning typically require the expert to provide demonstrations in the same context as the
learner. In robotics, this might be accomplished by means of kinesthetic demonstrations (Calinon
et al., 2007) or teleoperation (Pastor et al., 2009), but both methods require considerable operator
expertise. If we aim to endow robots with wide repertoires of behavioral skills, being able to acquire
those skills directly from thirdperson videos of humans would be dramatically more scalable.
Recently, a range of works have studied the problem of imitating a demonstration observed in a
different context, e.g. from a different viewpoint or an agent with a different embodiment, such
as a human (Stadie et al., 2017; Dragan & Srinivasa, 2012; Sermanet et al., 2016). Liu et al. (Liu
et al., 2017) proposed to translate demonstrations between the expert and the learner contexts to
learn an imitation policy by minimizing the distance to the translated demonstrations. However,
Liu et al. explicitly exclude from consideration any demonstrations with domain shift, where the
demonstration is performed by a human and imitated by the robot with clear visual differences (e.g.,
human hands vs. robot grippers). In contrast, our TCN models are trained on a diverse range of
demonstrations with different embodiments, objects, and backgrounds. This allows our TCNbased
method to directly mimic human demonstrations, including demonstrations where a human pours
liquid into a cup, and to mimic human poses without any explicit jointlevel alignment. To our
knowledge, our work is the first method for imitation of raw video demonstrations that can both
mimic raw videos and handle the domain shift between human and robot embodiment.
Labelfree training signals: Labelfree learning of visual representations promises to enable
visual understanding from unsupervised data, and therefore has been explored extensively in recent
years. Prior work in this area has studied unsupervised learning as a way of enabling supervised
learning from small labeled datasets (Dumoulin et al., 2017), image retrieval (Paulin et al., 2015),
and a variety of other tasks (Wang & Gupta, 2015; Zhang et al., 2016; Vincent et al., 2008; Kumar
et al., 2016). In this work, we focus specifically on representation learning for the purpose of model
53
interactions between objects, humans, and their environment, which requires implicit modeling of a
broad range of factors, such as functional relationships, while being invariant to nuisance variables
such as viewpoint and appearance. Our method makes use of simultaneously recorded signals from
multiple viewpoints to construct an image embedding. A number of prior works have used multiple
modalities and temporal or spatial coherence to extract embeddings and features. For example,
(Owens et al., 2015; Aytar et al., 2016) used cooccurrence of sounds and visual cues in videos to
learn meaningful visual features. (Zhang et al., 2016) also propose a multimodal approach for
selfsupervision by training a network for crosschannel input reconstruction. (Doersch et al., 2015;
Zagoruyko & Komodakis, 2015) use the spatial coherence in images as a selfsupervision signal
and (Pathak et al., 2016) use motion cues to selfsupervise a segmentation task. These methods are
more focused on spatial relationships, and the unsupervised signal they provide is complementary
to the one explored in this work.
A number of prior works use temporal coherence (Wiskott & Sejnowski, 2002; Goroshin et
al., 2015; Fernando et al., 2016; Misra et al., 2016). Others also train for viewpoint invariance
using metric learning (Kumar et al., 2016; Yi et al., 2016; SimoSerra et al., 2015). The novelty of
our work is to combine both aspects in opposition, as explained in Sec. 5.3.1. (Wang & Gupta,
2015) uses a triplet loss that encourages first and last frames of a tracked sequence to be closer
together in the embedding, while random negative frames from other videos are far apart. Our
method differs in that we use temporal neighbors as negatives to push against a positive that is
anchored by a simultaneous viewpoint. This causes our method to discover meaningful dimensions
such as attributes or pose, while (Wang & Gupta, 2015) focuses on learning intraclass invariance.
Simultaneous multiview capture also provides exact correspondence while tracking does not, and
can provide a rich set of correspondences such as occlusions, blur, lighting and viewpoint.
Other works have proposed to use prediction as a learning signal (Whitney et al., 2016; Mathieu
et al., 2015). The resulting representations are typically evaluated primarily on the realism of the
predicted images, which remains a challenging open problem. A number of other prior methods
have used a variety of labels and priors to learn embeddings. (Mori et al., 2015) use a labeled
dataset to train a pose embedding, then find the nearest neighbors for new images from the training
data for a pose retrieval task. Our method is initialized via ImageNet training, but can discover
dimensions such as pose and task progress (e.g., for a pouring task) without any taskspecific
labels. (Stewart & Ermon, 2016) explore various types of physical priors, such as the trajectories
of objects falling under gravity, to learn object tracking without explicit supervision. Our method
is similar in spirit, in that it uses temporal cooccurrence, which is a universal physical property,
but the principle we use is general and broadly applicable and does not require taskspecific input
of physical rules.
Mirror Neurons: Humans and animals have been shown, experimentally, to possess viewpoint
invariant representations of objects and other agents in their environment (Caggiano et al., 2011),
and the well known work on “mirror neurons” has demonstrated that these viewpoint invariant
representations are crucial for imitation (Rizzolatti & Craighero, 2004). Our multiview capture
setup is similar to the experimental setup used by (Caggiano et al., 2011), and our robot imitation
setup, where a robot imitates human motion without ever receiving ground truth pose labels,
examines how selfsupervised pose recognition might arise in a learned system.
54
5.3 Imitation with TimeContrastive Networks
Our approach to imitation learning is to only rely on sensory inputs from the world. We achieve this
in two steps. First, we learn abstract representations purely from passive observation. Second, we
use these representations to guide robotic imitations of human behaviors and learn to perform new
tasks. We use the term imitation rather than demonstrations because our models also learn from
passive observation of nondemonstration behaviors. A robot needs to have a general understanding
about everything it sees in order to better recognize an active demonstration. We purposely insist
on only using selfsupervision to keep the approach scalable in the real world. In this work, we
explore a few ways to use time as a signal for unsupervised representation learning. We also
explore different approaches to selfsupervised robotic control below.
5.3.1 Training TimeContrastive Networks
We illustrate our timecontrastive (TC) approach in Fig. 5.1. The method uses multiview metric
learning via a triplet loss (Schroff et al., 2015). The embedding of an imagex is represented by
f(x)∈R
d
. The loss ensures that a pair of cooccuring framesx
a
i
(anchor) andx
p
i
(positive) are
closer to each other in embedding space than any imagex
n
i
(negative). Thus, we aim to learn an
embeddingf such that
Yf(x
a
i
)−f(x
p
i
)Y
2
2
+< Yf(x
a
i
)−f(x
n
i
)Y
2
2
;
∀f(x
a
i
);f(x
p
i
);f(x
n
i
)∈T;
where is a margin that is enforced between positive and negative pairs, andT is the set of all
possible triplets in the training set. The core idea is that two frames (anchor and positive) coming
from the same time but different viewpoints (or modalities) are pulled together, while a visually
similar frame from a temporal neighbor is pushed apart. This signal serves two purposes: learn
disentangled representations without labels and simultaneously learn viewpoint invariance for
imitation. The crossview correspondence encourages learning invariance to viewpoint, scale,
occlusion, motionblur, lighting and background, since the positive and anchor frames show
the same subject with variations along these factors. For example, Fig. 5.1 exhibits all these
transformations between the top and bottom sequences, except for occlusion. In addition to
learning a rich set of visual invariances, we are also interested in viewpoint invariance for 3rd
person to 1stperson correspondence for imitation. How does the timecontrastive signal lead to
disentangled representations? It does so by introducing competition between temporal neighbors
to explain away visual changes over time. For example, in Fig. 5.1, since neighbors are visually
similar, the only way to tell them apart is to model the amount of liquid present in the cup, or to
model the pose of hands or objects and their interactions. Another way to understand the strong
training signal that TCNs provide is to recognize the two constraints being simultaneously imposed
on the model: along the view axis in Fig. 5.1 the model learns to explain what is common between
images that look different, while along the temporal axis it learns to explain what is different
between similarlooking images. Note that while the natural ability for imitation of this approach
is important, its capability for learning rich representations without supervision is an even more
55
View 1 t
anchor Time Negative Triplet loss Anchor Positive Views (and modalities) TC embedding deep network t
positive t
negative positive range negative range negative range margin range = m x positive range Figure 5.2: Singleview TCN: positives are selected within a small window around anchors,
while negatives are selected from distant timesteps in the same sequence.
significant contribution. The key ingredient in our approach is that multiple views ground and
disambiguate the possible explanations for changes in the physical world. We show in Sec. 5.4
that the TCN can indeed discover correspondences between different objects or bodies, as well
as attributes such as liquid levels in cups and pouring stages, all without supervision. This is
a somewhat surprising finding as no explicit correspondence between objects or bodies is ever
provided. We hypothesize that manifolds of different but functionally similar objects naturally
align in the embedding space, because they share some functionality and appearance.
Multiview data collection is simple and can be captured with just two operators equipped with
smartphones. One operator keeps a fixed point of view of the region of interest while performing
the task, while the other moves the camera freely to introduce the variations discussed above.
While more cumbersome than singleview capture, we argue that multiview capture is cheap,
simple, and practical, when compared to alternatives such as human labeling.
We can also consider timecontrastive models trained on singleview video as shown in Fig. 5.2.
In this case, the positive frame is randomly selected within a certain range of the anchor. A margin
range is then computed given the positive range. Negatives are randomly chosen outside of the
margin range and the model is trained as before. We empirically chose the margin range to be 2
times the positive range, which is itself set to 0:2s. While we show in Sec. 5.4 that multiview TCN
performs best, the singleview version can still be useful when no multiview data is available.
5.3.2 Learning Robotic Behaviors with Reinforcement Learning
In this work, we consider an imitation learning scenario where the demonstrations come from
a 3rdperson video observation of an agent with an embodiment that differs from the learning
56
agent, e.g. robotic imitation of a human. Due to differences in the contexts, direct tracking of the
demonstrated pixel values does not provide a sensible way of learning the imitation behavior. As
described in the previous section, the TCN embedding provides a way to extract image features that
are invariant to the camera angle and the manipulated objects, and can explain physical interactions
in the world. We use this insight to construct a reward function that is based on the distance
between the TCN embedding of a human video demonstration and camera images recorded with a
robot camera. As shown in Sec. 5.4.2, by optimizing this reward function through trial and error
we are able to mimic demonstrated behaviors with a robot, utilizing only its visual input and the
video demonstrations for learning. Although we use multiple multiview videos to train the TCN,
the video demonstration consists only of a single video of a human performing the task from a
random viewpoint.
LetV = (v
1
;:::v
T
) be the TCN embeddings of each frame in a video demonstration sequence.
For each camera image observed during a robot task execution, we compute TCN embeddings
W = (w
1
;:::w
T
). We define a reward function R(v
t
;w
t
) based on the squared Euclidean
distance and a Huberstyle loss:
R(v
t
;w
t
)=−Yw
t
− v
t
Y
2
2
−
¼
+Yw
t
− v
t
Y
2
2
where and are weighting parameters (empirically chosen), and
is a small constant. The
squared Euclidean distance (weighted by) gives us stronger gradients when the embeddings are
further apart, which leads to larger policy updates at the beginning of learning. The Huberstyle
loss (weighted by) starts prevailing when the embedding vectors are getting very close ensuring
high precision of the task execution and finetuning of the motion towards the end of the training.
In order to learn robotic imitation policies, we optimize the reward function described above
using reinforcement learning. In particular, for optimizing robot trajectories, we employ the
PILQR algorithm (Chebotar, Hausman, et al., 2017) described in Section 3. This algorithm
combines approximate modelbased updates via LQR with fitted timevarying linear dynamics, and
modelfree corrections. We notice that in our tasks, the TCN embedding provides a wellbehaved
lowdimensional (32dimensional in our experiments) representation of the state of the visual world
in front of the robot. By including the TCN features in the system state (i.e. state = joint angles +
joint velocities + TCN features), we can leverage the linear approximation of the dynamics during
the modelbased LQR update and significantly speed up the training.
5.3.3 Direct Human Pose Imitation
In the previous section, we discussed how reinforcement learning can be used with TCNs to enable
learning of object interaction skills directly from video demonstrations of humans. In this section,
we describe another approach for using TCNs: direct imitation of human pose. While object
interaction skills primarily require matching the functional aspects of the demonstration, direct
pose imitation requires learning an implicit mapping between human and robot poses, and therefore
involves a much more finegrained association between frames. Once learned, a humanrobot
mapping could be used to speed up the exploration phase of RL by initializing a policy close to the
solution.
57
Human imitates Agent SelfRegression TimeContrastive Human supervision OR Triplet Loss Triplet Loss Regression Regression Agent Human cheap collection perfect correspondence expensive collection noisy correspondence Human Agent internal joints Agent internal joints Figure 5.3: Training signals for pose imitation: timecontrastive, selfregression and human
supervision. The timecontrastive signal lets the model learn rich representations of humans or
robots individually. Selfregression allows the robot to predict its own joints given an image of
itself. The human supervision signal is collected from humans attempting to imitate robot poses.
We learn a direct pose imitation through selfregression. It is illustrated in Fig. 5.3 and Fig. 5.7
in the context of selfsupervised human pose imitation. The idea is to directly predict the internal
state of the robot given an image of itself. Akin to looking at itself in the mirror, the robot can
regress its prediction of its own image to its internal states. We first train a shared TCN embedding
by observing human and robots performing random motions. Then the robot trains itself with
selfregression. Because it uses a TCN embedding that is invariant between humans and robots, the
robot can then naturally imitate humans after training on itself. Hence we obtain a system that can
perform endtoend imitation of human motion, even though it was never given any human pose
labels nor humantorobot correspondences. We demonstrate a way to collect human supervision
for endtoend imitation in Fig. 5.3. However contrary to timecontrastive and selfregression
signals, the human supervision is very noisy and expensive to collect. We use it to benchmark our
approach in Sec. 5.4.3 and show that large quantities of cheap supervision can effectively be mixed
with small amounts of expensive supervision.
5.4 Experiments
Our experiments aim to study three questions. First, we examine whether the TCN can learn visual
representations that are more indicative of object interaction attributes, such as the stages in a
pouring task. This allows us to comparatively evaluate the TCN against other selfsupervised
58
representations. Second, we study how the TCN can be used in conjunction with reinforcement
learning to acquire complex object manipulation skills in simulation and on a realworld robotic
platform. Lastly, we demonstrate that the TCN can enable a robot to perform continuous, real
time imitation of human poses without explicitly specifying any jointlevel correspondences
between robots and humans. Together, these experiments illustrate the applicability of the TCN
representation for modeling poses, object interactions, and the implicit correspondences between
robot imitators and human demonstrators.
5.4.1 Discovering Attributes from General Representations
Liquid Pouring
In this experiment, we study what the TCN captures simply by observing a human subject pouring
liquids from different containers into different cups. The videos were captured using two standard
smartphones, one from a subjective point of view by the human performing the pouring, and the
other from a freely moving thirdperson viewpoint. Capture is synchronized across the two phones
using an offtheshelf app and each sequence is approximately 5 seconds long. We divide the
collected multiview sequences into 3 sets: 133 sequences for training (about 11 minutes total), 17
for validation and 30 for testing. The training videos contain clear and opaque cups, but we restrict
the testing videos to clear cups only in order to evaluate if the model has an understanding of how
full the cups are.
Models
In all subsequent experiments, we use a custom architecture derived from the Inception archi
tecture (Szegedy et al., 2015) that is similar to (Finn et al., 2015b). It consists of the Inception
model up until the layer “Mixed 5d” (initialized with ImageNet pretrained weights), followed
by 2 convolutional layers, a spatial softmax layer (Finn et al., 2015b) and a fullyconnected layer.
The embedding is a fully connected layer with 32 units added on top of our custom model. This
embedding is trained either with the multiview TC loss, the singleview TC loss, or the shuffle &
learn loss (Misra et al., 2016). For the TCN models, we use the triplet loss from (Schroff et al.,
2015) without modification and with a gap value of 0:2. Note that, in all experiments, negatives
always come from the same sequence as positives. We also experiment with other metric learning
losses, namely npairs (Sohn, 2016) and lifted structured (Song et al., 2015), and show that results
are comparable. We use the output of the last layer before the classifier of an ImageNetpretrained
Inception model (Deng et al., 2009; Szegedy et al., 2015) (a 2048dimensional vector) as a base
line in the following experiments, and call it ”InceptionImageNet”. Since the custom model
is initialized from ImageNet pretraining, it is a natural point of comparison which allows us
to control for any invariances that are introduced through ImageNet training rather than other
approaches. We compare TCN models to a shuffle & learn baseline trained on our data, using the
same hyperparameters taken from the paper (tmax of 60, tmin of 15, and negative class ratio of
0.75). Note that in our implementation, neither the shuffle & learn baseline nor TCN benefit from
a biased sampling to highmotion frames. To investigate the differences between multiview and
59
Method alignment classif. training
error error iteration
Random 28:1% 54:2% 
InceptionImageNet 29:8% 51:9% 
shuffle & learn (Misra et al., 2016) 22:8% 27:0% 575k
singleview TCN (triplet) 25:8% 24:3% 266k
multiview TCN (npairs) 18:1% 22:2% 938k
multiview TCN (triplet) 18:8% 21:4% 397k
multiview TCN (lifted) 18:0% 19:6% 119k
Table 5.1: Pouring alignment and classification errors: all models are selected at their lowest
validation loss. The classification error considers 5 classes related to pouring detailed in Table 5.3.
singleview, we compare to a singleview TCN, with a positive range of 0:2 seconds and a negative
multiplier of 2.
Model selection
The question of model selection arises in unsupervised training. Should you select the best model
based on the validation loss? Or hand label a small validation for a given task? We report numbers
for both approaches. In Table 5.1 we select each model based on the its lowest validation loss,
while in Table 5.2 we select based on a classification score from a small validation set labeled with
the 5 attributes described earlier. As expected, models selected by validation classification score
perform better on the classification task. However models selected by loss perform only slightly
worse, except for shuffle & learn, which suffers a bigger loss of accuracy. We conclude that it is
reasonable for TCN models to be selected based on validation loss, not using any labels.
Training time
We observe in Table 5.2 that the multiview TCN (using triplet loss) outperforms singleview
models while requiring 15x less training time and while being trained on the exact same dataset.
We conclude that taking advantage of temporal correspondences greatly improves training time
and accuracy.
Quantitative Evaluation
We present two metrics in Table 5.1 to evaluate what the models are able to capture. The alignment
metric measures how well a model can semantically align two videos. The classification metric
measures how well a model can disentangle pouringrelated attributes, that can be useful in a real
robotic pouring task. All results in this section are evaluated using nearest neighbors in embedding
space. Given each frame of a video, each model has to pick the most semantically similar frame in
another video. The ”Random” baseline simply returns a random frame from the second video.
The sequence alignment metric is particularly relevant and important when learning to imitate,
especially from a thirdparty perspective. For each pouring test video, a human operator labels
60
Method alignment classif. training
error error iteration
Random 28:1% 54:2% 
InceptionImageNet 29:8% 51:9% 
singleview TCN (triplet) 26:6% 23:6% 738k
shuffle & learn (Misra et al., 2016) 20:9% 23:2% 743k
multiview TCN (lifted) 19:1% 21:4% 927k
multiview TCN (triplet) 17:5% 20:3% 47k
multiview TCN (npairs) 17:5% 19:3% 224k
Table 5.2: Pouring alignment and classification errors: these models are selected using the
classification score on a small labeled validation set, then ran on the full test set. We observe that
multiview TCN outperforms other models with 15x shorter training time. The classification error
considers 5 classes related to pouring: ”hand contact with recipient”, ”within pouring distance”,
”container angle”, ”liquid is flowing” and ”recipient fullness”.
the key frames corresponding to the following events: the first frame with hand contact with the
pouring container, the first frame where liquid is flowing, the last frame where liquid is flowing, and
the last frame with hand contact with the container. These keyframes establish a coarse semantic
alignment which should provide a relatively accurate piecewiselinear correspondence between
all videos. For any pair of videos (v
1
;v
2
) in the test set, we embed each frame given the model
to evaluate. For each frame of the source videov
1
, we associate it with its nearest neighbor in
embedding space taken from all frames ofv
2
. We evaluate how well the nearest neighbor inv
2
semantically aligns with the reference frame inv
1
. Thanks to the labeled alignments, we find
the proportional position of the reference frame with the target videov
2
, and compute the frame
distance to that position, normalized by the target segment length.
We label the following attributes in the test and validation sets to evaluate the classification
task as reported in Table 5.3: is the hand in contact with the container? (yes or no); is the container
within pouring distance of the recipient? (yes or no); what is the tilt angle of the pouring container?
(values 90, 45, 0 and 45 degrees); is the liquid flowing? (yes or no); does the recipient contain
liquid? (yes or no). These particular attributes are evaluated because they matter for imitating and
performing a pouring task. Classification results are normalized by class distribution. Note that
while this could be compared to a supervised classifier, as mentioned in the introduction, it is not
realistic to expect labels for every possible task in a real application, e.g. in robotics. Instead, in
this work we aim to compare to realistic general offtheshelf models that one might use without
requiring new labels.
In Table 5.1, we find that the multiview TCN model outperforms all baselines. We observe
that singleview TCN and shuffle & learn are on par for the classification metric but not for the
alignment metric. We find that general offtheshelf Inception features significantly underperform
compared to other baselines. Qualitative examples and tSNE visualizations of the embedding
are available in Appendix A.2.1. We encourage readers to refer to supplementary videos to better
grasp these results.
61
Method hand within container liquid recipient
contact pouring angle is has
with container distance flowing liquid
container
Random 49:9% 48:9% 74:5% 49:2% 48:4%
Imagenet Inception 47:4% 45:2% 71:8% 48:8% 49:2%
shuffle & learn 17:2% 17:8% 46:3% 25:7% 28:0%
singleview TCN (triplet) 12:6% 14:4% 41:2% 21:6% 31:9%
multiview TCN (npairs) 8:0% 9:0% 35:9% 24:7% 35:5%
multiview TCN (triplet) 7:8% 10:0% 34:8% 22:7% 31:5%
multiview TCN (lifted) 7:8% 9:0% 35:4% 17:9% 27:7%
Table 5.3: Detailed attributes classification errors, for model selected by validation loss.
5.4.2 Learning Object Interaction Skills
In this section, we use the TCNbased reward function described in Sec. 5.3.2 to learn robotic
imitation behaviors from thirdperson demonstrations through reinforcement learning. We evaluate
our approach on two tasks, plate transfer in a simulated dish rack environment (Fig. 5.4, using
the Bullet physics engine (Coumans & Bai, 2016–2017)) and real robot pouring from human
demonstrations (Fig. 5.5).
Task Setup
The simulated dish rack environment consists of two dish racks placed on a table and filled with
plates. The goal of the task is to move plates from one dish rack to another without dropping
them. This requires a complex motion with multiple stages, such as reaching, grasping, picking
up, carrying, and placing of the plate. We record the human demonstrations using a virtual reality
(VR) system to manipulate a freefloating gripper and move the plates (Fig. 5.4 left). We record the
videos of the VR demonstrations by placing firstview and thirdperson cameras in the simulated
world. In addition to demonstrations, we also record a range of randomized motions to increase the
generalization ability of our TCN model. After recording the demonstrations, we place a simulated
7DoF KUKA robotic arm inside the dish rack environment (Fig. 5.4 right) and attach a firstview
camera to it. The robot camera images (Fig. 5.4 middle) are then used to compute the TCN reward
function. The robot policy is initialized with random Gaussian noise.
For the real robot pouring task, we first collect the multiview data from multiple cameras
to train the TCN model. The training set includes videos of humans performing pouring of
liquids recorded on smartphone cameras and videos of robot performing pouring of granular beads
recorded on two robot cameras. We not only collect positive demonstrations of the task at hand, we
also collect various interactions that do not actually involve pouring, such as moving cups around,
tipping them over, spilling beads, etc, to cover the range of possible events the robot might need
to understand. The pouring experiment analyzes how TCNs can implicit build correspondences
between human and robot manipulation of objects. The dataset that we used to train the TCN
consisted of∼20 minutes of humans performing pouring tasks, as well as∼20 additional minutes
62
Figure 5.4: Simulated dish rack task. Left: Thirdperson VR demonstration of the dish rack task.
Middle: View from the robot camera during training. Right: Robot executing the dish rack task.
of humans manipulating cups and bottles in ways other than pouring, such as moving the cups,
tipping them over, etc. In order for the TCN to be able to represent both human and robot arms,
and implicitly put them into correspondence, it must also be provided with data that allows it to
understand the appearance of robot arms. To that end, we added data consisting of∼20 minutes
of robot arms manipulating cups in pouringlike settings. Note that this data does not necessarily
need to itself illustrate successful pouring tasks: the final demonstration that is tracked during
reinforcement learning consists of a human successfully pouring a cup of fluid, while the robot
performs the pouring task with orange beads. However, we found that providing some clips
featuring robot arms was important for the TCN to acquire a representation that could correctly
register the similarities between human and robot pouring. Using additional robot data is justified
here because it would not be realistic to expect the robot to do well while having never seen its own
arm. Over time however, the more tasks are learned the less needed this should become. While
the TCN is trained with approximately 1 hour of pouringrelated multiview sequences, the robot
policy is only learned from a single liquid pouring video provided by a human (Fig. 5.5 left). With
this video, we train a 7DoF KUKA robot to perform the pouring of granular beads as depicted in
Fig. 5.5 (right). We compute TCN embeddings from the robot camera images (Fig. 5.5 middle)
and initialize the robot policy using random Gaussian noise. We set the initial exploration higher
on the wrist joint as it contributes the most to the pouring motion (for all compared algorithms).
Quantitative Evaluation
Fig. 5.6 shows the pouring task performance of using TCN models for reward computation
compared to the same baselines evaluated in the previous section. After each rollout, we measure
the weight of the beads in the receiving container. We perform runs of 10 rollouts per iteration.
Results in Fig. 5.6 are averaged over 4 runs per model (2 runs for 2 fixed random seeds). Already
after the first several iterations of using the multiview TCN model (mvTCN), the robot is able to
successfully pour significant amount of the beads. At the end of the training, the policy converges
to a consistently successful pouring behavior. In contrast, the robot fails to accomplish the task with
other models. Interestingly, we observe a low performance for singleview models (singleview
63
Figure 5.5: Real robot pouring task. Left: Thirdperson human demonstration of the pouring
task. Middle: View from the robot camera during training. Right: Robot executing the pouring
task.
TCN and shuffle & learn) despite being trained on the exact same multiview data as mvTCN. This
suggests taking advantage of multiview correspondences is necessary in this task for correctly
modeling object interaction from a 3rdperson perspective. The results show that mvTCN does
provide the robot with suitable guidance to understand the pouring task. In fact, since the PILQR (?,
?) method uses both modelbased and modelfree updates, the experiment shows that mvTCN not
only provides good indicators when the pouring is successful, but also useful gradients when it isn’t;
while the other tested representations are insufficient to learn this task. This experiment illustrates
how selfsupervised representation learning and continuous rewards from visual demonstrations
can alleviate the sample efficiency problem of reinforcement learning.
Qualitative Evaluation
As shown in our supplementary video, both dish rack and pouring policies converge to robust
imitated behaviors. In the dish rack task, the robot is able to gradually learn all of the task
components, including the arm motion and the opening and closing of the gripper. It first learns to
reach for the plate, then grasp and pick it up and finally carry it over to another dish rack and place
it there. The learning of this task requires only 10 iterations, with 10 rollouts per iteration. This
shows that the TCN reward function is dense and smooth enough to efficiently guide the robot to a
complex imitation policy.
In the pouring task, the robot starts with Gaussian exploration by randomly rotating and moving
the cup filled with beads. The robot first learns to move and rotate the cup towards the receiving
container, missing the target cup and spilling large amount of the beads in the early iterations. After
several more iterations, the robot learns to be more precise and eventually it is able to consistently
pour most of the beads in the last iteration. This demonstrates that our method can efficiently learn
tasks with nonlinear dynamic object transitions, such as movement of the granular media and
liquids, an otherwise difficult task to perform using conventional state estimation techniques.
64
0 2 4 6 8 10
PILQR Iterations
0
50
100
150
Transferred bead weight
Multiview TCN
Singleview TCN
Shuffle and Learn
ImageNet  Inception
Maximum Weight
0 5 10 15 20
PILQR Iterations
0
50
100
150
Transferred bead weight
Multiview TCN
Singleview TCN
Shuffle and Learn
ImageNet  Inception
Maximum Weight
Figure 5.6: Learning progress of the pouring task for two different demonstration videos (left
and right). For each training, only a single 3rdperson human demonstration is used, as shown
in Fig. 5.5. This graph reports the weight in grams measured from the target recipient after each
pouring action (maximum weight is 189g) along with the standard deviation of all 10 rollouts per
iteration.
5.4.3 SelfRegression for Human Pose Imitation
In the previous section, we showed that we can use the TCN to construct a reward function for
learning object manipulation with reinforcement learning. In this section, we also study how the
TCN can be used to directly map from humans to robots in real time, as depicted in Fig. 5.7:
in addition to understanding object interaction, we can use the TCN to build a posesensitive
embedding either unsupervised, or with minimal supervision. The multiview TCN is particularly
well suited for this task because, in addition to requiring viewpoint and robot/human invariance, the
correspondence problem is illdefined and difficult to supervise. Apart from adding a joints decoder
on top of the TCN embedding and training it with a selfregression signal, there is no fundamental
difference in the method. Throughout this section, we use the robot joint vectors corresponding to
the humantorobot imitation described in Fig. 5.3 as ground truth. Human images are fed into
the imitation system, and the resulting joints vector are compared against the ground truth joints
vector.
By comparing different combinations of supervision signals, we show in Table 5.4 that training
with all signals performs best. We observe that adding the timecontrastive signal always signif
icantly improves performance. In general, we conclude that relatively large amounts of cheap
weaklysupervised data and small amounts of expensive human supervised data is an effective
balance for our problem. Interestingly, we find that the selfsupervised model (TC+self) outper
forms the humansupervised one. It should however be noted that the quantitative evaluation is not
as informative here: since the task is highly subjective and different human subjects imitate the
robot differently, matching the joint angles on heldout data is exceedingly difficult. We invite the
reader to watch the accompanying video for examples of imitation, and observe that there is a close
connection between the human and robot motion, including for subtle elements of the pose such as
crouching: when the human crouches down, the robot lowers the torso via the prismatic joint in
the spine. In the video, we observe a complex humanrobot mapping is discovered entirely without
65
joints joints decoder deep network TimeContrastive Human Supervision SelfRegression agent imitates TCN embedding Figure 5.7: TCN for selfsupervised human pose imitation: architecture, training and imitation.
The embedding is trained unsupervised with the timecontrastive loss, while the joints decoder can
be trained with selfsupervision, human supervision or both. Output joints can be used directly by
the robot planner to perform the imitation. Human pose is never explicitly represented.
human supervision. This invites to reflect on the need for intermediate human pose detectors when
correspondence is illdefined as in this case. In Fig. 5.8, we visualize the TCN embedding for pose
imitation and show that pose across humans and robots is consistent within clusters, while being
invariant to viewpoint and backgrounds. More analysis is available in Appendix A.2.2.
5.5 Conclusions
In this work, we introduced a selfsupervised representation learning method (TCN) based on
multiview video. The representation is learned by anchoring a temporally contrastive signal
against cooccuring frames from other viewpoints, resulting in a representation that disambiguates
temporal changes (e.g., salient events) while providing invariance to viewpoint and other nuisance
variables. We show that this representation can be used to provide a reward function within a
reinforcement learning system for robotic object manipulation, and to provide mappings between
human and robot poses to enable pose imitation directly from raw video. In both cases, the TCN
enables robotic imitation from raw videos of humans performing various tasks, accounting for the
domain shift between human and robot bodies. Although the training process requires a dataset of
multiviewpoint videos, once the TCN is trained, only a single raw video demonstration is used for
imitation.
66
Supervision Robot joints distance error %
Random (possible) joints 42:4± 0:1
Self 38:8± 0:1
Human 33:4± 0:4
Human + Self 33:0± 0:5
TC + Self 32:1± 0:3
TC + Human 29:7± 0:1
TC + Human + Self 29:5± 0:2
Table 5.4: Imitation error for different combinations of supervision signals. The error reported
is the joints distance between prediction and groundtruth. Note perfect imitation is not possible.
Figure 5.8: tSNE embedding colored by agent for model ”TC+Self”. We show that images
are locally coherent with respect to pose while being invariant to agent or viewpoint.
67
Part III
Imitation and Transfer Learning
68
Chapter 6
MultiModal Imitation Learning from Unstructured
Demonstrations using Generative Adversarial Nets
Imitation learning has traditionally been applied to learn a single task from demonstrations thereof.
The requirement of structured and isolated demonstrations limits the scalability of imitation learn
ing approaches as they are difficult to apply to realworld scenarios, where robots have to be able to
execute a multitude of tasks. In this paper, we propose a multimodal imitation learning framework
that is able to segment and imitate skills from unlabelled and unstructured demonstrations by
learning skill segmentation and imitation learning jointly. The extensive simulation results indicate
that our method can efficiently separate the demonstrations into individual skills and learn to
imitate them using a single multimodal policy. The video of our experiments is available at
http://sites.google.com/view/nips17intentiongan.
6.1 Introduction
One of the key factors to enable deployment of robots in unstructured realworld environments is
their ability to learn from data. In recent years, there have been multiple examples of robot learning
frameworks that present promising results. These include: reinforcement learning (R. S. Sutton
& Barto, 1998)  where a robot learns a skill based on its interaction with the environment and
imitation learning (Argall et al., 2009; Billard et al., 2008)  where a robot is presented with a
demonstration of a skill that it should imitate. In this work, we focus on the latter learning setup.
Traditionally, imitation learning has focused on using isolated demonstrations of a particular
skill (Schaal, 1999). The demonstration is usually provided in the form of kinesthetic teaching,
which requires the user to spend sufficient time to provide the right training data. This constrained
setup for imitation learning is difficult to scale to real world scenarios, where robots have to be
able to execute a combination of different skills. To learn these skills, the robots would require a
large number of robottailored demonstrations, since at least one isolated demonstration has to be
provided for every individual skill.
In order to improve the scalability of imitation learning, we propose a framework that can learn
to imitate skills from a set of unstructured and unlabeled demonstrations of various tasks.
69
As a motivating example, consider a highly unstructured data source, e.g. a video of a person
cooking a meal. A complex activity, such as cooking, involves a set of simpler skills such as
grasping, reaching, cutting, pouring, etc. In order to learn from such data, three components are
required: i) the ability to map the image stream to stateaction pairs that can be executed by a
robot, ii) the ability to segment the data into simple skills, and iii) the ability to imitate each of
the segmented skills. In this work, we tackle the latter two components, leaving the first one for
future work. We believe that the capability proposed here of learning from unstructured, unlabeled
demonstrations is an important step towards scalable robot learning systems.
In this work, we present a novel imitation learning method that learns a multimodal stochas
tic policy, which is able to imitate a number of automatically segmented tasks using a set of
unstructured and unlabeled demonstrations. Our results indicate that the presented technique
can separate the demonstrations into sensible individual skills and imitate these skills using a
learned multimodal policy. We show applications of the presented method to the tasks of skill
segmentation, hierarchical reinforcement learning and multimodal policy learning.
6.2 Related Work
Imitation learning is concerned with learning skills from demonstrations. Approaches that are
suitable for this setting can be split into two categories: i) behavioral cloning (Pomerleau, 1991),
and ii) inverse reinforcement learning (IRL) (Ng & Russell, 2000). While behavioral cloning aims
at replicating the demonstrations exactly, it suffers from the covariance shift (Ross & Bagnell,
2010). IRL alleviates this problem by learning a reward function that explains the behavior shown
in the demonstrations. The majority of IRL works (Ho & Ermon, 2016; Ziebart et al., 2008; Abbeel
& Ng, 2004; Finn, Levine, & Abbeel, 2016; Levine et al., 2011) introduce algorithms that can
imitate a single skill from demonstrations thereof but they do not readily generalize to learning a
multitask policy from a set of unstructured demonstrations of various tasks.
More recently, there has been work that tackles a problem similar to the one presented in
this work, where the authors consider a setting where there is a large set of tasks with many
instantiations (Duan et al., 2017). In their work, the authors assume a way of communicating a
new task through a single demonstration. We follow the idea of segmenting and learning different
skills jointly so that learning of one skill can accelerate learning to imitate the next skill. In our
case, however, the goal is to separate the mix of expert demonstrations into single skills and learn
a policy that can imitate all of them, which eliminates the need of new demonstrations at test time.
The method presented here belongs to the field of multitask inverse reinforcement learning.
Examples from this field include (Dimitrakakis & Rothkopf, 2011) and (Babes et al., 2011).
In (Dimitrakakis & Rothkopf, 2011), the authors present a Bayesian approach to the problem,
while the method in (Babes et al., 2011) is based on an EM approach that clusters observed demon
strations. Both of these methods show promising results on relatively lowdimensional problems,
whereas our approach scales well to higher dimensional domains due to the representational power
of neural networks.
70
There has also been a separate line of work on learning from demonstration, which is then
iteratively improved through reinforcement learning (Kalakrishnan et al., 2011; Chebotar, Kalakr
ishnan, et al., 2017; M¨ ulling et al., 2013). In contrast, we do not assume access to the expert
reward function, which is required to perform reinforcement learning in the later stages of the
above algorithms.
There has been much work on the problem of skill segmentation and option discovery for
hierarchical tasks. Examples include (Niekum et al., 2013; Kroemer et al., 2015; Fox et al., 2017;
Vezhnevets et al., 2017; Florensa et al., 2017). In this work, we consider a possibility to discover
different skills that can all start from the same initial state, as opposed to hierarchical reinforcement
learning where the goal is to segment a task into a set of consecutive subtasks. We demonstrate,
however, that our method may be used to discover the hierarchical structure of a task similarly to
the hierarchical reinforcement learning approaches. In (Florensa et al., 2017), the authors explore
similar ideas to discover useful skills. In this work, we apply some of these ideas to the imitation
learning setup as opposed to the reinforcement learning scenario.
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have enjoyed success in
various domains including image generation (Denton et al., 2015), imageimage translation (J.
Y . Zhu et al., 2017; Kim et al., 2017) and video prediction (Mathieu et al., 2015). More recently,
there have been works connecting GANs and other reinforcement learning and IRL methods (Pfau
& Vinyals, 2016; Finn, Christiano, et al., 2016; Ho & Ermon, 2016). In this work, we expand
on some of the ideas presented in these works and provide a novel framework that exploits this
connection.
The works that are most closely related to this work are (Ho & Ermon, 2016), (Chen et al.,
2016) and (Y . Li et al., 2017). In (Chen et al., 2016), the authors show a method that is able to learn
disentangled representations and apply it to the problem of image generation. In this work, we
provide an alternative derivation of our method that extends their work and applies it to multimodal
policies. In (Ho & Ermon, 2016), the authors present an imitation learning GAN approach that
serves as a basis for the development of our method. We provide an extensive evaluation of the
hereby presented approach compared to the work in (Ho & Ermon, 2016), which shows that our
method, as opposed to (Ho & Ermon, 2016), can handle unstructured demonstrations of different
skills. A concurrent work (Y . Li et al., 2017) introduces a method similar to ours and applies it to
detecting driving styles from unlabelled human data.
6.3 Preliminaries
LetM = (S;A;P;R;p
0
;
;T) be a finitehorizon Markov Decision Process (MDP), whereS and
A are state and action spaces,P ∶S×A×S →R
+
is a statetransition probability function or
system dynamics,R∶S×A→R a reward function,p
0
∶S→R
+
an initial state distribution,
a
reward discount factor, andT a horizon. Let = (s
0
;a
0
;:::;s
T
;a
T
) be a trajectory of states and
actions andR() =∑
T
t=0
t
R(s
t
;a
t
) the trajectory reward. The goal of reinforcement learning
methods is to find parameters of a policy
(aSs) that maximizes the expected discounted reward
over trajectories induced by the policy: E
[R()] where s
0
∼ p
0
;s
t+1
∼ P(s
t+1
Ss
t
;a
t
) and
a
t
∼
(a
t
Ss
t
).
71
In an imitation learning scenario, the reward function is unknown. However, we are given
a set of demonstrated trajectories, which presumably originate from some optimal expert policy
distribution
E
1
that optimizes an unknown reward functionR
E
1
. Thus, by trying to estimate the
reward functionR
E
1
and optimizing the policy
with respect to it, we can recover the expert
policy. This approach is known as inverse reinforcement learning (IRL) (Abbeel & Ng, 2004).
In order to model a variety of behaviors, it is beneficial to find a policy with the highest possible
entropy that optimizesR
E
1
. We will refer to this approach as the maximumentropy IRL (Ziebart
et al., 2008) with the optimization objective
min
R
max
H(
)+E
R(s;a)−E
E
1
R(s;a); (6.1)
whereH(
) is the entropy of the policy
.
Ho and Ermon (Ho & Ermon, 2016) showed that it is possible to redefine the maximum
entropy IRL problem with multiple demonstrations sampled from a single expert policy
E
1
as an optimization of GANs (Goodfellow et al., 2014). In this framework, the policy
(aSs)
plays the role of a generator, whose goal is to make it difficult for a discriminator network
D
w
(s;a) (parameterized by w) to differentiate between imitated samples from
(labeled 0)
and demonstrated samples from
E
1
(labeled 1). Accordingly, the joint optimization goal can be
defined as
max
min
w
E
(s;a)∼
[log(D
w
(s;a))]+E
(s;a)∼
E
1
[log(1−D
w
(s;a))]+
H
H(
): (6.2)
The discriminator and the generator policy are both represented as neural networks and optimized
by repeatedly performing alternating gradient updates. The discriminator is trained on the mixed
set of expert and generator samples and outputs probabilities that a particular sample has originated
from the generator or the expert policies. This serves as a reward signal for the generator policy
that tries to maximize the probability of the discriminator confusing it with an expert policy. The
generator can be trained using the trust region policy optimization (TRPO) algorithm (Schulman
et al., 2015) with the cost function log(D
w
(s;a)). At each iteration, TRPO takes the following
gradient step:
E
(s;a)∼
[∇
log
(aSs) log(D
w
(s;a))]+
H
∇
H(
); (6.3)
which corresponds to minimizing the objective in Eq. (6.2) with respect to the policy
.
6.4 MultiModal Imitation Learning
The traditional imitation learning scenario described in Sec. 6.3 considers a problem of learning
to imitate one skill from demonstrations. The demonstrations represent samples from a single
expert policy
E1
. In this work, we focus on an imitation learning setup where we learn from
unstructured and unlabelled demonstrations of various tasks. In this case, the demonstrations come
from a set of expert policies
E
1
;
E
2
;:::;
E
k
, wherek can be unknown, that optimize different
72
reward functions/tasks. We will refer to this set of unstructured expert policies as a mixture of
policies
E
. We aim to segment the demonstrations of these policies into separate tasks and learn
a multimodal policy that will be able to imitate all of the segmented tasks.
In order to be able to learn multimodal policy distributions, we augment the policy input
with a latent intentioni distributed by a categorical or uniform distributionp(i), similar to (Chen
et al., 2016). The goal of the intention variable is to select a specific mode of the policy, which
corresponds to one of the skills presented in the demonstrations. The resulting policy can be
expressed as:
(aSs;i)=p(iSs;a)
(aSs)
p(i)
: (6.4)
We augment the trajectory to include the latent intention as
i
= (s
0
;a
0
;i
0
;:::s
T
;a
T
;i
T
). The
resulting reward of the trajectory with the latent intention isR(
i
)=∑
T
t=0
t
R(s
t
;a
t
;i
t
).R(a;s;i)
is a reward function that depends on the latent intentioni as we have multiple demonstrations that
optimize different reward functions for different tasks. The expected discounted reward is equal to:
E
[R(
i
)]=
∫
R(
i
)
(
i
)d
i
where
(
i
)=p
0
(s
0
)∏
T−1
t=0
P(s
t+1
Ss
t
;a
t
)
(a
t
Ss
t
;i
t
)p(i
t
).
Here, we show an extension of the derivation presented in (Ho & Ermon, 2016) (Eqs. (6.1, 6.2))
for a policy(aSs;i) augmented with the latent intention variablei, which uses demonstrations
from a set of expert policies
E
, rather than a single expert policy
E
1
. We are aiming at maximum
entropy policies that can be determined from the latent intention variable i. Accordingly, we
transform the original IRL problem to reflect this goal:
min
R
max
H((aSs))−H((aSs;i))+E
R(s;a;i)−E
E
R(s;a;i); (6.5)
where(aSs) =∑
i
(aSs;i)p(i), which results in the policy averaged over intentions (since the
p(i) is constant). This goal reflects our objective: we aim to obtain a multimodal policy that has a
high entropy without any given intention, but it collapses to a particular task when the intention
is specified. Analogously to the solution for a single expert policy, this optimization objective
results in the optimization goal of the generative adversarial imitation learning network, with the
exception that the stateaction pairs (s;a) are sampled from a set of expert policies
E
:
max
min
w
E
i∼p(i);(s;a)∼
[log(D
w
(s;a))]+E
(s;a)∼
E
[1− log(D
w
(s;a))] (6.6)
+
H
H(
(aSs))−
I
H(
(aSs;i));
where
I
,
H
correspond to the weighting parameters on the respective objectives.
73
The resulting entropyH(
(aSs;i)) term can be expressed as:
H(
(aSs;i))=E
i∼p(i);(s;a)∼
(− log(
(aSs;i)) (6.7)
=−E
i∼p(i);(s;a)∼
logp(iSs;a)
(aSs)
p(i)
=−E
i∼p(i);(s;a)∼
log(p(iSs;a))−E
i∼p(i);(s;a)∼
log
(aSs)+E
i∼p(i)
logp(i)
=−E
i∼p(i);(s;a)∼
log(p(iSs;a))+H(
(aSs))−H(i);
which results in the final objective:
max
min
w
E
i∼p(i);(s;a)∼
[log(D
w
(s;a))]+E
(s;a)∼
E
[1− log(D
w
(s;a))] (6.8)
+(
H
−
I
)H(
(aSs))+
I
E
i∼p(i);(s;a)∼
log(p(iSs;a))+
I
H(i);
whereH(i) is a constant that does not influence the optimization. This results in the same
optimization objective as for the single expert policy (see Eq. (6.2)) with an additional term
I
E
i∼p(i);(s;a)∼
log(p(iSs;a)) responsible for rewarding stateaction pairs that make the latent
intention inference easier. We refer to this cost as the latent intention cost and representp(iSs;a)
with a neural network. The final reward function for the generator is:
E
i∼p(i);(s;a)∼
[log(D
w
(s;a))]+
I
E
i∼p(i);(s;a)∼
log(p(iSs;a))+
H
′H(
(aSs)): (6.9)
6.4.1 Relation to InfoGAN
In this section, we provide an alternative derivation of the optimization goal in Eq. (6.8) by
extending the InfoGAN approach presented in (Chen et al., 2016). Following (Chen et al., 2016),
we introduce the latent variablec as a means to capture the semantic features of the data distribution.
In this case, however, the latent variables are used in the imitation learning scenario, rather than the
traditional GAN setup, which prevents us from using additional noise variables (z in the InfoGAN
approach) that are used as noise samples to generate the data from.
Similarly to (Chen et al., 2016), to prevent collapsing to a single mode, the policy optimization
objective is augmented with mutual informationI(c;G(
c
;c)) between the latent variable and
the stateaction pairs generatorG dependent on the policy distribution
c
. This encourages the
policy to produce behaviors that are interpretable from the latent code, and given a larger number
of possible latent code values leads to an increase in the diversity of policy behaviors. The
corresponding generator goal can be expressed as:
E
c∼p(c);(s;a)∼
c
[log(D
w
(s;a))]+
I
I(c;G(
c
;c))+
H
H(
c
) (6.10)
74
In order to computeI(c;G(
c
;c)), we follow the derivation from (Chen et al., 2016) that intro
duces a lower bound:
I(c;G(
c
;c))=H(c)−H(cSG(
c
;c)) (6.11)
=E
(s;a)∼G(
c
;c)
[E
c
′
∼P(cSs;a)
[logP(c
′
Ss;a)]]+H(c)
=E
(s;a)∼G(
c
;c)
[D
KL
(P(⋅Ss;a)SSQ(⋅Ss;a))+E
c
′
∼P(cSs;a)
[logQ(c
′
Ss;a)]]+H(c)
≥E
(s;a)∼G(
c
;c)
[E
c
′
∼P(cSs;a)
[logQ(c
′
Ss;a)]]+H(c)
=E
c∼P(c);(s;a)∼G(
c
;c)
[logQ(cSs;a)]+H(c)
By maximizing this lower bound we maximizeI(c;G(
c
;c)). The auxiliary distributionQ(cSs;a)
can be parametrized by a neural network.
The resulting optimization goal is
max
min
w
E
c∼p(c);(s;a)∼
c
[log(D
w
(s;a))]+E
(s;a)∼
E
[1− log(D
w
(s;a))] (6.12)
+
I
E
c∼P(c);(s;a)∼G(
c
;c)
[logQ(cSs;a)]+
H
H(
c
)
which results in the generator reward function:
E
c∼p(c);(s;a)∼
c
[log(D
w
(s;a))]+
I
E
c∼P(c);(s;a)∼G(
c
;c)
[logQ(cSs;a)]+
H
H(
c
): (6.13)
This corresponds to the same objective that was derived in Section 7.3. The auxiliary distribution
over the latent variablesQ(cSs;a) is analogous to the intention distributionp(iSs;a).
6.5 Implementation
In this section, we discuss implementation details that can alleviate instability of the training
procedure of our model. The first indicator that the training has become unstable is a high
classification accuracy of the discriminator. In this case, it is difficult for the generator to produce
a meaningful policy as the reward signal from the discriminator is flat and the TRPO gradient of
the generator vanishes. In an extreme case, the discriminator assigns all the generator samples to
the same class and it is impossible for TRPO to provide a useful gradient as all generator samples
receive the same reward. Previous work suggests several ways to avoid this behavior. These include
leveraging the Wasserstein distance metric to improve the convergence behavior (Arjovsky et al.,
2017) and adding instance noise to the inputs of the discriminator to avoid degenerate generative
distributions (Sønderby et al., 2016). We find that adding the Gaussian noise helped us the most
to control the performance of the discriminator and to produce a smooth reward signal for the
generator policy. During our experiments, we anneal the noise similar to (Sønderby et al., 2016),
as the generator policy improves towards the end of the training.
An important indicator that the generator policy distribution has collapsed to a unimodal
policy is a high or increasing loss of the intentionprediction networkp(iSs;a). This means that
the prediction of the latent variablei is difficult and consequently, the policy behavior can not be
75
Figure 6.1: Left: Walker2D running forwards, running backwards, jumping. Right: Humanoid
running forwards, running backwards, balancing.
Figure 6.2: Left: Reacher with 2 targets: random initial state, reaching one target, reaching another
target. Right: Gripperpusher: random initial state, grasping policy, pushing (when grasped) policy.
categorized into separate skills. Hence, the policy executes the same skill for different values of
the latent variable. To prevent this, one can increase the weight of the latent intention cost
I
in
the generator loss or add more instance noise to the discriminator, which makes its reward signal
relatively weaker.
In this work, we employ both categorical and continuous latent variables to represent the latent
intention. The advantage of using a continuous variable is that we do not have to specify the
number of possible values in advance as with the categorical variable and it leaves more room
for interpolation between different skills. We use a softmax layer to represent categorical latent
variables, and use a uniform distribution for continuous latent variables as proposed in (Chen et al.,
2016).
6.6 Experiments
Our experiments aim to answer the following questions: (1) Can we segment unstructured and
unlabelled demonstrations into skills and learn a multimodal policy that imitates them? (2) What
is the influence of the introduced intentionprediction cost on the resulting policies? (3) Can we
autonomously discover the number of skills presented in the demonstrations, and even accomplish
them in different ways? (4) Does the presented method scale to highdimensional policies? (5)
Can we use the proposed method for learning hierarchical policies? We evaluate our method on
a series of challenging simulated robotics tasks described below. We would like to emphasize
that the demonstrations consist of shuffled stateaction pairs such that no temporal information or
segmentation is used during learning.
6.6.1 Task Setup
Reacher The Reacher environment is depicted in Fig. 6.2 (left). The actuator is a 2DoF arm
attached at the center of the scene. There are several targets placed at random positions throughout
the environment. The goal of the task is, given a data set of reaching motions to random targets, to
76
discover the dependency of the target selection on the intention and learn a policy that is capable
of reaching different targets based on the specified intention input. We evaluate the performance of
our framework on environments with 1, 2 and 4 targets.
Walker2D The Walker2D (Fig. 6.1 left) is a 6DoF bipedal robot consisting of two legs
and feet attached to a common base. The goal of this task is to learn a policy that can switch
between three different behaviors dependent on the discovered intentions: running forward, running
backward and jumping. We use TRPO to train single expert policies and create a combined data
set of all three behaviors that is used to train a multimodal policy using our imitation framework.
Humanoid Humanoid (Fig. 6.1 right) is a highdimensional robot with 17 degrees of freedom.
Similar to Walker2D the goal of the task is to be able to discover three different policies: running
forward, running backward and balancing, from the combined expert demonstrations of all of
them.
Gripperpusher This task involves controlling a 4DoF arm with an actuated gripper to push
a sliding block to a specified goal area (Fig. 6.2 right). We provide separate expert demonstrations
of grasping the object, and pushing it towards the goal starting from the object already being
inside the hand. The initial positions of the arm, block and the goal area are randomly sampled at
the beginning of each episode. The goal of our framework is to discover both intentions and the
hierarchical structure of the task from a combined set of demonstrations.
6.6.2 MultiTarget Imitation Learning
Our goal here is to analyze the ability of our method to segment and imitate policies that perform
the same task for different targets. To this end, we first evaluate the influence of the latent intention
cost on the Reacher task with 2 and 4 targets. For both experiments, we use either a categorical
intention distribution with the number of categories equal to the number of targets or a continuous,
uniformlydistributed intention variable, which means that the network has to discover the number
of intentions autonomously. Fig. 6.3 top shows the results of the reaching tasks using the latent
intention cost for 2 and 4 targets with different latent intention distributions. For the continuous
latent variable, we show a span of different intentions between 1 and 1 in the 0.2 intervals. The
colors indicate the intention “value”. In the categorical distribution case, we are able to learn a
multimodal policy that can reach all the targets dependent on the given latent intention (Fig. 6.31
and Fig. 6.33 top). The continuous latent intention is able to discover two modes in case of two
targets (Fig. 6.32 top) but it collapses to only two modes in the four targets case (Fig. 6.34 top) as
this is a significantly more difficult task.
As a baseline, we present the results of the Reacher task achieved by the standard GAN
imitation learning presented in (Ho & Ermon, 2016) without the latent intention cost. The obtained
results are presented in Fig. 6.3 bottom. Since the network is not encouraged to discover different
skills through the intention learning cost, it collapses to a single target for 2 targets in both the
continuous and discrete latent intention variables. In the case of 4 targets, the network collapses to
2 modes, which can be explained by the fact that even without the latent intention cost the imitation
network tries to imitate most of the presented demonstrations. Since the demonstration set is very
diverse in this case, the network learned two modes without the explicit instruction (latent intention
cost) to do so.
77
Figure 6.3: Results of the imitation GAN with (top row) and without (bottom row) the latent
intention cost. Left: Reacher with 2 targets(crosses): final positions of the reacher (circles) for
categorical (1) and continuous (2) latent intention variable. Right: Reacher with 4 targets(crosses):
final positions of the reacher (circles) for categorical (3) and continuous (4) latent intention variable.
Figure 6.4: Left: Rewards of different Reacher policies for 2 targets for different intention values
over the training iterations with (1) and without (2) the latent intention cost. Right: Two examples
of a heatmap for 1 target Reacher using two latent intentions each.
To demonstrate the development of different intentions, in Fig. 6.4 (left) we present the Reacher
rewards over training iterations for different intention variables. When the latent intention cost
is included, (Fig. 6.41), the separation of different skills for different intentions starts to emerge
around the 1000th iteration and leads to a multimodal policy that, given the intention value,
consistently reaches the target associated with that intention. In the case of the standard imitation
learning GAN setup (Fig. 6.42), the network learns how to imitate reaching only one of the targets
for both intention values.
In order to analyze the ability to discover different ways to accomplish the same task, we use
our framework with the categorical latent intention in the Reacher environment with a single target.
Since we only have a single set of expert trajectories that reach the goal in one, consistent manner,
we subsample the expert stateaction pairs to ease the intention learning process for the generator.
Fig. 6.4 (right) shows two examples of a heatmap of the visited endeffector states accumulated for
two different values of the intention variable. For both cases, the task is executed correctly, the
robot reaches the target, but it achieves it using different trajectories. These trajectories naturally
emerged through the latent intention cost as it encourages different behaviors for different latent
intentions. It is worth noting that the presented behavior can be also replicated for multiple targets
78
Figure 6.5: Top: Rewards of Walker2D policies for different intention values over the training
iterations with (left) and without (right) the latent intention cost. Bottom: Rewards of Humanoid
policies for different intention values over the training iterations with (left) and without (right) the
latent intention cost.
if the number of categories in the categorical distribution of the latent intention exceeds the number
of targets.
6.6.3 MultiTask Imitation Learning
We also seek to further understand whether our model extends to segmenting and imitating policies
that perform different tasks. In particular, we evaluate whether our framework is able to learn a
multimodal policy on the Walker2D task. We mix three different policies – running backwards,
running forwards, and jumping – into one expert policy
E
and try to recover all of them through
our method. The results are depicted in Fig. 6.5 (top). The additional latent intention cost results in
a policy that is able to autonomously segment and mimic all three behaviors and achieve a similar
performance to the expert policies (Fig. 6.5 topleft). Different intention variable values correspond
to different expert policies: 0  running forwards, 1  jumping, and 2  running backwards. The
imitation learning GAN method is shown as a baseline in Fig. 6.5 (topright). The results show
79
Figure 6.6: Timelapse of the learned Gripperpusher policy. The intention variable is changed
manually in the fifth screenshot, once the grasping policy has grasped the block.
that the policy collapses to a single mode, where all different intention variable values correspond
to the jumping behavior, ignoring the demonstrations of the other two skills.
To test if our multimodal imitation learning framework scales to highdimensional tasks, we
evaluate it in the Humanoid environment. The expert policy is constructed using three expert
policies: running backwards, running forwards, and balancing while standing upright. Fig. 6.5
(bottom) shows the rewards obtained for different values of the intention variable. Similarly to
Walker2D, the latent intention cost enables the neural network to segment the tasks and learn a
multimodal imitation policy. In this case, however, due to the high dimensionality of the task,
the resulting policy is able to mimic running forwards and balancing policies almost as well as
the experts, but it achieves a suboptimal performance on the running backwards task (Fig. 6.5
bottomleft). The imitation learning GAN baseline collapses to a unimodal policy that maps all
the intention values to a balancing behavior (Fig. 6.5 bottomright).
Finally, we evaluate the ability of our method to discover options in hierarchical IRL tasks.
In order to test this, we collect expert policies in the Gripperpusher environment that consist of
grasping and pushing when the object is grasped demonstrations. The goal of this task is to check
whether our method will be able to segment the mix of expert policies into separate grasping
and pushingwhengrasped skills. Since the two subtasks start from different initial conditions,
we cannot present the results in the same form as for the previous tasks. Instead, we present a
timelapse of the learned multimodal policy (see Fig. 6.6) that presents the ability to change in
the intention during the execution. The categorical intention variable is manually changed after
the block is grasped. The intention change results in switching to a pushing policy that brings the
block into the goal region. We present this setup as an example of extracting different options from
the expert policies that can be further used in an hierarchical reinforcement learning task to learn
the best switching strategy.
6.7 Conclusions
We present a novel imitation learning method that learns a multimodal stochastic policy, which is
able to imitate a number of automatically segmented tasks using a set of unstructured and unlabeled
demonstrations. The presented approach learns the notion of intention and is able to perform
different tasks based on the policy intention input. We evaluated our method on a set of simulation
scenarios where we show that it is able to segment the demonstrations into different tasks and to
learn a multimodal policy that imitates all of the segmented skills. We also compared our method
to a baseline approach that performs imitation learning without explicitly separating the tasks.
80
Chapter 7
Closing the SimtoReal Loop: Adapting Simulation
Randomization with Real World Experience
We consider the problem of transferring policies to the real world by training on a distribution
of simulated scenarios. Rather than manually tuning the randomization of simulations, we adapt
the simulation parameter distribution using a few real world rollouts interleaved with policy
training. In doing so, we are able to change the distribution of simulations to improve the policy
transfer by matching the policy behavior in simulation and the real world. We show that policies
trained with our method are able to reliably transfer to different robots in two real world tasks:
swingpeginhole and opening a cabinet drawer. The video of our experiments can be found at
https://sites.google.com/view/simopt.
7.1 Introduction
Learning continuous control in real world complex environments has seen a wide interest in the
past few years and in particular focusing on learning policies in simulators and transferring to real
world, as we still struggle with finding ways to acquire the necessary amount of experience and
data in the real world directly. While there have been recent attempts on learning by collecting
large scale data directly on real robots (Levine et al., 2018; Pinto & Gupta, 2016; Yahya et al.,
2017; Kalashnikov et al., 2018), such an approach still remains challenging as collecting real world
data is prohibitively laborious and expensive. Simulators offer several advantages, e.g. they can
run faster than realtime and allow for acquiring large diversity of training data. However, due to
the imprecise simulation models and lack of high fidelity replication of real world scenes, policies
learned in simulations often cannot be directly applied on realworld systems, a phenomenon also
known as the reality gap (Jakobi et al., 1995). In this work, we focus on closing the reality gap
by learning policies on distributions of simulated scenarios that are optimized for a better policy
transfer.
Training policies on a large diversity of simulated scenarios by randomizing relevant parameters,
also known as domain randomization, has shown a considerable promise for the real world transfer
in a range of recent works (Tobin et al., 2017; Sadeghi & Levine, 2017; James et al., 2017;
Andrychowicz et al., 2018). However, design of the appropriate simulation parameter distributions
81
Figure 7.1: Policies for opening a cabinet drawer and swingpeginhole tasks trained by al
ternatively performing reinforcement learning with multiple agents in simulation and updating
simulation parameter distribution using a few real world policy executions.
remains a tedious task and often requires a substantial expert knowledge. Moreover, there are no
guarantees that the applied randomization would actually lead to a sensible real world policy as the
design choices made in randomizing the parameters tend to be somewhat biased by the expertise
of the practitioner. In this work, we apply a datadriven approach and use real world data to adapt
simulation randomization such that the behavior of the policies trained in simulation better matches
their behavior in the real world. Therefore, starting with some initial distribution of the simulation
parameters, we can perform learning in simulation and use real world rollouts of learned policies
to gradually change the simulation randomization such that the learned policies transfer better to
the real world without requiring the exact replication of the real world scene in simulation. This
approach falls into the domain of modelbased reinforcement learning. However, we leverage
recent developments in physics simulations to provide a strong prior of the world model in order to
accelerate the learning process. Our system uses partial observations of the real world and only
needs to compute rewards in simulation, therefore lifting the requirement for full state knowledge
or reward instrumentation in the real world.
7.2 Related Work
The problem of finding accurate models of the robot and the environment that can facilitate the
design of robotic controllers in real world dates back to the original works on system iden
tification (Ljung, 1999; Giri & Bai, 2010). In the context of reinforcement learning (RL),
modelbased RL explored optimizing policies using learned models (M. Deisenroth et al., 2013).
In (M. P. Deisenroth & Rasmussen, 2011; M. Deisenroth et al., 2011), the data from realworld
policy executions is used to fit a probabilistic dynamics model, which is then used for learning an
optimal policy. Although our work follows the general principle of modelbased reinforcement
learning, we aim at using a simulation engine as a form of parameterized model that can help us to
embed prior knowledge about the world.
Overcoming the discrepancy between simulated models and the real world has been addressed
through identifying simulation parameters (Kolev & Todorov, 2015), finding common feature
representations of real and synthetic data (Tzeng, Devin, et al., 2015), using generative models to
make synthetic images more realistic (Bousmalis et al., 2017), finetuning the policies trained in
simulation in the real world (Rusu et al., 2017), learning inverse dynamics models (Christiano et al.,
2016), multiobjective optimization of task fitness and transferability to the real world (Koos et al.,
2010), training on ensembles of dynamics models (Mordatch et al., 2015) and training on a large
82
variety of simulated scenarios (Tobin et al., 2017). Domain randomization of textures was used
in (Sadeghi & Levine, 2017) to learn to fly a real quadcopter by training an image based policy
entirely in simulation. Peng et al. (Peng et al., 2018) use randomization of physical parameters
of the scene to learn a policy in simulation and transfer it to real robot for pushing a puck to a
target position. In (Andrychowicz et al., 2018), randomization of physical properties and object
appearance is used to train a dexterous robotic hand to perform inhand manipulation. Yu et al. (Yu
et al., 2017) propose to not only train a policy on a distribution of simulated parameters, but also
learn a component that predicts the system parameters from the current states and actions, and use
the prediction as an additional input to the policy. In (Muratore et al., 2018), an upper confidence
bound on the estimated simulation optimization bias is used as a stopping criterion for a robust
training with domain randomization. In (Wulfmeier et al., 2017), an auxiliary reward is used to
encourage policies trained in source and target environments to visit the same states.
Combination of system identification and dynamics randomization has been used in the past
to learn locomotion for a real quadruped (Tan et al., 2018), nonprehensile object manipula
tion (Lowrey et al., 2018) and inhand object pivoting (Antonova et al., 2017). In our work, we
recognize domain randomization and system identification as powerful tools for training general
policies in simulation. However, we address the problem of automatically learning simulation
parameter distributions that improve policy transfer, as it remains challenging to do it manually.
Furthermore, as also noticed in (Pinto et al., 2017), simulators have an advantage of providing a
full state of the system compared to partial observations of the real world, which is also used in
our work for designing better reward functions.
The closest to our approach are the methods from (Tan et al., 2016; S. Zhu et al., 2018; Farchy
et al., 2013; Hanna & Stone, 2017; Rajeswaran et al., 2016) that propose to iteratively learn
simulation parameters and train policies. In (Tan et al., 2016), an iterative system identification
framework is used to optimize trajectories of a bipedal robot in simulation and calibrate the
simulation parameters by minimizing the discrepancy between the real world and simulated
execution of the trajectories. Although we also use the real world data to compute the discrepancy
of the simulated executions, we are able to use partial observations of the real world instead of
the full states and we concentrate on learning general policies by finding simulation parameter
distribution that leads to a better transfer without the need for exact replication of the real world
environment. (S. Zhu et al., 2018) suggests to optimize the simulation parameters such that the
value function is well approximated in simulation without replicating the real world dynamics. We
also recognize that exact replication of the real world dynamics might not be feasible, however
a suitable randomization of the simulated scenarios can still lead to a successful policy transfer.
In addition, our approach does not require estimating the reward in the real world, which might
be challenging if some of the reward components can not be observed. (Farchy et al., 2013) and
(Hanna & Stone, 2017) consider grounding the simulator using real world data. However, (Farchy
et al., 2013) requires a human in the loop to select the best simulation parameters, and (Hanna
& Stone, 2017) needs to fit additional models for the real robot forward dynamics and simulator
inverse dynamics. Finally, our work is closest to the adaptive EPOpt framework of Rajeswaran et
al. (Rajeswaran et al., 2016), which optimizes a policy over an ensemble of models and adapts the
model distribution using data from the target domain. EPOpt optimizes a risksensitive objective
83
to obtain robust policies, whereas we optimize the average performance which is a riskneutral
objective. Additionally, EPOpt updates the model distribution by employing Bayesian inference
with a particle filter, whereas we update the model distribution using an iterative KLdivergence
constrained procedure. More importantly, they focus on simulated environments while in our work,
we develop an approach that is shown to work in real world and apply it to two real robot tasks.
7.3 Closing the SimtoReal Loop
7.3.1 Simulation Randomization
LetM = (S;A;P;R;p
0
;
;T) be a finitehorizon Markov Decision Process (MDP), whereS and
A are state and action spaces,P ∶S×A×S →R
+
is a statetransition probability function or
probabilistic system dynamics,R ∶ S×A→R a reward function,p
0
∶ S →R
+
an initial state
distribution,
a reward discount factor, andT a fixed horizon. Let = (s
0
;a
0
;:::;s
T
;a
T
) be
a trajectory of states and actions andR() =∑
T
t=0
t
R(s
t
;a
t
) the trajectory reward. The goal
of reinforcement learning methods is to find parameters of a policy
(aSs) that maximize
the expected discounted reward over trajectories induced by the policy:E
[R()] wheres
0
∼
p
0
;s
t+1
∼P(s
t+1
Ss
t
;a
t
) anda
t
∼
(a
t
Ss
t
).
In our work, the system dynamics are either induced by a simulation engine or real world.
As the simulation engine itself is deterministic, a reparameterization trick (Kingma & Welling,
2013) can be applied to introduce probabilistic dynamics. In particular, we define a distribution of
simulation parameters ∼p
() parameterized by. The resulting probabilistic system dynamics
of the simulation engine areP
∼p
=P(s
t+1
Ss
t
;a
t
;).
As it was shown in (Tobin et al., 2017; Sadeghi & Levine, 2017; Andrychowicz et al., 2018), it
is possible to design a distribution of simulation parametersp
(), such that a policy trained on
P
∼p
would perform well on a realworld dynamics distribution. This approach is also known as
domain randomization and the policy training maximizes the expected reward under the dynamics
induced by the distribution of simulation parametersp
():
max
E
P
∼p
[E
[R()]] (7.1)
Domain randomization requires a significant expertise and tedious manual finetuning to design
the simulation parameter distributionp
(). Furthermore, as we show in our experiments, it is
often disadvantageous to use overly wide distributions of simulation parameters as they can include
scenarios with infeasible solutions that hinder successful policy learning, or lead to exceedingly
conservative policies. Instead, in the next section, we present a way to automate the learning of
p
() that makes it possible to shape a suitable randomization without the need to train on very
wide distributions.
7.3.2 Learning Simulation Randomization
The goal of our framework is to find a distribution of simulation parameters that brings observations
or partial observations induced by the policy trained under this distribution closer to the observations
84
RL SimOpt
Reality Simulation
Training
sim distribution
Figure 7.2: The pipeline for optimizing the simulation parameter distribution. After training a
policy on current distribution, we sample the policy both in the real world and for a range of
parameters in simulation. The discrepancy between the simulated and real observations is used to
update the simulation parameter distribution in SimOpt.
of the real world. Let
;p
be a policy trained under the simulated dynamics distributionP
∼p
as in the objective (7.1), and letD(
ob
;
ob
real
) be a measure of discrepancy between real world
observation trajectories
ob
real
= (o
0;real
:::;o
T;real
) and simulated observation trajectories
ob
=
(o
0;
:::;o
T;
) sampled using policy
;p
and the dynamics distributionP
∼p
. It should be noted
that the inputs of the policy
;p
and observations used to computeD(
ob
;
ob
real
) are not required
to be the same. The goal of optimizing the simulation parameter distribution is to minimize the
following objective:
min
E
P
∼p
E
;p
D(
ob
;
ob
real
) (7.2)
This optimization would entail training and real robot evaluation of the policy
;p
for each.
This would require a large amount of RL iterations and more critically real robot trials. Hence, we
develop an iterative approach to approximate the optimization by training a policy
;p
i
on the
simulation parameter distribution from the previous iteration and using it for both, sampling the
real world observations and optimizing the new simulation parameter distributionp
i+1
:
min
i+1
E
P
i+1
∼p
i+1
E
;p
i
D(
ob
i+1
;
ob
real
) (7.3)
s.t. D
KL
p
i+1
Yp
i
≤;
85
Algorithm 3: SimOpt framework
1: p
0
← Initial simulation parameter distribution
2: ← KLdivergence step for updatingp
3: for iterationi∈ {0;:::;N} do
4: env←Simulation(p
i
)
5:
;p
i
←RL(env)
6:
ob
real
∼RealRollout(
;p
i
)
7: ∼Sample(p
i
)
8:
ob
∼SimRollout(
;p
i
;)
9: c()←D(
ob
;
ob
real
)
10: p
i+1
←UpdateDistribution(p
i
;;c();)
11: end for
where we introduce a KLdivergence step between the old simulation parameter distributionp
i
and the updated distributionp
i+1
to avoid going out of the trust region of the policy
;p
i
trained
on the old simulation parameter distribution. Fig. 7.2 shows the general structure of our algorithm
that we call SimOpt.
7.3.3 Implementation
Here we describe particular implementation choices for the components of our framework used in
this work. However, it should be noted that each of the components is replaceable. Algorithm 3
describes the order of running all the components in our implementation. The RL training is
performed on a GPU based simulator using a parallelized version of proximal policy optimization
(PPO) (Schulman et al., 2017) on a multiGPU cluster (Liang et al., 2018). We parameterize
our simulation parameter distribution as a Gaussian, i.e. p
() ∼N (; ) with = (; ).
We choose weighted `
1
and `
2
norms between simulation and real world observations for our
observation discrepancy functionD:
D(
ob
;
ob
real
)= (7.4)
w
`
1
T
Q
i=0
SW(o
i;
−o
i;real
)S+w
`
2
T
Q
i=0
YW(o
i;
−o
i;real
)Y
2
2
;
wherew
`
1
andw
`
2
are the weights of the `
1
and `
2
norms, andW are the importance weights for
each observation dimension. We additionally apply a Gaussian filter to the distance computation to
account for misalignments of the trajectories.
As we use a nondifferentiable simulator we employ a samplingbased gradientfree algorithm
based on relative entropy policy search (Peters et al., 2010) for optimizing the objective (7.3),
which is able to perform updates ofp
with an upper bound on the KLdivergence step. By doing
so, the simulator can be treated as a blackbox, as in this casep
can be optimized directly by only
using samples ∼p
and the corresponding costsc() coming fromD(
ob
;
ob
real
). Sampling of
86
simulation parameters and the corresponding policy rollouts is highly parallelizable, which we
use in our experiments to evaluate large amounts of simulation parameter samples.
As noted above, single components of our framework can be exchanged. In case of availability
of a differentiable simulator, the objective (7.3) can be defined as a loss function for optimizing
with gradient descent. Furthermore, for cases where`
1
and`
2
norms are not applicable, we can
employ other forms of discrepancy functions, e.g. to account for potential domain shifts between
observations (Tzeng, Devin, et al., 2015; Tzeng, Hoffman, et al., 2015; Sermanet et al., 2018).
Alternatively, real world and simulation data can be additionally used to trainD(
ob
;
ob
real
) to
discriminate between the observations by minimizing the prediction loss of classifying obser
vations as simulated or real, similar to the discriminator training in the generative adversarial
framework (Goodfellow et al., 2014; Ho & Ermon, 2016; Hausman et al., 2017). Finally, a higher
dimensional generative modelp
() can be employed to provide a multimodal randomization of
the simulated environments.
7.4 Experiments
In our experiments we aim at answering the following questions: (1) How does our method
compare to standard domain randomization? (2) How learning a simulation parameter distribution
compares to training on a very wide parameter distribution? (3) How many SimOpt iterations and
real world trials are required for a successful transfer of robotic manipulation policies? (4) Does
our method work for different real world tasks and robots?
We start by performing an ablation study in simulation by transferring policies between scenes
with different initial state distributions, such as different poses of the cabinet in the drawer opening
task. We demonstrate that updating the distribution of the simulation parameters leads to a
successful policy transfer in contrast to just using an initial distribution of the parameters without
any updates as done in standard domain randomization. As we observe, training on very wide
parameter distributions is significantly more difficult and prone to fail compared to initializing
with a conservative parameter distribution and updating it using SimOpt afterwards.
Next, we show that we can successfully transfer policies to real robots, such as ABB Yumi
and Franka Panda, for complex articulated tasks such as cabinet drawer opening, and tasks with
nonrigid bodies and complex dynamics, such as swingpeginhole task with the peg swinging
on a soft rope. The policies can be transferred with a very small amount of real robot trials and
leveraging largescale training on a multiGPU cluster.
7.4.1 Tasks
We evaluate our approach on two robot manipulation tasks: cabinet drawer opening and swing
peginhole.
Swingpeginhole
The goal of this task is to put a peg attached to a robot hand on a rope into a hole placed at a 45
degrees angle. Manipulating a soft rope leads to a swinging motion of the peg, which makes the
87
dynamics of the task more challenging. The task set up in the simulation and real world using a
7DoF Yumi robot from ABB is depicted in Fig. 7.1 on the right. Our observation space consists
of 7DoF arm joint configurations and 3D position of the peg. The reward function for the RL
training in simulation includes the distance of the peg from the hole, angle alignment with the hole
and a binary reward for solving the task.
Drawer opening
In the drawer opening task, the robot has to open a drawer of a cabinet by grasping and pulling it
with its fingers. This task involves an ability to handle contact dynamics when grasping the drawer
handle. For this task, we use a 7DoF Panda arm from Franka Emika. Simulated and real world
settings are shown in Fig. 7.1 on the left. This task is operated on a 10D observation space: 7D
robot joint angles and 3D position of the cabinet drawer handle. The reward function consists of
the distance penalty between the handle and endeffector positions, the angle alignment of the
endeffector and the drawer handle, opening distance of the drawer and indicator function ensuring
that both robot fingers are on the handle.
We would like to emphasize that our method does not require the full state information of the
real world, e.g. we do not need to estimate the rope diameter, rope compliance etc. to update the
simulation parameter distribution in the swingpeginhole task. The output of our policies consists
of 7 joint velocity commands and an additional gripper command for the drawer opening task.
7.4.2 Simulation Engine
We use NVIDIA Flex as a highfidelity GPU based physics simulator that uses maximal coordinate
representation to simulate rigid body dynamics. Flex allows a highly parallel implementation
and can simulate multiple instances of the scene on a single GPU. We use the multiGPU based
RL infrastructure developed in (Liang et al., 2018) to leverage the highly parallel nature of the
simulator.
7.4.3 Comparison to Standard Domain Randomization
We aim at understanding what effect a wide simulation parameter distribution can have on learning
robust policies, and how we can improve the learning performance and the transferability of the
policies using our method to adjust simulation randomization. Fig. 7.3 shows an example of
training a policy on a significantly wide distribution of simulation parameters for the swingpeg
inhole task. In this case, peg size, rope properties and size of the peg box were randomized. As
we can observe, a large part of the randomized instances does not have a feasible solution, i.e.
when the peg is too large for the hole or the rope is too short. Finding a suitably wide parameter
distribution would require manual finetuning of the randomization parameters.
Moreover, learning performance of standard domain randomization depends strongly on the
variance of the parameter distribution. We investigate this in a simulated cabinet drawer opening
task with a Franka arm which is placed in front of a cabinet. We randomize the position of the
cabinet along the lateral direction (Xcoordinate) while keeping all other simulation parameters
88
Figure 7.3: An example of a wide distribution of simulation parameters in the swingpeginhole
task where it is not possible to find a solution for many of the task instances.
constant. We train our policies on a 2 layer neural network with fully connected layers of 64 units
each with PPO for 200 iterations. As we increase the variance of the cabinet position, we observe
that the policies learned tend to be conservative i.e. they do end up reaching the handle of the
drawer but fail to open it. This is shown in Fig. 7.4 (left) where we plot the reward as a function
of number of iterations used to train the RL policy. We start with a standard deviation of 2cm
(
2
= 7e−4) and increase it to 10cm (
2
= 0:01). As shown in the plot, the policy is sensitive to the
choice of this parameter and only manages to open the drawer when the standard deviation is 2cm.
We note that the reward difference may not seem that significant but realize that it is dominated by
reaching reward. Increasing variance further, in an attempt to cover a wider operating range, can
often lead to simulating unrealistic scenarios and catastrophic breakdown of the physics simulation
with various joints of the robot reaching their limits. We also observed that the policy is extremely
sensitive to variance in all three axes of the cabinet position i.e. policy only ever converges when
the standard deviation is 2cm and fails to learn even reaching the handle otherwise.
In our next set of experiments, we show that our method is able to perform policy transfer from
the source to target drawer opening scene where position of the cabinet in the target scene is offset
by a distance of 15cm and 22cm. Such large distances would have required the standard deviation
of the cabinet position to be at least 10cm for any na¨ ıve domain randomization based training
which fails to produce a policy that opens the drawer as shown in Fig. 7.4 (left). The policy is first
trained with RL on a conservative initial simulation parameter distribution. Afterwards, it is run
on the target scene to collect rollouts. These rollouts are then used to perform several SimOpt
iterations to optimize simulation parameters that best explain the current rollouts. We noticed that
the RL training can be sped up by initializing the policy with the weights from the previous SimOpt
iteration, effectively reducing the number of needed PPO iterations from 200 to 10 after the first
SimOpt iteration. The whole process is repeated until the learned policy starts to successfully open
89
Figure 7.4: Left: Performance of the policy training with domain randomization for different
variances of the distribution of the cabinet position along the Xaxis in the drawer opening task.
Right: Initial distribution of the cabinet position in the source environment, located at extreme left,
slowly starts to change to the target environment distribution as a function of running 5 iterations
of SimOpt.
the drawer in the target scene. We found that it took overall 3 iterations of doing RL and SimOpt to
learn to open the drawer when the cabinet was offset by 15cm. We further note that the number of
iterations increases to 5 as we increase the target cabinet distance to 22cm highlighting that our
method is able to operate on a wider range of mismatch between the current scene and the target
scene. Fig. 7.4 (right) shows how the source distribution variance adapts to the target distribution
variance for this experiment and Fig. 7.5 shows that our method starts with a conservative guess
of the initial distribution of the parameters and changes it using target scene rollouts until policy
behaviour in target and source scene starts to match.
7.4.4 Real Robot Experiments
In our real robot experiments, SimOpt is used to learn simulation parameter distribution of the
manipulated objects and the robot. We run our experiments on 7DoF Franka Panda and ABB
Yumi robots. The RL training and SimOpt simulation parameter sampling is performed using a
cluster of 64 GPUs for running the simulator with 150 simulated agents per GPU. In the real world,
we use object tracking with DART (Schmidt et al., 2014) to continuously track the 3D positions of
the peg in the swingpeginhole task and the handle of the cabinet drawer in the drawer opening
task, as well as initialize positions of the peg box and the cabinet in simulation. DART operates on
depth images and requires 3D articulated models of the objects. We learn multivariate Gaussian
distributions of the simulation parameters parameterized by a mean and a full covariance matrix,
and perform several updates of the simulation parameter distribution per SimOpt iteration using
the same real world rollouts to minimize the number of real world trials.
90
Figure 7.5: Policy performance in the target drawer opening environment trained on randomized
simulation parameters at different iterations of SimOpt. As the source environment distribution
gets adjusted, the policy transfer improves until the robot can successfully solve the task in the
fourth SimOpt iteration.
Figure 7.6: Running policies trained in simulation at different iterations of SimOpt for real world
swingpeginhole and drawer opening tasks. Left: SimOpt adjusts physical parameter distribution
of the soft rope, peg and the robot, which results in a successful execution of the task on a real
robot after two SimOpt iterations. Right: SimOpt adjusts physical parameter distribution of the
robot and the drawer. Before updating the parameters, the robot pushes too much on the drawer
handle with one of its fingers, which leads to opening the gripper. After one SimOpt iteration, the
robot can better control its gripper orientation, which leads to an accurate task execution.
Swingpeginhole
Fig. 7.6 (left) demonstrates the behavior of the real robot execution of the policy trained in
simulation over 3 iterations of SimOpt. At each iteration, we perform 100 iterations of RL in
approximately 7 minutes and 3 rollouts on the real robot using the currently trained policy to collect
realworld observations. Then, we run 3 update steps of the simulation parameter distribution
with 9600 simulation samples per update. In the beginning, the robot misses the hole due to the
discrepancy of the simulation parameters and the real world. After a single SimOpt iteration, the
robot is able to get much closer to the hole, however not being able to insert the peg as it requires
a slight angle to go into the hole, which is nontrivial to achieve using a soft rope. Finally, after
two SimOpt iterations, the policy trained on a resulting simulation parameter distribution is able to
swing the peg into the hole in 90% of the times when evaluated on 20 trials.
We observe that the most significant changes of the simulation parameter distribution occur in
the physical parameters of the rope that influence its dynamical behavior and the robot parameters
that influence the policy behavior, such as scaling of the policy actions. More details on the initial
and updated Gaussian distribution parameters can be found in Appendix A.3.1. Fig. 7.7 shows the
development of the covariance matrix over the iterations. We can observe some correlation in the
top left block of the matrix, which corresponds to the robot joint compliance and damping values.
This reflects the fact that these values have somewhat opposite effect on the robot behavior, i.e. if
we overshoot in the compliance we can compensate with increased damping.
91
Figure 7.7: Covariance matrix heat maps over 3 SimOpt updates of the swingpeginhole task
beginning with the initial covariance matrix.
Drawer opening
For drawer opening, we learn a Gaussian distribution of the robot and cabinet simulation parameters.
More details on the learned distribution and its initialization are provided in Appendix A.3.1.
Fig. 7.6 (right) shows the drawer opening behavior before and after performing a SimOpt update.
During each SimOpt iteration, we run 200 iterations of RL for approximately 22 minutes, perform
3 real robot rollouts and 20 update steps of the simulation distribution using 9600 samples per
update step. Before updating the parameter distribution, the robot is able to reach the handle and
start opening the drawer. However, it cannot exactly replicate the learned behavior from simulation
and does not keep the gripper orthogonal to the drawer, which results in pushing too much on
the handle from the bottom with one of the robot fingers. As the finger gripping force is limited,
the fingers begin to open due to a larger pushing force. After adjusting the simulation parameter
distribution that includes robot and drawer properties, the robot is able to better control its gripper
orientation and by evaluating on 20 trials can open the drawer at all times keeping the gripper
orthogonal to the handle.
7.4.5 Comparison to TrajectoryBased Parameter Learning
In our work, we run a closedloop policy in simulation to obtain simulated rollouts for SimOpt
optimization. Alternatively, we could directly set the simulator to states and execute actions from
the real world trajectories as proposed in (Tan et al., 2016; S. Zhu et al., 2018). However, such a
setting is not always possible as we might not be able to observe all required variables for setting
the internal state of the simulator at each time point, e.g. the current bending configuration of the
rope in the swingpeginhole task, which we are able to initialize but can not continually track
with our real world set up.
Without being able to set the simulator to the real world states continuously, we still can try
to copy the real world actions and execute them in an openloop manner in simulation. However,
in our simulated experiments we notice that especially when making particular state dimensions
unobservable for SimOpt cost computation, such as Xposition of the cabinet in the drawer
opening task, executing a closedloop policy still leads to meaningful simulation parameter updates
compared to the openloop execution. We believe in this case the robot behavior is still dependent
on the particular simulated scenario due to the closedloop nature of the policy, which also reflects
in the joint trajectories of the robot that are still included in the SimOpt cost function. This means
92
that by using a closedloop policy we can still update the simulation parameter distribution even
without explicitly including some of the relevant observations in the SimOpt cost computation.
7.5 Conclusions
Closing the simulation to reality transfer loop is an important component for a robust transfer of
robotic policies. In this work, we demonstrated that adapting simulation randomization using real
world data can help in learning simulation parameter distributions that are particularly suited for a
successful policy transfer without the need for exact replication of the real world environment. In
contrast to trying to learn policies using very wide distributions of simulation parameters, which
can simulate infeasible scenarios, we are able to start with distributions that can be efficiently
learned with reinforcement learning, and modify them for a better transfer to the real scenario.
Our framework does not require full state of the real environment and reward functions are only
needed in simulation. We showed that updating simulation distributions is possible using partial
observations of the real world while the full state still can be used for the reward computation in
simulation. We evaluated our approach on two real world robotic tasks and showed that policies
can be transferred with only a few iterations of simulation updates using a small number of real
robot trials.
In this work, we applied our method to learning unimodal simulation parameter distributions.
We plan to extend our framework to multimodal distributions and more complex generative
simulation models in future work. Furthermore, we plan to incorporate higherdimensional sensor
modalities, such as vision and touch, for both policy observations and factors of simulation
randomization.
93
Chapter 8
Conclusions and Future Work
In this thesis, we have presented a number of approaches for efficient and robust datadriven
learning of closedloop sensorybased robotic skills. In the first part, we developed several
reinforcement learning methods that can be employed on real robots and aim at reducing the
sample complexity while still maintaining the generality of the learned behaviors by decomposing
the training into smaller components. First, we showed how we can learn complex integrated
polices that can map from visual data to motor commands for contactrich tasks with discontinuous
dynamics. Next, we presented an algorithm for combining fast but biased modelbased optimization
with unbiased corrective modelfree learning to significantly speed up the training of tasks with
mixed linear and nonlinear dynamics.
In the second part of the thesis, we explored the paradigm of selfsupervision and external
sources of information such as pretrained reward estimators and reward signals from different
domains. First, we addressed the problem of tactilebased grasping and proposed a method
for learning general grasp correction behaviors using a pretrained grasp stability predictor and
reinforcement learning. After that, we presented a method that allows us to perform imitation
learning from thirdperson video demonstrations of a human by learning viewangle and appearance
invariant representations of the human video data and robot camera images, and using it to self
supervise the skill learning process.
The last part of this thesis explored imitation and transfer learning as the means to reduce
the amount of required realworld robot trials. First, we showed how we can learn multimodal
stochastic policies from unstructured and unlabelled demonstrations using the generative adver
sarial framework. Finally, we presented a method for improving simulation to reality transfer of
robotic skills by adjusting the distribution of simulated scenarios such that the behavior of a robot
in simulation better matches its behavior in the real world.
8.1 Future Work
There are several avenues for future work that could extend the methods presented in this thesis.
Some of the most promising directions are based on discovering additional sources of training data
that can further reduce the robot interaction time and help to acquire larger repertoire of skills. A
particular area of future interest is transfer learning. It is crucial to make use of data from external
94
domains, such as agents with different embodiment, robot simulations, videos etc. By transferring
knowledge from domains that have an easier access to data, we can build a strong prior of the robot
behaviors and models of the world, and consequently achieve faster learning times.
Safe exploration remains an issue for still often fragile robotic systems. In this work, we mostly
concentrated on robotic manipulation. In future work, we plan to also explore the applications
of learning methods to realrobot locomotion tasks. For safetycritical locomotion systems, it
is important to reuse the prior knowledge about the control and the system dynamics as much
as possible. Therefore, we plan to combine the classical locomotion pipelines with learning
components to improve their performance and robustness.
Finally, it is important for robotic skill learning to move from training single task policies
to a continuous multitask learning scenario. This imposes a range of challenges that should be
addressed in future work, such as reuse of previous knowledge, e.g. through metalearning, skill
consolidation, avoiding forgetting old experiences. Enabling robots to learn continuously is crucial
for scaling robot deployment in the real world and making robots a part of our daily lives.
95
Appendix A
Appendix
A.1 Appendix: Combining ModelBased and ModelFree Updates
for TrajectoryCentric Reinforcement Learning
A.1.1 Derivation of LQRFLM
Given a TVLG dynamics model and quadratic cost approximation, we can approximate our Q and
value functions to second order with the following dynamic programming updates, which proceed
from the last time stept=T to the first stept= 1:
Q
x;t
=c
x;t
+ f
⊺
x;t
V
x;t+1
; Q
xx;t
=c
xx;t
+ f
⊺
x;t
V
xx;t+1
f
x;t
;
Q
u;t
=c
u;t
+ f
⊺
u;t
V
x;t+1
; Q
uu;t
=c
uu;t
+ f
⊺
u;t
V
xx;t+1
f
u;t
;
Q
xu;t
=c
xu;t
+ f
⊺
x;t
V
xx;t+1
f
u;t
;
V
x;t
=Q
x;t
−Q
xu;t
Q
−1
uu;t
Q
u;t
;
V
xx;t
=Q
xx;t
−Q
xu;t
Q
−1
uu;t
Q
ux;t
:
Here, similar to prior work, we use subscripts to denote derivatives. It can be shown (e.g., in (Tassa
et al., 2012)) that the action u
t
that minimizes the secondorder approximation of the Qfunction
at every time stept is given by
u
t
=−Q
−1
uu;t
Q
ux;t
x
t
−Q
−1
uu;t
Q
u;t
:
This action is a linear function of the state x
t
, thus we can construct an optimal linear policy by
setting K
t
= −Q
−1
uu;t
Q
ux;t
and k
t
= −Q
−1
uu;t
Q
u;t
. We can also show that the maximumentropy
policy that minimizes the approximate Qfunction is given by
p(u
t
Sx
t
)=N (K
t
x
t
+ k
t
;Q
uu;t
):
This form is useful for LQRFLM, as we use intermediate policies to generate samples to fit TVLG
dynamics. (Levine & Abbeel, 2014) impose a constraint on the total KLdivergence between the
old and new trajectory distributions induced by the policies through an augmented cost function
96
c(x
t
;u
t
)=
1
c(x
t
;u
t
)− logp
(i−1)
(u
t
Sx
t
), where solving for via dual gradient descent can yield
an exact solution to a KLconstrained LQR problem, where there is a single constraint that operates
at the level of trajectory distributionsp(). We can instead impose a separate KLdivergence
constraint at each time step with the constrained optimization
min
ut;t
E
x∼p(xt);u∼N (ut;t)
[Q(x;u)]
s:t: E
x∼p(xt)
[D
KL
(N (u
t
;
t
)Yp
(i−1)
)]≤
t
:
The new policy will be centered around u
t
with covariance term
t
. Let the old policy be
parameterized by
K
t
,
k
t
, and
C
t
. We form the Lagrangian (dividing by
t
), approximateQ, and
expand the KLdivergence term to get
L(u
t
;
t
;
t
)
=
1
t
Q
⊺
x;t
x
t
+Q
⊺
u;t
u
t
+
1
2
x
⊺
t
Q
xx;t
x
t
+
1
2
tr(Q
xx;t
x;t
)
+
1
2
u
⊺
t
Q
uu;t
u
t
+
1
2
tr(Q
uu;t
t
)+ x
⊺
t
Q
xu;t
u
t
+
1
2
logS
t
S− logS
t
S−d+tr(
−1
t
t
)
+(
K
t
x
t
+
k
t
− u
t
)
⊺
−1
t
(
K
t
x
t
+
k
t
− u
t
)
+tr(
K
⊺
t
−1
t
K
t
x;t
)−
t
:
Now we set the derivative ofL with respect to
t
equal to 0 and get
t
=
1
t
Q
uu;t
+
−1
t
−1
:
Setting the derivative with respect to u
t
equal to 0, we get
u
t
=−
t
1
t
Q
u;t
+
1
t
Q
ux;t
x
t
−
^
C
−1
t
(
^
K
t
x
t
+
^
k
t
);
Thus our updated mean has the parameters
k
t
=−
t
1
t
Q
u;t
−
^
C
−1
t
^
k
t
;
K
t
=−
t
1
t
Q
ux;t
−
^
C
−1
t
^
K
t
:
97
As discussed by (Tassa et al., 2012), when the updated K
t
and k
t
are not actually the optimal
solution for the current quadratic Qfunction, the update to the value function is a bit more complex,
and is given by
V
x;t
=Q
⊺
x;t
+Q
⊺
u;t
K
t
+ k
⊺
t
Q
uu;t
K
t
+ k
⊺
t
Q
ux;t
;
V
xx;t
=Q
xx;t
+ K
⊺
t
Q
uu;t
K
t
+ 2Q
xu;t
K
t
:
A.1.2 PI
2
Update through Constrained Optimization
The structure of the proof for the PI
2
update follows (Peters et al., 2010), applied to the cost
togo S(x
t
;u
t
). Let us first consider the costtogo S(x
t
;u
t
) of a single trajectory or path
(x
t
;u
t
;x
t+1
;u
t+1
;:::;x
T
;u
T
) whereT is the maximum number of time steps. We can rewrite
the Lagrangian in a samplebased form as
L(p
(i)
;
t
)=
Qp
(i)
S(x
t
;u
t
)+
t
Qp
(i)
log
p
(i)
p
(i−1)
−:
Taking the derivative ofL(p
(i)
;
t
) with respect to a single optimal policyp
(i)
and setting it to
zero results in
@L
@p
(i)
=S(x
t
;u
t
)+
t
log
p
(i)
p
(i−1)
+p
(i)
p
(i−1)
p
(i)
1
p
(i−1)
=S(x
t
;u
t
)+
t
log
p
(i)
p
(i−1)
= 0:
Solve the derivative forp
(i)
by exponentiating both sides
log
p
(i)
p
(i−1)
=−
1
t
S(x
t
;u
t
);
p
(i)
=p
(i−1)
exp−
1
t
S(x
t
;u
t
):
This gives us a probability update rule for a single sample that only considers costtogo of one
path. However, when sampling from a stochastic policyp
(i−1)
(u
t
Sx
t
), there are multiple paths
that start in state x
t
with action u
t
and continue with a noisy policy afterwards. Hence, the updated
policyp
(i)
(u
t
Sx
t
) will incorporate all of these paths as
p
(i)
(u
t
Sx
t
)∝p
(i−1)
(u
t
Sx
t
)E
p
(i−1)exp−
1
t
S(x
t
;u
t
):
The updated policy is additionally subject to normalization, which corresponds to computing the
normalized probabilities in Eq. (3.2).
98
A.1.3 Detailed Experimental Setup
Simulation Experiments
All of our cost functions use the following generic loss term on a vector z
`(z)=
1
2
YzY
2
2
+
¼
+YzY
2
2
: (A.1)
and are hyperparameters that weight the squared`
2
loss and Huberstyle loss, respectively,
and we set
= 10
−5
.
On the gripper pusher task, we have three such terms. The first sets z as the vector difference
between the block and goal positions with = 10 and = 0:1. z for the second measures the
vector difference between the gripper and block positions, again with= 10 and = 0:1, and the
last loss term penalizes the magnitude of the fourth robot joint angle with= 10 and = 0. We
include this last term because, while the gripper moves in 3D, the block is constrained to a 2D
plane and we thus want to encourage the gripper to also stay in this plane. These loss terms are
weighted by 4, 1, and 1 respectively.
On the reacher task, our only loss term uses as z the vector difference between the arm end
effector and the target, with= 0 and = 1. For both the reacher and door opening tasks, we also
include a small torque penalty term that penalizes unnecessary actuation and is typically several
orders of magnitude smaller than the other loss terms.
On the door opening task, we use two loss terms. For the first, z measures the difference
between the angle in radians of the door hinge and the desired angle of−1:0, with= 1 and = 0.
The second term is timevarying: for the first 25 time steps, z is the vector difference between the
bottom of the robot end effector and a position above the door handle, and for the remaining time
steps, z is the vector difference from the end effector to inside the handle. This encourages the
policy to first navigate to a position above the handle, and then reach down with the hook to grasp
the handle. Because we want to emphasize the second loss during the beginning of the trajectory
and gradually switch to the first loss, we do a timevarying weighting between the loss terms. The
weight of the second loss term is fixed to 1, but the weight of the first loss term at time stept is
5
t
T
2
.
For the neural network policy architectures, we use two fullyconnected hidden layers of
rectified linear units (ReLUs) with no output nonlinearity. On the reacher task, the hidden layer
size is 32 units per layer, and on the door opening task, the hidden layer size is 100 units per layer.
All of the tasks involve varying conditions for which we train one TVLG policy per condition
and, for reacher and door opening, train a neural network policy to generalize across all conditions.
For gripper pusher, the conditions vary the starting positions of the block and the goal, which can
have a drastic effect on the difficulty of the task. Figure A.1 illustrates the four initial conditions
of the gripper pusher task for which we train TVLG policies. For reacher, analogous to OpenAI
Gym, we vary the initial arm configuration and position of the target and train TVLG policies
from 16 randomly chosen conditions. Note that, while OpenAI Gym randomizes this initialization
per episode, we always reset to the same condition when training TVLG policies as this is an
additional assumption we impose. However, when we test the performance of the neural network
99
Figure A.1: The initial conditions for the gripper pusher task that we train TVLG policies on. The
top left and bottom right conditions are more difficult due to the distance from the block to the
goal and the configuration of the arm. The top left condition results are reported in Section 3.6.1.
policy, we collect 300 test episodes with random initial conditions. For the door opening task, we
initialize the robot position within a small square in the ground plane. We train TVLG policies
from the four corners of this square and test our neural network policies with 100 test episodes
from random positions within the square.
For the gripper pusher and door opening tasks, we train TVLG policies with PILQR, LQRFLM
and PI
2
with 20 episodes per iteration per condition for 20 iterations. In Appendix A.1.4, we
also test PI
2
with 200 episodes per iteration. For the reacher task, we use 3 episodes per iteration
per condition for 10 iterations. Note that we do not collect any additional samples for training
neural network policies. For the prior methods, we train DDPG with 150 episodes per epoch for
80 epochs on the reacher task, and TRPO uses 600 episodes per iteration for 120 iterations. On
door opening, TRPO uses 400 episodes per iteration for 80 iterations and DDPG uses 160 episodes
per epoch for 100 epochs, though note that DDPG is ultimately not successful.
Real Robot Experiments
For the real robot tasks we use a hybrid cost function that includes two loss terms of the form of
Eq. A.1. The first loss term`
arm
(z) computes the difference between the current position of the
robot’s endeffector and the position of the endeffector when the hockey stick is located just in
front of the puck. We set = 0:1 and = 0:0001 for this cost function. The second loss term
`
goal
(z) is based on the distance between the puck and the goal that we estimate using a motion
capture system. We set= 0:0 and = 1:0. Both`
arm
and`
goal
have a linear ramp, which makes
the cost increase towards the end of the trajectory. In addition, We include a small torque cost
term`
torque
to penalize unnecessary high torques. The combined function sums over all the cost
100
0 200 400 600 800 1000 1200 1400 1600
# samples
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Avg final distance
PI2
LQRFLM
PILQR
0 200 400 600 800 1000 1200 1400 1600
# samples
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Avg final distance
PI2
LQRFLM
PILQR
0 200 400 600 800 1000 1200 1400 1600
# samples
0.0
0.5
1.0
1.5
2.0
Avg final distance
PI2
LQRFLM
PILQR
Figure A.2: Topleft, topright, bottomleft: single condition comparisons of the gripperpusher
task in three additional conditions, which correspond to the topright, bottomright, and bottomleft
conditions depicted in Figure A.1, respectively. The PILQR method outperforms other baselines
in two out of the three conditions. The conditions presented in the top and middle figure are
significantly easier than the other conditions presented in the work. Bottom right: Additional
results on the door opening task.
terms: `
total
= 100:0`
goal
+`
arm
+`
torque
. We give a substantially higher weight to the cost on
the distance to the goal to achieve a higher precision of the task execution.
Our neural network policy includes two fullyconnected hidden layers of rectified linear units
(ReLUs). Each of the hidden layers consists of 42 neurons. The inputs of the policy include: puck
and goal positions measured with a motion capture system, robot joint angles, joint velocities, the
endeffector pose and the endeffector velocity. During PILQRMDGPS training, we use data
augmentation to regularize the neural network. In particular, the observations were augmented
with Gaussian noise to mitigate overfitting to the noisy sensor data.
A.1.4 Additional Simulation Results
Figure A.2 shows additional simulation results obtained for the gripper pusher task for the three
additional initial conditions. The instances presented here are not as challenging as the one reported
in the work. Our method (PILQR) is able to outperform other baselines except for the first two
101
conditions presented in the first rows of Figure A.2, where LQRFLM performs equally well due
to the simplicity of the task. PI
2
is not able to make progress with the same number of samples,
however, its performance on each condition is comparable to LQRFLM when provided with 10
times more samples.
We also test PI
2
with 10 times more samples on the reacher and door opening tasks. On the
reacher task, PI
2
improves substantially with more samples, though it is still worse than the four
other methods. However, as Figure A.2 (bottomright) shows, PI
2
is unable to succeed on the
door opening task even with 10 times more samples. The performance of PI
2
is likely to continue
increasing as we provide even more samples.
A.2 Appendix. TimeContrastive Networks: SelfSupervised Learn
ing from Video
A.2.1 Objects Interaction Analysis
Here, we visualize the embeddings from the ImageNetInception and multiview TCN models with
tSNE using a coloring by groundtruth labels. Each color is a unique combination of 5 attribute
values defined earlier, i.e. if each color is well separated the model can identify uniquely all
possible combinations of our 5 attributes. Indeed we observe in Fig. A.3 some amount of color
separation for the TCN but not for the baseline.
Figure A.3: tSNE colored by attribute combinations: TCN (left) does a better job than
ImageNetInception (right) at separating combinations of attributes.
102
reference Nearest Neighbors ImageNetIncep. Shuffle & Learn multiview TCN 1stperson 3rdperson view Figure A.4: Labelfree pouring imitation: nearest neighbors (right) for each reference image (left)
for different models (multiview TCN, Shuffle & Learn and ImageNetInception). These pouring
test images show that the TCN model can distinguish different hand poses and amounts
of poured liquid simply from unsupervised observation while being invariant to viewpoint,
background, objects and subjects, motionblur and scale.
A.2.2 Pose Imitation Analysis
Pose Imitation Data
The human training data consists of sequences distinguished by human subject and clothing
pair. Each sequence is approximately 4 minutes. For the labelfree TC supervision we collected
approximately 30 human pairs (about 2 hours) where humans imitate a robot but the joint labels
are not recorded, along with 50 robot sequences with random motion (about 3 hours, trivial to
collect). For human supervision, we collected 10 human/clothing pairs (about 40 minutes, very
expensive collection) while also recording the joints labels. Each recorded sequence is captured
by 3 smartphone cameras fixed on tripods at specific angles (0
○
, 60
○
and 120
○
) and distance. The
validation and testing sets each consist of 6 human/clothing pairs not seen during training (about
24 minutes, very expensive collection).
The distance error is normalized by the full value range of each joint, resulting in a percentage
error. Note that the Human Supervision signal is quite noisy, since the imitation task is subjective
and different human subjects interpret the mapping differently. In fact, a perfect imitation is not
possible due to physiological differences between human bodies and the Fetch robot. Therefore,
the best comparison metric available to us is to see whether the joint angles predicted from a
heldout human observation match the actual joint angles that the human was attempting to imitate.
103
reference frame TCN Nearest Neighbors Figure A.5: Labelfree pose imitation: nearest neighbors (right) for each reference frame (left)
for each row. Although only trained with selfsupervision (no human labels), the multiview TCN
can understand correspondences between humans and robots for poses such as crouching, reaching
up and others while being invariant to viewpoint, background, subjects and scale.
Models
We train our model using the 3 different signals as described in Fig. 5.3. The model consists of a
TCN as described in Sec. 5.4, to which we add a joint decoder network (2 fullyconnected layers
above TC embedding: → 128→ 8, producing 8 joint values). We train the joints decoder with
L2 regression using the selfsupervision or human supervision signals. The model can be trained
with different combinations of signals; we study the effects of each combination in Fig. 5.6 (right).
The datasets used are approximately 2 hours of random human motions, 3 hours of random robot
motions and 40 minutes of human supervision, as detailed in Appendix A.2.2. At test time, the
resulting joints vector can then directly be fed to the robot stack to update its joints (using the
native Fetch planner) as depicted in Fig. 5.3 (right). This results in an endtoend imitation from
pixels to joints without any explicit representation of human pose.
Supervision Analysis
As shown in Fig. 5.3 (right), our imitation system can be trained with different combinations of
signals. Here we study how our selfsupervised imitation system compares to the other possible
combinations of training signals. The performance of each combination is reported in Table 5.4
using the maximum amounts of data available (10 sequences for Human supervision and 30
104
sequences for TC supervision), while Fig. 5.6 (right) varies the amount of human supervision.
Models such as the ”TC + Self” or ”Self” do not make use of any human supervision, hence
only appear as single points on the vertical axis. Models that do not include TC supervision
are simply trained as endtoend regression problems. For example the “Self” model is trained
endtoend to predict internal joints from third person observations of the robot, and then that model
is applied directly to the human imitation task. For reference, we compute a random baseline which
samples joints values within physically possible ranges. In general, we observe that more human
supervision decreases the L2 robot joints error. It is interesting to note that while not given any
labels, the selfsupervised model (”TC + Self”) still significantly outperforms the fullysupervised
model (”Human”). The combination of all supervision signals performs the best. Overall, we
observe that adding the TC supervision to any other signal significantly decreases the imitation
error. In Fig. A.6, we vary the amount of TC supervision provided and find the imitation error
keeps decreasing as we increase the amount of data. Based on these results, we can make the
argument that relatively large amounts of cheap weaklysupervised data and small amounts of
expensive human supervised data is an effective balance for our problem. A nonextensive analysis
of viewpoint and scale invariance in Sec. A.2.4 seems to indicate that the model remains relatively
competitive when presented with viewpoints and scales not seen during training.
Analysis by Joint
In Fig. A.7, we examine the error of each joint individually for 4 models. Interestingly, we find
that for all joints excepts for ”shoulder pan”, the unsupervised ”TC+Self” models performs almost
as well as the humansupervised ”TC+Human+Self”. The unsupervised model does not seem to
correctly model the shoulder pan and performs worse than Random. Hence most of the benefits of
human supervision found in Fig. 5.6 (right) come from correcting the shoulder pan prediction.
Qualitative Results
We offer multiple qualitative evaluations: kNearest Neighbors (kNN) in Fig. A.5, imitation strips
in Fig. A.12 and a tSNE visualization in Fig. 5.8. Video strips do not fully convey the quality
of imitation, we strongly encourage readers to watch the videos accompanying this paper. kNN:
In Fig. A.5, we show the nearest neighbors of the reference frames for the selfsupervised model
”TC+Self” (no human supervision). Although never trained across humans, it learned to associate
poses such as crouching or reaching up between humans never seen during training and with
entirely new backgrounds, while exhibiting viewpoint, scale and translation invariance. Imitation
strips: In Fig. A.12, we present an example of how the selfsupervised model has learned to
imitate the height level of humans by itself (easier to see in supplementary videos) using the ”torso”
joint (see Fig. A.7). This stark example of the complex mapping between human and robot joints
illustrates the need for learned mappings, here we learned a nonlinear mapping from many human
joints to a single ”torso” robot joint without any human supervision. tSNE: We qualitatively
evaluate the arrangement of our learned embedding using tSNE representations with perplexity of
30 and learning rate of 10. In Fig. 5.8, we show that the agentcolored embedding exhibits local
105
coherence with respect to pose while being invariant to agent and viewpoint. More kNN examples,
imitation strips and tSNE visualizations from different models are available in Sec. A.2.3.
Figure A.6: Varying the amount of unsupervised data: increasing the number of unsupervised
sequences decreases the imitation error for both models.
Figure A.7: L2 robot error breakdown by robot joints. From left to right, we report errors
for the 8 joints of the Fetch robot, followed by the joints average, followed by the joints average
excluding the ”shoulder pan” join.
A.2.3 Imitation Examples
We qualitatively evaluate the arrangement of our learned embedding using tSNE representations
with perplexity of 30 and learning rate of 10. In this section we show the embedding before and
after training, and colorize points by agent in Fig. A.9 and by view in Fig. A.8. The representations
show that the embedding initially clusters views and agents together, while after training points
from a same agent or view spread over the entire manifold, indicating view and agent invariance.
106
Figure A.8: tSNE embedding before (left) and after (right) training, colored by view. Before
training, we observe concentrated clusters of the same color, indicating that the manifold is
organized in a highly viewspecific way, while after training each color is spread over the entire
manifold.
Figure A.9: tSNE embedding before (left) and after (right) training, colored by agent. Before
training, we observe concentrated clusters of the same color, indicating that the manifold is
organized in a highly agentspecific way, while after training each color is spread over the entire
manifold.
107
A.2.4 Imitation Invariance Analysis
Figure A.10: Testing TC+Human+Self model for orientation invariance: while the error
increases for viewpoints not seen during training (30
○
, 90
○
and 150
○
), it remains competitive.
Figure A.11: Testing for scale invariance: while the error increases when decreasing the distance
of the camera to the subject (about half way compared to training), it remains competitive and
lower than the humansupervised baseline.
In this section, we explore how much invariance is captured by the model. In Fig. A.10, we test
the L2 imitation error from new viewpoints (30
○
, 90
○
and 150
○
) different from training viewpoints
(0
○
, 60
○
and 120
○
). We find that the error increases but the model does not break down and keeps
a lower accuracy than the Human model in Fig. 5.6 (right). We also evaluate in Fig. A.11 the
accuracy while bringing the camera closer than during training (about half way) and similarly
find that while the error increases, it remains competitive and lower than the human supervision
baseline. From these experiments, we conclude that the model is somewhat robust to viewpoint
changes (distance and orientation) even though it was trained with only 3 fixed viewpoints.
108
Figure A.12: Selfsupervised imitation examples. Although not trained using any human
supervision (model ”TC+Self”), the TCN is able to approximately imitate human subjects unseen
during training. Note from the rows (1,2) that the TCN discovered the mapping between the robot’s
torso joint (up/down) and the complex set of human joints commanding crouching. In rows (3,4),
we change the capture conditions compared to training (see rows 1 and 2) by using a freeform
camera motion, a closeup scale and introduction some motionblur and observe that imitation is
still reasonable.
A.3 Appendix: Closing the SimtoReal Loop: Adapting Simulation
Randomization with Real World Experience
A.3.1 Simulation Parameters
Tables A.1 and A.2 show the initial mean, diagonal values of the initial covariance matrix and
the final mean of the Gaussian simulation parameter distributions that have been optimized with
SimOpt in drawer opening (Table A.1) and swingpeginhole (Table A.2) tasks.
A.3.2 SimOpt Parameters
Tables A.3 and A.4 show the SimOpt distribution update parameters for drawer opening and swing
peginhole tasks including REPS (Peters et al., 2010) parameters, settings of the discrepancy
functionD(
ob
;
ob
real
), weights of each observation dimension in the discrepancy function, and
reinforcement learning settings such as parallelized PPO (Schulman et al., 2017; Liang et al., 2018)
training parameters and task reward weights.
109
init diag(init)
final
Robot properties
Joint compliance (7D) [6:0::: 6:0] 0.5 [6:5::: 6:1]
Joint damping (7D) [3:0::: 3:0] 0.5 [2:4::: 2:7]
Gripper compliance 11.0 0.5 10.9
Gripper damping 0.0 0.5 0.34
Joint action scaling (7D) [0:26::: 0:26] 0.01 [0:19::: 0:35]
Cabinet properties
Drawer joint compliance 7.0 1.0 8.3
Drawer joint damping 2.0 0.5 0.81
Drawer handle friction 0.001 0.5 2.13
Table A.1: Drawer opening: simulation parameter distribution.
init diag(init)
final
Robot properties
Joint compliance (7D) [8:0::: 8:0] 1.0 [8:2::: 7:8]
Joint damping (7D) [3:0::: 3:0] 1.0 [3:0::: 2:6]
Joint action scaling (7D) [0:5::: 0:5] 0.02 [0:25::: 0:44]
Rope properties
Rope torsion compliance 2.0 0.07 1.89
Rope torsion damping 0.1 0.07 0.48
Rope bending compliance 10.0 0.5 9.97
Rope bending damping 0.01 0.05 0.49
Rope segment width 0.004 2e4 0.007
Rope segment length 0.016 0.004 0.017
Rope segment friction 0.25 0.03 0.29
Rope density 2500.0 8.0 2500.12
Peg properties
Peg scale 0:33 0.01 0:30
Peg friction 1.0 0.06 1.0
Peg mass coefficient 1.0 0.06 1.06
Peg density 400.0 10.0 400.07
Peg box properties
Peg box scale 0.029 0.01 0.034
Peg box friction 1.0 0.2 1.01
Table A.2: Swingpeginhole: simulation parameter distribution.
110
Simulation distribution update parameters
Number of REPS updates per SimOpt iteration 20
Number of simulation parameter samples per update 9600
Timesteps per simulation parameter sample 453
KLthreshold 1.0
Minimum temperature of sample weights 0.001
Discrepancy function parameters
L1cost weight 0.5
L2cost weight 1.0
Gaussian smoothing standard deviation (timesteps) 5
Gaussian smoothing truncation (timesteps) 4
Observation dimensions cost weights
Joint angles (7D) 0.5
Drawer position (3D) 1.0
PPO parameters
Number of agents 400
Episode length 150
Timesteps per batch 151
Clip parameter 0.2
0.99
0.95
Entropy coefficient 0.0
Optimization epochs 5
Optimization batch size per agent 8
Optimization step size 5e4
Desired KLstep 0.01
RL reward weights
L2distance between endeffector and drawer handle 0.5
Angular alignment of endeffector with drawer handle 0.07
Opening distance of the drawer 0.4
Keeping fingers around the drawer handle bonus 0.005
Action penalty 0.005
Table A.3: Drawer opening: SimOpt parameters.
111
Simulation distribution update parameters
Number of REPS updates per SimOpt iteration 3
Number of simulation parameter samples per update 9600
Timesteps per simulation parameter sample 453
KLthreshold 1.0
Minimum temperature of sample weights 0.001
Discrepancy function parameters
L1cost weight 0.5
L2cost weight 1.0
Gaussian smoothing standard deviation (timesteps) 5
Gaussian smoothing truncation (timesteps) 4
Observation dimensions cost weights
Joint angles (7D) 0.05
Peg position (3D) 1.0
Peg position in the previous timestep (3D) 1.0
PPO parameters
Number of agents 100
Episode length 150
Timesteps per batch 64
Clip parameter 0.2
0.99
0.95
Entropy coefficient 0.0
Optimization epochs 10
Optimization batch size per agent 8
Optimization step size 5e4
Desired KLstep 0.01
RL reward weights
L1distance between the peg and the hole 10.0
L2distance between the peg and the hole 4.0
Task solved (peg completely in the hole) bonus 0.1
Action penalty 0.7
Table A.4: Swingpeginhole: SimOpt parameters.
112
References
Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In
International conference on machine learning.
Aharon, M., Elad, M., & Bruckstein, A. (2006). k svd: An algorithm for designing overcomplete
dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11), 4311
4322.
Akrour, R., Abdolmaleki, A., Abdulsamad, H., & Neumann, G. (2016). Modelfree trajectory
optimization for reinforcement learning. In Icml.
Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., . . .
Zaremba, W. (2018). Learning dexterous inhand manipulation. CoRR, abs/1808.00177.
Antonova, R., Cruciani, S., Smith, C., & Kragic, D. (2017). Reinforcement learning for pivoting
task. CoRR, abs/1703.00472.
Argall, B., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from
demonstration. Robotics and Autonomous Systems, 57(5), 469–483.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. CoRR, abs/1701.07875.
Aytar, Y ., V ondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from
unlabeled video. CoRR, abs/1610.09001.
Babes, M., Marivate, V ., Subramanian, K., & Littman, M. L. (2011). Apprenticeship learning
about multiple intentions. In Proceedings of the 28th international conference on machine
learning (icml11) (pp. 897–904).
Bekiroglu, Y ., Kragic, D., & Kyrki, V . (2010). Learning grasp stability based on tactile data and
hmms. In Roman, 2010 ieee (pp. 132–137).
Bekiroglu, Y ., Laaksonen, J., Jørgensen, J., Kyrki, V ., & Kragic, D. (2011). Assessing grasp
stability based on learning and haptic data. Robotics, IEEE Transactions on, 27(3), 616–629.
Billard, A., Calinon, S., Dillmann, R., & Schaal, S. (2008). Robot programming by demonstration.
In Springer handbook of robotics (pp. 1371–1394). Springer.
113
Bo, L., Ren, X., & Fox, D. (2011). Hierarchical matching pursuit for image classification:
Architecture and fast algorithms. In Nips (p. 21152123).
Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y ., Kelcey, M., Kalakrishnan, M., . . . Vanhoucke, V .
(2017). Using simulation and domain adaptation to improve efficiency of deep robotic grasping.
CoRR, abs/1709.07857.
Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W.
(2016). OpenAI gym. arXiv preprint arXiv:1606.01540.
Caggiano, V ., Fogassi, L., Rizzolatti, G., Pomper, J., Thier, P., Giese, M., & Casile, A. (2011).
Viewbased encoding of actions in mirror neurons of area f5 in macaque premotor cortex. Current
Biology, 21(2), 144–148.
Calinon, S., Guenter, F., & Billard, A. (2007). On learning, representing and generalizing a task
in a humanoid robot. IEEE Trans. on Systems, Man and Cybernetics, Part B, 37(2).
Chebotar, Y ., Hausman, K., Su, Z., Molchanov, A., Kroemer, O., Sukhatme, G., & Schaal, S.
(2016). Bigs: Biotac grasp stability dataset. In Grasping and manipulation datasets, icra 2016
workshop on.
Chebotar, Y ., Hausman, K., Su, Z., Sukhatme, G., & Schaal, S. (2016). Selfsupervised regrasping
using spatiotemporal tactile features and reinforcement learning. In Iros.
Chebotar, Y ., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., & Levine, S. (2017). Com
bining modelbased and modelfree updates for trajectorycentric reinforcement learning. In
International conference on machine learning.
Chebotar, Y ., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., & Levine, S. (2017). Path integral
guided policy search. In Icra.
Chebotar, Y ., Kroemer, O., & Peters, J. (2014). Learning robot tactile sensing for object
manipulation. In Intelligent robots and systems (iros 2014), 2014 ieee/rsj international conference
on (pp. 3368–3375).
Chen, X., Duan, Y ., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan:
Interpretable representation learning by information maximizing generative adversarial nets.
arXiv:1606.03657.
Christiano, P. F., Shah, Z., Mordatch, I., Schneider, J., Blackwell, T., Tobin, J., . . . Zaremba, W.
(2016). Transfer from simulation to real world through learning deep inverse dynamics model.
CoRR, abs/1610.03518.
Coumans, E., & Bai, Y . (2016–2017). pybullet, a python module for physics simulation in
robotics, games and machine learning. http://pybullet.org/.
Dang, H., & Allen, P. (2014). Stable grasping under pose uncertainty using tactile feedback.
Autonomous Robots, 36(4), 309–330.
114
Daniel, C., Neumann, G., Kroemer, O., & Peters, J. (2013). Learning sequential motor tasks. In
Icra.
Deisenroth, M., Fox, D., & Rasmussen, C. (2014). Gaussian processes for dataefficient learning
in robotics and control. PAMI.
Deisenroth, M., Neumann, G., & Peters, J. (2013). A survey on policy search for robotics.
Foundations and Trends in Robotics, 2(12), 1142.
Deisenroth, M., Rasmussen, C., & Fox, D. (2011). Learning to control a lowcost manipulator
using dataefficient reinforcement learning. In Rss.
Deisenroth, M. P., & Rasmussen, C. E. (2011). Pilco: A modelbased and dataefficient approach
to policy search. In Icml.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & FeiFei, L. (2009). ImageNet: A LargeScale
Hierarchical Image Database. In Cvpr.
Denton, E. L., Chintala, S., & Fergus, R. (2015). Deep generative image models using a laplacian
pyramid of adversarial networks. In Advances in neural information processing systems (pp.
1486–1494).
Dimitrakakis, C., & Rothkopf, C. A. (2011). Bayesian multitask inverse reinforcement learning.
In European workshop on reinforcement learning (pp. 273–284).
Doersch, C., Gupta, A., & Efros, A. (2015). Unsupervised visual representation learning by
context prediction. CoRR, abs/1505.05192.
Dragan, A., & Srinivasa, S. (2012). Online customization of teleoperation interfaces. In Roman
(pp. 919–924).
Duan, Y ., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., . . . Zaremba, W.
(2017). Oneshot imitation learning. arXiv preprint arXiv:1703.07326.
Dumoulin, V ., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., & Courville, A.
(2017). Adversarially learned inference. In Iclr.
Endo, G., Morimoto, J., Matsubara, T., Nakanishi, J., & Cheng, G. (2008). Learning CPGbased
biped locomotion with a policy gradient method: Application to a humanoid robot. IJRR, 27(2),
213228.
Englert, P., & Toussaint, M. (2016). Combined optimization and reinforcement learning for
manipulation skills. In Rss.
Farchy, A., Barrett, S., MacAlpine, P., & Stone, P. (2013). Humanoid robots learning to walk
faster: From the real world to simulation and back. In Aamas.
Farshidian, F., Neunert, M., & Buchli, J. (2014). Learning of closedloop motion control. In Iros.
115
Fernando, B., Bilen, H., Gavves, E., & Gould, S. (2016). Selfsupervised video representation
learning with oddoneout networks. CoRR, abs/1611.06646.
Finn, C., Christiano, P., Abbeel, P., & Levine, S. (2016). A connection between generative
adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint
arXiv:1611.03852.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via
policy optimization. In Proceedings of the 33rd international conference on machine learning
(V ol. 48).
Finn, C., Tan, X., Duan, Y ., Darrell, T., Levine, S., & Abbeel, P. (2015a). Deep spatial
autoencoders for visuomotor learning. CoRR, 117(117), 240.
Finn, C., Tan, X. Y ., Duan, Y ., Darrell, Y ., Levine, S., & Abbeel, P. (2015b). Learning visual
feature spaces for robotic manipulation with deep spatial autoencoders. CoRR, abs/1509.06113.
Florensa, C., Duan, Y ., & Abbeel, P. (2017). Stochastic neural networks for hierarchical
reinforcement learning. arXiv preprint arXiv:1704.03012.
Fox, R., Krishnan, S., Stoica, I., & Goldberg, K. (2017). Multilevel discovery of deep options.
arXiv preprint arXiv:1703.08294.
Fukushima, K. (1980). Neocognitron: A selforganizing neural network model for a mechanism
of pattern recognition unaffected by shifts in position. Biological Cybernetics, 36, 193202.
Giri, F., & Bai, E.W. (2010). Blockoriented nonlinear system identification. London: Springer
Verlag London.
G´ omez, V ., Kappen, H., Peters, J., & Neumann, G. (2014). Policy search for path integral control.
In Ecml/pkdd.
Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., . . . Bengio,
Y . (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
& K. Q. Weinberger (Eds.), Nips (p. 26722680).
Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y . (2015). Unsupervised learning of
spatiotemporally coherent metrics. In Iccv.
Gu, S., Lillicrap, T., Sutskever, I., & Levine, S. (2016). Continuous deep Qlearning with
modelbased acceleration. CoRR, abs/1603.00748.
Hagan, M., & Menhaj, M. (1994, Nov). Training feedforward networks with the marquardt
algorithm. Neural Networks, IEEE Transactions on, 5(6), 989993.
Hanna, J., & Stone, P. (2017). Grounded action transformation for robot learning in simulation.
In Aaai.
116
Hausman, K., Chebotar, Y ., Schaal, S., Sukhatme, G. S., & Lim, J. J. (2017). Multimodal
imitation learning from unstructured demonstrations using generative adversarial nets. In Nips.
Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y ., & Erez, T. (2015). Learning continuous
control policies by stochastic value gradients. In Nips.
Hinterstoisser, S., Lepetit, V ., Rajkumar, N., & Konolige, K. (2016). Going further with point
pair features. In Eccv.
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. CoRR, abs/1606.03476.
Retrieved fromhttp://arxiv.org/abs/1606.03476
Ijspeert, J., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear dynamical
systems in humanoid robots. In Icra.
Jakobi, N., Husbands, P., & Harvey, I. (1995). Noise and the reality gap: The use of simulation in
evolutionary robotics. In European conference on artificial life.
James, S., Davison, A. J., & Johns, E. (2017). Transferring endtoend visuomotor control from
simulation to real world for a multistage task. CoRR, abs/1707.02267.
Jolliffe, I. T. (1986). Principal component analysis. New York: Springer.
Kalakrishnan, M., Righetti, L., Pastor, P., & Schaal, S. (2011). Learning force control policies
for compliant manipulation. In Intelligent robots and systems (iros), 2011 ieee/rsj international
conference on (pp. 4639–4644).
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., . . . Levine, S. (2018).
Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. CoRR,
abs/1806.10293.
Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017, 06–11 Aug). Learning to dis
cover crossdomain relations with generative adversarial networks. In D. Precup & Y. W. Teh
(Eds.), Proceedings of the 34th international conference on machine learning (V ol. 70, pp.
1857–1865). International Convention Centre, Sydney, Australia: PMLR. Retrieved from
http://proceedings.mlr.press/v70/kim17a.html
Kingma, D., & Welling, M. (2013). Autoencoding variational Bayes. CoRR, abs/1312.6114.
Kober, J., Bagnell, J., & Peters, J. (2013). Reinforcement learning in robotics: a survey.
International Journal of Robotic Research, 32(11), 12381274.
Kober, J., Mohler, B., & Peters, J. (2008). Learning perceptual coupling for motor primitives. In
Iros.
Kober, J., Oztop, E., & Peters, J. (2011). Reinforcement learning to adjust robot movements
to new situations. In Ijcai proceedingsinternational joint conference on artificial intelligence
(V ol. 22, p. 2650).
117
Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal
locomotion. In Icra.
Kolev, S., & Todorov, E. (2015). Physically consistent state estimation and system identification
for contacts. In Humanoids.
Koos, S., Mouret, J.B., & Doncieux, S. (2010). Crossing the reality gap in evolutionary robotics
by promoting transferable controllers. In Gecco. ACM.
Koutn´ ık, J., Schmidhuber, J., & Gomez, F. (2014). Evolving deep unsupervised convolutional
networks for visionbased reinforcement learning. In Gecco.
Kroemer, O., Daniel, C., Neumann, G., Van Hoof, H., & Peters, J. (2015). Towards learning
hierarchical skills for multiphase manipulation tasks. In Robotics and automation (icra), 2015
ieee international conference on (pp. 1503–1510).
Kroemer, O., Detry, R., Piater, J., & Peters, J. (2010). Combining active learning and reactive
control for robot grasping. Robotics and Autonomous Systems (RAS), 58(9), 1105–1116.
Kumar, V ., Carneiro, G., & Reid, I. D. (2016). Learning local image descriptors with deep
siamese and triplet convolutional networks by minimizing global loss functions. In Cvpr.
Levine, S., & Abbeel, P. (2014). Learning neural network policies with guided policy search
under unknown dynamics. In Nips.
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). Endtoend training of deep visuomotor
policies. JMLR, 17(1).
Levine, S., & Koltun, V . (2013). Guided policy search. In Proceedings of the 30th international
conference on machine learning (pp. 1–9).
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., & Quillen, D. (2018). Learning handeye
coordination for robotic grasping with deep learning and largescale data collection. I. J. Robotics
Res., 37(45), 421436.
Levine, S., Popovic, Z., & Koltun, V . (2011). Nonlinear inverse reinforcement learning with
gaussian processes. In Advances in neural information processing systems (pp. 19–27).
Levine, S., Wagener, N., & Abbeel, P. (2015). Learning contactrich manipulation skills with
guided policy search. In Robotics and automation (icra), 2015 ieee international conference on
(pp. 156–163).
Li, M., Bekiroglu, Y ., Kragic, D., & Billard, A. (2014). Learning of grasp adaptation through
experience and tactile sensing. In Intelligent robots and systems (iros 2014), 2014 ieee/rsj
international conference on (pp. 3339–3346).
118
Li, Y ., Song, J., & Ermon, S. (2017). Inferring the latent structure of human decisionmaking
from raw visual inputs. CoRR, abs/1703.08840. Retrieved fromhttp://arxiv.org/abs/
1703.08840
Liang, J., Makoviychuk, V ., Handa, A., Chentanez, N., Macklin, M., & Fox, D. (2018). Gpu
accelerated robotic simulation for distributed reinforcement learning. CoRL.
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., . . . Wierstra, D. (2016).
Continuous control with deep reinforcement learning. In Iclr.
Lioutikov, R., Paraschos, A., Peters, J., & Neumann, G. (2014). Samplebased information
theoretic stochastic optimal control. In Icra.
Liu, Y ., Gupta, A., Abbeel, P., & Levine, S. (2017). Imitation from observation: Learning to
imitate behaviors from raw video via context translation. CoRR, abs/1707.03374.
Ljung, L. (1999). System identification – theory for the user. Prentice Hall.
Lowrey, K., Kolev, S., Dao, J., Rajeswaran, A., & Todorov, E. (2018). Reinforcement learning
for nonprehensile manipulation: Transfer from simulation to physical system. In Simpar.
Madry, M., Bo, L., Kragic, D., & Fox, D. (2014, May). Sthmp: Unsupervised spatiotemporal
feature learning for tactile data. In Robotics and automation (icra), 2014 ieee international
conference on (p. 22622269).
Mathieu, M., Couprie, C., & LeCun, Y . (2015). Deep multiscale video prediction beyond mean
square error. arXiv preprint arXiv:1511.05440.
Misra, I., Zitnick, C., & Hebert, M. (2016). Unsupervised learning using sequential verification
for action recognition. CoRR, abs/1603.08561.
Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.
(2013). Playing Atari with deep reinforcement learning. In Nips workshop on deep learning.
Montesano, L., & Lopes, M. (2012). Active learning of visual descriptors for grasping using
nonparametric smoothed beta distributions. Robotics and Autonomous Systems, 60(3), 452–462.
Montgomery, W., Ajay, A., Finn, C., Abbeel, P., & Levine, S. (2017). Resetfree guided policy
search: efficient deep reinforcement learning with stochastic initial states. In Icra.
Montgomery, W., & Levine, S. (2016). Guided policy search via approximate mirror descent. In
Nips.
Mordatch, I., Lowrey, K., & Todorov, E. (2015). Ensemblecio: Fullbody dynamic motion
planning that transfers to physical humanoids. In Iros.
Mori, G., Pantofaru, C., Kothari, N., Leung, T., Toderici, G., Toshev, A., & Yang, W. (2015). Pose
embeddings: A deep architecture for learning to match human poses. CoRR, abs/1507.00302.
119
M¨ ulling, K., Kober, J., Kroemer, O., & Peters, J. (2013). Learning to select and generalize
striking movements in robot table tennis. The International Journal of Robotics Research, 32(3),
263–279.
Muratore, F., Treede, F., Gienger, M., & Peters, J. (2018). Domain randomization for simulation
based policy optimization with transferability assessment. In Corl.
Ng, A. Y ., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Icml (pp.
663–670).
Niekum, S., Chitta, S., Barto, A. G., Marthi, B., & Osentoski, S. (2013). Incremental semantically
grounded learning from demonstration. In Robotics: Science and systems (V ol. 9).
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E., & Freeman, W. (2015). Visually
indicated sounds. CoRR, abs/1512.08512.
Pan, Y ., & Theodorou, E. (2014). Probabilistic differential dynamic programming. In Nips.
Pastor, P., Hoffmann, H., Asfour, T., & Schaal, S. (2009). Learning and generalization of motor
skills by learning from demonstration. In Icra (p. 763768).
Pastor, P., Kalakrishnan, M., Chitta, S., Theodorou, E., & Schaal, S. (2011). Skill learning
and task outcome prediction for manipulation. IEEE International Conference on Robotics and
Automation, 3828–3834.
Pathak, D., Girshick, R., Doll´ ar, P., Darrell, T., & Hariharan, B. (2016). Learning features by
watching objects move. CoRR, abs/1612.06370.
Pati, Y ., Rezaiifar, R., & Krishnaprasad, P. (1993). Orthogonal matching pursuit: recursive
function approximation with applications to wavelet decomposition. In Signals, systems and
computers. 1993 conference record of the twentyseventh asilomar conference on (p. 4044 vol.1).
Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin, F., & Schmid, C. (2015, Dec). Local
convolutional features with unsupervised training for image retrieval. In Iccv (p. 9199).
Peng, X. B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Simtoreal transfer of
robotic control with dynamics randomization. In Icra.
Peters, J., M¨ ulling, K., & Altun, Y . (2010). Relative entropy policy search. In Aaai. AAAI Press.
Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients.
Neural Networks, 21(4).
Pfau, D., & Vinyals, O. (2016). Connecting generative adversarial networks and actorcritic
methods. arXiv preprint arXiv:1610.01945.
Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., & Abbeel, P. (2017). Asymmetric actor
critic for imagebased robot learning. CoRR, abs/1710.06542.
120
Pinto, L., & Gupta, A. (2016). Supersizing selfsupervision: Learning to grasp from 50k tries
and 700 robot hours. In Icra.
Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation.
Neural Computation, 3(1), 88–97.
Rajeswaran, A., Ghotra, S., Levine, S., & Ravindran, B. (2016). Epopt: Learning robust neural
network policies using model ensembles. CoRR, abs/1610.01283.
Ratliff, N., Bagnell, J., & Srinivasa, S. (2007). Imitation learning for locomotion and manipulation.
In Humanoids.
Rizzolatti, G., & Craighero, L. (2004). The mirrorneuron system. Annual Review of Neuroscience,
27, 169192.
Ross, S., & Bagnell, D. (2010). Efficient reductions for imitation learning. In Aistats (V ol. 3, pp.
3–5).
Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured
prediction to noregret online learning. In Aistats (V ol. 15, p. 627635).
Rusu, A. A., Vecerik, M., Roth¨ orl, T., Heess, N., Pascanu, R., & Hadsell, R. (2017). Simtoreal
robot learning from pixels with progressive nets. In Corl.
Sadeghi, F., & Levine, S. (2017). Cad2rl: Real singleimage flight without a single real image.
RSS.
Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in cognitive
sciences, 3(6), 233–242.
Schaal, S., Peters, J., Nakanishi, J., & Ijspeert, A. (2003). Control, planning, learning, and
imitation with dynamic movement primitives. In Iros workshop on bilateral paradigms on humans
and humanoids.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61,
85117.
Schmidt, T., Newcombe, R. A., & Fox, D. (2014). Dart: Dense articulated realtime tracking. In
Rss.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face
recognition and clustering. CoRR, abs/1503.03832.
Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy
optimization. In Icml.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). Highdimensional
continuous control using generalized advantage estimation. In Iclr.
121
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy
optimization algorithms. CoRR, abs/1707.06347.
Sermanet, P., Lynch, C., Chebotar, Y ., Hsu, J., Jang, E., Schaal, S., & Levine, S. (2018).
Timecontrastive networks: Selfsupervised learning from video. In Icra.
Sermanet, P., Xu, K., & Levine, S. (2016). Unsupervised perceptual rewards for imitation
learning. CoRR, abs/1612.06699.
SimoSerra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & MorenoNoguer, F. (2015).
Discriminative learning of deep convolutional feature point descriptors. In Iccv.
Sohn, K. (2016). Improved deep metric learning with multiclass npair loss objective. In
Advances in neural information processing systems (pp. 1857–1865).
Song, H. O., Xiang, Y ., Jegelka, S., & Savarese, S. (2015). Deep metric learning via lifted
structured feature embedding. CoRR, abs/1511.06452.
Stadie, B., Abbeel, P., & Sutskever, I. (2017). Thirdperson imitation learning. CoRR,
abs/1703.01703.
Stewart, R., & Ermon, S. (2016). Labelfree supervision of neural networks with physics and
domain knowledge. CoRR, abs/1609.05566.
Stulp, F., Buchli, J., Ellmer, A., Mistry, M., Theodorou, E., & Schaal, S. (2012). Modelfree
reinforcement learning of impedance control in stochastic environments. IEEE Trans. Autonomous
Mental Development, 4(4), 330341.
Stulp, F., & Sigaud, O. (2012). Path integral policy improvement with covariance matrix
adaptation. In Icml.
Stulp, F., Theodorou, E., Buchli, J., & Schaal, S. (2011). Learning to grasp under uncertainty. In
Icra.
Stulp, F., Theodorou, E., & Schaal, S. (2012). Reinforcement learning with sequences of motion
primitives for robust manipulation. IEEE Trans. Robotics, 28(6), 13601370.
Su, Z., Hausman, K., Chebotar, Y ., Molchanov, A., Loeb, G., Sukhatme, G., & Schaal, S. (2015).
Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor.
In Humanoid robots (humanoids), 2015 ieeeras 15th international conference on (p. 297303).
Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on
approximating dynamic programming. In Icml.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.
Szegedy, C., Vanhoucke, V ., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the inception
architecture for computer vision. CoRR, abs/1512.00567.
122
Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Husz´ ar, F. (2016). Amortised map inference
for image superresolution. CoRR, abs/1610.04490. Retrieved from http://dblp.uni
trier.de/db/journals/corr/corr1610.html#SonderbyCTSH16
Tan, J., Xie, Z., Boots, B., & Liu, C. K. (2016). Simulationbased design of dynamic controllers
for humanoid balancing. In Iros.
Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y ., Hafner, D., . . . Vanhoucke, V . (2018).
Simtoreal: Learning agile locomotion for quadruped robots. In Rss.
Tassa, Y ., Erez, T., & Todorov, E. (2012). Synthesis and stabilization of complex behaviors. In
Iros.
Tedrake, R., Zhang, T., & Seung, H. (2004). Stochastic policy gradient reinforcement learning on
a simple 3d biped. In Iros.
Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to
reinforcement learning. JMLR, 11.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain
randomization for transferring deep neural networks from simulation to the real world. In Iros.
Tompson, J., Jain, A., LeCun, Y ., & Bregler, C. (2014). Joint training of a convolutional network
and a graphical model for human pose estimation. In Nips.
Tzeng, E., Devin, C., Hoffman, J., Finn, C., Peng, X., Levine, S., . . . Darrell, T. (2015).
Towards adapting deep visuomotor representations from simulated to real environments. CoRR,
abs/1511.07111.
Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultaneous deep transfer across
domains and tasks. In Iccv.
van Hoof, H., Peters, J., & Neumann, G. (2015). Learning of nonparametric control policies
with highdimensional state features. In Aistats.
Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu,
K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint
arXiv:1703.01161.
Vincent, P., Larochelle, H., Bengio, Y ., & Manzagol, P. (2008). Extracting and composing robust
features with denoising autoencoders. In International conference on machine learning.
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos.
CoRR, abs/1505.00687.
Wettels, N., Parnandi, A., Moon, J., Loeb, G., & Sukhatme, G. (2009). Grip control using
biomimetic tactile sensing systems. Mechatronics, IEEE/ASME Transactions On, 14(6), 718–
723.
123
Wettels, N., Santos, V ., Johansson, R., & Loeb, G. (2008). Biomimetic tactile sensor array.
Advanced Robotics, 22(8), 829849.
Whitney, W., Chang, M., Kulkarni, T., & Tenenbaum, J. (2016). Understanding visual concepts
with continuation learning. CoRR, abs/1602.06822.
Wiskott, L., & Sejnowski, T. (2002, April). Slow feature analysis: Unsupervised learning of
invariances. Neural Comput., 14(4), 715–770.
Wulfmeier, M., Posner, I., & Abbeel, P. (2017). Mutual alignment transfer learning. CoRR,
abs/1707.07907.
Yahya, A., Li, A., Kalakrishnan, M., Chebotar, Y ., & Levine, S. (2017). Collective robot
reinforcement learning with distributed asynchronous guided policy search. In Iros.
Yao, Y ., Rosasco, L., & Caponnetto, A. (2007). On early stopping in gradient descent learning.
Constructive Approximation, 26(2), 289315.
Yi, K. M., Trulls, E., Lepetit, V ., & Fua, P. (2016). LIFT: learned invariant feature transform.
CoRR, abs/1603.09114.
Yu, W., Tan, J., Liu, C. K., & Turk, G. (2017). Preparing for the unknown: Learning a universal
policy with online system identification. In Rss.
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional
neural networks. In Cvpr (pp. 4353–4361).
Zhang, R., Isola, P., & Efros, A. (2016). Splitbrain autoencoders: Unsupervised learning by
crosschannel prediction. CoRR, abs/1611.09842.
Zhu, J.Y ., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired imagetoimage translation using
cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593.
Zhu, S., Kimmel, A., Bekris, K. E., & Boularias, A. (2018). Fast model identification via physics
engines for dataefficient policy search. In Ijcai. ijcai.org.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse
reinforcement learning. In D. Fox & C. P. Gomes (Eds.), Aaai (p. 14331438). AAAI Press.
124
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Scaling robot learning with skills
PDF
Algorithms and systems for continual robot learning
PDF
Rethinking perceptionaction loops via interactive perception and learned representations
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Closing the reality gap via simulationbased inference and control
PDF
Learning affordances through interactive perception and manipulation
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Learning from planners to enable new robot capabilities
PDF
Leveraging crosstask transfer in sequential decision problems
PDF
Sampleefficient and robust neurosymbolic learning from demonstrations
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Datadriven autonomous manipulation
PDF
Hierarchical tactile manipulation on a haptic manipulation platform
PDF
Programguided framework for your interpreting and acquiring complex skills with learning robots
PDF
Characterizing and improving robot learning: a controltheoretic perspective
PDF
Machine learning of motor skills for robotics
PDF
Nonverbal communication for nonhumanoid robots
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Robot lifelong task learning from human demonstrations: a Bayesian approach
Asset Metadata
Creator
Chebotar, Yevgen
(author)
Core Title
Datadriven acquisition of closedloop robotic skills
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/13/2019
Defense Date
03/04/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
imitation learning,OAIPMH Harvest,reinforcement learning,robotics,transfer learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav S. (
committee chair
), Culbertson, Heather (
committee member
), Gupta, Satyandra K. (
committee member
)
Creator Email
ychebota@usc.edu,yevgen.chebotar@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/uscthesesc89131734
Unique identifier
UC11676863
Identifier
etdChebotarYe7154.pdf (filename),uscthesesc89131734 (legacy record id)
Legacy Identifier
etdChebotarYe7154.pdf
Dmrecord
131734
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chebotar, Yevgen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 900892810, USA
Tags
imitation learning
reinforcement learning
robotics
transfer learning