Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Rethinking perception-action loops via interactive perception and learned representations
(USC Thesis Other)
Rethinking perception-action loops via interactive perception and learned representations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Rethinking Perception-Action Loops via Interactive Perception and Learned Representations by Karol Hausman A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (Computer Science) Committee: Prof. Gaurav S. Sukhatme (Chair) Computer Science Prof. Stefan Schaal Computer Science Prof. Joseph Lim Computer Science Prof. Satyandra K. Gupta Electrical Engineering Prof. Pieter Abbeel Electrical Engineering and Computer Sciences UC Berkeley May 2018 Copyright 2018 Karol Hausman Acknowledgements There are so many people that contributed to this thesis who I would like to express my appreciation to. First and foremost, I would like to thank my Ph.D. advisor, Gaurav Sukhatme, who not only provided professional guidance, a great working atmosphere and academic freedom, but also became a true mentor and a friend who has helped me tremendously with both academic as well as non-academic endeavours. Thank you for giving me a chance and believing in me, I could not imagine having a better advisor. I want to point out other USC professors who I worked with over the course of my Ph.D.: Stefan Schaal, Nora Ayanian, Joseph Lim and Fei Sha. Thank you for creating a very open and collaborative atmosphere and letting me use your resources and guidance. It showed me a great example of what an open academic workplace should look like, which I will strive to recreate as I move forwards. I wish to express my gratitude to many industrial collaborators who I worked with during my internships: Google DeepMind (Tobi Springenberg, Nicolas Heess, Martin Riedmiller, Ziyu Wang, Gabe Martin, Yuval Tassa, Tom Erez and Tom Rothoerl), Qualcomm (Cas Wierzynski, Ali Agha, Sarah Gibson, Slawek Grzechnik), Bosch (Sarah Osentoski, Dejan Pangercic, Ross Kidson, Scott Niekum), NASA JPL (Stephan Weiss, Roland Brockers, Larry Matthies) and all my fellow interns. I had an extreme privilege to work with you all and get to know you better. Not only, was I able to contribute to the extraordinary work you were doing, but I also had an amazing time meeting you and living in different places around the world. I am also very thankful to my external academic collaborators: Pieter Abbeel, Ken Goldberg, Oliver Brock, Danica Kragic, Sergey Levine, Sachin Patil, Marvin Zhang, Greg Kahn and many more for giving me a chance to work with you and being extremely helpful and supportive throughout our work. This thesis would not have been possible without you. There are also a number of amazing fellow PhD students that I collaborated with over the course of my Ph.D. Yevgen Chebotar, James Preiss, Oliver Kroemer, Eric Heiden, Harry Zhe Su, Artem Molchanov, Joerg Mueller, Abishek Hariharan, Chet Corcos, Jeannette Bohg, Bharath Sankaran - I had great fun working with all of you and I really appreciate all the time and effort you put into making our work the best it could be. It felt amazing to be a part of a bigger team and there is no way I would have created this thesis without your help. I would like thank all of my labmates from RESL, CLMC and MPI labs that I had a pleasure of overlapping with. Thank you for creating a great working atmosphere and making me feel welcome and an integral part of the team. It means a world to me. A special thanks goes to Sean, Nick, Yevgen and Artem - my close friends from the CLMC and RESL labs who had to deal with my constant interruptions, jokes, numerous favors and high-fives. The weirdness and dorkiness of our interactions made me feel very comfortable and allowed me to be myself. You guys made my time at USC extremely funny and enjoyable even in the most stressful times. Thank you! ii A group that I am particularly thankful to, which holds a special place in my heart, is my Los Angeles family. During my stay at USC, I had a chance to develop incredible relationships and make friends for life. Carrie, Mark, Sean, Nathalie, Becky, Shruti, Harsh, Nick, Lena, Jimmy, Rahne, Devang, Aninya - you kept me going the entire time and made sure that my life was filled with love, joy and fun. I will always cherish the incredible memories from our trips, hikes, climbs, parties, game nights and birthdays. Thank you so much for creating a true home for me in a place that is so far away from where I come from - I really appreciate it and let’s continue our adventures moving forward! I would also like to thank my family back in Poland. Mom, Dad, Marcin, Grandma and many more. Your support and encouragement so far away from home means a lot to me. My last word goes to my fiance and my best friend, Ola. I cannot express how grateful I am for the countless hours of support and encouragement that you gave me, especially during the hardest times. Your constant, often illogical, belief in me and your unconditional love makes everything I do worth it every single day. You are my closest friend, partner and collaborator. Thank you for not only making this degree possible but also for making me the person I am today. iii Table of Contents Acknowledgements ii 1 Introduction 1 1.1 Interactive and Active Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Learned Representations via Deep Robot Learning . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 I Active Perception 8 2 Observability-Aware Trajectory Optimization for Self-Calibration 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Problem Formulation and Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Motion and Sensor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Nonlinear Observability Analysis . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Quality of Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Taylor Expansion of the Sensor Model . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Observability Gramian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 Measure of Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Trajectory Representation and Optimization . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Differentially Flat Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Constrained Trajectory Representation using Piecewise Polynomials . . . 18 2.5.3 Numerical Stability for Constrained Trajectory . . . . . . . . . . . . . . . 20 2.5.4 The Optimization Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Example Application to UA Vs with IMU-GPS State Estimator . . . . . . . . . . . 21 2.6.1 EKF for IMU-GPS Sensor Suite . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.2 Differentially Flat Outputs and Physical Constraints of the System . . . . 22 2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7.2 Evaluation of Various Self-Calibration Routines . . . . . . . . . . . . . . . 23 2.7.3 Evaluation of an Example Waypoint Trajectory . . . . . . . . . . . . . . . 27 2.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iv 2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Occlusion-Aware Multi-Robot Target Tracking in 3D 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 State Estimation with EKF . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Control Generation using Optimization . . . . . . . . . . . . . . . . . . . . 36 3.3.4 Reasoning about Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 II Interactive Perception 50 4 Active Articulation Model Estimation 52 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Articulation Model Representation . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Visual Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.3 Action/Manipulation Sensor Model . . . . . . . . . . . . . . . . . . . . . . 57 4.3.4 Best Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Visual Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.3 Manipulation Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.4 Best Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Articulation Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Self-Supervised Regrasping using Reinforcement Learning 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Grasp stability estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1 Biomimetic tactile sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Hierarchical Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.3 Spatio-temporal HMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 v 5.4 Reinforcement learning for regrasping . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.1 Mapping from tactile features to grasp adjustments . . . . . . . . . . . . . 77 5.4.2 Policy search for learning mapping parameters . . . . . . . . . . . . . . . . 78 5.4.3 Learning a general regrasping policy . . . . . . . . . . . . . . . . . . . . . 79 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.1 Evaluation of grasp stability prediction . . . . . . . . . . . . . . . . . . . . 80 5.5.2 Learning individual linear regrasping policies . . . . . . . . . . . . . . . . 81 5.5.3 Evaluation of general regrasping policy . . . . . . . . . . . . . . . . . . . . 82 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 III Learned Representations via Deep Robot Learning 85 6 Combining Model-Based and Model-Free Updates for Deep Reinforcement Learning 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.1 Model-Based Optimization of TVLG Policies . . . . . . . . . . . . . . . . 91 6.3.2 Policy Improvement with Path Integrals . . . . . . . . . . . . . . . . . . . 91 6.4 Integrating Model-Based Updates into PI2 . . . . . . . . . . . . . . . . . . . . . . . 93 6.4.1 Two-Stage PI2 update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4.2 Model-Based Substitution with LQR-FLM . . . . . . . . . . . . . . . . . . 93 6.4.3 Optimizing Cost Residuals with PI2 . . . . . . . . . . . . . . . . . . . . . . 94 6.4.4 Summary of PILQR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5 Training Parametric Policies with GPS . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.6.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.6.2 Real Robot Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7 Multi-Modal Imitation Learning from Unstructured Demonstrations using Genera- tive Adversarial Nets 105 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4 Multi-modal Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1 Relation to InfoGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.6.1 Task setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.6.2 Multi-Target Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . 113 7.6.3 Multi-Task Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vi 8 Learning an Embedding Space for Transferable Robot Skills 118 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.4 Learning Versatile Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.4.1 Policy Learning via a Variational Bound on Entropy Regularized RL . . . 122 8.5 Learning an Embedding for Versatile Skills in an Off-Policy Setting . . . . . . . . 124 8.6 Learning to Control the Previously-Learned Embedding . . . . . . . . . . . . . . . 126 8.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.7.1 Didactic Example: Multi-Goal Point Mass Task with Sparse Rewards . . 126 8.7.2 Control of the Skill Embedding for Manipulation Tasks . . . . . . . . . . 128 8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 IV Conclusions 132 9 Conclusions 133 9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A Appendix 135 A.1 Appendix: Combining Model-Based and Model-Free Updates for Deep Reinforce- ment Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.1.1 Derivation of LQR-FLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.1.2 PI2 update through constrained optimization . . . . . . . . . . . . . . . . . 137 A.1.3 Detailed Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.1.4 Additional Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.2 Appendix: Learning and Embedding Space for Transferable Robot Skills . . . . . 142 A.2.1 Variational Bound Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.2.2 Derivation for Multiple Timesteps . . . . . . . . . . . . . . . . . . . . . . . 143 A.2.3 Algorithm details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.2.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 References 147 vii List of Figures 1.1 Example of a traditional robotics pipeline. The information flow is unidirec- tional where perceptual components provide information to the action generation components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Root Mean Squared Error (RMSE) convergence of EKF self-calibration (ac- celerometer and gyroscope biases b a , b ! and position of the GPS sensor p p i ) state estimates for a) figure-eight trajectory, b) star trajectory, c) optimal tra- jectory from our method. We introduced additional yaw motion for a) and b) trajectories in order to improve state estimation of these heuristics. . . . . . . . . 12 2.2 Self-calibration task: results obtained for optimizing for different objectives. Top row: random and optimized results when optimizing for p p i . Bottom row: random and optimized results when optimizing for b a . The left and right column show the final RMSE for b a and p p i respectively. . . . . . . . . . . . . . . . . . . 24 2.3 Self-calibration task: final RMSE values for the accelerometer bias b a and the GPS position in the IMU frame p p i obtained using optimization (green) and 3 different heuristics: star and figure eight trajectories from Fig. 2.1 and randomly sampled trajectories that are close of the physical limits of the system. . . . . . . 25 2.4 Self-calibration task: statistics collected over 50 runs of the quadrotor EKF using 6 different trajectories: ours - optimized trajectory using the hereby defined observability cost; trace - optimized trajectory using the covariance-trace cost function; PL-random - randomly sampled trajectory that is very close to the phys- ical limits of the system; star, figure 8 - heuristics-based trajectories presented in Fig. 2.1; random - randomly sampled trajectory that satisfies the constraints. Top left: GPS position integrated RMSE, top right: GPS position final RMSE, bottom left: accelerometer bias integrated RMSE, bottom right: accelerometer bias final RMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Waypoint navigation task: minimum snap trajectory (red) vs. optimized trajectory using our method (green). Position waypoints are shown in black. . . . . . . . . 28 viii 3.1 Three quadrotors cooperatively tracking a target following a figure-eight tra- jectory (blue line). The green ellipsoids show the 3-dimensional covariances of quadrotors’ positions. The green ellipse represents the 2-dimensional co- variance of the position of the target. Magenta lines depict the measurements between quadrotors and the global camera (blue cube at the top). How should the quadrotors move in order to minimize the uncertainty of the target? . . . . . 31 3.2 Signed distance if the quadrotor/target is inside the view frustum (left) and outside the view frustum (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Signed distance function in the presence of occlusions. First, we determine the shadows of all the occlusions such that the resulting field-of-view (the shaded area) is calculated. The signed distance is computed as the distance to the field-of-view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Comparison of different time horizons in the optimization step. The time horizons are 2, 5, and 10 time steps, respectively. Evaluation measures are shown with 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. There is no significant difference in the tracking error for these time horizons. . . . . . . . . . . . . . . 39 3.5 Comparison between different sampling approaches and the optimization ap- proach presented in this work. From left to right: random sampling, lattice sampling and optimization. Evaluation measures are shown with 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The quantitative results from this experiment for all the trajectories are summarized in Table 3.1. The optimization approach yields better results than the other presented methods. . . . . . . . . . 40 3.6 Comparison of our optimization-based approach applied to different number of quadrotors used for the tracking task. From left to right the number of quadrotors are: 1, 3 and 5. Evaluation measures are shown with 95% confidence intervals over 10 runs. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The experiments confirms the intuition that deploying more quadrotors improves the tracking performance up to a point of diminishing returns. . . . . . 41 3.7 Comparison between not taking occlusions into account in the optimization step (top) and accounting for occlusions using the signed distance function presented in this work (bottom). Results collected for 3 different trajectories: figure eight (left), spiral (middle) and random walk (right) Statistics and 95% confidence intervals over 10 runs with 3 quadrotors. Taking occlusions into account is beneficial especially at the spots right below the global camera where the quadrotors have to stay close to each other. . . . . . . . . . . . . . . . . . . . 43 ix 3.8 Target tracking results for our previous level-based approach (top) and the hereby presented method without explicitly reasoning about sensing topologies (bottom). Statistics and 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The quantitative results from this experiment for all the trajectories are summarized in Table 3.1. The approach presented in this work is more beneficial for the reasons explained in Fig. 3.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.9 An example of an advantage of our approach over the level-based approach. Left: most beneficial sensing topology for the level-based approach. The quadrotors form a chain topology as each level of quadrotors is allowed to sense only one level below it. Right: A superior sensing topology achieved by our novel method. The lower quadrotor is seen by the global sensor and the other quadrotor. The upper quadrotor can localize itself based on the observation of the lower quadrotor. Both quadrotors can observe the tracking target which provides more information than the level-based approach. There are no constraints for the topology levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Top: The PR2 performing an action that reduces the uncertainty over articulation models. Bottom: Three of the most probable articulation model hypotheses: free- body(green), rotational(blue) and prismatic(pink). Selected action is depicted in yellow and lies on the plane of the cabinet . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Particles representing different hypothesis of the underlying articulation model. Blue is a rotational model, pink is a prismatic model and green is a free-body model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Probabilistic graphical model for articulation recognition. . . . . . . . . . . . . . 55 4.4 The angle between the action vector W V a t and the tangent vector to the current model W V m t (rotational in this case). . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Automatically generated candidate actions. . . . . . . . . . . . . . . . . . . . . . 61 4.6 Four experimental scenarios. Top-left: rotational cabinet door with an incomplete demonstration; top-right: two free-body erasers moved apart in a straight line; bottom-left: locked drawer as a rigid joint; bottom-right: stapler as a rotational joint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Top: Entropy over articulation model types after each action for the cabinet door experiment. Bottom: Probability of the true articulation model (rotational) after each action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 x 4.8 Outcomes of performing actions on the cabinet door. Top-left: the current state of the particle filter with most probable hypotheses of each articulation model. The yellow arrow shows the optimal action in terms of maximizing the information gain. Top-right: the state of the particle filter after performing the first action (z a 1 = 0, door did not move), together with the currently chosen most informative action. Second row-left: Percentage of each of the models’ particles in the particle filter before any action, after the first action, and after the second action (z a 2 = 1, door did move). The two bottom rows depict the same idea, but the action is selected based on minimization the expected entropy. The outcomes of the actions in this case were bothz a 2 = 1. . . . . . . . . . . . . . . . . . . . . . . . 67 4.9 Top: Probability distribution over models before and after the action in the eraser experiment. Bottom: Analogous statistics for the rigid-joint experiment. Results of 3 actions are presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.10 Rotational particles in the stapler experiment before (left) and after (right) a manipulation action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Regrasping scenario: the robot partially misses the object with one of the fingers during the initial grasp (left), predicts that the current grasp will be unstable, places the object down, and adjusts the hand configuration to form a firm grasp of the object using all of its fingers (right). . . . . . . . . . . . . . . . . . . . . . . 72 5.2 The schematic of the electrode arrangements on the BioTac sensor (left). Tactile image used for the ST-HMP features (right). The X values are the reference electrodes. The 19 BioTac electrodes are measured relative to these 4 reference electrodes. V1 and V2 are created by taking an average response of the neighbor- ing electrodes: V1 = avg(E17, E18, E12, E2, E13, E3) and V2 = avg(E17, E18, E15, E5, E13, E3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Objects and experimental setup used for learning the grasp stability predictor and the regrasping behavior. If an object falls out of the hand it returns to its initial position due to the shape of the bowl. Top-left: the cylinder. Top-right: the box. Bottom-left: the ball. Bottom-right: the novel object. . . . . . . . . . . . 80 5.4 Top left: schematic of the electrode arrangements on the BioTac sensor and the corresponding tactile image used for the ST-HMP features. V1, V2 and V3 are computed by averaging the neighboring electrode values. Top right, bottom left, bottom right: reinforcement learning curves for regrasping individual objects using REPS. Policy updates are performed every 100 regrasps. . . . . . . . . . . 82 6.1 Real robot tasks used to evaluate our method. Left: The hockey task which involves discontinuous dynamics. Right: The power plug task which requires high level of precision. Both of these tasks are learned from scratch without demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 We evaluate on a set of simulated robotic manipulation tasks with varying difficulty. Left to right, the tasks involve pushing a block, reaching for a target, and opening a door in 3D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 xi 6.3 Average final distance from the block to the goal on one condition of the gripper pusher task. This condition is difficult due to the block being initialized far away from the gripper and the goal area, and only PILQR is able to succeed in reaching the block and pushing it toward the goal. Results for additional conditions are available in Appendix A.1.4, and the supplementary video demonstrates the final behavior of each learned policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Final distance from the reacher end effector to the target averaged across 300 random test conditions per iteration. MDGPS with LQR-FLM, MDGPS with PILQR, TRPO, and DDPG all perform competitively. However, as the log scale for the x axis shows, TRPO and DDPG require orders of magnitude more samples. MDGPS with PI 2 performs noticeably worse. . . . . . . . . . . . . . . 99 6.5 Minimum angle in radians of the door hinge (lower is better) averaged across 100 random test conditions per iteration. MDGPS with PILQR outperforms all other methods we compare against, with orders of magnitude fewer samples than DDPG and TRPO, which is the only other successful algorithm. . . . . . . . . . 100 6.6 Single condition comparison of the hockey task performed on the real robot. Costs lower than the dotted line correspond to the puck entering the goal. . . . . 101 6.7 Experimental setup of the hockey task and the success rate of the final PILQR- MDGPS policy. Red and Blue: goal positions used for training, Green: new goal position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.8 Single condition comparison of the power plug task performed on the real robot. Note that costs above the dotted line correspond to executions that did not actually insert the plug into the socket. Only our method (PILQR) was able to consistently insert the plug all the way into the socket by the final iteration. . . . . . . . . . . 103 7.1 Left: Walker-2D running forwards, running backwards, jumping. Right: Hu- manoid running forwards, running backwards, balancing. . . . . . . . . . . . . . 112 7.2 Left: Reacher with 2 targets: random initial state, reaching one target, reaching another target. Right: Gripper-pusher: random initial state, grasping policy, pushing (when grasped) policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3 Results of the imitation GAN with (top row) and without (bottom row) the latent intention cost. Left: Reacher with 2 targets(crosses): final positions of the reacher (circles) for categorical (1) and continuous (2) latent intention variable. Right: Reacher with 4 targets(crosses): final positions of the reacher (circles) for categorical (3) and continuous (4) latent intention variable. . . . . . . . . . . . . 114 7.4 Left: Rewards of different Reacher policies for 2 targets for different intention values over the training iterations with (1) and without (2) the latent intention cost. Right: Two examples of a heatmap for 1 target Reacher using two latent intentions each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xii 7.5 Top: Rewards of Walker-2D policies for different intention values over the training iterations with (left) and without (right) the latent intention cost. Bottom: Rewards of Humanoid policies for different intention values over the training iterations with (left) and without (right) the latent intention cost. . . . . . . . . . 116 7.6 Time-lapse of the learned Gripper-pusher policy. The intention variable is changed manually in the fifth screenshot, once the grasping policy has grasped the block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.1 Schematics of our approach. We train the agent in a multi-task setup, where the task id is given as a one-hot input to the embedding network (bottom-left). The embedding network generates an embedding distribution that is sampled and concatenated with the current observation to serve as an input to the policy. After interaction with the environment, a segment of states is collected and fed into the inference network (bottom-right). The inference network is trained to classify what embedding vector the segment of states was generated from. . . . . . . . . 121 8.2 Bottom: resulting trajectories for different 3D embedding values with (right) and without (left) variational-inference-based regularization. The contours depict the reward gained by the agent. Top: Absolute error between the mean embedding value predicted by the inference network and the actual mean of the embed- ding used to generate these trajectories. Note that every error curve at the top corresponds to a single trajectory at the bottom. . . . . . . . . . . . . . . . . . . . 127 8.3 Left, middle: resulting trajectories that were generated by different distributions used for the skill-embedding space: multivariate Bernoulli (left), Gaussian (middle). The contours depict the reward gained by the agent. Note that there is no reward outside the goal region. Right: KL-divergence between the embedding distributions produced by task 1 and the other three tasks. Task 1 and 3 have different task ids but they are exactly the same tasks. Our method is able to discover that task 1 and 3 can be covered by the same embedding, which corresponds to the minimal KL-divergence between their embeddings. . . . . . 128 8.4 Left: visualization of the sequence of manipulation tasks we consider. Top row: spring-wall, middle row: L-wall, bottom row: rail-push. The left two columns depict the two initial skills that are learned jointly, the rightmost column (in the left part of the figure) depicts the transfer task that should be solved using the previously acquired skills. Right: trajectories of the block in the plane as manipulated by the robot. The trajectories are produced by sampling a random embedding vector trained with (red) and without (black) the inference network from the marginal distribution over the L-wall pre-training tasks every 50 steps and following the policy. Dots denote points at which the block was lifted. . . . 129 8.5 Comparison of our method against different training strategies for our manipula- tion tasks: spring-wall, L-wall, and rail-push. . . . . . . . . . . . . . . . . . . . . 130 8.6 Final policy for all three tasks: spring-wall (top), L-wall (middle), rail-push (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 xiii A.1 The initial conditions for the gripper pusher task that we train TVLG policies on. The top left and bottom right conditions are more difficult due to the distance from the block to the goal and the configuration of the arm. The top left condition results are reported in Section 6.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.2 Single condition comparisons of the gripper-pusher task in three additional conditions. The top, middle, and bottom plots correspond to the top right, bottom right, and bottom left conditions depicted in Figure A.1, respectively. The PILQR method outperforms other baselines in two out of the three conditions. The conditions presented in the top and middle figure are significantly easier than the other conditions presented in the work. . . . . . . . . . . . . . . . . . . . 141 A.3 Additional results on the door opening task. . . . . . . . . . . . . . . . . . . . . . 142 xiv Abstract For robots to become fully autonomous in real-world environments, they must be able to cope with uncertainties resulting from imperfect control and perception. Although robotic perception and control systems typically account for uncertainty independently, they do not consider uncertainty in the coupling between perception and control, leading to degraded performance. To address this shortcoming, recent approaches in robotics follow the insight that interaction with the environment can facilitate perception and inform manipulation, which results in reducing the uncertainty between perception and control components. Another challenge that is at the intersection of these components involves intermediate repre- sentations used for the transition between the two. Traditionally, the interfaces between perception and control systems have been well-defined and hand-specified by robotics engineers, which might cause the intermediate representations to be constraining. This is in strong contrast to modern, flexible and learned representations that are optimized for the task at hand using deep learning. In this thesis, we present several methods that demonstrate how interaction with the environ- ment, closed perception-action loops and learned representations are beneficial for manipulation and perceptual tasks. Specifically, we look at the problems of Interactive Perception, where the robot forcefully interacts with the environment, Active Perception, where the robot changes its sensor states to acquire additional data, and Robot Learning, where the robot learns the intermediate representations in an end-to-end fashion. We show applications of these paradigms to the tasks of articulation model estimation, grasping, multi-robot target tracking, self-calibration, locomotion and acquiring manipulation skills. xv Chapter 1 Introduction Technology has reached the point where we are beginning to see robots introduced into our everyday lives. At the moment, robotic systems that have achieved commercial success are still mostly hand-engineered to be deployed in a limited number of constrained scenarios. Even though we have seen significant improvements in many fields that robotics draws from, the problem of creating an intelligent machine that is able to understand, manipulate and interact with its surroundings remains unsolved. When encountered with unstructured, real-world, complex scenarios, the challenge remains to develop robotic algorithms that are able to learn from mistakes and generalize to unseen situations. Due to the complexity of these challenges, roboticists traditionally tended to split robotics problems into smaller subcomponents and solve them separately. An example of a traditional robotics pipeline that consists of such subcomponents is depicted in Fig. 1.1. As the first step, the robot needs to be able to sense the world around it. The sensors are designed in a way that enables a robotic system to collect all the necessary information about the world with the highest precision and bandwidth possible within design limitations such as cost, weight, and space. The output of the sensors is then sent to a state estimation algorithm such as a Kalman Filter or a Particle Filter (Thrun, Burgard, & Fox, 2005a) whose purpose is to process the sensory data and estimate from it quantities necessary for further use in decision making and control. Next, the state estimates are fed into a high-level planning component that is responsible for producing a feasible plan to accomplish a goal specified by the user. The output of the planning component is then sent to a set of low-level controllers whose purpose is to command the robot to follow the plan as accurately as possible. At the very end of the process, the motor commands are sent to a robot that executes the actions in the environment. This traditional robotics pipeline has many appealing properties: its well-defined partitioning allows researchers to work on smaller subproblems within a limited scope, it provides modularity and generality of all of the subcomponents, and it allows for a relatively simple diagnostic of the whole system. Nevertheless, this seemingly beneficial split into subcomponents leads to a number of assumptions that may be harmful for the end performance of a robotic system. These include: • The information flow in the pipeline in Fig. 1.1 is unidirectional, which assumes that the output of each component is perfect (or sufficient) for the later component to perform 1 Figure 1.1: Example of a traditional robotics pipeline. The information flow is unidirectional where perceptual components provide information to the action generation components. optimally. In reality, this leads to compounding errors that are propagated throughout the pipeline. • The split into different subproblems causes the robotics communities to focus on a specific subproblem in isolation and optimize it based on a different optimality criteria than that which the full robotics problem requires (such as the accuracy of a map for a robot rather than the usefulness of the map to navigate to a goal). • The interfaces between different subcomponents are well-defined and hand-specified by robotics engineers, which might cause these intermediate representations to be constraining. This is in strong contrast to modern, flexible and learned representations that are optimized for the task at hand (LeCun, Bengio, & Hinton, 2015). In this thesis, we present different approaches to address these challenges. The first part of the thesis focuses on the paradigm of Active and Interactive Perception that tackles the problem of sequential perception-action information flow by making the perceptual process active and exploratory. It introduces a tight interaction between the control and perception systems that enables the closing of action-perception loops and improves overall performance. In the second part of the thesis, we introduce a set of Deep Robot Learning approaches that address the problem of hand-designed and non-goal-tailored representations by using deep learning techniques to optimize the representation in an end-to-end fashion. We strongly believe that both of these ideas underlie a significant shift from a traditional robotics pipeline and have the potential to greatly improve the current state of robotics research. In the following, we describe these two parts of the thesis in more detail. 1.1 Interactive and Active Perception When operating in human environments, robots must frequently cope with many types of uncer- tainty. Sources of uncertainty include systematic and random perceptual error, imperfect actuators, ambiguity due to a lack of data, and changing environmental conditions. The robotics commu- nity has widely adopted perception and estimation methods that account for uncertainty, such as 2 Bayesian estimators (Thrun et al., 2005a). Similarly, control techniques such as robust optimal control maintain high performance in the presence of uncertainty. However, these methods typically account for uncertainty independently, and do not consider uncertainty in the interaction between perception and control, leading to degraded performance. A traditional decoupled system includes a perception component responsible for processing sensory data and estimating the state of the robot and its environment. A planning algorithm receives the estimated state that incorporates all the environmental constraints to generate a plan for performing the required task. Finally, this plan is executed by low-level motor controllers that are responsible for tracking the trajectory. The main disadvantage of such a sequential approach is that the error generated in the early stages propagates to the final outcome of the task, which often leads to failures. In contrast, there is compelling evidence that human and animal visual perception is a more active and exploratory process that involves tight interaction between our control and perception systems (J. J. Gibson, 1979; O’Regan & No¨ e, 2001). For instance in 1966, the psychologist (J. Gibson, 1966) showed that augmenting the perceptual process with interaction significantly increases performance. Human subjects found a reference object among a set of irregularly- shaped three-dimensional objects with an accuracy of 49% when shown images in print, but this accuracy increased to 72% when subjects were shown videos of the objects rotating. Nearly perfect performance (99%) was achieved when the subjects touched and rotated the objects in their hands. Recent approaches in robotics have diverged from the sequential view of robotic systems and try to close the loop between perception and action to improve overall performance. This paradigm of closing the loop between perception and action is already visible in manipulation skill learning. For example, Associative Skill Memories (Kappler, Pastor, Kalakrishnan, Wuthrich, & Schaal, 2015) demonstrate a method for adapting a manipulation skill based on incrementally-acquired reference signals from experience. This approach is able to decide online when and how to switch behaviors in order to ensure successful task execution. Another example comes from the work which takes advantage of the robot’s manipulation capabilities to push a set of tabletop objects to facilitate visual segmentation (Hausman et al., 2013). While this work shows the premise of improving perception through manipulation, it does not fully close the perception-action loop because the actions are selected randomly without any information from the perception system. Thus, actions are executed to improve perception, but perception is not used to inform the action selection process. In this thesis, we propose several methods that demonstrate how perception-action loops reveal new sensory information which is beneficial for perceptual tasks. We believe that the shift from the sequential sense-plan-act approach to an interactive paradigm will enable us to further advance robotic systems. To provide evidence for this claim, we apply the proposed paradigm to multiple robotic systems such as humanoids and Unmanned Aerial Vehicles (UA Vs) in scenarios where the robots either interact with the environment or change their position to improve their perception. We show the advantage of perception-action feedback loops on the problems of Articulation Model Estimation and Grasping on a manipulation platform, and on the problems of Multi-Robot Target Tracking and Self-Calibration on a UA V . 3 We propose a novel information-theoretic approach to the problem of articulation model estimation. Many objects in human environments move in structured ways relative to one another, e.g. due to mechanical constraints such as hinges. Articulation models describe these movements, providing useful information for both prediction and planning (Sturm, Stachniss, & Burgard, 2011; Mart´ ın & Brock, 2014). Previous works on detecting articulated motion models have generally used a maximum likelihood strategy to fit models, ignoring uncertainty (Pillai, Walter, & Teller, 2014; Sturm et al., 2011; Mart´ ın & Brock, 2014). However, passive techniques that discard all but the most likely model may lead to problems, in which the robot can be deceived in many situations by purely fitting model to the demonstration data. In this thesis, we address these problems by tracking a non-parametric distribution over different articulation models and their parameters. In addition, the proposed approach selects the most informative actions to converge quickly to an accurate articulation model, and uses the information from interactions to further update the model distribution. In addition to the perceptual goals that can be achieved using an (Inter)active Perception paradigm, the goal for a humanoid robot is usually to complete certain manipulation tasks. We believe that closing the loop between high-level perception and actions enables us to reformulate some classic robotics problems, such as grasping. Instead of using the perception system to estimate the pose that is input to a grasp controller, the robot can leverage continuous interaction with the environment to improve grasping performance. In this work, we demonstrate a regrasping approach that is able to predict the stability of a grasp based on tactile sensory information, as well as provides a means for the robot to process and use the information on why the grasp has failed. Our grasp adjustment method aims to use this valuable tactile information in order to infer a local change of the gripper configuration that will improve the grasp stability. Thus, the robot is able to estimate the grasp stability and learn to react in case the grasp will result in failure. This method shows that traditional indirect perception, which focuses on relative pose estimation of the hand and the object, can be successfully replaced by an Interactive Perception system that directs its perception on the states that are directly related to the successful execution of the task. Both methods described above tackle manipulation problems by exploiting a forceful interac- tion with the environment to enhance perception, which in turn eases manipulation. The resulting new type of sensory signal is considered an opportunity to gain more information about the quantity of interest or about the progress of the manipulation task. Furthermore, any prior knowledge of the relation between the interaction and sensory signal may also be employed for an improved interpretation of the signal. These approaches are subsumed by the term Interactive Perception (IP). The next problems described in this thesis use a similar paradigm but do not change the environment. In contrast, the set of actions considered by the field of Active Perception (AP) usually result in changes of the sensor pose relative to the environment. Similarly to IP, the actions are exploratory, but they lack forceful interactions. In this thesis, we apply Active Perception approaches to the problems of Multi-Robot Target Tracking and Self-Calibration. Tracking a moving target has many potential applications in various fields. For example, consider a search and rescue scenario where autonomous ground robots are deployed to assist disaster victims, but they are not able to localize themselves in an unknown environment. In this case, one could use flying robots that, due to their higher altitude, can take advantage of, for 4 example, GPS signals to help localize the ground robots. Although it comes at the cost of motion planning complexity, deploying multiple cooperatively-controlled robots in this setting provides advantages, such as increased coverage and robustness to failure. In this thesis, we present two methods that consider the cooperative control of a team of robots to estimate the position of a target using onboard sensing. The approaches explicitly reason about the uncertainty of the target and the team of robots to generate the optimal trajectory for tracking the target. Another example of an AP method applied to a perception and estimation task is that of self-calibration. In addition to the states directly used for system control such as position, velocity, and attitude, more recent work also estimates internal states that calibrate the sensor suite of the system (Weiss et al., 2013). These so-called self-calibration states include all information needed to calibrate the sensors against each other – such as the position and attitude of one sensor with respect to another – as well as their intrinsic parameters such as measurement bias. The advantages of including self-calibration states in the online state estimator are threefold: i) the same implementation of the self-calibrating estimator can be used for different vehicles, ii) the state estimator can compensate for errors in the initialization or after collisions, and iii) the platform does not require any offline calibration routine because it can self-calibrate itself while operating. In this work, we present a framework that reasons about state observability and optimizes trajectories for self-calibration using the Active Perception paradigm. 1.2 Learned Representations via Deep Robot Learning In the second part of the thesis, we focus on learning goal-oriented representations. To this end, we draw on ideas from deep learning, which has been very successful in learning appropriate rep- resentations in a number of different fields; these include computer vision (Krizhevsky, Sutskever, & Hinton, 2012), natural language processing (Collobert et al., 2011), speech recognition (Jaitly, Nguyen, Senior, & Vanhoucke, 2012) and speech generation (Oord et al., 2016). Even though deep neural networks have been widely adopted and yield state-of-the-art results in many fields, their application in robotics remains limited. There are, however, many promising ways of applying deep learning in robotics, including reinforcement, imitation and meta learning, which we refer to as deep robot learning. The second part of this document focuses on three approaches to improve different aspects of the current state-of-the-art in robot learning. We begin by presenting an approach that addresses one of the main limitations of current deep reinforcement learning (DRL) approaches - their sample complexity. The current state- of-the-art DRL methods are very data inefficient (requiring millions of interactions with the environment (Schulman, Moritz, Levine, Jordan, & Abbeel, 2016)), making them very difficult to apply on a real robotic system. To that end, we introduce two modifications that significantly improve the data-efficiency of DRL algorithms. First, we employ a simplified policy class - time-varying linear Gaussian controllers - which allows us to use sample-efficient optimal control algorithms. In addition, we can further combine our method with guided policy search (Levine, Finn, Darrell, & Abbeel, 2016a) to train arbitrary parameterized policies such as deep neural networks, hence preventing the loss of policy expressiveness. Second, we combine the advantages of model-based and model-free learning, allowing our algorithm to handle complex, unknown 5 dynamical systems while maintaining data-efficient learning. In particular, we use a model- based algorithm based on the linear-quadratic regulator that can be integrated into the model-free framework of path integral policy improvement. Our simulation and real-world experiments demonstrate that this method can solve challenging manipulation tasks with comparable (or better) performance than model-free DRL methods while improving the sample efficiency by up to two orders of magnitude. Next, we present an approach that tackles the problem of limited scalability of current imitation learning methods that use deep representations. Imitation learning has traditionally been applied to learn a single task from demonstrations thereof. The requirement of structured and isolated demonstrations limits the scalability of imitation learning approaches, as they are difficult to apply to real-world scenarios in which robots must be able to execute a multitude of tasks. In this thesis, we present a multi-modal imitation learning framework that is able to segment and imitate skills from unlabelled and unstructured demonstrations by learning skill segmentation and imitation learning jointly. Our results indicate that the presented technique can separate the demonstrations into sensible individual skills and imitate these skills using a learned multi-modal policy. We show applications of the presented method to the tasks of skill segmentation, hierarchical reinforcement learning and multi-modal policy learning. In the last part of the thesis, we present a novel approach to deep reinforcement learning in robotics that addresses the challenge of representing and reusing previously-learned skills to solve more complex tasks. In order to represent the skills parameterized by deep-neural-network policies, we introduce a skill embedding space. The learned policies, when conditioned on a sample from the skill embedding space, execute a particular version of a skill that was used to solve a previously-seen task. We can take advantage of these skills for rapidly solving new tasks, effectively by solving the control problem in the embedding space rather than in the raw action space. Our formulation draws on a connection between reinforcement learning and variational inference and is a principled and general scheme for learning hierarchical stochastic policies. We find that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, our results indicate that the proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards. 1.3 Thesis Outline This thesis is organized as follows: • In Chapter 2, we present an Active Perception method for self-calibration with an example application to UA Vs. This approach generates trajectories that are optimized for the quality of observability of the self-calibration states of the system. • Chapter 3 introduces another AP approach that addresses the problem of multi-robot target tracking. Here, the optimization of the vehicles’ controls takes place in the 3D space and explicitly reasons about occlusions. 6 • Chapter 4 introduces an Interactive Perception approach for articulation model estimation, which uses an information-theoretic cost function to improve state estimation through manipulation. • Chapter 5 shows our work on regrasping that relies on the IP paradigm. We present an approach that uses spatio-temporal tactile features and reinforcement learning to find the optimal regrasping configuration. • In Chapter 6, we present a deep reinforcement learning method that addresses the problem of high sample complexity. We show how our approach is able to combine the flexibility of model-free methods and the efficiency of model-based reinforcement learning methods. • Chapter 7 presents a deep robot learning method that extends the imitation learning frame- work to work with unstructured and unlabelled demonstrations of multiple skills. • In Chapter 8, we introduce a robot skill embedding approach that is able to reuse previously learned policies to accomplish more complex skills. • In Chapter 9, we conclude and present ideas for future work. 7 Part I Active Perception 8 In this part of the thesis, we describe two Active Perception approaches that focus on the problems of self-calibration and multi-robot target tracking. A more thorough analysis of Active Perception together with a survey of the approaches that use this paradigm can be found in (Bajcsy, 1988) and (Bohg et al., 2016). More recently, (Bajcsy, Aloimonos, & Tsotsos, 2016) revisited AP giving a historical perspec- tive on the field and a new, broader definition of an active perceiver, where an agent is an active perceiver if it knows why it wishes to sense, and then chooses what to perceive, and determines how, when and where to achieve that perception. This paradigm requires the agent to reason about so called Expectation-Action tuples to select the next best action. The expected result of the action can be confirmed by its execution. In the following chapters, we present different examples of how AP can be applied to the real-robot challenges that use trajectory optimization to actively improve state estimation. The hereby presented approaches explicitly reason about the uncertainty or the observability of a robotic system, and they generate the optimal trajectory to help solve the perceptual challenges and to minimize the system’s unpredictability. These two chapters are based on our previously-published work (Hausman, Preiss, Sukhatme, & Weiss, 2016) and (Hausman, Kahn, et al., 2016). 9 Chapter 2 Observability-Aware Trajectory Optimization for Self-Calibration We study the nonlinear observability of a system’s states in view of how well they are observable and what control inputs would improve the convergence of their estimates. We use these insights to develop an observability-aware trajectory-optimization framework for nonlinear systems that produces trajectories well suited for self-calibration. Common trajectory-planning algorithms tend to generate motions that lead to an unobservable subspace of the system state, causing suboptimal state estimation. We address this problem with a method that reasons about the quality of observability while respecting system dynamics and motion constraints to yield the optimal trajectory for rapid convergence of the self-calibration states (or other user-chosen states). Experiments performed on a simulated quadrotor system with a GPS-IMU sensor suite demonstrate the benefits of the optimized observability-aware trajectories when compared to a covariance-based approach and multiple heuristic approaches. Our method is∼80x faster than the covariance-based approach and achieves better results than any other approach in the self-calibration task. We applied our method to a waypoint navigation task and achieved a∼2x improvement in the integrated RMSE of the global position estimates and∼4x improvement in the integrated RMSE of the GPS-IMU transformation estimates compared to a minimal-energy trajectory planner. 2.1 Introduction State estimation is a core capability for autonomous robots. For any system, it is desirable to estimate the state at any point in time as accurately as possible. Accurate state estimation is crucial for robust control strategies and serves as the foundation for higher-level planning and perception. In addition to the states directly used for system control, such as position, velocity, and attitude, more recent work also estimates internal states that calibrate the sensor suite of the system (Weiss et al., 2013). These so-called self-calibration states include all information needed to calibrate the sensors against each other – such as the position and attitude of one sensor with respect to another – as well as their intrinsic parameters such as measurement bias. In general, these states could be estimated a priori using offline calibration techniques. The advantages of including self-calibration states in the online state estimator are threefold: i) the 10 same implementation of the self-calibrating estimator can be used for different vehicles ii) the state estimator can compensate for errors in the initialization or after collisions and iii) the platform does not require any offline calibration routine because it can self-calibrate itself while operating. Including self-calibration states in the estimator has many advantages, but it comes at an important cost: the dimensionality of the state vector increases while the number of measurements remains unchanged. This cost often leads to the requirement of engineered system inputs to render all states observable, i.e. the system needs a tailored trajectory that might require extra time or energy compared to an observability-unaware trajectory. This requirement holds true not only for the initial self-calibration but also throughout the mission. Usually, the self-calibration motion is executed by an expert operator who controls the vehicle and continuously checks if the states have converged to reasonable values. During the mission, the expert operator must take care to excite the system sufficiently to keep all states observable. More importantly, in autonomous missions, trajectory-planning algorithms that minimize energy use may generate trajectories that lead to an unobservable subspace of the system state. In this work, we present a framework that optimizes trajectories for self-calibration. The resulting trajectories avoid unobservable subspaces of the system state during the mission. We develop a cost function that explicitly addresses the quality of observability of system states. Our method takes into account motion constraints and yields an optimal trajectory for fast convergence of the self-calibration states or any other user-chosen states. The presented theory applies to any (non)linear system and it is not specific to a particular realization of the state estimator such as KF, EKF, or UKF. While past approaches have focused on analyzing the environment to compute where to move to obtain informative measurements for state estimation (Bry & Roy, 2011), we assume the presence of accurate measurements * and focus on how to move to generate motions that render the full state space observable. In order to evaluate our method, we conduct several experiments using a simulated Unmanned Aerial Vehicle (UA V) with GPS-IMU sensor suite, as this is a common scenario that illuminates the problem of self-calibration. An example of the performance of the self-calibration framework is presented in Fig. 2.1, where an optimized trajectory outperforms common calibration heuristics usually chosen by experts in terms of speed and accuracy of the state convergence. The key contributions of our approach are: • we present a method that is able to predict the quality of state estimation based on the vehicle’s ego-motion rather than on the perceived environment; the method takes into account system dynamics, measurements and the nonlinear observability analysis. • Our method is carried out on the nonlinear continuous system without making any state- estimator-specific assumptions. • We demonstrate a full self-calibration-based trajectory optimization framework that is readily adjustable for any dynamical system and any set of states of the system. * Advanced sensors and state estimators are nowadays able to obtain accurate measurements in a large variety of different environments (M. Li, Kim, & Mourikis, 2013) 11 -20 0 -50 -50 z[m] 20 40 y[m] x[m] 0 0 50 50 -20 0 -50 -50 z[m] 20 40 x[m] y[m] 0 0 50 50 -20 0 -50 -50 z[m] 20 40 y[m] x[m] 0 0 50 50 time [s] RMSE b ω [rad/s] b a [m/s 2 ] p ip [m] time [s] RMSE time [s] RMSE Figure 2.1: Root Mean Squared Error (RMSE) convergence of EKF self-calibration (accelerometer and gyroscope biases b a , b ! and position of the GPS sensor p p i ) state estimates for a) figure-eight trajectory, b) star trajectory, c) optimal trajectory from our method. We introduced additional yaw motion for a) and b) trajectories in order to improve state estimation of these heuristics. 12 • We show that the observability-aware trajectory optimization can be also used for the waypoint navigation task which results in more accurate state estimation. 2.2 Related Work Previous work on improving state estimation of a system has mainly focused on analyzing the environment to choose informative measurements (Bryson & Sukkarieh, 2004; Julian, Angermann, Schwager, & Rus, 2012). With the arrival of robust visual-inertial navigation solutions (e.g. Google Tango † ), sufficiently accurate measurements can be obtained most of the time during a mission without special path reasoning. These results give us the opportunity to shift focus towards other aspects of the trajectory, in particular, its suitability for self-calibration. In (Maye, Sommer, Agamennoni, Siegwart, & Furgale, 2015), the authors find the best set of measurements from a given trajectory to calibrate the system. Unobservable parameters are locked until the trajectory has sufficient information to make them observable. The analysis is performed on the linearized system and analyzes a given trajectory rather than generating an optimal convergence trajectory. (Gao, Zhang, & Wang, 2015) analyzed a specific marine system and developed a trajectory to calibrate it based on heuristics. The approach is not generally applicable to other systems. Other approaches analyze the final covariance of the system when simulating it on a test trajectory: (Martinelli & Siegwart, 2006) maximize the inverse of the covariance at the final time step and use this cost in an optimization procedure. Similarly, in (Achtelik, Weiss, Chli, & Siegwart, 2013) the authors sample a subset of the state space with a Rapidly Exploring Random Tree approach (Bry & Roy, 2011) and optimize for the final covariance of the system. These approaches are sample-based techniques discretizing their environment and state space. The discretization and linearization steps induce additional errors and may lead to wrong results similar to the well-known rank issue when analyzing a system in its linearized instead of the nonlinear form (Hesch, Kottas, Bowman, & Roumeliotis, 2013). From the nonlinear observability analysis described in (Hermann & Krener, 1977), (Krener & Ide, 2009) develop a measure of observability rather than only extracting binary information on the observability of a state. (Hinson & Morgansen, 2013) make use of this measure to generate trajectories that optimize the convergence of states that are directly visible in the sensor model. We make use of this definition and extend the approach to analyze the quality of observability of states that are not directly visible in the sensor model. This way, we can also generate motions leading to trajectories that optimize the convergence of, e.g., IMU biases. Trajectory optimization has been successfully used in many different applications including robust perching for fixed-wing vehicles (Moore, Cory, & Tedrake, 2014), locomotion for hu- manoids (Kuindersma et al., 2015) and manipulation tasks for robotic arms (Levine, Wagener, & Abbeel, 2015a). Since we evaluate our system using a model of an Unmanned Aerial Vehicle (UA V), we present the related work on the trajectory optimization in this area. (Mellinger & Kumar, 2011) use trajectory optimization of differentially flat variables of a quadrotor to obtain minimum-snap trajectories. (Richter, Bry, & Roy, 2013) presented an extension of this approach † https://www.google.com/atap/project-tango/ 13 to generate fast quadrotor paths in cluttered environments using an unconstrained QP. In (M¨ uller & Sukhatme, 2014), the authors generate risk-aware trajectories using the formulation developed by (Van Den Berg, Abbeel, & Goldberg, 2011) with a goal of safe quadrotor landing. (Hausman, M¨ uller, Hariharan, Ayanian, & Sukhatme, 2015) use the framework of trajectory optimization to generate controls for multiple quadrotors to track a mobile target. Finally, (Moore et al., 2014) use LQR-trees to optimize for a trajectory that leads to robust perching for a fixed-wing vehicle. In this work, we use trajectory optimization to generate paths for self-calibration, which are evaluated on a simulated quadrotor system. Our work augments other trajectory-optimization-based approaches by providing an observability-aware cost function that can be used in combination with other optimization objectives. 2.3 Problem Formulation and Fundamentals 2.3.1 Motion and Sensor Models We assume the following nonlinear system dynamics, i.e. the motion model: _ x=f(x;u;); where x is the state, u are the control inputs and is noise caused by non-perfect actuators and modelling errors. For sensory output we use the nonlinear sensor model: z=h(x;); where z is the sensor reading and is the sensor noise. It is often the case that there are certain elements in the state vector x, such as sensor biases, that stay constant according to the motion model, i.e. their values are independent of the controls u and other states x. In this work, we will call these entities self-calibration states x sc . 2.3.2 Nonlinear Observability Analysis The observability of a system is defined as the possibility to compute the initial state of the system given the sequence of its inputs u(t) and measurements z(t). A system is said to be globally observable if there exist no two points x 0 (0), x 1 (0) in the state space with the same input-output u(t)-z(t) maps for any control inputs. A system is weakly locally observable if there is no point x 1 (0) with the same input-output map in a neighborhood of x 0 (0) for a specific control input. Observability of linear as well as nonlinear systems can be determined by performing a rank test where the system is observable if the rank of the observability matrix is equal to the number of states. In the case of a nonlinear system, the nonlinear observability matrix is constructed using 14 the Lie derivatives of the sensor modelh. Lie derivatives are defined recursively with zero-noise assumption. The 0-th Lie derivative is the sensor model itself, i.e.: L h 0 =h(x); the next Lie derivative is constructed as: L h i+1 = @ @t L h i = @L h i @x @x @t = @L h i @x f(x;u): One can observe that Lie derivatives with respect to the sensor model are equivalent to the respective time derivatives of the sensory output z: _ z= @ @t z(t)= @ @t h(x(t))= @h @x @x @t = @h @x f(x;u)=L h 1 : By continuing to compute the respective Lie derivatives one can form the matrix: O(x;u)= ∇L h 0 ∇L h 1 ∇L h 2 ::: T ; where∇L h 0 = @L h 0 @x and _ z= @z @t . The matrixO(x;u) formed from the sensor model and its Lie derivatives is the nonlinear observability matrix. Following (Hermann & Krener, 1977), if the observability matrix has full column rank, then the state of the nonlinear system is weakly locally observable. Unlike linear systems, nonlinear observability is a local property that is input– and state-dependent. It is worth noting that the observability of the system is a binary property and does not quantify how well observable the system is. This limits its utility for gradient-based methods. We address this issue in the next section. 2.4 Quality of Observability Following (Krener & Ide, 2009) and according to the definition presented in Sec. 2.3.2, we introduce the notion of quality of observability. A state is said to be well observable if the system output changes significantly when the state is marginally perturbed (Weiss, 2012). A state with this property is robust to measurement noise and it is highly distinguishable within some proximity where this property holds. Conversely, a state that leads to a small change in the output, even though the state value was extensively perturbed, is defined as poorly observable. In the limit, the measurement does not change even if we move the state value through its full range. In this case, the state is unobservable (Hermann & Krener, 1977). 2.4.1 Taylor Expansion of the Sensor Model In order to model the variation of the output in relation to a perturbation of the state, we approximate the sensor model using then-th order Taylor expansion about a pointt 0 : 15 h t 0 (x(t);u(t))= n Q i=0 (t−t 0 ) i i! h i (x(t 0 );u(t 0 )); whereh t 0 represents the Taylor expansion ofh aboutt 0 with the following Taylor coefficients h 0 ;h 1 ;:::;h n : h 0 (x(t 0 );u(t 0 ))=h(x(t 0 );u(t 0 ))=L h 0 (x(t 0 );u(t 0 )) h 1 (x(t 0 );u(t 0 ))= @ @t (h(x(t 0 );u(t 0 )))=L h 1 (x(t 0 );u(t 0 )) ⋮ h n (x(t 0 );u(t 0 ))=L h n (x(t 0 );u(t 0 )): Using this result, one can also approximate the state derivative of the sensor model @ @x h(x(t);u(t)). For brevity, we introduce the notationt=t−t 0 ,h t 0 (t)=h t 0 (x(t);u(t)), and we omit the argu- ments of the Lie derivatives: @ @x h t 0 (t)= n Q i=0 t i i! ∇L h i : This result in matrix form is: @ @x h t 0 (t)= I tI t 2 2 I ::: t n n! IO(x(t);u(t)); (2.1) whereO(x(t);u(t)) is the nonlinear observability matrix. Eq. 2.1 describes the Jacobian of the sensor modelh with respect to the state x around the time t 0 . Using this Jacobian, we are able to predict the change of the measurement with respect to a small perturbation of the state. This prediction not only incorporates the sensor model but it also implicitly models the dynamics of the system via high order Lie derivatives. Hence, in addition to showing the effect of the states that directly influence the measurement, Eq. 2.1 also reveals the effects of the varying control inputs and the states that are not included in the sensor model. This will prove useful in Sec. 2.4.3. 2.4.2 Observability Gramian In addition to the change in the output with respect to the state perturbation, one needs to take into account the fact that different states can have different influence on the output. Thus, a large effect on the output caused by a small change in one state can swamp a similar effect on the output caused 16 by a different state and therefore, weaken its observability. In order to model these interactions, following (Krener & Ide, 2009), we employ the local observability Gramian: W o (0;T)= S T 0 (0;t) T H(t) T H(t)(0;t)dt; (2.2) where (0;t) is the state transition matrix (see (Krener & Ide, 2009) for details), H(t) is the Jacobian of the sensor modelH(t)= @ @x h(t) and the trajectory spans the time intervalt∈ [0;T]. Since a nonlinear system can be approximated by a linear time-varying system by linearizing its dynamics about the current trajectory, one can also use the local observability Gramian for nonlinear observability analysis. If the rank of the local observability Gramian is equal to the number of states, the original nonlinear system is locally weakly observable (Hermann & Krener, 1977). (Krener & Ide, 2009) introduced measures of observability that are based on the condition number or the smallest singular value of the local observability Gramian. Unfortunately, the local observability Gramian is difficult to compute for any nonlinear system. In fact, it can only be computed in closed form for certain simple nonlinear systems. In order to solve this problem, the local observability Gramian can be approximated numerically by simulating the sensor model for small state perturbations, resulting in the empirical local observability Gramian (Krener & Ide, 2009): W o ≈ 1 4 2 S T 0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ z T 1 (t) ⋮ z T n (t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ [z 1 (t)::: z n (t)]dt; (2.3) where z i = z +i − z −i and z ±i is the simulated measurement when the state x i is perturbed by a small value ±. The empirical local observability Gramian in Eq. 2.3 converges to the local observability Gramian in Eq. 2.2 for→ 0. The main disadvantage of this numerical approximation is that it cannot approximate the local observability Gramian for the states that do not appear in the sensor model. As→ 0 this approximation replaces the state transition matrix (0;t) with the identity matrix. This relieves the burden of finding an analytical solution for (0;t), however, it also eliminates any effects on the local observability Gramian caused by the states that are not in the sensor model. Thus, it becomes difficult to reason about the observability of these states using this approximation. We address this problem in the following section. 2.4.3 Measure of Observability In order to present the hereby proposed measure of observability concisely, we introduce the following notation: K t 0 (t)= @ @x h t 0 (t)= @ @x h t 0 (x(t);u(t)): 17 Following the definition of the local observability Gramian, we use the Taylor expansion of the sensor model to approximate the local observability Gramian: W o (0;T; t)≈ S T 0 K t (t+ t) T K t (t+ t)dt; (2.4) where t is a fixed horizon that enables us to see the effects of the system dynamics. In order to measure the quality of observability we use the smallest singular value of the approximated local observability GramianW o (0;T; t). In contrast to the empirical local observability Gramian, our formulation is able to capture input-output dependencies that are not visible in the sensor model. We achieve this property by incorporating higher order Lie derivatives that are included in the observability matrix. Intuitively, at each time step, we use the Taylor expansion of the sensor model about the current time step t to approximate the Jacobian of the measurement in a fixed time horizon t. We use this approximation to estimate the local observability Gramian which is integrated over the entire trajectory. In order to measure the observability of a subset of the states, one can use the smallest singular value of the submatrix of the local observability Gramian that includes only the states of interest. We use this property to focus on different self-calibration states of the system. 2.5 Trajectory Representation and Optimization 2.5.1 Differentially Flat Trajectories In order to efficiently represent trajectories, we consider differentially flat systems (Van Nieuwstadt & Murray, 1997). A system is differentially flat if all of its inputs x;u can be represented as a function of flat outputs y and their finite derivatives ~ y, i.e.: x=(y; _ y; y;:::; (n) y )=(~ y) u= (y; _ y; y;:::; (m) y )= (~ y): For the rest of this work, we express robot trajectories as flat outputs because this is the minimal representation that enables us to deduce the state and controls of the system over time. 2.5.2 Constrained Trajectory Representation using Piecewise Polynomials Similar to (M¨ uller & Sukhatme, 2014), we represent a trajectory by ak-dimensional,d-degree piecewise polynomial, composed ofq pieces: y(t)= ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ P 1 t(t) ift 0 ≤t<t 1 ⋮ P q t(t) ift q−1 ≤t≤t q ; 18 whereP i is thek×(d+ 1) matrix of polynomial coefficients for theith polynomial piece, and t is the time vector, i.e.: t(t)= t 0 t 1 ::: t d T : We formulate constraints on the initial and final positions and derivatives of a trajectory as a system of linear equations (M¨ uller & Sukhatme, 2014). For example: c 1 = y(0)=P 1 t(0) c 2 = _ y(0)=P 1 _ t(0) ⋮ c f = (n) y (T)=P q (n) t (T); where c 1 ;c 2 ;:::;c f are the trajectory constraints and _ t is the trivial derivative _ y(t)=P i _ t(t). In matrix form, the initial constraints appear as: c 1 c 2 :::=P 1 t(0) _ t(0) ::: (2.5) and the final constraints are defined simlarly. In addition to start and end constraints, a physically plausible trajectory must be continuous up to the-th derivative: P 1 t(t 1 )=P 2 t(t 1 )⋯P 1 () t (t 1 )=P 2 () t (t 1 ) ⋮ P q−1 t(t q−1 )=P q t(t q−1 )⋯P q−1 () t (t q−1 )=P q () t (t q−1 ): To compactly express the evaluation of a polynomial and its first derivatives at a point in time, we define the time matrix: T i = t(t i ) _ t(t i ) ::: () t (t i ) : We may thus express the smoothness constraints as a banded linear system: P 1 ::: P q ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ T 1 −T 1 T 2 ⋱ ⋱ −T q−2 T q−1 −T q−1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = 0: (2.6) If we add equations in the form of Eq. 2.5 for the initial and final constraints, the resulting linear system completely expresses the problem constraints. 19 With an appropriately high polynomial degreed, Eq. 2.6 forms an underdetermined system. Therefore, we can use the left null space of the constraint matrix as the optimization space. Any linear combination of the left null space may be added to the particular solution to form a different polynomial that still satisfies the trajectory constraints. For the particular solution, we use the minimum-norm solution from the Moore-Penrose pseudoinverse, which gives smooth trajectories. The described change of variables significantly reduces the dimensionality of the optimization problem and eliminates all equality constraints. The only remaining constraints are nonlinear inequalities related to the physical limits of the motor torques, etc., required to execute the path. 2.5.3 Numerical Stability for Constrained Trajectory The linear system presented in the previous section tends to be ill-conditioned. Its condition number grows exponentially with the polynomial degree (V. Y . Pan, 2015) and it is exacerbated by longer time intervals in the polynomial pieces. Using the formulation from (Mellinger & Kumar, 2011), we scale the time duration of the problem such that each polynomial piece lies in a time interval of ≤ 1 second. This produces a better-conditioned matrix whose solution can easily be converted back into a solution to the original problem. 2.5.4 The Optimization Objective The goal of this work is to find a trajectory that will provide an optimal convergence of the self- calibration parameters of a nonlinear system. In order to achieve this goal, we aim at minimizing the cost function of the following form: arg min ~ y(0);:::;~ y(T) o(~ y(t)); whereo(~ y(t)) is the observability-dependent cost that is directly related to the convergence of the self-calibration states in the estimator. In case of the hereby presented measure of quality of observability,o(~ y(t)) is: o(~ y(t))= min (W o (0;T; t)); where min (W o (0;T; t)) is the minimum singular value of the approximated local observability GramianW o described in Eq. 2.4. To the best of our knowledge, the only other cost function that reflects the convergence of the states of the system is based on the EKF covariance. Minimizing the trace of the covariance results in minimizing the uncertainty about the state for all of its individual dimensions (Beinhofer, Muller, Krause, & Burgard, 2013) and yields better results than optimizing its determinant (i.e. mutual information) (Hausman, M¨ uller, et al., 2015). Therefore, we employ the covariance-trace cost function that integrates the traces of the covariance submatrices that are responsible for the self-calibration states. We use this method as one of the baselines for our approach. 20 As described in Sec. 2.5.2, introducing the new constrained trajectory representation enables us to pose trajectory optimization as an unconstrained optimization problem and reduce its dimen- sionality. However, in order to ensure the physical plausibility of the trajectory, we still need to optimize it subject to physical limits of the system. We represent the physical inequality constraints as nonlinear functions of the differentially flat variables. For optimization we use the implementation of the Sequential Quadratic Programming (SQP) method with nonlinear inequality constraints from Matlab Optimization Toolbox. 2.6 Example Application to UA Vs with IMU-GPS State Estimator We demonstrate the presented theory on a simulated quadrotor with a 3-DoF position sensor (e.g. GPS) and a 6-DoF inertial measurement unit (IMU). This is a simple, widely popular sensor suite, but it presents a challenging self-calibration task, as there is limited intuition for what kind of trajectory would make the states well observable. Although we present experiments for the quadrotor, we emphasize that the presented theory can be applied to a variety of nonlinear systems. 2.6.1 EKF for IMU-GPS Sensor Suite As a realization of the state estimator of the quadrotor, we employ the popular Extended Kalman Filter (EKF). The EKF continuously estimates state values by linearizing the motion and sensor model around the current mean of the filter. It recursively fuses all controls u 1∶k and sensor readings z 1∶k up to timek and maintains the state posterior probability: p(x k S z 1∶k ;u 1∶k )=N (^ x k ; k ) as a Gaussian with mean ^ x k and covariance k . In particular, we use the indirect formulation of an iterated EKF (Lynen, Achtelik, Weiss, Chli, & Siegwart, 2013) where the state prediction is driven by IMU measurements. We choose this state estimator due its ability to work with various sensor suites and proven robustness in the quadrotor scenario. The state consists of the following: x T = [p i w T ;v i w T ;q i w T ;b ! T ;b a T ;p p i T ]; (2.7) where p i w , v i w and q i w are the position, velocity and orientation (represented as a quaternion) of the IMU in the world frame, b w and b a are the gyroscope and accelerometer biases, and p p i T is is the relative position between the GPS module and the IMU. 21 The state is governed by the following differential equations: _ p i w = v i w _ v i w = C T (q i w ) (a m − b a − n a )− g _ q i w = 1 2 (! m − b ! − n ! )q i w _ b w = n b! ; _ b a = n ba ; (2.8) where C (q) is the rotation matrix obtained from the quaternion q, (!) is the quaternion multipli- cation matrix of!, a m is the measured acceleration, and! m is the angular velocity with white Gaussian noise n a and n ! . Since the IMU biases can change over time, they are modeled as random processes where n bw and n ba are assumed to be zero-mean Gaussian random variables. Starting from the initial state defined in Eq. 2.7, we define the error state as: ~ x T = [p i w T ; v i w T ; i w T ; b ! T ; b a T ; p p i T ]; where ~ x is the error between the real state value x and the state estimate ^ x. For quaternions the error state is defined as:q= q⊗ ^ q≈ [1 1 2 T ] T . In this setup, the self-calibration error states ~ x sc are the gyroscope and accelerometer biases b ! ;b a and position of the GPS sensor in the IMU frame p p i : ~ x T sc = [b ! T ; b a T ; p p i T ]: Using the IMU-GPS state vector in Eq. 2.7, the system dynamics in Eq. 2.8, and assuming the connection between the IMU and the GPS sensor is rigid, we define the GPS sensor model as: z gps =h(x;n zgps )= p i w + C T (q i w ) p p i + n zgps ; where n zgps is white Gaussian measurement noise. The nonlinear observability analysis in (Kelly & Sukhatme, 2011) and (Weiss, 2012) shows that the system is fully observable with appropriate inputs. 2.6.2 Differentially Flat Outputs and Physical Constraints of the System As shown by (Mellinger & Kumar, 2011) the quadrotor dynamics are differentially flat. This means that a quadrotor can execute any smooth trajectory in the space of flat outputs as long as the trajectory respects the physical limitations of the system. The flat outputs arex;y;z position and yaw: y= [x;y;z;] T : The remaining extrinsic states, i.e. roll and pitch angles, are functions of the flat outputs and their derivatives. In order to ensure that trajectories are physically plausible we place inequality 22 constraints on 3 entities: the thrust-to-weight ratio (≤ 1:5), angular velocity (≤ rad s ), and angular acceleration (≤ 5 rad s 2 ). These values are rough estimates for a small-size quadrotor. 2.7 Experimental Results 2.7.1 Experimental Setup We evaluate the proposed method in simulation using the quadrotor described in Sec. 2.6. The simulation environment enables extensive testing with ground truth self-calibration states that would not be possible for a real robot. We represent trajectories as degree-6 piecewise polynomials with continuity up to the 4th derivative. In all experiments, we require trajectories with zero velocity and acceleration at the beginning and end points. The quadrotor has a GPS sensor that is positioned p p i = [0:1 0:1 0:1] T m away from the IMU and produces measurements with standard deviation of 0.2m. The accelerometer and the gyroscope have initial biases of b a = [0:05 0:05 0:05] T m/s 2 and b ! = [0:01 0:01 0:01] T rad/s respectively. These are common values for real quadrotor systems that we have used. The initial belief is that all the self-calibration states are zero. Thus, a bad self-calibration trajectory will fail to converge the state estimate of the system. 2.7.2 Evaluation of Various Self-Calibration Routines To evaluate the influence of choosing different states to construct the local observability Gramian, we compared two optimization objectives: i) the local observability Gramian constructed using the position states with the p p i states, and ii) the position states with the b a states. Initial tests showed that b ! converges quickly for almost any trajectory, so we did not include it in the evaluation. For the self-calibration task we require trajectories to start and end at the same position. We generated random trajectories by sampling a zero-mean Gaussian distribution for each optimization variable, i.e. each component of the left null space of the piecewise polynomial constraint matrix (Eq. 2.6). We then used each random trajectory as an initial condition for nonlinear optimization to produce an optimized trajectory. Fig. 2.2 shows the optimization results using both objectives. The optimized trajectories significantly outperformed the randomly generated ones. The two self-calibration states p p i and b a are co-related in our system, i.e. optimizing for one state also leads to improved performance on the other state. However, one can observe that the trajectories optimized for b a yield improved b a final RMSE values compared to those of trajectories optimized for p p i , and analogously the trajectories optimized for p p i yield better p p i results than those optimized for b a . Due to the small differences between results in the case of the accelerometer bias b a and the larger difference for the position of the GPS sensor p p i , we chose to conduct further experiments using the p p i objective. Fig. 2.3 shows results from the same experiment for a number of differently constructed trajec- tories. PL-random is a more competitive set of random trajectories generated by choosing larger random null space polynomial weights and discarding trajectories that violated the quadcopter’s 23 −150 −100 −50 0 0 0.01 0.02 0.03 0.04 optimized for p ip cost final RMSE b a [m/s 2 ] −150 −100 −50 0 0 0.02 0.04 0.06 0.08 0.1 0.12 cost final RMSE p ip [m] −8 −6 −4 −2 0 x 10 −3 0 0.01 0.02 0.03 0.04 optimized for b a cost final RMSE b a [m/s 2 ] −8 −6 −4 −2 0 x 10 −3 0 0.02 0.04 0.06 0.08 0.1 0.12 cost final RMSE p ip [m] random our method Figure 2.2: Self-calibration task: results obtained for optimizing for different objectives. Top row: random and optimized results when optimizing for p p i . Bottom row: random and optimized results when optimizing for b a . The left and right column show the final RMSE for b a and p p i respectively. 24 −150 −100 −50 0 0 0.005 0.01 0.015 cost final RMSE b a [m/s 2 ] −150 −100 −50 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 cost final RMSE p ip [m] PL−random our method figure 8 star Figure 2.3: Self-calibration task: final RMSE values for the accelerometer bias b a and the GPS position in the IMU frame p p i obtained using optimization (green) and 3 different heuristics: star and figure eight trajectories from Fig. 2.1 and randomly sampled trajectories that are close of the physical limits of the system. physical limits. The remaining trajectories are therefore likely to contain velocities and accelera- tions that are near the physical limits, which should lead to better observability. Figure 8 and star are the heuristic trajectories presented in Fig. 2.1, and our method are trajectories generated from our optimization framework using the PL-random trajectories as initial conditions. While the star trajectory and some of the PL-random trajectories perform well on b a , our approach outperforms all other methods on p p i . In order to more extensively test the different self-calibration strategies, we collected statistics over 50 EKF simulations for a single representative trajectory from each strategy. Fig. 2.4 summarizes our results in terms of the RMSE integrated over the entire trajectory and the final RMSE for accelerometer bias b a and GPS position p p i . Results show that our approach outperforms all baseline approaches in terms of the final and integrated RMSE of the GPS position p p i . The only method that is able to achieve a similar integrated RMSE value for the GPS position is the covariance-trace-based optimization described in Sec. 2.5.4. However, it takes approximately 13 hours for the optimizer to find that solution, versus approximately 10 minutes with our method. The main reason for this is that in order to estimate the trace of the covariance of the EKF, one needs to perform matrix inversion at every time step, which is more computationally expensive than the integration of the local observability Gramian and the one singular value decomposition used by our approach. The integrated RMSE of the accelerometer bias b a also suggests that our approach is able to make this state converge faster than in other methods. Nevertheless, a few other trajectories such as covariance-trace-based and PL-random were able to perform well in this test. This is also visible in the case of the final RMSE of the accelerometer bias b a where the first four methods yield similar results. While our method is slightly worse than the covariance-trace-based and the two heuristics-based approaches, given the standard deviation of the measurement (0.2m) 25 ∫ RMSE p ip [m] final RMSE p ip [m] ∫ RMSE b a [m/s 2 ] final RMSE b a [m/s 2 ] Figure 2.4: Self-calibration task: statistics collected over 50 runs of the quadrotor EKF using 6 different trajectories: ours - optimized trajectory using the hereby defined observability cost; trace - optimized trajectory using the covariance-trace cost function; PL-random - randomly sampled trajectory that is very close to the physical limits of the system; star, figure 8 - heuristics- based trajectories presented in Fig. 2.1; random - randomly sampled trajectory that satisfies the constraints. Top left: GPS position integrated RMSE, top right: GPS position final RMSE, bottom left: accelerometer bias integrated RMSE, bottom right: accelerometer bias final RMSE. 26 min-snap our method p i w∫ RMSE 6:06± 1m 3:69± 0:31m p p i ∫ RMSE 3:40± 0:83m 0:85± 0:22m Table 2.1: Waypoint navigation task: statistics of the integrated RMSE values for position p i w and GPS sensor position in the IMU frame p p i collected over 50 runs. Both trajectories take 50 seconds. and the final RMSE values of the bias being below 0.005 m s 2 (which is less than 10% of the initial bias RMSE), we consider the trajectories from all 4 methods to have converged this estimate. 2.7.3 Evaluation of an Example Waypoint Trajectory In addition to the self-calibration task, we applied our method to a waypoint navigation task. With minor extensions, the piecewise polynomial constraint matrix formulation in Eq. 2.6 can satisfy position and derivative constraints along the path in addition to start and endpoint constraints. We compare a trajectory optimized using our method to a minimum snap trajectory computed using the method from (Mellinger & Kumar, 2011). Fig. 2.5 shows both of the optimized trajectories. The trajectory optimized using our method is much more complex than a simple min-snap trajectory because it aims to yield well-observable states. The results in Tab. 2.1 show that our trajectory yields 4x better GPS sensor position estimates and 2x better position estimates than the min-snap trajectory. We note that even though the observability-aware trajectory is longer and more complex, which makes the state estimation harder, the resulting estimates are still significantly better than the min-snap trajectory. This result supports the intuition that sensor calibration can have significant influence on the estimation of other system states. 2.7.4 Discussion The presented results indicate that our approach is able to outperform other baselines at the task of estimating the position of the GPS sensor in the IMU frame. However, it yields comparable results with other methods regarding the accelerometer bias. Even though for this simple system one can think of heuristics that performs reasonably well, these may not generalize for more complex systems. The main advantage of the presented framework lies in its generality – it is applicable to any nonlinear system – and the fact that it can be combined with other objectives as long as they can be represented in the cost function. The waypoint navigation task presents another useful application of our technique, and shows that the influence of the GPS-IMU position estimate on the quality of position estimation can be significant (see Tab. 2.1). 27 20 15 10 position [m] 8 10 -5 12 5 0 14 5 0 10 15 20 Figure 2.5: Waypoint navigation task: minimum snap trajectory (red) vs. optimized trajectory using our method (green). Position waypoints are shown in black. 28 2.8 Conclusions We introduced an observability-aware trajectory optimization framework that is applicable to any nonlinear system and produces trajectories that are well suited for self-calibration. In contrast to existing approaches, our method moves the focus from where to go during a mission to how to achieve the goal while staying well-observable. The presented results performed for a simu- lated quadrotor system with a GPS-IMU sensor suite demonstrate the benefits of the optimized observability-aware trajectories compared to other heuristics and a covariance-based approach. For the self-calibration task we were able to achieve almost 2x better final RMSE values for the GPS-IMU position state than all the baseline approaches and comparable converged values for the accelerometer and gyroscope biases. Our method runs∼80x faster than the only other generic baseline approach that is applicable to other systems, and it achieved better results. The presented method was also applied to a waypoint navigation task and achieved almost 2x better integrated RMSE of the position estimate and more than 4x better integrated RMSE of the GPS-IMU position estimate than the minimum snap trajectory. 29 Chapter 3 Occlusion-Aware Multi-Robot Target Tracking in 3D We introduce an optimization-based control approach that enables a team of robots to cooperatively track a target using onboard sensing. In this setting, the robots are required to estimate their own positions as well as concurrently track the target. Our probabilistic method generates controls that minimize the expected uncertainty of the target. Additionally, our method efficiently reasons about occlusions between robots and takes them into account for the control generation. We evaluate our approach in a number of experiments in which we simulate a team of quadrotor robots flying in three-dimensional space to track a moving target on the ground. We compare our method to other state-of-the-art approaches represented by the random sampling technique, lattice planning method, and our previous method. Our experimental results indicate that our method achieves up to 8 times smaller maximum tracking error and up to 2 times smaller average tracking error than the next best approach in the presented scenarios. 3.1 Introduction Tracking a moving target has many potential applications in various fields. For example, consider a search and rescue scenario where autonomous ground robots are deployed to assist disaster victims but they are not able to localize themselves in an unknown environment. In this case, one could use flying robots that due to their higher altitude can take advantage of, e.g., GPS signals on the one hand and can help to localize the ground robots by observing them on the other hand. Although it comes at the cost of higher complexity in motion planning, using multiple cooperatively controlled robots in this setting provides advantages, such as increased coverage and robustness to failure. Consider a scenario depicted in Fig. 3.1 where a team of aerial robots equipped with onboard cameras is tasked with tracking and estimating the position of a mobile robot with respect to a global frame of reference. Some of the quadrotors may be in the field of view of the external global sensor while others are not. The ideal cooperative control algorithm for this team would take into account all visibility constraints and uncertainties in order to establish a configuration of quadrotors that propagates position information from the global sensors through the quadrotors to the target. We introduce a centralized planning approach that generates controls for multiple robots and minimizes the uncertainty of the tracked target. At any particular instance an individual quadrotor 30 Figure 3.1: Three quadrotors cooperatively tracking a target following a figure-eight trajectory (blue line). The green ellipsoids show the 3-dimensional covariances of quadrotors’ positions. The green ellipse represents the 2-dimensional covariance of the position of the target. Magenta lines depict the measurements between quadrotors and the global camera (blue cube at the top). How should the quadrotors move in order to minimize the uncertainty of the target? 31 may not be able to see the target due to occlusions or the sensor’s limited field-of-view. To formulate a cost function that considers these discontinuities in the sensor domain, we extend the approach from (Patil, Duan, Schulman, Goldberg, & Abbeel, 2014) to multiple robots and penalize the trace of the target’s covariance. Using this cost function, we employ an optimization framework to find locally optimal controls. We execute these controls and estimate the state of the quadrotor team and target using an Extended Kalman Filter (EKF) (Thrun et al., 2005a). The novel contributions of this work are as follows: a) we take into account sensing discon- tinuities caused, for example, by occlusions in different multi-robot configurations in a manner that is amenable to continuous optimization, and b) we generate 3D positional control inputs for all the quadrotors. We evaluated our approach in a number of simulations and compared it to our previous method in which we introduced cooperative multi-robot control with switching of sensing topologies (Hausman, M¨ uller, Hariharan, Ayanian, & Sukhatme, 2014, 2015). The video of the simulation experiments that present the results of our method is available at: http://tinyurl.com/iros16cooperative. 3.2 Related Work The task of cooperative target tracking has been addressed in various ways. Many researchers considered centralized (Charrow, Kumar, & Michael, 2014; Fink, Ribeiro, Kumar, & Sadler, 2010), decentralized (Ong et al., 2006; Adamey & Ozguner, 2012; Mottaghi & Vaughan, 2006), and distributed (Wang & Gu, 2012; Jung & Sukhatme, 2002, 2006) approaches to control multiple aerial or ground robots. Nevertheless, all of the above approaches do not take into account the position uncertainty of the robots that are deployed to perform the tracking task. The position of the robots is assumed to either be known or independently obtained with high accuracy. In this work, we consider both the position uncertainty of the tracked target as well as the uncertainty in the robots’ poses. The problem of target localization is very similar to the task of target tracking. There have been many authors that worked on target localization (Grocholsky, Makarenko, & Durrant-Whyte, 2003; Grocholsky, Bayraktar, Kumar, Taylor, & Pappas, 2006; Stump, Kumar, Grocholsky, & Shiroma, 2009; Hoffmann & Tomlin, 2010; Grocholsky, 2002) in a multi-robot scenario with onboard sensing. It is worth noting that these approaches are implemented in a distributed fashion which makes them well-suited for multi-robot scenarios with limited communication. One of the simplifications introduced in these approaches, however, is to limit the robots to planar movements and disable the possibility that the robots can be perceived by each other. In our work, we relax these assumptions and show how to cope with occlusions between different robots. Furthermore, the above-mentioned approaches use the mutual information measures as an uncertainty measure. In contrast, we use the trace of the covariance. We chose this utility measure because it copes better with degenerate covariances (and equally well with circular covariances) than the mutual information approach, as shown by Beinhofer et al. (Beinhofer, M¨ uller, Krause, & Burgard, 2013) and our previous work (Hausman, M¨ uller, et al., 2015). For more details please see (Hausman, M¨ uller, et al., 2015). 32 Cooperative localization (CL) is the use of multiple robots for accurate state estimation (Mourikis & Roumeliotis, 2006; K. X. Zhou & Roumeliotis, 2012). These works mainly address state estima- tion and localization, while we focus on tracking and control generation. Another tracking approach was proposed by (A. Ahmad & Lima, 2013) where the authors weight the observations of individual robots based on their localization uncertainty. (K. Zhou & Roumeliotis, 2011) proposed a similar approach but their focus is on the non-convex optimization of the cost function and evaluation of different sensor models. In both of these approaches, the authors decouple the target tracking from the robots’ localization, which does not take into account the high correlation of the target’s and the robots’ position estimates. (Lima et al., 2014) propose a distributed formation control approach for target tracking. Although the target is tracked using a modified particle filter, the non-linear model-predictive con- troller models the uncertainty of the target using a covariance-based method that does not account for occlusions. Our method uses a similar non-linear model-predictive controller formulation that does account for occlusions. Another approach by (A. Ahmad, Tipaldi, Lima, & Burgard, 2013a) addresses the problem of cooperative target tracking and robot localization using least square minimization. However, their graph optimization approach does not provide the uncertainty of the associated estimates, thus the optimization is unable to minimize the uncertainty. There have been several optimization-based approaches (Howard, Matari´ c, & Sukhatme, 2002; A. Ahmad, Tipaldi, Lima, & Burgard, 2013b; G. Huang, Truax, Kaess, & Leonard, 2013) that tackle the problem of cooperative multi-robot localization using only onboard sensors. These approaches, however, provide the maximum-likelihood state estimates which does not directly minimize the uncertainty of the target pose. In our previous work (Hausman, M¨ uller, et al., 2014) we introduced the concept of level-based sensing topologies, which simplifies the reasoning about occlusions and reduces the considered control space of each robot to a two-dimensional plane. In this work, we tackle the problem of generating locally optimal controls for each quadrotor in the full 3-dimensional space and we explicitly reason about occlusions in the trajectory optimization algorithm. In addition, we do not, as previously, apply the same controls over the look-ahead horizon in the optimization but consider arbitrary trajectories with different controls for different time steps. Moreover, by removing the level-based sensing topologies in this work, we enable more realistic simulations as the quadrotors are not assumed to instantly jump from one topology level to another. 3.3 Problem Definition We propose a centralized approach where we jointly estimate the positions of the quadrotors and the target using an EKF. First, we describe our state estimation technique to then focus on the control generation and reasoning about occlusions. 33 3.3.1 Assumptions In the presented approach, we assume certain properties of the system that enable us to simplify the computation. First of all, we apply an EKF, hence the Markov assumption (Thrun et al., 2005a) that the measurements are conditionally independent given the joint state. We assume that there exists one global fixed sensor with limited range, therefore necessitating the use of a quadrotor team for tracking the target. We fix the orientation of the moving quadrotors so that one can only control their 3D position. For the moving target, a standard uncontrolled motion model is applied. We assume the target tracking never experiences complete failure, otherwise the scenario would evolve into a simultaneous exploration and tracking problem. For the part of the system responsible for the control generation, we follow Platt et al.(Platt Jr, Tedrake, Kaelbling, & Lozano-Perez, 2010) and assume that the maximum likelihood observation is obtained at each time step. This eliminates the stochasticity of the cost functions. 3.3.2 State Estimation with EKF System Parametrization The state at timet consists ofn individual quadrotor poses x (i) t ;i ∈ [1;n] and the target pose x (targ) t . Let u (i) t be the control input applied to theith robot at timet. The joint state and control input are defined as: x t = [x (1) t ;:::;x (n) t ;x (targ) t ] (3.1) u t = [u (1) t ;:::;u (n) t ] (3.2) Let t be the uncertainty covariance of the joint state. The dynamics and measurement models for the joint state are given by the stochastic, differen- tiable functions f and h: x t+1 = f(x t ;u t ;q t ); q t ∼N(0;Q t ) (3.3) z t = h(x t ;r t ); r t ∼N(0;R t ) (3.4) where q t is the dynamics noise, r t is the measurement noise, and they are assumed to be zero-mean Gaussian distributed with state-dependent covariancesQ t andR t , respectively. We consider two types of sensors and corresponding measurements: absolute (global) mea- surements, e.g., GPS, and relative measurements between two quadrotors or a quadrotor and the target, e.g., distance or relative pose measurements. The stochastic measurement function of the absolute sensors is given by: z (i) t = h (i) (x (i) t ;r (i) t ) 34 while the relative sensor model is: z (i;j) t = h (i;j) (x (i) t ;x (j) t ;r (i;j) t ) All measurement functions can be naturally extended for the joint state (Martinelli, Pont, & Siegwart, 2005). Uncertainty Model Given the current belief (x t ; t ), control input u t and measurement z t+1 , the belief dynamics is described by the EKF equations. In order to model the discontinuity in the sensing domain, which can be caused either by a limited field of view, sensor failure or occlusions, we follow the method from (Patil, Duan, et al., 2014) and introduce a binary vector t ∈R dim[z] . Thekth entry in the vector t takes the value 1 if thekth dimension of z t is available and a value of 0 if no measurement is obtained. We detail the method for computing t in Sec. 3.3.4. The EKF update equations are as follows: x t+1 = f(x t ;u t ;0)+K t (z t − h(x t )) (3.5a) t+1 = (I−K t H t ) − t+1 (3.5b) K t = − t+1 H ⊺ t t [ t H t − t+1 H ⊺ t t +W t R t W ⊺ t ] −1 t (3.5c) − t+1 =A t t A ⊺ t +V t Q t V ⊺ t (3.5d) A t = @f @x (x t ;u t ;0); V t = @f @q (x t ;0) (3.5e) H t = @h @x (x t+1 ;0); W t = @h @r (x t+1 ;0); (3.5f) where t = diag[ t ] and z t is a measurement obtained at time stept. It is worth noting that the Kalman gain update in Eq. 3.5c includes the binary matrix t to account for discontinuities in the sensor domains. Due to the Markov assumption, the individual measurements can be separately fused into the belief using the EKF update equations. Dynamics Model The dynamics of an individual quadrotor is given by: f (i) (x (i) ;u (i) ;q (i) )= x (i) + u (i) t+ q (i) where x (i) ;u (i) ∈R 3 are the 3D position and velocity and t is the length of a time step. In this work, we use a constant velocity motion model for the target, however our approach generalizes to any uncontrolled motion model. 35 Observation Model The cameras are assumed to be at the center of each quadrotor facing down. The absolute sensor provides the 3D position of the observed quadrotor/target as a measurement: h (i) (x (i) t ;r (i) t )= x (i) t + r (i) t The relative sensor model provides the position of the (observed)jth quadrotor/target relative to the (observing)ith quadrotor: h (i;j) (x (i) t ;x (j) t ;r (i;j) t )= (x (j) t − x (i) t )+ r (i;j) t : 3.3.3 Control Generation using Optimization At each time stept we seek a set of control inputs u t∶T=t+h that for a time horizonh minimizes the uncertainty of the target while penalizing collisions and the control effort. In order to minimize the uncertainty of the target, we, similarly to (Hausman, M¨ uller, et al., 2015), penalize the trace of the target covariance. In consequence, it is possible for the quadrotor covariance to get large if the quadrotor is not near the other quadrotors or the global sensor. However, it is possible to bring the quadrotor back near the target if it is feasible within the time horizon of the controller. The other components included in the cost function are: the distance of every quadrotor pair to avoid collisions and the cost of the control effort. The final cost function is composed as follows: c t (x t ; t ;u t )= tr( (targ) t )+c collision (x t )+ SSu t SS 2 2 c T (x T ; T )= tr( (targ) T )+c collision (x T ) c collision (x)= n Q i=1 n Q j=i+1 max(SSx (j) − x (i) SS 2 2 −d max ; 0) where;; and are user-defined scalar weighting parameters andd max is the maximum distance for which the collision cost takes effect. The final objective function is: min x t∶T ;u t∶T−1 E[c T (x T ; T )+ T−1 Q t c t (x t ; t ;u t )] (3.6) s: t: x t+1 = f(x t ;u t ;0) x t ∈X feasible ; u t ∈U feasible x 0 = x init ; 0 = init For trajectory optimization, one can solve a nonlinear optimization problem to find a locally op- timal set of controls (Patil, Kahn, et al., 2014). In this work, we used sequential quadratic program- ming (SQP) to locally optimize the non-convex, constrained optimization problem. The innermost 36 Figure 3.2: Signed distance if the quadrotor/target is inside the view frustum (left) and outside the view frustum (right). QP solver was generated by a numerical optimization code framework called FORCES (Domahidi, 2012). FORCES generates code for solving QPs that is specialized for convex multistage problems such as trajectory optimization. Given the output controls u t∶T−1 computed using trajectory optimization, we follow the model predictive control paradigm (Camacho & Alba, 2013) by executing a subset of the optimized controls at each time step and then replanning. We use the previously computed controls as an initialization to warm start the optimization. 3.3.4 Reasoning about Occlusions The absolute/relative position of a quadrotor/target may not be observable due to occlusions from other quadrotors and the limited field-of-view of the sensors. As previously mentioned, we model this discontinuity with the binary variable (according to (Patil, Duan, et al., 2014)). In order to make the objective function differentiable, we approximate with a sigmoid function. It is worth noting that this is only required in the optimization step, the state estimation EKF step remains as defined previously. Letsd (i;j) (x) be the signed distance of x (j) to the field-of-view of thejth quadrotor (see Fig. 3.2). The signed distance is negative if thejth quadrotor is visible and positive otherwise. We introduce the parameter which determines the slope of the sigmoidal approximation. The sigmoidal approximation of the measurement availability is given by: (i;j) = 1 1+ exp[−⋅sd (i;j) (x)] (3.7) 37 Figure 3.3: Signed distance function in the presence of occlusions. First, we determine the shadows of all the occlusions such that the resulting field-of-view (the shaded area) is calculated. The signed distance is computed as the distance to the field-of-view. For more details on the sigmoidal approximation of the availability of the measurement we refer the reader to (Patil, Duan, et al., 2014). To calculate the signed distance of x (j) to the field-of-view of x (i) , we first represent the field-of-view as a truncated view frustum with a minimum and maximum distance given by the sensor model as depicted in Fig. 3.3. If x (j) is outside of the view frustum, sd (i;j) (x) is the distance of x (j) to the view frustum as shown in the right part of Fig. 3.2. If x (j) is inside the view frustum and there are no occlusions, the signed distance is computed as shown in the left part of Fig. 3.2. In the presence of occlusions, we first determine the shadows of all the occlusions in the plane of x (j) . In the next step, we use an open source 2D polygon clipping library - GPC * to generate the 2D polygon field-of-view, and then calculate the signed distance. Fig. 3.3 shows an example of a signed distance function in the presence of an occlusion. 3.4 Evaluation and Discussion We evaluated our approach in a number of simulation experiments. The simulation environment consists of a fixed global down-looking camera attached 4 meters above the origin of the coordinate system, a ground robot target and a varying number of quadrotors equipped with down-looking * http://www.cs.man.ac.uk/ ˜ toby/gpc/ 38 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] Figure 3.4: Comparison of different time horizons in the optimization step. The time horizons are 2, 5, and 10 time steps, respectively. Evaluation measures are shown with 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. There is no significant difference in the tracking error for these time horizons. 39 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] Figure 3.5: Comparison between different sampling approaches and the optimization approach presented in this work. From left to right: random sampling, lattice sampling and optimization. Evaluation measures are shown with 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The quantitative results from this experiment for all the trajectories are summarized in Table 3.1. The optimization approach yields better results than the other presented methods. 40 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.001 0.002 0.003 0.004 0.005 trace(target covariance) [m^2] position error of target [m] Figure 3.6: Comparison of our optimization-based approach applied to different number of quadrotors used for the tracking task. From left to right the number of quadrotors are: 1, 3 and 5. Evaluation measures are shown with 95% confidence intervals over 10 runs. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The experiments confirms the intuition that deploying more quadrotors improves the tracking performance up to a point of diminishing returns. 41 cameras. These scenarios are similar to the real robot experiments presented in (Hausman, M¨ uller, et al., 2015). Each quadrotor is controlled through the velocity commands u (i) = [v x ;v y ;v z ]. For each quadrotor,X feasible is unrestricted whileU feasible is define as −1 ≤ u (i) ≤ 1. The state of the target consists of its position and velocity x (targ) = [x;y;v x ;v y ] and the target moves on the XY-plane. The length of the simulation time step is equal to 0:1s. All camera sensors in this setup have the same properties as described in Sec. 3.3.2, and can detect objects in a 3-meter high truncated pyramid. Consequently, the global camera is not able to see the target directly. The camera measurement standard deviation is set to 0:02m, based on the average standard deviation of a commodity depth sensor (Khoshelham & Elberink, 2012). In addition, the measurement covariance scales quarticly with the distance to the measured quadrotor. In order to make simulations even more realistic, we also introduce motion noise with standard deviation equal to 0:1m~s. We evaluated our approach for 3 different target trajectories, including a figure eight trajectory (see Fig. 3.1), a spiral trajectory of a approximate size of 1.6mx1.6m, and a random walk trajectory with a step size of 0.03m. The figure eight and spiral trajectories simulate an adversarial target that is attempting to evade being tracked, while the random walk trajectory simulates a non-adversarial (but still non-cooperative) target. An example of a system setup with the field of view of each camera is depicted in Fig. 3.1, where the target trajectory is a figure eight. Our system aims to estimate the position of the target as accurately as possible by actively controlling the quadrotors. Some of the simulation experiments are presented in the video attached to this submission † . 3.4.1 Experiments We present different sets of experiments illustrating various evaluation criteria of our system. Time Horizon Experiment We evaluate the influence of different time horizons on the tracking accuracy. Fig. 3.4 shows the statistics for three different time horizons for the figure eight trajectory. We obtained similar results for the spiral and random trajectories. Based on these results, the time horizon has no significant effect on the performance in this setting, most likely because we are using model predictive control (MPC) in which we plan over the given time horizon, execute the first control, then repeat. We therefore set the time horizon to be 2 time steps for the remaining experiments. Sampling vs. Optimization Experiment We compare our optimization-based control generation with random sampling (Charrow et al., 2014) and lattice planners (Likhachev & Ferguson, 2009). Random sampling methods randomly sample controls to generate trajectories while lattice planners draw samples from a predefined manifold, which in our case was a sphere with radius equal to the maximum allowed control effort. For both methods, the cost of each resulting trajectory is evaluated and the controls corresponding to the minimum cost trajectory are executed. Fig. 3.5 shows the statistical results of the comparison † http://tinyurl.com/iros16cooperative 42 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 250 300 350 400 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 250 300 350 400 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 250 300 350 400 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 250 300 350 400 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] Figure 3.7: Comparison between not taking occlusions into account in the optimization step (top) and accounting for occlusions using the signed distance function presented in this work (bottom). Results collected for 3 different trajectories: figure eight (left), spiral (middle) and random walk (right) Statistics and 95% confidence intervals over 10 runs with 3 quadrotors. Taking occlusions into account is beneficial especially at the spots right below the global camera where the quadrotors have to stay close to each other. between the sampling method, lattice sampling method and our approach for 3 quadrotors for the figure eight trajectory. In order to make a fair comparison for the sampling methods, we chose the number of samples such that the execution times per one optimization step of all the methods were similar and we averaged the results over 10 runs. One can notice a significantly larger tracking error and the trace of the target covariance in performance of both of the sampling methods compared to the trajectory optimization approach. It is worth noting that the lattice approach yields worse results than the random sampling approach. It is most likely due to the fact that generating controls of the same magnitude throughout the trajectory causes the quadrotors to constantly overshoot the target. Number of Quadrotors Experiment We compared the tracking performance and the execution time for different numbers of quadrotors. The statistics presented in Fig. 3.6 demonstrates a visible improvement by using a larger number of quadrotors which confirms the intuition that larger teams facilitate more reliable and accurate target tracking. It is important to note that the quadrotors are able to leverage their large quantity despite the very narrow field of view of the cameras. This behavior is achieved because of the reasoning about occlusions which prevents the quadrotors to block each other’s views. The average time for one optimization step for 3 quadrotors was equal to 0:156± 0:002s. 43 Optimization with Occlusions Experiment To evaluate the importance of considering occlusions, we evaluated our method with and without reasoning about occlusions for a team of 3 quadrotors (Fig. 3.7) and measured the target covariance and target tracking error for all the trajectories. Of note is the much larger error and higher uncertainty in Fig. 3.7(top) than Fig. 3.7(bottom) for all the trajectories, indicating that reasoning about occlusions is particularly important for target tracking with multi-robot teams. 44 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] 0 50 100 150 200 time step [1/10 s] 0.000 0.002 0.004 0.006 0.008 0.010 trace(target covariance) [m^2] position error of target [m] Figure 3.8: Target tracking results for our previous level-based approach (top) and the hereby presented method without explicitly reasoning about sensing topologies (bottom). Statistics and 95% confidence intervals over 10 runs with 3 quadrotors. Results are presented only for the figure eight trajectory for brevity, however, we obtained similar results for other trajectories, i.e. spiral and random walk. The quantitative results from this experiment for all the trajectories are summarized in Table 3.1. The approach presented in this work is more beneficial for the reasons explained in Fig. 3.9. 45 Figure 3.9: An example of an advantage of our approach over the level-based approach. Left: most beneficial sensing topology for the level-based approach. The quadrotors form a chain topology as each level of quadrotors is allowed to sense only one level below it. Right: A superior sensing topology achieved by our novel method. The lower quadrotor is seen by the global sensor and the other quadrotor. The upper quadrotor can localize itself based on the observation of the lower quadrotor. Both quadrotors can observe the tracking target which provides more information than the level-based approach. There are no constraints for the topology levels. 46 level-based 2D opt. lattice sampling 3D opt. no occ. ours (3D opt.) avg. max tracking error (8) [m] 0:016± 0:024 0:41± 1:00 0:027± 0:028 0:0061± 0:0072 0:0037± 0:0050 avg. max tracking error rel. (8) 26x 1x 15x 67x 111x avg. tracking error (8) [m] 0:0030± 0:0036 0:037± 0:073 0:0035± 0:0044 0:0021± 0:0010 0:0012± 0:00071 avg. tracking error rel. (8) 12x 1x 11x 18x 31x avg. max tracking error (spiral) [m] 0:017± 0:018 0:070± 0:14 0:028± 0:061 0:011± 0:027 0:0015± 0:00088 avg. max tracking error rel. (spiral) 4x 1x 2.5x 6x 47x avg. tracking error (spiral) [m] 0:0010± 0:0027 0:0055± 0:014 0:0020± 0:0043 0:0018± 0:0013 0:00052± 0:00023 avg. tracking error rel. (spiral) 5.5x 1x 3x 3x 11x avg. max tracking error (rw) [m] 0:0065± 0:013 0:0071± 0:018 0:15± 0:44 0:018± 0:033 0:0042± 0:011 avg. max tracking error rel. (rw) 23x 21x 1x 8x 36x avg. tracking error (rw) [m] 0:00091± 0:0011 0:0012± 0:00090 0:0030± 0:014 0:0021± 0:0024 0:00063± 0:00034 avg. tracking error rel. (rw) 3x 2.5x 1x 1.5x 5x Table 3.1: Quantitative results for the experiments with 3 quadrotors for 3 different trajectories: 8 - figure eight, spiral and rw - random walk. Max target tracking error was averaged over 10 runs, target tracking error was averaged over all time steps and 10 runs. Our approach yields significantly smaller error than the other baseline methods for all the tested trajectories. 47 Topology Switching Experiment We compare the method proposed in this work to our previous target tracking method that intro- duced level-based sensing topologies and an efficient topology switching algorithm (Hausman, M¨ uller, et al., 2014). In this approach, the quadrotors are organized on different levels with an assumption that each level can only sense the adjacent level below it. At each time step the algorithm determines the planar controls for each of the quadrotors as well as determines whether to switch to one of the neighboring topologies by moving one of the quadrotors by one level up or down. This approach was introduced in order to avoid the reasoning about occlusions between quadrotors at different altitudes. However, a topology switch is assumed to be instantaneously completed, which is not realistic. In order to make our approach and the level-based approach comparable for standard quadrotors, we introduce a delay of 3 time steps for the topology switch in order to realistically simulate a real quadrotor adjusting altitude. The performance of both algorithms for the figure eight trajectory is depicted in Fig. 3.8. The quadrotors perform better when explicitly reasoning about occlusions with our approach, especially when the target is far away from the global camera. This phenomena can be explained by Fig. 3.9. In the level-based approach, if the target is far away from the global camera the optimal topology is a chain topology. Our approach, however, is able to form a different sensing topology that is superior to the previous one due to its higher number of available measurements (see Fig. 3.9). This superior topology is not valid for the level-based approach as it violates the assumption of sensing only the adjacent level below the current level. This topology enables the algorithm to localize the quadrotor on the left side due to the measurement of the other quadrotor and have two localized quadrotors observe the target. It is worth noting that our novel approach can create a greater variety of sensing topologies compared to our previous approach without explicitly reasoning about sensing topologies. Average Error Comparisons Table 3.1 summarizes the statistics of the average tracking error and maximum average tracking error for different approaches and trajectories tested in our experiments. It is worth noting that our approach achieved up to 8 times smaller average maximum tracking error and up to 2 times smaller average tracking error compared to the the next best method. It is also apparent that our method yields the largest improvement for the spiral trajectory. The reason for that is that in the initial phases of this trajectory the target stays underneath the camera so the quadrotors have to stay close together to be in the global camera view frustum. Hence, it is likely that the quadrotors may occlude each other’s views. Since our method explicitly reasons about these occlusions, it is able to outperform other methods. 3.5 Conclusions We presented an optimization-based probabilistic multi-robot target tracking approach that effi- ciently reasons about occlusions. We evaluated our approach in a number of simulation experiments. We have compared our method to other baseline approaches such as random sampling, lattice 48 sampling, and our previous work on sensing topologies (Hausman, M¨ uller, et al., 2014). Our experimental results indicated that our method achieves up to 8 times smaller average maximum tracking error and up to 2 times smaller average tracking error than the next best approach in the presented scenarios. 49 Part II Interactive Perception 50 In this part of the thesis, we present two Interactive Perception approaches that focus on the problems of articulation model estimation and grasping. A thorough survey on the topic of Interactive Perception is presented in our previous work (Bohg et al., 2016). Interactive Perception approaches exploit any kind of forceful interaction with the environment to simplify and enhance perception. Thereby, they enable robust, perceptually-guided manipulation behaviors. IP has two main benefits. First, physical interaction creates a novel sensory signal that would otherwise not be present. Second, by exploiting knowledge of the regularity in the combined space of sensory data and action parameters, the prediction and interpretation of this novel signal becomes simpler and more robust. In the following chapters, we present different examples of how these benefits are implemented on a robotic system. In addition, we elaborate on specific applications where IP either simplifies the perceptual challenges of a manipulation task such as opening a drawer or it allows the robot to leverage continuous interaction with the environment to improve its performance at grasping. These two chapters are based on our previously-published work (Karol Hausman & Sukhatme, 2015) and (Chebotar, Hausman, Kroemer, Sukhatme, & Schaal, 2016). 51 Chapter 4 Active Articulation Model Estimation We introduce a particle filter-based approach to representing and actively reducing uncertainty over articulated motion models. The presented method provides a probabilistic model that integrates visual observations with feedback from manipulation actions to best characterize a distribution of possible articulation models. We evaluate several action selection methods to efficiently reduce the uncertainty about the articulation model. The full system is experimentally evaluated using a PR2 mobile manipulator. Our experiments demonstrate that the proposed system allows for intelligent reasoning about sparse, noisy data in a number of common manipulation scenarios. 4.1 Introduction When operating in human environments, robots must frequently cope with multiple types of uncertainty such as systematic and random perceptual error, ambiguity due to a lack of data, and changing environmental conditions. While methods like Bayesian reasoning allow robots to make intelligent decisions under uncertainty, in many situations it is still advantageous, or even necessary, to actively gather additional information about the environment. The paradigm of interactive perception aims to use a robot’s manipulation capabilities to gain the most useful perceptual information to model the world and inform intelligent decision making. In this work, we leverage interactive perception to directly model and reduce the robot’s uncertainty over articulated motion models. Many objects in human environments move in structured ways relative to one another; articulation models describe these movements, providing useful information for both prediction and planning (Sturm et al., 2011; Mart´ ın & Brock, 2014). For example, many common household items like staplers, drawers, and cabinets have parts that move rigidly, prismatically, and rotationally with respect to each other. Previous works on detection of articulated motion models have generally used a passive maximum likelihood strategy to fit models, ignoring uncertainty (Pillai et al., 2014; Sturm et al., 2011; Mart´ ın & Brock, 2014). However, passive, fit-based approaches can often lead to several problems. First, these methods give no indication if there are competing high-likelihood hypotheses, or if the fitted model is the only reasonable explanation of the data. Second, purely observational models only look at the fit of observed data points, but do not reason about rigidity 52 Figure 4.1: Top: The PR2 performing an action that reduces the uncertainty over articulation models. Bottom: Three of the most probable articulation model hypotheses: free-body(green), rotational(blue) and prismatic(pink). Selected action is depicted in yellow and lies on the plane of the cabinet 53 constraints or certainty, leading them to be tricked in many situations. Third, without a confidence metric, the robot cannot reason about how to gather additional data intelligently when needed. In this work, we address all of the above challenges by tracking a distribution over articulation models, updating the model distribution based upon the outcome of manipulation actions, and selecting informative actions to converge quickly to accurate articulation models. An example of our method in action is shown in Fig. 4.1. The key contributions of our approach are that we: a) develop a particle filter approach to keep track over uncertainty over different articulation models and their parameters, b) design a manipulation sensor model that updates model likelihoods based on the feedback from the robot’s manipulator, and c) introduce a probabilistic action selection algorithm that reduces uncertainty efficiently. The effectiveness of this approach is demonstrated through a series of experiments using a PR2 mobile manipulator. 4.2 Related Work Interactive perception has enjoyed success in many different applications, including object seg- mentation (van Hoof, Kroemer, & Peters, 2014; Hausman et al., 2013; Kenney, Buckley, & Brock, 2009), object recognition (Schiebener, Morimoto, Asfour, & Ude, 2013; Hausman, Corcos, Mueller, Sha, & Sukhatme, 2014), object sorting (Chang, Smith, & Fox, 2012; Gupta & Sukhatme, 2012), and object search (Gupta, Ruhr, Beetz, & Sukhatme, 2013). Some classes of problems have state information that cannot be observed passively, requiring the use of interactive perception. One instance of this is the estimation of kinematic models of articulated objects. The task of estimating articulation models has been addressed in various ways. Sturm et al. (Sturm et al., 2011) use human demonstrations to obtain visual data from tracked markers, which is then used to characterize single or multiple joints in a probabilistic fashion. Katz et al. (Katz, Kazemi, Andrew Bagnell, & Stentz, 2013; Katz, Orthey, & Brock, 2014) and Pillai et al. (Pillai et al., 2014) were able to move away from a marker-based system by using visual features to detect and track rigid parts. The observed motion of these rigid bodies is then used to characterize various joint types. Huang et al. (X. Huang, Walker, & Birchfield, 2012) use different camera view-points and make use of structure-from-motion techniques to recover and classify the joint properties. In their recent work, Martin and Brock (Mart´ ın & Brock, 2014) present an algorithm that is able to perform articulation model estimation on the fly, in contrast to previous offline algorithms. Jain and Kemp (Jain & Kemp, 2010) present an alternate approach in which an equilibrium point control algorithm is used to open doors and drawers without extensive use of visual data. Otte et al. (Otte, Kulick, Toussaint, & Brock, 2014) present related work that addresses the problem of physical exploration to estimate the articulation model. However, the problem is formulated differently in this case—the robot is searching for a high-level strategy (in what order should the robot interact with objects?) in order to estimate the structure of the whole environment. In our work, we probabilistically combine visual observations with the outcomes of the robot’s manipulation actions in a recursive state estimation framework. We use fiducial markers to simplify the perceptual challenge, as the main focus is on informative action selection and the 54 Figure 4.2: Particles representing different hypothesis of the underlying articulation model. Blue is a rotational model, pink is a prismatic model and green is a free-body model. representation of multiple hypotheses. By using this approach, it is possible to improve any of the above-mentioned visual articulation model frameworks by taking advantage of the robot’s manipulation capabilities to estimate the underlying model in a principled way. 4.3 Approach 4.3.1 Articulation Model Representation z m 1 z m 2 M; A1 A2 z a 2 z m 3 A3 z a 3 N1 N2 N3 Figure 4.3: Probabilistic graphical model for articulation recognition. 55 Our approach takes advantage of robot manipulation capabilities to actively reduce the un- certainty over articulation models and their parameters. This requires fusing different sensor modalities—vision-based object tracking and outcomes of manipulation actions performed by the robot. Both sensor measurements are fused using a recursive Bayesian update formulation which is implemented as a particle filter (Thrun, Burgard, & Fox, 2005b) (Fig. 4.2). Articulation Models and their Parameters We consider four types of articulation models: rigid, prismatic, rotational and free-body, parametrized similarly to Sturm et al.(Sturm et al., 2011). The rigid model has 6 parameters specifying a fixed relative transformation between two objects or parts; the prismatic model has 2 more parameters than the rigid model that indicate the direction of the prismatic axis; the rotational model has 3 more parameters than the rigid model that indicate the axis of rotation and the radius; the free-body model does not contain any parameters, as the objects can move freely with respect to each other. Probabilistic Graphical Model for State Estimation Our estimate of the articulation model can be represented as the probabilistic graphical model shown in Fig. 4.3, where we assume that the model and its parameters do not change over time. The model consists of the type of articulation modelM, its parameters, sets of 6-DOF object pose measurements z m t which are represented in a plate notation (there existN t measurements for each time stept), action sensor measurementsz a t and actionsA t . Given this probabilistic graphical model the posterior probability of the articulation model and its parameters can be factorized: P = p(M;Sz a 1∶t ;z m 1∶t ;A 1∶t−1 )∝ p(z m t SM;) ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ visual sensor model p(z a t SM;;A t−1 ) ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ action sensor model p(M;Sz a 1∶t−1 ;z m 1∶t−1 ;A 1∶t−2 ) ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ previous belief : (4.1) This formulation enables us to fuse all the sensor measurements coming from vision as well manipulation data. In the following subsections, we describe the respective parts of the above factorization. 4.3.2 Visual Sensor Model In order to evaluate how the observed poses of the objects can be explained by the model and its parameters, a visual sensor model is needed. Having obtained the current noisy measurementz m t , similarly to Sturm et al.(Sturm et al., 2011), it is projected onto the model that is being evaluated. The projection is achieved by using the model’s inverse kinematics function: ^ q=f −1 M; (z m t ): (4.2) The resulting configuration estimate ^ q is the configuration (i.e. a scalar joint position or rotational angle in the prismatic and rotational models, and a constant for the rigid model) to the original 56 Figure 4.4: The angle between the action vector W V a t and the tangent vector to the current model W V m t (rotational in this case). observationz m t that lies on the surface of the model. This configuration is then used with the model’s forward kinematics function to obtain the 6-DOF pose corresponding to the projected configuration: M ^ z m t =f M; (^ q): (4.3) Finally, the visual sensor model is approximated with a 2-dimensional multivariate normal distribu- tion (Eq. 4.5) that is centered around 2-dimensional vector of zeros and depends on the translational and rotational distanced (Eq. 4.4) between the projected M ^ z m t and the observedz m t measurement. Since the visual measurements are conditionally independent given the model and its parameters, we are able to write the final sensor model: d(z m t ; M ^ z m t )∶= [trans d (z m t ; M ^ z m t );rot d (z m t ; M ^ z m t )] (4.4) p(z m t SM;)=∏ z m t ∈z m t N(d(z m t ; M ^ z m t );0;); (4.5) where is the estimated covariance of the observations. In the case of a free-body model, the above formulas do not hold, as it does not have forward and inverse kinematics functions defined. The likelihood update for the free-body model corresponds to a constant probability density value that is defined by a uniform distribution over all possible poses in the robot’s workspace. This follows from the intuition that a free-body is not constrained by any joint and is roughly equally likely to be anywhere in the workspace. 4.3.3 Action/Manipulation Sensor Model The second sensor modality that is incorporated in the model is the outcome of manipulation actions. The actions are defined as short translational movements performed on a scale of 10cm, where a stable grasp of the articulated object is assumed. Each action is represented as a vector W V a t in the world frameW. 57 The observed quantity in the manipulation sensor model is defined as a binary variable that is equal to 0 if the manipulation was unsuccessful (the manipulator cannot complete the motion because the joint friction is too high, it attempted motion in a rigid direction, etc.) and 1 otherwise (the expected motion is observed). For each model, its parameters, and the selected action, we can compute the likelihood of the action being successful as: p(z a t SM;;A t−1 )=e − ; (4.6) where is the weighting factor dependent on stiffness of the arm controller used in the robot; is the angle between the vector W V m t describing the motion prediction at the current grasping point W z m t for a given modelM in the world frameW and the action vector W V a t that is the direction of the movement of the manipulator. It worth noting that one can use other functions to estimate the likelihood of the action being successful as long as the likelihood is inversely proportional to the angle. To estimate W V m t , the most recent noisy observation W z m t is used as a point estimate of the pose of the articulated part at the grasping location. We introduce a functionf that, given the modelM and the current estimate of the pose W z m t , returns a vector that is a tangent to the predicted motion path of the object at the current point (the blue vector depicted in Fig. 4.4): W V m t =f(M; W z m t ) (4.7) = arccos W V m t ⋅ W V a t S W V m t SS W V a t S : (4.8) This formulation results in a decrease of the likelihood of successful motion as the angle between predicted motion vector and action vector increases. In the case of the rigid and the free-body models, we assume that the only reason forz a t = 0 is the configuration of the articulated object (there are no singularities and the robot’s arm movement is in its workspace), which leads to the definition of the manipulation sensor model: p(z a t = 0SM rigid ;;A t−1 )= 1 (4.9) p(z a t = 1SM rigid ;;A t−1 )= 0 (4.10) p(z a t = 0SM free ;;A t−1 )= 0 (4.11) p(z a t = 1SM free ;;A t−1 )= 1: (4.12) 4.3.4 Best Action Selection In order to select the best action (e.g. the action that reduces the uncertainty over the model the most), we compare two different strategies. Entropy-based Action Selection In this case, we define the best action as the action that will maximally reduce the entropy over the model distribution in expectation. Because the outcomez a t+1 of the future actionA t has not been 58 observed, we compute the expected entropy, H, of the distribution over articulation models after the action is executed: A ∗ t = arg min At E z a t+1 ∼p(z a t+1 SM;) H[M;SZ a 1∶t+1 ;z m 1∶t ;A t ]: (4.13) Intuitively, the action that maximally reduces the entropy should disambiguate the current hypothe- ses of articulation models in the most efficient way. However, as we will show later, this intuition is incorrect in many situations. Information Gain-based Action Selection Our second approach is based on an information gain metric, rather than on entropy itself. Similar to entropy-based action selection, first, the expected posterior is calculated: Q=E z a t+1 ∼p(z a t+1 SM;) p(M;SZ a 1∶t+1 ;z m 1∶t ;A t ): (4.14) Next, we look for the action that will maximize the information gain between the current distribution and the expected distribution, given the action that is being evaluated. Information gain is defined as the Kullback-Leibler divergence (Kullback & Leibler, 1951) between these two distributions: A ∗ t = argmax At D KL (PSSQ): (4.15) 4.4 Implementation 4.4.1 Particle Filter Since the joint distribution of a model and its parameters has no simple parametric form and may often be multimodal, we approximate it using a particle filter. Each particle represents an articulation model type (e.g. rigid, free-body, rotational, or prismatic) and the associated model parameters. An example showing the state of the particle filter is depicted in Fig. 4.2. Since the particle filter has to cover a high dimensional space, we propose a particle initialization algorithm that ensures that the highest likelihood areas of the joint distribution are well-covered. The action selection algorithm is then applied after updating the particle weights according to the visual sensor model. For particle resampling, stratified resampling (Kitagawa, 1996) is used because it better ensures that the full support of the underlying distribution is covered adequately. Initialization of Particles We take advantage of human demonstrations or prior observation data in order to initialize the particles, though this step is not strictly required. An equal number of particles are used for each articulation model. In order to fit model parameters to the data, the algorithm from Sturm et. al (Sturm et al., 2011) is employed, in which the authors use MLESAC (maximum likelihood sample consensus) (Torr & Zisserman, 2000) to fit and refine parameters. For more details we 59 refer the reader to (Sturm et al., 2011). It is important to emphasize that a small number of MLESAC iterations are used to estimate the parameters of the model. This ensures that more of the high-dimensional parameter space is covered, rather than just the highest likelihood settings, given the noisy visual data. 4.4.2 Visual Sensor Model We record all available object pose data during demonstrations, as well as during robot manipulation actions. Every time a correction step of the particle filter step occurs, we make use of all the new data that was not previously used to update the weights of the particles according to Eqn. 4.5. 4.4.3 Manipulation Sensor Model In our robot experiments, we use a PR2 mobile manipulator that does not contain a force/torque sensor to directly measure the forces in the robot’s wrist. In order to obtain manipulation feedback from the robot, we implemented a compliant controller that stops whenever the controller effort exceeds a certain threshold. In this way, we are able to detect the sensor outputz a t that specifies if the interaction with the articulation model was successful (i.e. if the part moved as expected or resisted motion). 4.4.4 Best Action Selection To select the optimal action, a set of candidate actions is first generated. The actions are sampled such that they cover the range of directions that the robot can act in order to move the articulated object from a known grasp point. One set of candidate actions is depicted in Fig. 4.5. Next, either the expected entropy (post-action) or the relative entropy (compared to pre-action) of the particle filter is calculated for each generated action. Finally, based on the chosen optimality criteria, the action optimal with respect to the selected criteria is chosen. Entropy-based Action Selection In order to estimate the entropy of the distribution represented in the particle filter, taking only particles’ weights into account is not sufficient. For example, if the particles are concentrated around the true estimate of the articulation model, their weights will be similar, making the entropy large. To solve this problem, the position of the particles must be taken into account by estimating the underlying differential entropy of the distribution. For each model we perform kernel density estimation (Rosenblatt, 1956) over its parameters with Gaussian kernels: ^ f H ()= 1 n n Q i=1 K H (− i ) (4.16) K H ()= SHS −1~2 K(H −1~2 ) (4.17) K()= (2) −d~2 e −1~2 T : (4.18) 60 Figure 4.5: Automatically generated candidate actions. 61 The bandwidth matrix H is estimated according to Scott’s rule (Scott, 1979): H ij = 0;i≠j (4.19) » H ii =n −1 d+4 i ; (4.20) where i is the standard deviation of the i-th variable,d is the dimensionality of the parameters, andn is the number of samples. By sampling the kernel density estimate at the (weighted) locations of the original particles, we are able to estimate the differential entropy of the joint distribution represented in the particle filter with the resubstitution estimate proposed in (I. Ahmad & Lin, 1976): H=− 1 n n Q i=1 ln ^ f H ( i ): (4.21) At the end of this process, the expected entropy of the joint distribution of models and their parameters is estimated as the sum of the entropies (according to Eq. 4.21) of all the model types for each action. The action with the smallest expected entropy is considered optimal with respect to the entropy-based criteria. Information Gain Action Selection In this case, we look for the action that will maximize the information gain (e.g. relative entropy) of the joint distribution approximated with the particle filter. Information gain is defined as the Kullback-Leibler divergence between the current distribution and the expected distribution post-action: IG(P;Q)=D KL (PSSQ)=Q i P(i) ln P(i) Q(i) : (4.22) It is worth noting that in this case, kernel density estimation is not needed since the algorithm can operate on the particle weights directly, making it more computationally efficient than the previous method. 4.5 Evaluation and Discussion 4.5.1 Experimental Setup In the following experiments, we use a PR2 mobile manipulator with a Microsoft Kinect sensor attached to its head. Since object recognition is not the focus of this work, we use RGB-D data to track the pose of objects in the world with AR tags, a type of visual fiducial. All code used in the following experiments is open source and available online as a ROS package * . * http://wiki.ros.org/active articulation 62 Figure 4.6: Four experimental scenarios. Top-left: rotational cabinet door with an incomplete demonstration; top-right: two free-body erasers moved apart in a straight line; bottom-left: locked drawer as a rigid joint; bottom-right: stapler as a rotational joint. We designed four experimental scenarios in which the visual demonstration data leaves room for ambiguity about the articulation model type and parameters. In the first scenario, the underlying model is rotational and involves a cabinet door, in which the demonstration consists of a single opening and closing of the door to an angle of approximately 30 degrees. The second scenario is a free-body model and consists of a whiteboard eraser that the demonstrator moves along a straight line with respect to another eraser. The third scenario consists of a rigid model and involves the robot initializing the filter without a demonstration by attempting to open a locked drawer, tracking only the fiducial markers on this static object. The final scenario is a rotational model and consists of a stapler that is dynamically moved around the scene during the demonstration. Fig. 4.6 shows all four scenarios. For the visual sensor model, we set to a diagonal matrix with 0:04m for the translational component and 30 degrees for the rotational component based on empirical data. Throughout the experiments, the number of MLESAC iterations was set to 3 when initializing the particle filter. 63 Figure 4.7: Top: Entropy over articulation model types after each action for the cabinet door experiment. Bottom: Probability of the true articulation model (rotational) after each action. 64 4.5.2 Action Selection In order to evaluate our system, we first compare different action selection strategies on the cabinet door scenario. After choosing the most efficient action selection strategy, we perform the remainder of our experiments in all four scenarios. In all experiments, 500 particles were used to cover the space of the probability distribution. Evaluation We evaluate three different action selection strategies: minimum expected entropy and maximum KL-divergence as described in Sec. 4.3.4, and random action selection. In the cabinet door scenario, each action selection method was tested by having the robot perform 7 actions with that method. The results are averaged over 5 runs of the experiment. We measure the success of each action selection strategy by its reduction in uncertainty (i.e. minimizing the entropy over models) and by the number of particles representing the correct model—a rotational joint in this case. In order to eliminate the influence of unsuccessful action execution (e.g. the gripper slips off the door handle) on the action strategy comparison, we simulate the output of the manipulation sensor model by analyzing data from executions where manipulation performed well; empirically, we found that the manipulation action was generally successful when the angle between the action vector and the tangent to the model was less than 30 degrees. Using this, we simulated a binary success criteria based on a threshold of 30 degrees. Results Fig. 4.7 presents the obtained entropy over articulation model types and probability of the correct model after each action. It can be seen that the KL-divergence action selection method reduces entropy much faster than the other evaluated methods. Somewhat surprisingly, the expected entropy strategy is almost as slow as selecting an action randomly. To gain some insight into this, recall that the goal of maximizing the KL-divergence is to change the distribution as much as possible, whereas minimizing the expected entropy aims to obtain as ‘peaky‘ of distribution as possible. In the presented scenarios, it is the case that in order to minimize the entropy of the articulation models, the algorithm needs to increase it initially to find the right model, which eventually, leads to a significant reduction of the entropy. In other words, the minimum entropy strategy fails because it reduces entropy greedily at each step and gets stuck in local minima; by contrast, the information gain approach aims to change the distribution as much as possible in any direction, allowing escape from minima. One can observe this effect in the bottom graph of Fig. 4.7. After one action, the entropy-based method yields better results (in terms of the probability of the correct model as well as the entropy) than the KLD-based algorithm, but in all following steps, the KLD-based algorithm outperforms the other methods. Fig. 4.8 emphasizes the difference between the two presented action selection methods, showing the probability distribution and chosen actions for both methods. Although both methods chose a similarly good first action, the KLD algorithm selects a much better second action. The entropy- based method chose an action that is perpendicular to the motion predictions of two competing 65 articulation models, resulting in a minimal entropy reduction. The KLD approach selects an action that shifts the probability density over articulation models towards rotational—a model type with a larger entropy in its parameters. This causes the overall entropy of the joint distribution to increase temporarily. In the long run, it is desirable to move belief towards the correct model type, so that in the next step, the entropy over its parameters can be reduced. However, an entropy reduction method is unlikely to select this action because of the 1-step entropy increase. 4.5.3 Articulation Model Estimation Evaluation We now use the superior KL-divergence action selection method to test the overall performance of our system. For each of the previously described experimental scenarios, the algorithm is run 5 times and the resulting statistics are averaged over all the runs. A different number of actions were taken in each scenario based on the convergence rate of the algorithm. Results The second row of Fig. 4.8 shows the distribution of the articulation model types in the cabinet door experiment before any action was taken (based only on the visual data from the demonstration), after the first action, and after the second action, respectively. One can see that the algorithm has shifted most of the probability mass to the correct model (rotational) after only two actions. The first action (in the door plane, Fig. 4.8, top-left) was selected in order to eliminate the free-body model hypothesis while the second action (towards the normal of the door plane, Fig. 4.8, top-right) was chosen to distinguish between the prismatic and rotational hypotheses, as they act on different angles. It is worth noting that once the rotational articulation model type has converged, the standard deviation of each of the parameters of that model was smaller than 0:04 which demonstrates that both the model type and its parameters have converged. Fig. 4.9 presents a similar comparison for the other two experiments. As before, the first column of the graph shows the averaged distribution before any action was taken, and the consecutive columns show the results after each respective action. In case of the eraser experiment, it is apparent that one action was enough to converge on the correct free-body model (Fig. 4.9, top row). This is because the KL-divergence action selection algorithm chose the action that is perpendicular to the previously demonstrated linear motion. Since the eraser can move freely in this direction, the prismatic hypothesis is pruned out and the free-body model is the only hypothesis that can explain both the visual and the manipulation data. The rigid model is not taken into consideration since the demonstration movements were too large to support this hypothesis. Note that a passive, fit-based method would be tricked in this scenario by almost certainly fitting a prismatic model to the visual data. The rigid-body experiment required three actions to shift the majority of the belief to the correct model (Fig. 4.9, bottom row). The explanation of this behavior lies in the measurement noise of the fiducial marker tracking. Since the marker appears to slightly oscillate in the demonstration, the 66 free revolute prismatic rigid action Figure 4.8: Outcomes of performing actions on the cabinet door. Top-left: the current state of the particle filter with most probable hypotheses of each articulation model. The yellow arrow shows the optimal action in terms of maximizing the information gain. Top-right: the state of the particle filter after performing the first action (z a 1 = 0, door did not move), together with the currently chosen most informative action. Second row-left: Percentage of each of the models’ particles in the particle filter before any action, after the first action, and after the second action (z a 2 = 1, door did move). The two bottom rows depict the same idea, but the action is selected based on minimization the expected entropy. The outcomes of the actions in this case were bothz a 2 = 1. 67 free revolute prismatic rigid Figure 4.9: Top: Probability distribution over models before and after the action in the eraser experiment. Bottom: Analogous statistics for the rigid-joint experiment. Results of 3 actions are presented. Figure 4.10: Rotational particles in the stapler experiment before (left) and after (right) a manipu- lation action. 68 particle initialization algorithm is able to fit many different models to the data, including prismatic joints with axes in many different directions. Manipulation actions help to change this initial belief by moving along the axis of the most likely prismatic hypotheses. Since the manipulation output confirms that there is no movement in this direction, the prismatic hypotheses are pruned out. However, due to a large number of different prismatic hypotheses, many actions are needed to be certain that it is actually a rigid model. The final experiment illustrates uncertainty over model parameters caused by systematic observation noise, rather than over model type. Fig. 4.10 shows the state of the particle filter in the stapler scenario before and after an action was taken. In this case, the articulation model type is clear from the demonstration (rotational joint), but there is significant ambiguity in the parameter space of this model. Since there is systematic noise in the marker detection (resulting from the changing viewing angle of the markers), it is crucial to use the particle filter in order to model all potential articulation model hypotheses and use manipulation to prune out the incorrect ones. In this experiment, one action was sufficient in order for the algorithm to converge. 4.6 Conclusions We introduced a probabilistic interactive perception system to represent and reduce uncertainty over various types of articulated motion models. The system includes a visual sensor model to deal with perceptual noise, a manipulation model to probabilistically reason about the outcomes of actions, and an action selection metric that allows for efficient reduction of uncertainty over models and parameters. Experiments demonstrated, somewhat counter-intuitively, that the most efficient way of re- ducing the entropy of the distribution over candidate models and parameters is not achieved by greedily minimizing the expected entropy after each action. Instead, an information gain approach based on KL-divergence was shown to be significantly more effective than minimizing entropy or performing random actions. We then showed that the full system can actively cope with ambiguity and reduce uncertainty over models and their parameters in a variety of situations. The PR2 was able to actively manipulate objects to disambiguate a limited cabinet door demonstration; to prevent being fooled by prismatic motion between two free-body erasers; to eliminate alternate hypotheses of a rigid locked drawer caused by random visual noise; and to narrow the distribution of rotational parameters of a stapler that were corrupted by systematic perceptual noise. 69 Chapter 5 Self-Supervised Regrasping using Reinforcement Learning We introduce a framework for learning regrasping behaviors based on tactile data. First, we present a grasp stability predictor that uses spatio-temporal tactile features collected from the early-object-lifting phase to predict the grasp outcome with a high accuracy. Next, the trained predictor is used to supervise and provide feedback to a reinforcement learning algorithm that learns the required linear grasp-adjustment policies based on tactile feedback, with a small number of parameters, for single objects. Finally, a general high-dimensional regrasping policy is learned in a supervised manner by using the outputs of the individual policies. Our results gathered over more than 50 hours of real robot experiments indicate that the robot is able to predict the grasp outcome with 93% accuracy. The robot is able to improve the grasp success rate from 42% when randomly grasping an object to up to 97% when allowed to regrasp the object in case of a predicted failure. In addition, our experiments indicate that the general high-dimensional policy learned using our method is able to outperform the respective linear policies on each of the single objects that they were trained on. Moreover, the general policy is able to generalize to a novel object that was not present during training. 5.1 Introduction Autonomous grasping of unknown objects is a fundamental requirement for service robots perform- ing manipulation tasks in real world environments. Even though there has been a lot of progress in the area of grasping, it is still considered an open challenge and even the state-of-the-art grasping methods may result in failures. Two questions arise immediately: i) how to detect these failures early, and ii) how to adjust a grasp to avoid failure and improve stability. Early grasp stability assessment is particularly important in the regrasping scenario, where, in case of predicted failure, the robot must be able to place the object down in the same position in order to regrasp it later. In many cases, early detection of grasping failures cannot be performed using a vision system as they occur at the contact points and involve various tactile events such as incipient slip. Recent developments of tactile sensors (Wettels, Santos, Johansson, & Loeb, 2008) 70 and spatio-temporal classification algorithms (Madry, Bo, Kragic, & Fox, 2014) enable us to use informative tactile feedback to advance grasping-failure-detection methods. In order to correct the grasp, the robot has to be able to process and use the information on why the grasp has failed. Tactile sensors are one source of this information. Our grasp adjustment approach aims to use this valuable tactile information in order to infer a local change of the gripper configuration that will improve the grasp stability. An example of a regrasping behavior is depicted in Fig. 5.1. Hereby, we show that simple regrasping strategies can be learned using linear policies if enough data is provided. However, these strategies do not generalize well to other classes of objects than those they were trained on. The main reason for this shortcoming is that the policies are not representative enough to capture the richness of different shapes and physical properties of the objects. A potential solution to learn a more complex and generalizable regrasping strategy is to employ a more complex policy class and gather a lot of real-robot data with a variety of objects to learn the policy parameters. The main weakness of such a solution is that, in addition to requiring large amounts of data, these complex policies often result in the learner becoming stuck in poor local optima (Levine & Koltun, 2013a; Deisenroth, Neumann, & Peters, 2013). We propose learning a complex high-dimensional regrasping policy in a supervised fashion. Our method uses simple linear policies to guide the general policy to avoid poor local minima and to learn the general policy from smaller amounts of data. In this work, we jointly address the problems of grasp-failure detection and grasp adjustment, i.e. regrasping using tactile feedback. In particular, we use a failure detection method to guide and self-supervise the regrasping behavior using reinforcement learning. In addition to the regrasping approach, we present an extensive evaluation of a spatio-temporal grasp stability predictor that is used on a biomimetic tactile sensor. This prediction is then used as a reward signal that supervises the regrasping reinforcement learning process. Moreover, we present a method to learn a more complex high-dimensional regrasping policies by reusing the previously learned linear policies in a supervised setting. 5.2 Related Work The problem of grasp stability assessment has been addressed previously in various ways. (Dang & Allen, 2014) utilize tactile sensory data to estimate grasp stability. However, their approach does not take into account the temporal properties of the data and uses the tactile features only from one time step at the end of the grasping process. There have also been other approaches that model the entire time series of tactile data. In (Bekiroglu, Laaksonen, Jørgensen, Kyrki, & Kragic, 2011) and (Bekiroglu, Kragic, & Kyrki, 2010), the authors train a Hidden Markov Model to represent the sequence of tactile data and then use it for grasp stability assessment. The newest results in the analysis of the tactile time series data for grasp stability assessment were presented in (Madry et al., 2014). The authors show that the unsupervised feature learning approach presented in their work achieves better results than any of the previously mentioned methods, including (Bekiroglu et al., 2011) and (Bekiroglu et al., 2010). In this work, we show how the spatio-temporal tactile 71 Figure 5.1: Regrasping scenario: the robot partially misses the object with one of the fingers during the initial grasp (left), predicts that the current grasp will be unstable, places the object down, and adjusts the hand configuration to form a firm grasp of the object using all of its fingers (right). 72 features developed in (Madry et al., 2014) can be applied to the state-of-the-art biomimetic tactile sensor, which enables us to better exploit the capabilities of this advanced tactile sensor. Biomimetic tactile sensors and human-inspired algorithms have been used previously for grasping tasks. In (Wettels, Parnandi, Moon, Loeb, & Sukhatme, 2009), the authors show an approach to control grip force while grasping an object using the same biomimetic tactile sensor that is used in this work. (Su et al., 2015) present a grasping controller that uses estimation of various tactile properties to gently pick up objects of various weights and textures using a biomimetic tactile sensor. All of these works tackle the problem of human-inspired grasping. In this work, however, we focus on the regrasping behavior and grasp stability prediction. We also compare the grasp stability prediction results using the finger forces estimated by the method from (Su et al., 2015) to the methods described in this work. Reinforcement learning has enjoyed success in many different applications including manip- ulation tasks such as playing table tennis (Kober, Oztop, & Peters, 2011) or executing a pool stroke (Pastor, Kalakrishnan, Chitta, Theodorou, & Schaal, 2011). Reinforcement learning methods have also been applied to grasping tasks. (Kroemer, Detry, Piater, & Peters, 2010) propose a hierarchical controller that determines where and how to grasp an unknown object. The authors use joint positions of the hand in order to determine the reward used for optimizing the policy. Another approach was presented by (Montesano & Lopes, 2012), where the authors address the problem of actively learning good grasping points from visual features using reinforcement learning. In this work, we present a different approach to the problem of grasping using reinforcement learning, i.e. regrasping. In addition, we use tactile-feature-based reward function to improve the regrasping performance. Tactile features have been rarely used in reinforcement learning manipulation scenarios. (Pastor et al., 2011) use pressure sensor arrays mounted on the robot’s fingers to learn a manipulation skill of flipping a box using chopsticks. Similarly to (Pastor et al., 2011), (Chebotar, Kroemer, & Peters, 2014) use dynamic movement primitives and reinforcement learning to learn the task of gentle surface scraping with a spatula. In both of these cases, the robot was equipped with the tactile matrix arrays. In this work, we apply a reinforcement learning approach to the task of regrasping using an advanced biomimetic tactile sensor together with state-of-the-art spatio-temporal feature descriptors. The works most related to this work have focused on the problem of grasp adjustment based on sensory data. (Dang & Allen, 2014) tackles the regrasping problem by searching for the closest stable grasp in the database of all the previous grasps performed by the robot. A similar approach is presented by (M. Li, Bekiroglu, Kragic, & Billard, 2014), where the authors propose a grasp adaptation strategy that searches a database for a similar tactile sensing experience in order to correct the grasp. The authors introduce an object-level impedance controller whose parameters are adapted based on the current grasp stability estimates. In that case, the grasp adaptation is focused on in-hand adjustments rather than placing the object down and regrasping it. The main differences between these approaches and ours are twofold: i) we employ spatio-temporal features and use tactile data from the beginning of the lifting phase which enables us to achieve high accuracy in a short time for grasp stability prediction, and ii) we use reinforcement learning with spatio-temporal 73 8 7 9 1 10 11 2 17 12 3 V1 13 4 V2 14 5 18 15 6 19 16 7 9 8 10 11 12 13 15 16 14 1 2 3 5 6 4 17 18 19 8 7 9 1 10 11 2 17 12 3 V1 13 4 V2 14 5 18 15 6 19 16 7 9 8 10 11 12 13 15 16 14 1 2 3 5 6 4 17 18 19 Figure 5.2: The schematic of the electrode arrangements on the BioTac sensor (left). Tactile image used for the ST-HMP features (right). The X values are the reference electrodes. The 19 BioTac electrodes are measured relative to these 4 reference electrodes. V1 and V2 are created by taking an average response of the neighboring electrodes: V1 = avg(E17, E18, E12, E2, E13, E3) and V2 = avg(E17, E18, E15, E5, E13, E3). features which is supervised by the previously learned grasp stability predictor. This allows us to learn the regrasping behavior in an autonomous and efficient way. 5.3 Grasp stability estimation The first component of our regrasping framework is the grasp stability predictor that provides early detection of grasp failures. We use short sequences of tactile data to extract necessary information about the grasp quality. This requires processing tactile data both in the spatial and the temporal domains. In this section, we first present the biomimetic tactile sensor used in this work and how its readings are adapted for the feature extraction method. To the best of our knowledge, this is the first time that the spatio-temporal tactile features are used for this advanced biomimetic tactile sensor. Next, we describe the spatio-temporal method for learning a sparse representation of the tactile sequence data and how it is used for the grasp stability prediction. 5.3.1 Biomimetic tactile sensor In this work, we use a haptically-enabled robot equipped with 3 biomimetic tactile sensors - BioTacs (Wettels et al., 2008). Each BioTac consists of a rigid core housing an array of 19 electrodes surrounded by an elastic skin. The skin is inflated with an incompressible and conductive liquid. When the skin is in contact with an object, the liquid is displaced, resulting in distributed impedance changes in the electrode array on the surface of the rigid core. The impedance of 74 each electrode tends to be dominated by the thickness of the liquid between the electrode and the immediately overlying skin. In order to apply feature extraction methods from computer vision, as described in the next section, the BioTac electrode readings have to be represented as tactile images. To form such a 2D tactile image, the BioTac electrode values are laid out according to their spatial arrangement on the sensor as depicted in Fig. 5.2. 5.3.2 Hierarchical Matching Pursuit To describe a time series of tactile data, we employ a spatio-temporal feature descriptor - Spatio- Temporal Hierarchical Matching Pursuit (ST-HMP). We choose ST-HMP features as they have been shown to have high performance in temporal tactile data classification tasks (Madry et al., 2014). ST-HMP is based on Hierarchical Matching Pursuit (HMP), which is an unsupervised feature- extraction method used for images (Bo, Ren, & Fox, 2011). HMP creates a hierarchy of several layers of simple descriptors using a matching pursuit encoder and spatial pooling. To encode data as sparse codes, HMP learns a dictionary of codewords. The dictionary is learned using the common codebook-learning method K-SVD (Aharon, Elad, & Bruckstein, 2006). Given a set ofN H-dimensional observations (e.g. image patches)Y = [y 1 ;:::;y N ] ∈R H×N , HMP learns aM-word dictionaryD = [d 1 ;:::;d M ]∈R H×M and the corresponding sparse codes X = [x 1 ;:::;x N ] ∈R M×N that minimize the reconstruction error between the original and the encoded data: min D;X YY −DXY 2 F s:t: ∀mYd m Y 2 = 1 and ∀iYx i Y 0 ≤K; where Y⋅Y F is a Frobenius norm,x i are the sparse vectors, Y⋅Y 0 is a zero-norm that counts number of non-zero elements in a vector, andK is the sparsity level that limits the number of non-zero elements in the sparse codes. The optimization problem of minimizing the reconstruction error can be solved in an alternating manner. First, the dictionaryD is fixed and each of the sparse codes is optimized: min x i Yy i −Dx i Y 2 s:t: Yx i Y 0 ≤K: Next, given the sparse code matrix X, the dictionary is recomputed by solving the following optimization problem for each codeword in the dictionary: min dm YY −DXY 2 F s:t: Yd m Y 2 = 1: The first step is combinatorial and NP-hard but there exist algorithms to solve it approximately. HMP uses orthogonal matching pursuit (OMP) (Pati, Rezaiifar, & Krishnaprasad, 1993) to approx- imate the sparse codes. OMP proceeds iteratively by selecting a codeword that is best correlated 75 with the current residual of the reconstruction error. Thus, at each iteration, it selects the codewords that maximally reduce the error. After selecting a new codeword, observations are orthogonally projected into the space spanned by all the previously selected codewords. The residual is then recomputed and a new codeword is selected again. This process is repeated until the desired sparsity levelK is reached. The dictionary optimization step can be performed using a standard gradient descent algorithm. 5.3.3 Spatio-temporal HMP In ST-HMP, the tactile information is aggregated both in the spatial and the temporal domains. This is achieved by constructing a pyramid of spatio-temporal features at different coarseness levels, which provides invariance to spatial and temporal deviations of the tactile signal. In the spatial domain, the dictionary is learned and the sparse codes are extracted from small tactile image patches. As the next step, the sparse codes are aggregated using spatial max-pooling. In particular, the image is divided into spatial cells and each cell’s features are computed by max-pooling all the sparse codes inside the spatial cell: F(C s )= max j∈Cs Sx j1 S;:::; max j∈Cs Sx jm S; wherej ∈C s indicates that the image patch is inside the spatial cellC s andm is the dimensionality of the sparse codes. The HMP features of each tactile image in the time sequence are computed by performing max-pooling on various levels of the spatial pyramid, i.e. for different sizes of the spatial cells. After computing the HMP features for all tactile images in the time series, the pooling is performed on the temporal level by constructing a temporal pyramid. The tactile sequence is divided in sub-sequences of different lengths. For all sub-sequences, the algorithm performs max-pooling of the HMP features resulting in a single feature descriptor for each sub-sequence. Combined with spatial pooling, this results in a spatio-temporal pooling of the sparse codes: F(C st )= max j∈Cst S^ x j1 S;:::; max j∈Cst S^ x jm S; where ^ x j is a HMP feature vector that is inside the spatio-temporal cellC st . Finally, the features of all the spatio-temporal cells are concatenated to create a single feature vectorF P for the complete tactile sequence: F P = [C 11 ;:::;C ST ]; whereS is the number of the spatial cells andT is the number of the temporal cells. Hence, the total dimensionality of the ST-HMP descriptor isS×T ×M. After extracting the ST-HMP feature descriptor from the tactile sequence, we use Support Vector Machine (SVM) with a linear kernel to learn a classifier for the grasp stability prediction as described in (Madry et al., 2014). The tactile sequences used in this work consist of tactile data 76 shortly before and after starting to pick up a grasped object. This process is described in more detail in Sec. 5.5. In this work, we also compare features extracted only from tactile sensors with combinations of tactile and non-tactile features, such as force-torque sensors, strain gages, hand orientation, finger angles, etc. To achieve temporal invariance, we apply temporal max-pooling to these features with the same temporal pyramid as for the tactile features. By doing so, we can combine tactile ST-HMP and non-tactile temporally pooled features by concatenating their feature vectors. 5.4 Reinforcement learning for regrasping Once a grasp is predicted as a failure by the grasp stability predictor, the robot has to place the object down and regrasp it using the information acquired during the initial grasp. In order to achieve this goal, we learn a mapping from the tactile features of the initial grasp to the grasp adjustment. The parameters of this mapping function are learned using a reinforcement learning approach. In the following, we explain how this mapping is computed. In addition, we describe the policy search method and our approach to reducing the dimensionality of the problem. 5.4.1 Mapping from tactile features to grasp adjustments Similarly to the grasp stability prediction, we use the ST-HMP features as the tactile data rep- resentation to compute the adjustment of the unsuccessful grasp. In particular, we use a linear combination of the ST-HMP features to compute the change of the grasp pose. The weights of this combination are learned using a policy search algorithm described in the next section. Using multiple levels in the spatial and temporal pyramids of ST-HMP increases the dimen- sionality of tactile features substantially. This leads to a large number of parameters to learn for the mapping function, which is usually a hard task for policy search algorithms (Deisenroth et al., 2013). Therefore, we perform principal component analysis (PCA) (Jolliffe, 1986) on the ST-HMP features and use only the largest principal components to compute the mapping. The grasp pose adjustment is represented by the 3-dimensional change of the gripper’s position and 3 Euler angles describing the change of the gripper’s orientation. Each adjustment dimension is computed separately using the largest principal components of the ST-HMP features: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ x y z ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ w x;1 ⋯ w x;n w y;1 ⋯ w y;n ⋮ ⋱ ⋮ w ;1 ⋯ w ;n ⎞ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎝ 1 2 ⋮ n ⎞ ⎟ ⎟ ⎟ ⎠ ; wherew i;j are the weights of the tactile features i andn is the number of principal components that are used for the mapping. 77 5.4.2 Policy search for learning mapping parameters The linear combination weights of the mapping from the tactile features to the grasp adjustments w i;j are learned using a reinforcement learning technique described in the following. All the combination weights are first concatenated to form a single parameter vector: = (w x;1 ;:::;w x;n ;:::;w ;n ): We define the policy() as a Gaussian distribution over with a mean and a covariance matrix . In order to find good regrasping candidates, the parameter vector is sampled from this distribution. In the next step, we compute the rewardR() by estimating the success of the adjusted grasp using the grasp stability predictor described in Sec. 5.3. After a number of trials is collected, the current policy is optimized and the process repeats. For policy optimization we use the relative entropy policy search (REPS) algorithm (Peters, M¨ ulling, & Altun, 2010). The main advantage of this method is that, in the process of reward maximization, the loss of information during a policy update is bounded, which leads to better convergence behavior. The goal of REPS is to maximize the expected rewardJ() of the policy subject to bounded information loss between the previous and updated policy. Information loss is defined as the Kullback-Leibler (KL) divergence between the two policies. Bounding the information loss limits the change of the policy and hence, avoids sampling too far from unexplored policy regions. Letq() be the old policy and() be the new policy after the policy update. Using this notation, we can formulate a constrained optimization problem: max J()= S ()R()d s. t. S () log () q() d ≤; S ()d = 1; where, as mentioned before,J() is the total expected reward of using the policy(). The first constraint bounds the KL-divergence between the policies with the maximum information lost set to. The second constraint ensures that() forms a proper probability distribution. Solving the optimization problem with Lagrange multipliers results in the following dual function: g()=+ log S q() exp R() d; where the integral term can be approximated from the samples using the maximum-likelihood estimate of the expected value. Furthermore, from the Lagrangian it can be derived that: ()∝q() exp R() : 78 Therefore, we are able compute the new policy parameters with a weighted maximum-likelihood solution. The weights are equal to exp(R()~), where the rewards are scaled by the parameter. By decreasing one gives larger weights to the high-reward samples. An increase of results in more uniform weights. The parameter is computed according to the optimization constraints by solving the dual problem. Given a set of policy parameters { 1 ;:::; N } and corresponding episode rewards, the policy update rules for and can be formulated as follows (Deisenroth et al., 2013): = ∑ N i=1 d i i ∑ N i=1 d i ; = ∑ N i=1 d i ( i −)( i −) ⊺ ∑ N i=1 d i ; with weightsd i = exp(R()~). 5.4.3 Learning a general regrasping policy After the individual linear policies have been learned, we train a larger high-dimensional policy in a supervised manner using the outputs of the individual policies. This is similar to the guided policy search approach proposed in (Levine & Koltun, 2013a). In our case, the guidance of the general policy comes from the individual policies that can be efficiently learned for separate objects. As the general policy class we choose a neural network with a large number of parameters. Such a policy has enough representational richness to incorporate regrasping behavior for many different objects. However, learning its parameters directly requires a very large number of experiments, whereas supervised learning with already learned individual policies speeds up the process significantly. To generate training data for learning the general policy, we sample grasp corrections from the already learned individual policies using previously collected data. Input features and resulting grasp corrections are combined in a “transfer” dataset, which is used to transfer the behaviors to the general policy. In order to increase the amount of information provided to the general policy, we increase the number of its input features by extracting a larger number of PCA components from the ST-HMP features. Using different features in the general policy than in the original individual policies is one of the advantages of our setting. The individual policies provide outputs of the desired behavior, while the general policy can have a different set of input features. To train the neural network, we employ the mean-squared error loss function and the Levenberg- Marquardt optimization algorithm (Hagan & Menhaj, 1994). In the hidden layers, we use neurons with the hyperbolic tangent sigmoid transfer function: a(x)= 2 1+ exp(−2x) − 1: For the activation of the output layer we use a linear transfer function, i.e. the output is a linear combination of the inputs from the previous layer. In order to avoid overfitting of the training data we employ the early stopping technique during training (Yao, Rosasco, & Caponnetto, 2007). The data set is divided into mutually exclusive training, validation and test sets. While the network parameters are optimized on the training set, the training stops once the performance on the validation set starts decreasing. 79 Figure 5.3: Objects and experimental setup used for learning the grasp stability predictor and the regrasping behavior. If an object falls out of the hand it returns to its initial position due to the shape of the bowl. Top-left: the cylinder. Top-right: the box. Bottom-left: the ball. Bottom-right: the novel object. 5.5 Experimental Results 5.5.1 Evaluation of grasp stability prediction In our experiments, we use a Barrett arm and hand that is equipped with three biomimetic tactile sensors (BioTacs) (Wettels et al., 2008). Each BioTac includes an array of 19 electrodes, whose impedances change depending on the local deformation of the robot’s flexible skin. For extracting ST-HMP features, the BioTac electrode values are laid out in a 2D tactile image according to their spatial arrangement on the sensor as depicted in Fig. 5.4 (top left). We use bowls (see Fig. 5.3) to bring the objects up right if they fall out of the gripper during the extensive shaking motions that are performed later in the experiment. This experimental setup enables us to fully automate the learning process and let the robot run for many hours to autonomously learn the grasp stability predictor. The experiment proceeds as follows. The robot reaches for the object to perform a randomly generated top grasp. The randomness stems from white noise added to the top grasp. Standard deviation of the noise is±10deg in roll and pitch of the gripper,±60deg in yaw, and±1cm in all translational directions. These parameters are tuned such that there is always at least one finger touching the object. After approaching and grasping the object using the force grip controller (Su et al., 2015), the robot lifts the object and performs extensive shaking motions in all directions to ensure that the grasp is stable. The shaking motions are performed by rapidly changing the 80 end-effector’s orientation by±15deg and position by±3cm in all directions multiple times. If the object is still in the hand after the shaking motions, we consider it to be a successful grasp. The wrist-mounted force-torque sensor is used to determine if the object is still in the hand at the end of the experiment. The ST-HTMP features use a temporal window of 650ms before and 650ms after starting picking up the object. Our goal is to determine early in the lifting phase if the grasp is going to fail. In this manner, the robot can stop the motion early enough to avoid displacing the object, and hence, it can regrasp it later. We evaluate our approach on three objects: a cylindrical object, a box and a ball. We perform a 5-fold cross-validation on 500 grasp samples for each object. The robot achieves a grasp classification accuracy of 90:7% on the cylinder, 82:4% on the box and 86:4% on the ball. 5.5.2 Learning individual linear regrasping policies After learning the grasp stability predictor, we evaluate the regrasping algorithm for individual policies. The experimental setup for this scenario is similar to the one for the grasp stability predictor. The robot uses the stability prediction to self-supervise the learning process. In this manner, we are able to let the robot run for many hours for each object to autonomously learn the regrasping behavior. As described in Sec. 5.4.1, we apply PCA and extract five principal components from the ST-HMP features for learning individual policies. As a result, linear policies contain only 30 parameters (5 for each of the 6 grasp adjustment dimensions). This makes the policy search feasible using relatively small amounts of data. We evaluate the individual policies learned for the cylinder, box and ball objects. We perform multiple policy updates for each object until the policy converges. For each update, we collect 100 regrasping samples. First, we perform a randomly generated top grasp. If the grasp is predicted to fail, the algorithm samples the current regrasping policy and the robot performs up to three regrasps. If one of the regrasps is successful, the robot stops regrasping and performs the next random grasp. The rewards for the reinforcement learning are specified as follows. 0.0: The grasp is predicted unsuccessful by the grasp stability predictor. We do not perform any additional actions. 0.5: The grasp is predicted successful by the stability predictor. However, the object falls out of the hand after additional extensive shaking motions. 1.0: The grasp is predicted successful and the object is still in the hand after the shaking motions. Fig. 5.4 shows the average reward values after each policy update for all the objects. The robot is able to improve its regrasping behavior significantly. To evaluate the results of the policy search, we perform 100 random grasps using the final policies on each of the objects that they were learned on. The robot has three attempts to regrasp each object using the learned policy. Table 5.1 shows the percentage of successful grasps on each object after each regrasp. Already after one regrasp, the robot is able to correct the majority of the failed grasps by increasing the success rate of the grasps from 41.8% to 83.5% on the cylinder, from 40.7% to 85.4% on the box and from 52.9% to 84.8% on the ball. Moreover, allowing additional regrasps increases this value to 90.3% for two and 97.1% for three regrasps on the cylinder, 93.7% and 96.8% on the box, and to 91.2% and 81 0 1 2 3 4 5 6 7 8 0.2 0.3 0.4 0.5 0.6 0.7 Nr. of Policy Updates Avg. Reward Regrasping: Cylinder 0 1 2 3 4 5 6 7 8 0.2 0.3 0.4 0.5 0.6 0.7 Avg. Reward Regrasping: Cylinder 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 Nr. of Policy Updates Avg. Reward Regrasping: Box 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 Avg. Reward Regrasping: Box 0 1 2 3 4 5 6 0.4 0.5 0.6 Nr. of Policy Updates Avg. Reward Regrasping: Ball 0 1 2 3 4 5 6 0.4 0.5 0.6 Avg. Reward Regrasping: Ball Figure 5.4: Top left: schematic of the electrode arrangements on the BioTac sensor and the corresponding tactile image used for the ST-HMP features. V1, V2 and V3 are computed by averaging the neighboring electrode values. Top right, bottom left, bottom right: reinforcement learning curves for regrasping individual objects using REPS. Policy updates are performed every 100 regrasps. 95.1% on the ball. These results indicate that the robot is able to learn a tactile-based regrasping strategy for individual objects. 5.5.3 Evaluation of general regrasping policy After training individual policies we create a “transfer” dataset with grasp corrections obtained from the individual linear regrasping policies for all objects. For each set of tactile features, we query the respective previously-learned linear policy for the corresponding grasp correction. We take the input features for the individual policies from the failed grasps in the open-source * BiGS dataset (Chebotar, Hausman, Su, et al., 2016). The grasps in BiGS were collected in an analogous experimental setup and can directly be used for creating the “transfer” dataset . In total, the training set contains 3351 examples: 1380 for the cylinder, 1035 for the box and 936 for the ball. We use supervised learning with the obtained dataset to learn a combined policy that mimics the behavior of the individual policies. In this work, we employ a neural network to achieve this task. * http://bigs.robotics.usc.edu/ 82 Object Individual policies Combined policy No regrasps 1 regrasp 2 regrasps 3 regrasps Cylinder 41.8 83.5 90.3 97.1 92.3 Box 40.7 85.4 93.7 96.8 87.6 Ball 52.9 84.8 91.2 95.1 91.4 New object 40.1 - - - 80.7 Table 5.1: Performance of the individual and combined regrasping policies. To find the optimal architecture of the neural network, we evaluated different networks with various depths and numbers of neurons to learn the nonlinear policy. The best performance is achieved by using 20 ST-HMP PCA features as inputs. We have not observed any improvement of the approximation accuracy when using more than one hidden layer. This indicates that the ST-HMP algorithm already extracts most distinctive features from the tactile data and we do not require additional deep network architecture for our task. The final neural network consists of one hidden layer of 20 hidden units with tangent sigmoid activation functions, 20 input features and 6 outputs for grasp position and orientation adjustments. The resulting number of parameters in the generalized policy is 546. Such a high-dimensional policy would be hard to learn by directly employing reinforcement learning. Our formulation as supervised learning, however, simplifies this problem and makes the learning process with relatively small amounts of data more feasible. Table 5.1 shows performance of the generalized policy on the single objects. Since the com- bined policy is deterministic, we only evaluate a single regrasp for each failed grasp. Interestingly, the combined policy is able to achieve better performance on each of the single objects than the respective linear policies learned specifically for these object after one regrasp. Furthermore, in cases of the cylinder and the ball, the performance of the generalized policy is better than the linear policies evaluated after two regrasps. This shows that the general policy generalizes well between the single policies. In addition, by utilizing the knowledge obtained from single policies, the generalized policy is able to perform better on the objects that the single policies were trained on. The performance of the generalized policy on the box object is slightly worse than on the two other objects. A notable difference in this case is the increased importance of the gripper yaw angle with respect to the grasp performance. The individual policy learned on the box learns to correct the grasp such that the robot aligns its fingers with the box sides while regrasping. However, this is not important for the cylinder and the ball objects due to their symmetric shapes. Therefore, the regrasping policy for the box could not benefit from the two other policies when adjusting grasp in the yaw direction. We test performance of the generalized policy on a novel, more complex object (see the bottom-right corner in Fig. 5.3), which was not present during learning. It is worth noting that the novel object combines features of the three objects that the policies were trained on. The generalized policy is able to improve the grasping performance significantly, which shows its ability to generalize to more complex objects. Nevertheless, there are some difficulties when the robot performs regrasp on a part of the object that is different from the initial grasp, such as 83 switching from the round lid to the bottom part of the object, which is of a box form. In this case, the regrasp is incorrect for the new part of the object, i.e. the yaw adjustment is suboptimal for the box part of the object due to the round grasping surface (the lid) in the initial grasp. The reason is the lack of this data point in the previously encountered situations in the training dataset. During the experiments, we were able to observe many intuitive corrections made by the robot using the learned regrasping policy. The robot was able to identify if one of the fingers was only barely touching the object’s surface, causing the object to rotate in the hand. In this case, the regrasp resulted in either rotating or translating the gripper such that all of its fingers were firmly touching the object. Another noticeable trend learned through reinforcement learning was that the robot would regrasp the middle part of the object which was closer to the center of mass, hence, more stable for grasping. On the box object, the robot learned to change its grasp such that its fingers were aligned with the box’s sides. These results indicate that not only can the robot learn a set of linear regrasping policies for individual objects, but also that it can use them as the basis for guiding the generalized regrasping behavior. 5.6 Conclusions In this work, we presented a framework for learning regrasping behaviors based on tactile data. We trained a grasp stability predictor that uses spatio-temporal tactile features collected from the early-object-lifting phase to predict the grasp outcome with high accuracy. The trained predictor was used to supervise and provide feedback to a reinforcement learning algorithm that also uses spatio-temporal features extracted from a biomimetic tactile sensor to estimate the required grasp adjustments. In addition, we proposed a method that is able to learn complex high-dimensional policies by using examples from simple policies learned through reinforcement learning. In this manner, we were able to avoid requiring large amounts of data to learn complex policies. Instead, we employed supervised learning techniques to mimic various behaviors of simple policies. In order to test our approach, we collected over 50 hours of evaluation data on a real robot. We were able to achieve a 93% grasp prediction accuracy and improve grasp success rate of a single object from 41.8% to 97.1% by allowing up to 3 regrasps of the object. Our experiments indicate that the combined policy learned using our method is able to achieve better performance on each of the single objects than the respective linear policies learned using reinforcement learning specifically for these objects after one regrasp. Moreover, the general policy achieves approximately 80% success rate after one regrasp on a novel object that was not present during training. These results show that our supervised policy learning method applied to regrasping can generalize to more complex objects. 84 Part III Learned Representations via Deep Robot Learning 85 In this part of the thesis, we introduce three deep robot learning approaches that address different challenges of learning representations in robotics. The first approach presented in this part of the thesis addresses one of the main limitations of the current deep reinforcement learning approaches, i.e. their sample complexity. Next, we present an approach that tackles the problem of limited scalability of the current imitation learning methods that use deep representations. In the last chapter of this part of the thesis, we present a novel approach to deep reinforcement learning in robotics that addresses the challenge of representing and reusing previously learned skills to solve more complex tasks. The first two chapters in this part are based on our already published work (Chebotar, Hausman, et al., 2017) and (Hausman, Chebotar, Schaal, Sukhatme, & Lim, 2017), whereas the last chapter describes our work that is currently under review (Hausman, Springenberg, Wang, Heess, & Riedmiller, 2018). 86 Chapter 6 Combining Model-Based and Model-Free Updates for Deep Reinforcement Learning Reinforcement learning algorithms for real-world robotic applications must be able to handle com- plex, unknown dynamical systems while maintaining data-efficient learning. These requirements are handled well by model-free and model-based RL approaches, respectively. In this work, we aim to combine the advantages of these approaches. By focusing on time-varying linear-Gaussian policies, we enable a model-based algorithm based on the linear-quadratic regulator that can be integrated into the model-free framework of path integral policy improvement. We can further combine our method with guided policy search to train arbitrary parameterized policies such as deep neural networks. Our simulation and real-world experiments demonstrate that this method can solve challenging manipulation tasks with comparable or better performance than model-free methods while maintaining the sample efficiency of model-based methods. 6.1 Introduction Reinforcement learning (RL) aims to enable automatic acquisition of behavioral skills, which can be crucial for robots and other autonomous systems to behave intelligently in unstructured real-world environments. However, real-world applications of RL have to contend with two often opposing requirements: data-efficient learning and the ability to handle complex, unknown dynamical systems that might be difficult to model. Real-world physical systems, such as robots, are typically costly and time consuming to run, making it highly desirable to learn using the lowest possible number of real-world trials. Model-based methods tend to excel at this (Deisenroth et al., 2013), but suffer from significant bias, since complex unknown dynamics cannot always be modeled accurately enough to produce effective policies. Model-free methods have the advantage of handling arbitrary dynamical systems with minimal bias, but tend to be substantially less sample-efficient (Kober, Bagnell, & Peters, 2013; Schulman, Levine, Moritz, Jordan, & Abbeel, 2015). Can we combine the efficiency of model-based algorithms with the final performance of model-free algorithms in a method that we can practically use on real-world physical systems? As we will discuss in Section 6.2, many prior methods that combine model-free and model- based techniques achieve only modest gains in efficiency or performance (Heess, Wayne, Silver, 87 Figure 6.1: Real robot tasks used to evaluate our method. Left: The hockey task which involves discontinuous dynamics. Right: The power plug task which requires high level of precision. Both of these tasks are learned from scratch without demonstrations. 88 Lillicrap, Tassa, & Erez, 2015; Gu, Lillicrap, Sutskever, & Levine, 2016). In this work, we aim to develop a method in the context of a specific policy representation: time-varying linear-Gaussian controllers. The structure of these policies provides us with an effective option for model-based updates via iterative linear-Gaussian dynamics fitting (Levine & Abbeel, 2014), as well as a simple option for model-free updates via the path integral policy improvement (PI 2 ) algorithm (Theodorou, Buchli, & Schaal, 2010). Although time-varying linear-Gaussian (TVLG) policies are not as powerful as represen- tations such as deep neural networks (Mnih et al., 2013; Lillicrap et al., 2016) or RBF net- works (Deisenroth, Rasmussen, & Fox, 2011), they can represent arbitrary trajectories in con- tinuous state-action spaces. Furthermore, prior work on guided policy search (GPS) has shown that TVLG policies can be used to train general-purpose parameterized policies, including deep neural network policies, for tasks involving complex sensory inputs such as vision (Levine & Abbeel, 2014; Levine, Finn, Darrell, & Abbeel, 2016b). This yields a general-purpose RL pro- cedure with favorable stability and sample complexity compared to fully model-free deep RL methods (Montgomery, Ajay, Finn, Abbeel, & Levine, 2017). The main contribution of this work is a procedure for optimizing TVLG policies that integrates both fast model-based updates via iterative linear-Gaussian model fitting and corrective model-free updates via the PI 2 framework. The resulting algorithm, which we call PILQR, combines the efficiency of model-based learning with the generality of model-free updates and can solve complex continuous control tasks that are infeasible for either linear-Gaussian models or PI 2 by itself, while remaining orders of magnitude more efficient than standard model-free RL. We integrate this approach into GPS to train deep neural network policies and present results both in simulation and on a real robotic platform. Our real-world results demonstrate that our method can learn complex tasks, such as hockey and power plug plugging (see Figure 6.1), each with less than an hour of experience and no user-provided demonstrations. 6.2 Related Work The choice of policy representation has often been a crucial component in the success of a RL procedure (Deisenroth et al., 2013; Kober et al., 2013). Trajectory-centric representations, such as splines (Peters & Schaal, 2008), dynamic movement primitives (Schaal, Peters, Nakanishi, & Ijspeert, 2003), and TVLG controllers (Lioutikov, Paraschos, Neumann, & Peters, 2014; Levine & Abbeel, 2014) have proven particularly popular in robotics, where they can be used to represent cyclical and episodic motions and are amenable to a range of efficient optimization algorithms. In this work, we build on prior work in trajectory-centric RL to devise an algorithm that is both sample- efficient and able to handle a wide class of tasks, all while not requiring human demonstration initialization. More general representations for policies, such as deep neural networks, have grown in popularity recently due to their ability to process complex sensory input (Mnih et al., 2013; Lillicrap et al., 2016; Levine et al., 2016b) and represent more complex strategies that can succeed from a variety of initial conditions (Schulman, Levine, Moritz, et al., 2015; Schulman et al., 2016). While trajectory-centric representations are more limited in their representational power, they can 89 be used as an intermediate step toward efficient training of general parameterized policies using the GPS framework (Levine et al., 2016b). Our proposed trajectory-centric RL method can also be combined with GPS to supervise the training of complex neural network policies. Our experiments demonstrate that this approach is several orders of magnitude more sample-efficient than direct model-free deep RL algorithms. Prior algorithms for optimizing trajectory-centric policies can be categorized as model-free methods (Theodorou et al., 2010; Peters et al., 2010), methods that use global models (Deisenroth, Fox, & Rasmussen, 2014; Y . Pan & Theodorou, 2014), and methods that use local models (Levine & Abbeel, 2014; Lioutikov et al., 2014; Akrour, Abdolmaleki, Abdulsamad, & Neumann, 2016). Model-based methods typically have the advantage of being fast and sample-efficient, at the cost of making simplifying assumptions about the problem structure such as smooth, locally linearizable dynamics or continuous cost functions. Model-free algorithms avoid these issues by not modeling the environment explicitly and instead improving the policy directly based on the returns, but this often comes at a cost in sample efficiency. Furthermore, many of the most popular model-free algorithms for trajectory-centric policies use example demonstrations to initialize the policies, since model-free methods require a large number of samples to make large, global changes to the behavior (Theodorou et al., 2010; Peters et al., 2010; Pastor, Hoffmann, Asfour, & Schaal, 2009). Prior work has sought to combine model-based and model-free learning in several ways. (Farshidian, Neunert, & Buchli, 2014) also use LQR and PI 2 , but do not combine these methods directly into one algorithm, instead using LQR to produce a good initialization for PI 2 . Their work assumes the existence of a known model, while our method uses estimated local models. A number of prior methods have also looked at incorporating models to generate additional synthetic samples for model-free learning (R. Sutton, 1990; Gu et al., 2016), as well as using models for improving the accuracy of model-free value function backups (Heess, Wayne, Silver, Lillicrap, Tassa, & Erez, 2015). Our work directly combines model-based and model-free updates into a single trajectory-centric RL method without using synthetic samples that degrade with modeling errors. 6.3 Preliminaries The goal of policy search methods is to optimize the parameters of a policyp(u t Sx t ), which defines a probability distribution over actions u t conditioned on the system state x t at each time stept of a task execution. Let = (x 1 ;u 1 ;:::;x T ;u T ) be a trajectory of states and actions. Given a cost functionc(x t ;u t ), we define the trajectory cost asc() =∑ T t=1 c(x t ;u t ). The policy is optimized with respect to the expected cost of the policy J()=E p [c()]= S c()p()d; wherep() is the policy trajectory distribution given the system dynamicsp(x t+1 Sx t ;u t ) p()=p(x 1 ) T M t=1 p(x t+1 Sx t ;u t )p(u t Sx t ): 90 One policy class that allows us to employ an efficient model-based update is the TVLG controller p(u t Sx t ) = N (K t x t + k t ; t ). In this section, we present the model-based and model-free algorithms that form the constituent parts of our hybrid method. The model-based method is an extension of a KL-constrained LQR algorithm (Levine & Abbeel, 2014), which we shall refer to as LQR with fitted linear models (LQR-FLM). The model-free method is a PI 2 algorithm with per-time step KL-divergence constraints that is derived in previous work (Chebotar, Kalakrishnan, et al., 2017). 6.3.1 Model-Based Optimization of TVLG Policies The model-based method we use is based on the iterative linear-quadratic regulator (iLQR) and builds on prior work (Levine & Abbeel, 2014; Tassa, Erez, & Todorov, 2012). We provide a full description and derivation in Appendix A.1.1. We use samples to fit a TVLG dynamics modelp(x t+1 Sx t ;u t )=N (f x;t x t + f u;t u t ;F t ) and assume a twice-differentiable cost function. As in (Tassa et al., 2012), we can compute a second- order Taylor approximation of our Q-function and optimize this with respect to u t to find the optimal action at each time stept. To deal with unknown dynamics, (Levine & Abbeel, 2014) impose a KL-divergence constraint between the updated policyp (i) and previous policyp (i−1) to stay within the space of trajectories where the dynamics model is approximately correct. We similarly set up our optimization as min p (i) E p (i)[Q(x t ;u t )]s:t:E p (i) D KL (p (i) Yp (i−1) )≤ t : (6.1) The main difference from (Levine & Abbeel, 2014) is that we enforce separate KL constraints for each linear-Gaussian policy rather than a single constraint on the induced trajectory distribution (i.e., compare Eq. (6.1) to the first equation in Section 3.1 of (Levine & Abbeel, 2014)). LQR-FLM has substantial efficiency benefits over model-free algorithms. However, as our experimental results in Section 6.6 show, the performance of LQR-FLM is highly dependent on being able to model the system dynamics accurately, causing it to fail for more challenging tasks. 6.3.2 Policy Improvement with Path Integrals PI 2 is a model-free RL algorithm based on stochastic optimal control. A detailed derivation of this method can be found in (Theodorou et al., 2010). Each iteration of PI 2 involves generatingN trajectories by running the current policy. Let S(x i;t ;u i;t ) = c(x i;t ;u i;t )+∑ T j=t+1 c(x i;j ;u i;j ) be the cost-to-go of trajectory i ∈ {1;:::;N} starting in state x i;t by performing action u i;t and following the policyp(u t Sx t ) afterwards. Then, we can compute probabilitiesP(x i;j ;u i;j ) for each trajectory starting at time stept P(x i;t ;u i;t )= exp− 1 t S(x i;t ;u i;t ) ∫ exp− 1 t S(x i;t ;u i;t )du i;t : (6.2) The probabilities follow from the Feynman-Kac theorem applied to stochastic optimal con- trol (Theodorou et al., 2010). The intuition is that the trajectories with lower costs receive 91 higher probabilities, and the policy distribution shifts towards a lower cost trajectory region. The costs are scaled by t , which can be interpreted as the temperature of a soft-max distribution. This is similar to the dual variables t in LQR-FLM in that they control the KL step size, however they are derived and computed differently. After computing the new probabilitiesP , we update the policy distribution by reweighting each sampled control u i;t byP(x i;t ;u i;t ) and updating the policy parameters by a maximum likelihood estimate (Chebotar, Kalakrishnan, et al., 2017). To relate PI 2 updates to LQR-FLM optimization of a constrained objective, which is necessary for combining these methods, we can formulate the following theorem. Theorem 1. The PI 2 update corresponds to a KL-constrained minimization of the expected cost- to-goS(x t ;u t )=∑ T j=t c(x j ;u j ) at each time stept min p (i) E p (i)[S(x t ;u t )]s:t:E p (i−1)D KL p (i) Yp (i−1) ≤; where is the maximum KL-divergence between the new policyp (i) (u t Sx t ) and the old policy p (i−1) (u t Sx t ). Proof. The Lagrangian of this problem is given by L(p (i) ; t )=E p (i)[S(x t ;u t )]+ t E p (i−1)D KL p (i) Yp (i−1) −: By minimizing the Lagrangian with respect to p (i) we can find its relationship to p (i−1) (see Appendix A.1.2), given by p (i) (u t Sx t )∝p (i−1) (u t Sx t )E p (i−1)exp− 1 t S(x t ;u t ): (6.3) This gives us an update rule forp (i) that corresponds exactly to reweighting the controls from the previous policyp (i−1) based on their probabilitiesP(x t ;u t ) described earlier. The temperature t now corresponds to the dual variable of the KL-divergence constraint. The temperature t can be estimated at each time step separately by optimizing the dual function g( t )= t + t logE p (i−1) exp− 1 t S(x t ;u t ); (6.4) with derivation following from (Peters et al., 2010). PI 2 was used by (Chebotar, Kalakrishnan, et al., 2017) to solve several challenging robotic tasks such as door opening and pick-and-place, where they achieved better final performance than LQR-FLM. However, due to its greater sample complexity, PI 2 required initialization from demonstrations. 92 6.4 Integrating Model-Based Updates into PI2 Both PI 2 and LQR-FLM can be used to learn TVLG policies and both have their strengths and weaknesses. In this section, we first show how the PI 2 update can be broken up into two parts, with one part using a model-based cost approximation and another part using the residual cost error after this approximation. Next, we describe our method for integrating model-based updates into PI 2 by using our extension of LQR-FLM to optimize the linear-quadratic cost approximation and performing a subsequent update with PI 2 on the residual cost. We demonstrate in Section 6.6 that our method combines the strengths of PI 2 and LQR-FLM while compensating for their weaknesses. 6.4.1 Two-Stage PI2 update To integrate a model-based optimization into PI 2 , we can divide it into two steps. Given an approximation ^ c(x t ;u t ) of the real costc(x t ;u t ) and the residual cost ~ c(x t ;u t ) = c(x t ;u t )− ^ c(x t ;u t ), let ^ S t = ^ S(x t ;u t ) be the approximated cost-to-go of a trajectory starting with state x t and action u t , and ~ S t = ~ S(x t ;u t ) be the residual of the real cost-to-goS(x t ;u t ) after approximation. We can rewrite the PI 2 policy update rule from Eq. (6.3) as p (i) (u t Sx t ) ∝p (i−1) (u t Sx t )E p (i−1) exp− 1 t ^ S t + ~ S t ∝ ^ p(u t Sx t )E p (i−1) exp− 1 t ~ S t ; (6.5) where ^ p(u t Sx t ) is given by ^ p(u t Sx t )∝p (i−1) (u t Sx t )E p (i−1) exp− 1 t ^ S t : (6.6) Hence, by decomposing the cost into its approximation and the residual approximation error, the PI 2 update can be split into two steps: (1) update using the approximated costs ^ c(x t ;u t ) and samples from the old policyp (i−1) (u t Sx t ) to get ^ p(u t Sx t ); (2) updatep (i) (u t Sx t ) using the residual costs ~ c(x t ;u t ) and samples from ^ p(u t Sx t ). 6.4.2 Model-Based Substitution with LQR-FLM We can use Theorem (1) to rewrite Eq. (6.6) as a constrained optimization problem min ^ p E ^ p ^ S(x t ;u t )s:t: E p (i−1) D KL ^ pYp (i−1) ≤: Thus, the policy ^ p(u t Sx t ) can be updated using any algorithm that can solve this optimization problem. By choosing a model-based approach for this, we can speed up the learning process significantly. Model-based methods are typically constrained to some particular cost approximation, however, PI 2 can accommodate any form of ~ c(x t ;u t ) and thus will handle arbitrary cost residuals. 93 LQR-FLM solves the type of constrained optimization problem in Eq. (6.1), which matches the optimization problem needed to obtain ^ p, where the cost-to-go ^ S is approximated with a quadratic cost and a linear-Gaussian dynamics model. * We can thus use LQR-FLM to perform our first update, which enables greater efficiency but is susceptible to modeling errors when the fitted local dynamics are not accurate, such as in discontinuous systems. We can use a PI 2 optimization on the residuals to correct for this bias. 6.4.3 Optimizing Cost Residuals with PI2 In order to perform a PI 2 update on the residual costs-to-go ~ S, we need to know what ^ S is for each sampled trajectory. That is, what is the cost-to-go that is actually used by LQR-FLM to make its update? The structure of the algorithm implies a specific cost-to-go formulation for a given trajectory – namely, the sum of quadratic costs obtained by running the same policy under the TVLG dynamics used by LQR-FLM. A given trajectory can be viewed as being generated by a deterministic policy conditioned on a particular noise realization i;1 ;:::; i;T , with actions given by u i;t = K t x i;t + k t + » t i;t ; (6.7) where K t , k t , and t are the parameters ofp (i−1) . We can therefore evaluate ^ S(x t ;u t ) by simu- lating this deterministic controller from (x t ;u t ) under the fitted TVLG dynamics and evaluating its time-varying quadratic cost, and then plugging these values into the residual cost. In addition to the residual costs ~ S for each trajectory, the PI 2 update also requires control samples from the updated LQR-FLM policy ^ p(u t Sx t ). Although we have the updated LQR-FLM policy, we only have samples from the old policyp (i−1) (u t Sx t ). However, we can apply a form of the re-parametrization trick (D. Kingma & Welling, 2013) and again use the stored noise realization of each trajectory t;i to evaluate what the control would have been for that sample under the LQR-FLM policy ^ p. The expectation of the residual cost-to-go in Eq. (6.5) is taken with respect to the old policy distributionp (i−1) . Hence, we can reuse the states x i;t and their corresponding noise i;t that was sampled while rolling out the previous policy p (i−1) and evaluate the new controls according to ^ u i;t = ^ K t x i;t + ^ k t + » ^ t i;t . This linear transformation on the sampled control provides unbiased samples from ^ p(u t Sx t ). After transforming the control samples, they are reweighted according to their residual costs and plugged into the PI 2 update in Eq. (6.2). 6.4.4 Summary of PILQR algorithm Algorithm 1 summarizes our method for combining LQR-FLM and PI 2 to create a hybrid model- based and model-free algorithm. After generating a set of trajectories by running the current policy (line 2), we fit TVLG dynamics and compute the quadratic cost approximation ^ c(x t ;u t ) and approximation error residuals ~ c(x t ;u t ) (lines 3, 4). In order to improve the convergence * In practice, we make a small modification to the problem in Eq. (6.1) so that the expectation in the constraint is evaluated with respect to the new distribution ^ p(xt) rather than the previous onep (i−1) (xt). This modification is heuristic and no longer aligns with Theorem (1), but works better in practice. 94 Figure 6.2: We evaluate on a set of simulated robotic manipulation tasks with varying difficulty. Left to right, the tasks involve pushing a block, reaching for a target, and opening a door in 3D. behavior of our algorithm, we adjust the KL-step t of the LQR-FLM optimization in Eq. (6.1) based inversely on the proportion of the residual costs-to-go to the sampled costs-to-go (line 5). In particular, if the ratio between the residual and the overall cost is sufficiently small or large, we increase or decrease, respectively, the KL-step t . We then continue with optimizing for the temperature t using the dual function from Eq. (6.4) (line 6). Finally, we perform an LQR-FLM update on the cost approximation (line 7) and a subsequent PI 2 update using the cost residuals (line 8). As PILQR combines LQR-FLM and PI 2 updates in sequence in each iteration, its computational complexity can be determined as the sum of both methods. Due to the properties of PI 2 , the covariance of the optimized TVLG controllers decreases each iteration and the method eventually converges to a single solution. 6.5 Training Parametric Policies with GPS PILQR offers an approach to perform trajectory optimization of TVLG policies. In this work, we employ mirror descent guided policy search (MDGPS) (Montgomery & Levine, 2016) in order to use PILQR to train parametric policies, such as neural networks. Instead of directly learning the parameters of a high-dimensional parametric or “global policy” with RL, we first learn simple TVLG policies, which we refer to as “local policies”p(u t Sx t ) for various initial conditions of the task. After optimizing the local policies, the optimized controls from these policies are used to create a training set for learning the global policy in a supervised manner. Hence, the final global policy generalizes across multiple local policies. Using the TVLG representation of the local policies makes it straightforward to incorporate PILQR into the MDGPS framework. Instead of constraining against the old local TVLG policy as in Theorem (1), each instance of the local policy is now constrained against the old global policy min p (i) E p (i)[S(x t ;u t )]s:t: E p (i−1) D KL p (i) Y (i−1) ≤: The two-stage update proceeds as described in Section 6.4.1, with the change that the LQR-FLM policy is now constrained against the old global policy (i−1) . 95 Algorithm 1: PILQR algorithm 1: for iterationk ∈ {1;:::;K} do 2: Generate trajectoriesD = { i } by running the current linear-Gaussian policyp (k−1) (u t Sx t ) 3: Fit TVLG dynamics ^ p(x t+1 Sx t ;u t ) 4: Estimate cost approximation ^ c(x t ;u t ) using fitted dynamics and compute cost residuals: ~ c(x t ;u t )=c(x t ;u t )− ^ c(x t ;u t ) 5: Adjust LQR-FLM KL step t based on ratio of residual costs-to-go ~ S and sampled costs-to-goS 6: Compute t using dual function from Eq. (6.4) 7: Perform LQR-FLM update to compute ^ p(u t Sx t ): min p (i) E p (i)[Q(x t ;u t )] s:t: E p (i) D KL (p (i) Yp (i−1) )≤ t 8: Perform PI 2 update using cost residuals and LQR-FLM actions to compute the new policy: p (k) (u t Sx t )∝ ^ p(u t Sx t )E p (i−1) exp− 1 t ~ S t ) 9: end for 6.6 Experimental Evaluation Our experiments aim to answer the following questions: (1) How does our method compare to other trajectory-centric and deep RL algorithms in terms of final performance and sample efficiency? (2) Can we utilize linear-Gaussian policies trained using PILQR to obtain robust neural network policies using MDGPS? (3) Is our proposed algorithm capable of learning complex manipulation skills on a real robotic platform? We study these questions through a set of simulated comparisons against prior methods, as well as real-world tasks using a PR2 robot. The performance of each method can be seen in our supplementary video. † Our focus in this work is specifically on robotics tasks that involve manipulation of objects, since such tasks often exhibit elements of continuous and discontinuous dynamics and require sample-efficient methods, making them challenging for both model-based and model-free methods. 6.6.1 Simulation Experiments We evaluate our method on three simulated robotic manipulation tasks, depicted in Figure 6.2 and discussed below: Gripper pusher. This task involves controlling a 4 DoF arm with a gripper to push a white block to a red goal area. The cost function is a weighted combination of the distance from the gripper to the block and from the block to the goal. Reacher. The reacher task from OpenAI gym (Brockman et al., 2016) requires moving the end of a 2 DoF arm to a target position. This task is included to provide comparisons against prior methods. The cost function is the distance from the end effector to the target. We modify the cost function slightly: the original task uses an` 2 norm, while we use a differentiable Huber-style loss, which is more typical for LQR-based methods (Tassa et al., 2012). † https://sites.google.com/site/icml17pilqr 96 0 200 400 600 800 1000 1200 1400 1600 # samples 0.0 0.5 1.0 1.5 2.0 Avg final distance PI2 LQR-FLM PILQR Figure 6.3: Average final distance from the block to the goal on one condition of the gripper pusher task. This condition is difficult due to the block being initialized far away from the gripper and the goal area, and only PILQR is able to succeed in reaching the block and pushing it toward the goal. Results for additional conditions are available in Appendix A.1.4, and the supplementary video demonstrates the final behavior of each learned policy. 97 Door opening. This task requires opening a door with a 6 DoF 3D arm. The arm must grasp the handle and pull the door to a target angle, which can be particularly challenging for model-based methods due to the complex contacts between the hand and the handle, and the fact that a contact must be established before the door can be opened. The cost function is a weighted combination of the distance of the end effector to the door handle and the angle of the door. Additional experimental setup details, including the exact cost functions, are provided in Ap- pendix A.1.3. We first compare PILQR to LQR-FLM and PI 2 on the gripper pusher and door opening tasks. Figure 6.3 details performance of each method on the most difficult condition for the gripper pusher task. Both LQR-FLM and PI 2 perform significantly worse on the two more difficult conditions of this task. While PI 2 improves in performance as we provide more samples, LQR-FLM is bounded by its ability to model the dynamics, and thus predict the costs, at the moment when the gripper makes contact with the block. Our method solves all four conditions with 400 total episodes per condition and, as shown in the supplementary video, is able to learn a diverse set of successful behaviors including flicking, guiding, and hitting the block. On the door opening task, PILQR trains TVLG policies that succeed at opening the door from each of the four initial robot positions. While the policies trained with LQR-FLM are able to reach the handle, they fail to open the door. Next we evaluate neural network policies on the reacher task. Figure 6.4 shows results for MDGPS with each local policy method, as well as two prior deep RL methods that directly learn neural network policies: trust region policy optimization (TRPO) (Schulman, Levine, Moritz, et al., 2015) and deep deterministic policy gradient (DDPG) (Lillicrap et al., 2016). MDGPS with LQR-FLM and MDGPS with PILQR perform competitively in terms of the final distance from the end effector to the target, which is unsurprising given the simplicity of the task, whereas MDGPS with PI 2 is again not able to make much progress. On the reacher task, DDPG and TRPO use 25 and 150 times more samples, respectively, to achieve approximately the same performance as MDGPS with LQR-FLM and PILQR. For comparison, amongst previous deep RL algorithms that combined model-based and model-free methods, SVG and NAF with imagination rollouts reported using approximately up to five times fewer samples than DDPG on a similar reacher task (Heess, Wayne, Silver, Lillicrap, Tassa, & Erez, 2015; Gu et al., 2016). Thus we can expect that MDGPS with our method is about one order of magnitude more sample-efficient than SVG and NAF. While this is a rough approximation, it demonstrates a significant improvement in efficiency. Finally, we compare the same methods for training neural network policies on the door opening task, shown in Figure 6.5. TRPO requires 20 times more samples than MDGPS with PILQR to learn a successful neural network policy. The other three methods were unable to learn a policy that opens the door despite extensive hyperparameter tuning. We provide additional simulation results in Appendix A.1.4. 6.6.2 Real Robot Experiments To evaluate our method on a real robotic platform, we use a PR2 robot (see Figure 6.1) to learn the following tasks: Hockey. The hockey task requires using a stick to hit a puck into a goal 1:4m away. The cost function consists of two parts: the distance between the current position of the stick and a target 98 Figure 6.4: Final distance from the reacher end effector to the target averaged across 300 random test conditions per iteration. MDGPS with LQR-FLM, MDGPS with PILQR, TRPO, and DDPG all perform competitively. However, as the log scale for the x axis shows, TRPO and DDPG require orders of magnitude more samples. MDGPS with PI 2 performs noticeably worse. 99 Figure 6.5: Minimum angle in radians of the door hinge (lower is better) averaged across 100 random test conditions per iteration. MDGPS with PILQR outperforms all other methods we compare against, with orders of magnitude fewer samples than DDPG and TRPO, which is the only other successful algorithm. pose that is close to the puck, and the distance between the position of the puck and the goal. The puck is tracked using a motion capture system. Although the cost provides some shaping, this task presents a significant challenge due to the difference in outcomes based on whether or not the robot actually strikes the puck, making it challenging for prior methods, as we show below. Power plug plugging. In this task, the robot must plug a power plug into an outlet. The cost function is the distance between the plug and a target location inside the outlet. This task requires fine manipulation to fully insert the plug. Our TVLG policies consist of 100 time steps and we control our robot at a frequency of 20 Hz. For further details of the experimental setup, including the cost functions, we refer the reader to Appendix A.1.3. Both of these tasks have difficult, discontinuous dynamics at the contacts between the objects, and both require a high degree of precision to succeed. In contrast to prior works (Daniel, Neumann, Kroemer, & Peters, 2013) that use kinesthetic teaching to initialize a policy that is then finetuned with model-free methods, our method does not require any human demonstrations. The policies are randomly initialized using a Gaussian distribution with zero mean. Such initialization does 100 0 50 100 150 200 # samples 0 1000 2000 3000 4000 5000 Avg cost PI2 LQR-FLM PILQR in the goal Figure 6.6: Single condition comparison of the hockey task performed on the real robot. Costs lower than the dotted line correspond to the puck entering the goal. not provide any information about the task to be performed. In all of the real robot experiments, policies are updated every 10 rollouts and the final policy is obtained after 20-25 iterations, which corresponds to mastering the skill with less than one hour of experience. In the first set of experiments, we aim to learn a policy that is able to hit the puck into the goal for a single position of the goal and the puck. The results of this experiment are shown in Figure 6.6. In the case of the prior PI 2 method (Theodorou et al., 2010), the robot was not able to hit the puck. Since the puck position has the largest influence on the cost, the resulting learning curve shows little change in the cost over the course of training. The policy to move the arm towards the recorded arm position that enables hitting the puck turned out to be too challenging for PI 2 in the limited number of trials used for this experiment. In the case of LQR-FLM, the robot was able to occasionally hit the puck in different directions. However, the resulting policy could not capture the complex dynamics of the sliding puck or the discrete transition, and was unable to hit the puck toward the goal. The PILQR method was able to learn a robust policy that consistently hits the puck into the goal. Using the step adjustment rule described in Section 6.4.4, the algorithm would shift towards model-free updates from the PI 2 method as the TVLG approximation of the dynamics became less accurate. Using our method, the robot was able to get to the final position of 101 Figure 6.7: Experimental setup of the hockey task and the success rate of the final PILQR-MDGPS policy. Red and Blue: goal positions used for training, Green: new goal position. the arm using fast model-based updates from LQR-FLM and learn the puck-hitting policy, which is difficult to model, by automatically shifting towards model-free PI 2 updates. In our second set of hockey experiments, we evaluate whether we can learn a neural network policy using the MDGPS-PILQR algorithm that can hit the puck into different goal locations. The goals were spaced 0:5m apart (see Figure 6.7). The strategies for hitting the puck into different goal positions differ substantially, since the robot must adjust the arm pose to approach the puck from the right direction and aim toward the target. This makes it quite challenging to learn a single policy for this task. We performed 30 rollouts for three different positions of the goal (10 rollouts each), two of which were used during training. The neural network policy was able to hit the puck into the goal in 90% of the cases (see Figure 6.7). This shows that our method can learn high-dimensional neural network policies that generalize across various conditions. The results of the plug experiment are shown in Figure 6.8. PI 2 alone was unable to reach the socket. The LQR-FLM algorithm succeeded only 60% of the time at convergence. In contrast to the peg insertion-style tasks evaluated in prior work that used LQR-FLM (Levine, Wagener, & Abbeel, 2015b), this task requires very fine manipulation due to the small size of the plug. Our method was able to converge to a policy that plugged in the power plug on every rollout at convergence. The supplementary video illustrates the final behaviors of each method for both the hockey and power plug tasks. ‡ 6.7 Discussion We presented an algorithm that combines elements of model-free and model-based RL, with the aim of combining the sample efficiency of model-based methods with the ability of model-free methods to improve the policy even in situations where the model’s structural assumptions are ‡ https://sites.google.com/site/icml17pilqr 102 0 50 100 150 200 250 300 # samples −620 −615 −610 −605 −600 −595 −590 −585 Avg cost PI2 LQR-FLM PILQR plugged in Figure 6.8: Single condition comparison of the power plug task performed on the real robot. Note that costs above the dotted line correspond to executions that did not actually insert the plug into the socket. Only our method (PILQR) was able to consistently insert the plug all the way into the socket by the final iteration. 103 violated. We show that a particular choice of policy representation – TVLG controllers – is amenable to fast optimization with model-based LQR-FLM and model-free PI 2 algorithms using sample-based updates. We propose a hybrid algorithm based on these two components, where the PI 2 update is performed on the residuals between the true sample-based cost and the cost estimated under the local linear models. This algorithm has a number of appealing properties: it naturally trades off between model-based and model-free updates based on the amount of model error, can easily be extended with a KL-divergence constraint for stable learning, and can be effectively used for real-world robotic learning. We further demonstrate that, although this algorithm is specific to TVLG policies, it can be integrated into the GPS framework in order to train arbitrary parameterized policies, including deep neural networks. We evaluated our approach on a range of challenging simulated and real-world tasks. The results show that our method combines the efficiency of model-based learning with the ability of model-free methods to succeed on tasks with discontinuous dynamics and costs. We further illustrate in direct comparisons against state-of-the-art model-free deep RL methods that, when combined with the GPS framework, our method achieves substantially better sample efficiency. 104 Chapter 7 Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets Imitation learning has traditionally been applied to learn a single task from demonstrations thereof. The requirement of structured and isolated demonstrations limits the scalability of imitation learn- ing approaches as they are difficult to apply to real-world scenarios, where robots have to be able to execute a multitude of tasks. In this paper, we propose a multi-modal imitation learning framework that is able to segment and imitate skills from unlabelled and unstructured demonstrations by learning skill segmentation and imitation learning jointly. The extensive simulation results indicate that our method can efficiently separate the demonstrations into individual skills and learn to imitate them using a single multi-modal policy. The video of our experiments is available at http://sites.google.com/view/nips17intentiongan. 7.1 Introduction One of the key factors to enable deployment of robots in unstructured real-world environments is their ability to learn from data. In recent years, there have been multiple examples of robot learning frameworks that present promising results. These include: reinforcement learning (R. S. Sutton & Barto, 1998) - where a robot learns a skill based on its interaction with the environment and imitation learning (Argall, Chernova, Veloso, & Browning, 2009; Billard, Calinon, Dillmann, & Schaal, 2008) - where a robot is presented with a demonstration of a skill that it should imitate. In this work, we focus on the latter learning setup. Traditionally, imitation learning has focused on using isolated demonstrations of a particular skill (Schaal, 1999). The demonstration is usually provided in the form of kinesthetic teaching, which requires the user to spend sufficient time to provide the right training data. This constrained setup for imitation learning is difficult to scale to real world scenarios, where robots have to be able to execute a combination of different skills. To learn these skills, the robots would require a large number of robot-tailored demonstrations, since at least one isolated demonstration has to be provided for every individual skill. In order to improve the scalability of imitation learning, we propose a framework that can learn to imitate skills from a set of unstructured and unlabeled demonstrations of various tasks. 105 As a motivating example, consider a highly unstructured data source, e.g. a video of a person cooking a meal. A complex activity, such as cooking, involves a set of simpler skills such as grasping, reaching, cutting, pouring, etc. In order to learn from such data, three components are required: i) the ability to map the image stream to state-action pairs that can be executed by a robot, ii) the ability to segment the data into simple skills, and iii) the ability to imitate each of the segmented skills. In this work, we tackle the latter two components, leaving the first one for future work. We believe that the capability proposed here of learning from unstructured, unlabeled demonstrations is an important step towards scalable robot learning systems. In this work, we present a novel imitation learning method that learns a multi-modal stochas- tic policy, which is able to imitate a number of automatically segmented tasks using a set of unstructured and unlabeled demonstrations. Our results indicate that the presented technique can separate the demonstrations into sensible individual skills and imitate these skills using a learned multi-modal policy. We show applications of the presented method to the tasks of skill segmentation, hierarchical reinforcement learning and multi-modal policy learning. 7.2 Related Work Imitation learning is concerned with learning skills from demonstrations. Approaches that are suitable for this setting can be split into two categories: i) behavioral cloning (Pomerleau, 1991), and ii) inverse reinforcement learning (IRL) (Ng & Russell, 2000). While behavioral cloning aims at replicating the demonstrations exactly, it suffers from the covariance shift (Ross & Bagnell, 2010). IRL alleviates this problem by learning a reward function that explains the behavior shown in the demonstrations. The majority of IRL works (Ho & Ermon, 2016; Ziebart, Maas, Bagnell, & Dey, 2008; Abbeel & Ng, 2004; Finn, Levine, & Abbeel, 2016; Levine, Popovic, & Koltun, 2011) introduce algorithms that can imitate a single skill from demonstrations thereof but they do not readily generalize to learning a multi-task policy from a set of unstructured demonstrations of various tasks. More recently, there has been work that tackles a problem similar to the one presented in this work, where the authors consider a setting where there is a large set of tasks with many instantiations (Duan et al., 2017). In their work, the authors assume a way of communicating a new task through a single demonstration. We follow the idea of segmenting and learning different skills jointly so that learning of one skill can accelerate learning to imitate the next skill. In our case, however, the goal is to separate the mix of expert demonstrations into single skills and learn a policy that can imitate all of them, which eliminates the need of new demonstrations at test time. The method presented here belongs to the field of multi-task inverse reinforcement learning. Examples from this field include (Dimitrakakis & Rothkopf, 2011) and (Babes, Marivate, Subra- manian, & Littman, 2011). In (Dimitrakakis & Rothkopf, 2011), the authors present a Bayesian approach to the problem, while the method in (Babes et al., 2011) is based on an EM approach that clusters observed demonstrations. Both of these methods show promising results on relatively low-dimensional problems, whereas our approach scales well to higher dimensional domains due to the representational power of neural networks. 106 There has also been a separate line of work on learning from demonstration, which is then iteratively improved through reinforcement learning (Kalakrishnan, Righetti, Pastor, & Schaal, 2011; Chebotar, Kalakrishnan, et al., 2016; M¨ ulling, Kober, Kroemer, & Peters, 2013). In contrast, we do not assume access to the expert reward function, which is required to perform reinforcement learning in the later stages of the above algorithms. There has been much work on the problem of skill segmentation and option discovery for hierarchical tasks. Examples include (Niekum, Chitta, Barto, Marthi, & Osentoski, 2013; Kroemer, Daniel, Neumann, Van Hoof, & Peters, 2015; Fox, Krishnan, Stoica, & Goldberg, 2017; Vezhnevets et al., 2017; Florensa, Duan, & Abbeel, 2017). In this work, we consider a possibility to discover different skills that can all start from the same initial state, as opposed to hierarchical reinforcement learning where the goal is to segment a task into a set of consecutive subtasks. We demonstrate, however, that our method may be used to discover the hierarchical structure of a task similarly to the hierarchical reinforcement learning approaches. In (Florensa et al., 2017), the authors explore similar ideas to discover useful skills. In this work, we apply some of these ideas to the imitation learning setup as opposed to the reinforcement learning scenario. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have enjoyed success in various domains including image generation (Denton, Chintala, & Fergus, 2015), image-image translation (Zhu, Park, Isola, & Efros, 2017; Kim, Cha, Kim, Lee, & Kim, 2017) and video prediction (Mathieu, Couprie, & LeCun, 2015). More recently, there have been works connecting GANs and other reinforcement learning and IRL methods (Pfau & Vinyals, 2016; Finn, Christiano, Abbeel, & Levine, 2016; Ho & Ermon, 2016). In this work, we expand on some of the ideas presented in these works and provide a novel framework that exploits this connection. The works that are most closely related to this work are (Ho & Ermon, 2016), (Chen et al., 2016) and (Y . Li, Song, & Ermon, 2017a). In (Chen et al., 2016), the authors show a method that is able to learn disentangled representations and apply it to the problem of image generation. In this work, we provide an alternative derivation of our method that extends their work and applies it to multi-modal policies. In (Ho & Ermon, 2016), the authors present an imitation learning GAN approach that serves as a basis for the development of our method. We provide an extensive evaluation of the hereby presented approach compared to the work in (Ho & Ermon, 2016), which shows that our method, as opposed to (Ho & Ermon, 2016), can handle unstructured demonstrations of different skills. A concurrent work (Y . Li et al., 2017a) introduces a method similar to ours and applies it to detecting driving styles from unlabelled human data. 7.3 Preliminaries LetM = (S;A;P;R;p 0 ; ;T) be a finite-horizon Markov Decision Process (MDP), whereS and A are state and action spaces,P ∶S×A×S→R + is a state-transition probability function or system dynamics,R∶S×A→R a reward function,p 0 ∶S→R + an initial state distribution, a reward discount factor, andT a horizon. Let = (s 0 ;a 0 ;:::;s T ;a T ) be a trajectory of states and actions andR() =∑ T t=0 t R(s t ;a t ) the trajectory reward. The goal of reinforcement learning methods is to find parameters of a policy (aSs) that maximizes the expected discounted reward 107 over trajectories induced by the policy: E [R()] where s 0 ∼ p 0 ;s t+1 ∼ P(s t+1 Ss t ;a t ) and a t ∼ (a t Ss t ). In an imitation learning scenario, the reward function is unknown. However, we are given a set of demonstrated trajectories, which presumably originate from some optimal expert policy distribution E 1 that optimizes an unknown reward functionR E 1 . Thus, by trying to estimate the reward functionR E 1 and optimizing the policy with respect to it, we can recover the expert policy. This approach is known as inverse reinforcement learning (IRL) (Abbeel & Ng, 2004). In order to model a variety of behaviors, it is beneficial to find a policy with the highest possible entropy that optimizesR E 1 . We will refer to this approach as the maximum-entropy IRL (Ziebart et al., 2008) with the optimization objective min R max H( )+E R(s;a)−E E 1 R(s;a); (7.1) whereH( ) is the entropy of the policy . Ho and Ermon (Ho & Ermon, 2016) showed that it is possible to redefine the maximum- entropy IRL problem with multiple demonstrations sampled from a single expert policy E 1 as an optimization of GANs (Goodfellow et al., 2014). In this framework, the policy (aSs) plays the role of a generator, whose goal is to make it difficult for a discriminator network D w (s;a) (parameterized by w) to differentiate between imitated samples from (labeled 0) and demonstrated samples from E 1 (labeled 1). Accordingly, the joint optimization goal can be defined as max min w E (s;a)∼ [log(D w (s;a))]+E (s;a)∼ E 1 [log(1−D w (s;a))]+ H H( ): (7.2) The discriminator and the generator policy are both represented as neural networks and optimized by repeatedly performing alternating gradient updates. The discriminator is trained on the mixed set of expert and generator samples and outputs probabilities that a particular sample has originated from the generator or the expert policies. This serves as a reward signal for the generator policy that tries to maximize the probability of the discriminator confusing it with an expert policy. The generator can be trained using the trust region policy optimization (TRPO) algorithm (Schulman, Levine, Abbeel, Jordan, & Moritz, 2015) with the cost function log(D w (s;a)). At each iteration, TRPO takes the following gradient step: E (s;a)∼ [∇ log (aSs) log(D w (s;a))]+ H ∇ H( ); (7.3) which corresponds to minimizing the objective in Eq. (7.2) with respect to the policy . 7.4 Multi-modal Imitation Learning The traditional imitation learning scenario described in Sec. 7.3 considers a problem of learning to imitate one skill from demonstrations. The demonstrations represent samples from a single expert policy E1 . In this work, we focus on an imitation learning setup where we learn from 108 unstructured and unlabelled demonstrations of various tasks. In this case, the demonstrations come from a set of expert policies E 1 ; E 2 ;:::; E k , wherek can be unknown, that optimize different reward functions/tasks. We will refer to this set of unstructured expert policies as a mixture of policies E . We aim to segment the demonstrations of these policies into separate tasks and learn a multi-modal policy that will be able to imitate all of the segmented tasks. In order to be able to learn multi-modal policy distributions, we augment the policy input with a latent intentioni distributed by a categorical or uniform distributionp(i), similar to (Chen et al., 2016). The goal of the intention variable is to select a specific mode of the policy, which corresponds to one of the skills presented in the demonstrations. The resulting policy can be expressed as: (aSs;i)=p(iSs;a) (aSs) p(i) : (7.4) We augment the trajectory to include the latent intention as i = (s 0 ;a 0 ;i 0 ;:::s T ;a T ;i T ). The resulting reward of the trajectory with the latent intention isR( i )=∑ T t=0 t R(s t ;a t ;i t ).R(a;s;i) is a reward function that depends on the latent intentioni as we have multiple demonstrations that optimize different reward functions for different tasks. The expected discounted reward is equal to: E [R( i )]= ∫ R( i ) ( i )d i where ( i )=p 0 (s 0 )∏ T−1 t=0 P(s t+1 Ss t ;a t ) (a t Ss t ;i t )p(i t ). Here, we show an extension of the derivation presented in (Ho & Ermon, 2016) (Eqs. (7.1, 7.2)) for a policy(aSs;i) augmented with the latent intention variablei, which uses demonstrations from a set of expert policies E , rather than a single expert policy E 1 . We are aiming at maximum entropy policies that can be determined from the latent intention variable i. Accordingly, we transform the original IRL problem to reflect this goal: min R max H((aSs))−H((aSs;i))+E R(s;a;i)−E E R(s;a;i); (7.5) where(aSs) =∑ i (aSs;i)p(i), which results in the policy averaged over intentions (since the p(i) is constant). This goal reflects our objective: we aim to obtain a multi-modal policy that has a high entropy without any given intention, but it collapses to a particular task when the intention is specified. Analogously to the solution for a single expert policy, this optimization objective results in the optimization goal of the generative adversarial imitation learning network, with the exception that the state-action pairs (s;a) are sampled from a set of expert policies E : max min w E i∼p(i);(s;a)∼ [log(D w (s;a))]+E (s;a)∼ E [1− log(D w (s;a))] (7.6) + H H( (aSs))− I H( (aSs;i)); 109 where I , H correspond to the weighting parameters on the respective objectives. The resulting entropyH( (aSs;i)) term can be expressed as: H( (aSs;i))=E i∼p(i);(s;a)∼ (− log( (aSs;i)) (7.7) =−E i∼p(i);(s;a)∼ logp(iSs;a) (aSs) p(i) =−E i∼p(i);(s;a)∼ log(p(iSs;a))−E i∼p(i);(s;a)∼ log (aSs)+E i∼p(i) logp(i) =−E i∼p(i);(s;a)∼ log(p(iSs;a))+H( (aSs))−H(i); which results in the final objective: max min w E i∼p(i);(s;a)∼ [log(D w (s;a))]+E (s;a)∼ E [1− log(D w (s;a))] (7.8) +( H − I )H( (aSs))+ I E i∼p(i);(s;a)∼ log(p(iSs;a))+ I H(i); whereH(i) is a constant that does not influence the optimization. This results in the same optimization objective as for the single expert policy (see Eq. (7.2)) with an additional term I E i∼p(i);(s;a)∼ log(p(iSs;a)) responsible for rewarding state-action pairs that make the latent intention inference easier. We refer to this cost as the latent intention cost and representp(iSs;a) with a neural network. The final reward function for the generator is: E i∼p(i);(s;a)∼ [log(D w (s;a))]+ I E i∼p(i);(s;a)∼ log(p(iSs;a))+ H ′H( (aSs)): (7.9) 7.4.1 Relation to InfoGAN In this section, we provide an alternative derivation of the optimization goal in Eq. (7.8) by extending the InfoGAN approach presented in (Chen et al., 2016). Following (Chen et al., 2016), we introduce the latent variablec as a means to capture the semantic features of the data distribution. In this case, however, the latent variables are used in the imitation learning scenario, rather than the traditional GAN setup, which prevents us from using additional noise variables (z in the InfoGAN approach) that are used as noise samples to generate the data from. Similarly to (Chen et al., 2016), to prevent collapsing to a single mode, the policy optimization objective is augmented with mutual informationI(c;G( c ;c)) between the latent variable and the state-action pairs generatorG dependent on the policy distribution c . This encourages the policy to produce behaviors that are interpretable from the latent code, and given a larger number of possible latent code values leads to an increase in the diversity of policy behaviors. The corresponding generator goal can be expressed as: E c∼p(c);(s;a)∼ c [log(D w (s;a))]+ I I(c;G( c ;c))+ H H( c ) (7.10) 110 In order to computeI(c;G( c ;c)), we follow the derivation from (Chen et al., 2016) that intro- duces a lower bound: I(c;G( c ;c))=H(c)−H(cSG( c ;c)) (7.11) =E (s;a)∼G( c ;c) [E c ′ ∼P(cSs;a) [logP(c ′ Ss;a)]]+H(c) =E (s;a)∼G( c ;c) [D KL (P(⋅Ss;a)SSQ(⋅Ss;a))+E c ′ ∼P(cSs;a) [logQ(c ′ Ss;a)]]+H(c) ≥E (s;a)∼G( c ;c) [E c ′ ∼P(cSs;a) [logQ(c ′ Ss;a)]]+H(c) =E c∼P(c);(s;a)∼G( c ;c) [logQ(cSs;a)]+H(c) By maximizing this lower bound we maximizeI(c;G( c ;c)). The auxiliary distributionQ(cSs;a) can be parametrized by a neural network. The resulting optimization goal is max min w E c∼p(c);(s;a)∼ c [log(D w (s;a))]+E (s;a)∼ E [1− log(D w (s;a))] (7.12) + I E c∼P(c);(s;a)∼G( c ;c) [logQ(cSs;a)]+ H H( c ) which results in the generator reward function: E c∼p(c);(s;a)∼ c [log(D w (s;a))]+ I E c∼P(c);(s;a)∼G( c ;c) [logQ(cSs;a)]+ H H( c ): (7.13) This corresponds to the same objective that was derived in Section 7.4. The auxiliary distribution over the latent variablesQ(cSs;a) is analogous to the intention distributionp(iSs;a). 7.5 Implementation In this section, we discuss implementation details that can alleviate instability of the training procedure of our model. The first indicator that the training has become unstable is a high classification accuracy of the discriminator. In this case, it is difficult for the generator to produce a meaningful policy as the reward signal from the discriminator is flat and the TRPO gradient of the generator vanishes. In an extreme case, the discriminator assigns all the generator samples to the same class and it is impossible for TRPO to provide a useful gradient as all generator samples receive the same reward. Previous work suggests several ways to avoid this behavior. These include leveraging the Wasserstein distance metric to improve the convergence behavior (Arjovsky, Chintala, & Bottou, 2017) and adding instance noise to the inputs of the discriminator to avoid degenerate generative distributions (Sønderby, Caballero, Theis, Shi, & Husz´ ar, 2016). We find that adding the Gaussian noise helped us the most to control the performance of the discriminator and to produce a smooth reward signal for the generator policy. During our experiments, we anneal the noise similar to (Sønderby et al., 2016), as the generator policy improves towards the end of the training. An important indicator that the generator policy distribution has collapsed to a uni-modal policy is a high or increasing loss of the intention-prediction networkp(iSs;a). This means that 111 Figure 7.1: Left: Walker-2D running forwards, running backwards, jumping. Right: Humanoid running forwards, running backwards, balancing. the prediction of the latent variablei is difficult and consequently, the policy behavior can not be categorized into separate skills. Hence, the policy executes the same skill for different values of the latent variable. To prevent this, one can increase the weight of the latent intention cost I in the generator loss or add more instance noise to the discriminator, which makes its reward signal relatively weaker. In this work, we employ both categorical and continuous latent variables to represent the latent intention. The advantage of using a continuous variable is that we do not have to specify the number of possible values in advance as with the categorical variable and it leaves more room for interpolation between different skills. We use a softmax layer to represent categorical latent variables, and use a uniform distribution for continuous latent variables as proposed in (Chen et al., 2016). 7.6 Experiments Our experiments aim to answer the following questions: (1) Can we segment unstructured and unlabelled demonstrations into skills and learn a multi-modal policy that imitates them? (2) What is the influence of the introduced intention-prediction cost on the resulting policies? (3) Can we autonomously discover the number of skills presented in the demonstrations, and even accomplish them in different ways? (4) Does the presented method scale to high-dimensional policies? (5) Can we use the proposed method for learning hierarchical policies? We evaluate our method on a series of challenging simulated robotics tasks described below. We would like to emphasize that the demonstrations consist of shuffled state-action pairs such that no temporal information or segmentation is used during learning. The performance of our method can be seen in our supplementary video * . 7.6.1 Task setup Reacher The Reacher environment is depicted in Fig. 7.2 (left). The actuator is a 2-DoF arm attached at the center of the scene. There are several targets placed at random positions throughout the environment. The goal of the task is, given a data set of reaching motions to random targets, to discover the dependency of the target selection on the intention and learn a policy that is capable of reaching different targets based on the specified intention input. We evaluate the performance of our framework on environments with 1, 2 and 4 targets. * http://sites.google.com/view/nips17intentiongan 112 Figure 7.2: Left: Reacher with 2 targets: random initial state, reaching one target, reaching another target. Right: Gripper-pusher: random initial state, grasping policy, pushing (when grasped) policy. Walker-2D The Walker-2D (Fig. 7.1 left) is a 6-DoF bipedal robot consisting of two legs and feet attached to a common base. The goal of this task is to learn a policy that can switch between three different behaviors dependent on the discovered intentions: running forward, running backward and jumping. We use TRPO to train single expert policies and create a combined data set of all three behaviors that is used to train a multi-modal policy using our imitation framework. Humanoid Humanoid (Fig. 7.1 right) is a high-dimensional robot with 17 degrees of freedom. Similar to Walker-2D the goal of the task is to be able to discover three different policies: running forward, running backward and balancing, from the combined expert demonstrations of all of them. Gripper-pusher This task involves controlling a 4-DoF arm with an actuated gripper to push a sliding block to a specified goal area (Fig. 7.2 right). We provide separate expert demonstrations of grasping the object, and pushing it towards the goal starting from the object already being inside the hand. The initial positions of the arm, block and the goal area are randomly sampled at the beginning of each episode. The goal of our framework is to discover both intentions and the hierarchical structure of the task from a combined set of demonstrations. 7.6.2 Multi-Target Imitation Learning Our goal here is to analyze the ability of our method to segment and imitate policies that perform the same task for different targets. To this end, we first evaluate the influence of the latent intention cost on the Reacher task with 2 and 4 targets. For both experiments, we use either a categorical intention distribution with the number of categories equal to the number of targets or a continuous, uniformly-distributed intention variable, which means that the network has to discover the number of intentions autonomously. Fig. 7.3 top shows the results of the reaching tasks using the latent intention cost for 2 and 4 targets with different latent intention distributions. For the continuous latent variable, we show a span of different intentions between -1 and 1 in the 0.2 intervals. The colors indicate the intention “value”. In the categorical distribution case, we are able to learn a multi-modal policy that can reach all the targets dependent on the given latent intention (Fig. 7.3-1 and Fig. 7.3-3 top). The continuous latent intention is able to discover two modes in case of two targets (Fig. 7.3-2 top) but it collapses to only two modes in the four targets case (Fig. 7.3-4 top) as this is a significantly more difficult task. As a baseline, we present the results of the Reacher task achieved by the standard GAN imitation learning presented in (Ho & Ermon, 2016) without the latent intention cost. The obtained results are presented in Fig. 7.3 bottom. Since the network is not encouraged to discover different skills through the intention learning cost, it collapses to a single target for 2 targets in both the continuous and discrete latent intention variables. In the case of 4 targets, the network collapses to 113 Figure 7.3: Results of the imitation GAN with (top row) and without (bottom row) the latent intention cost. Left: Reacher with 2 targets(crosses): final positions of the reacher (circles) for categorical (1) and continuous (2) latent intention variable. Right: Reacher with 4 targets(crosses): final positions of the reacher (circles) for categorical (3) and continuous (4) latent intention variable. Figure 7.4: Left: Rewards of different Reacher policies for 2 targets for different intention values over the training iterations with (1) and without (2) the latent intention cost. Right: Two examples of a heatmap for 1 target Reacher using two latent intentions each. 2 modes, which can be explained by the fact that even without the latent intention cost the imitation network tries to imitate most of the presented demonstrations. Since the demonstration set is very diverse in this case, the network learned two modes without the explicit instruction (latent intention cost) to do so. To demonstrate the development of different intentions, in Fig. 7.4 (left) we present the Reacher rewards over training iterations for different intention variables. When the latent intention cost is included, (Fig. 7.4-1), the separation of different skills for different intentions starts to emerge around the 1000-th iteration and leads to a multi-modal policy that, given the intention value, consistently reaches the target associated with that intention. In the case of the standard imitation learning GAN setup (Fig. 7.4-2), the network learns how to imitate reaching only one of the targets for both intention values. In order to analyze the ability to discover different ways to accomplish the same task, we use our framework with the categorical latent intention in the Reacher environment with a single target. Since we only have a single set of expert trajectories that reach the goal in one, consistent manner, we subsample the expert state-action pairs to ease the intention learning process for the generator. 114 Fig. 7.4 (right) shows two examples of a heatmap of the visited end-effector states accumulated for two different values of the intention variable. For both cases, the task is executed correctly, the robot reaches the target, but it achieves it using different trajectories. These trajectories naturally emerged through the latent intention cost as it encourages different behaviors for different latent intentions. It is worth noting that the presented behavior can be also replicated for multiple targets if the number of categories in the categorical distribution of the latent intention exceeds the number of targets. 7.6.3 Multi-Task Imitation Learning We also seek to further understand whether our model extends to segmenting and imitating policies that perform different tasks. In particular, we evaluate whether our framework is able to learn a multi-modal policy on the Walker-2D task. We mix three different policies – running backwards, running forwards, and jumping – into one expert policy E and try to recover all of them through our method. The results are depicted in Fig. 7.5 (top). The additional latent intention cost results in a policy that is able to autonomously segment and mimic all three behaviors and achieve a similar performance to the expert policies (Fig. 7.5 top-left). Different intention variable values correspond to different expert policies: 0 - running forwards, 1 - jumping, and 2 - running backwards. The imitation learning GAN method is shown as a baseline in Fig. 7.5 (top-right). The results show that the policy collapses to a single mode, where all different intention variable values correspond to the jumping behavior, ignoring the demonstrations of the other two skills. To test if our multi-modal imitation learning framework scales to high-dimensional tasks, we evaluate it in the Humanoid environment. The expert policy is constructed using three expert policies: running backwards, running forwards, and balancing while standing upright. Fig. 7.5 (bottom) shows the rewards obtained for different values of the intention variable. Similarly to Walker-2D, the latent intention cost enables the neural network to segment the tasks and learn a multi-modal imitation policy. In this case, however, due to the high dimensionality of the task, the resulting policy is able to mimic running forwards and balancing policies almost as well as the experts, but it achieves a suboptimal performance on the running backwards task (Fig. 7.5 bottom-left). The imitation learning GAN baseline collapses to a uni-modal policy that maps all the intention values to a balancing behavior (Fig. 7.5 bottom-right). Finally, we evaluate the ability of our method to discover options in hierarchical IRL tasks. In order to test this, we collect expert policies in the Gripper-pusher environment that consist of grasping and pushing when the object is grasped demonstrations. The goal of this task is to check whether our method will be able to segment the mix of expert policies into separate grasping and pushing-when-grasped skills. Since the two sub-tasks start from different initial conditions, we cannot present the results in the same form as for the previous tasks. Instead, we present a time-lapse of the learned multi-modal policy (see Fig. 7.6) that presents the ability to change in the intention during the execution. The categorical intention variable is manually changed after the block is grasped. The intention change results in switching to a pushing policy that brings the block into the goal region. We present this setup as an example of extracting different options from the expert policies that can be further used in an hierarchical reinforcement learning task to learn the best switching strategy. 115 Figure 7.5: Top: Rewards of Walker-2D policies for different intention values over the training iterations with (left) and without (right) the latent intention cost. Bottom: Rewards of Humanoid policies for different intention values over the training iterations with (left) and without (right) the latent intention cost. Figure 7.6: Time-lapse of the learned Gripper-pusher policy. The intention variable is changed manually in the fifth screenshot, once the grasping policy has grasped the block. 116 7.7 Conclusions We present a novel imitation learning method that learns a multi-modal stochastic policy, which is able to imitate a number of automatically segmented tasks using a set of unstructured and unlabeled demonstrations. The presented approach learns the notion of intention and is able to perform different tasks based on the policy intention input. We evaluated our method on a set of simulation scenarios where we show that it is able to segment the demonstrations into different tasks and to learn a multi-modal policy that imitates all of the segmented skills. We also compared our method to a baseline approach that performs imitation learning without explicitly separating the tasks. 117 Chapter 8 Learning an Embedding Space for Transferable Robot Skills We present a method for reinforcement learning of closely related skills that are parameterized via a skill embedding space. We learn such skills by taking advantage of latent variables and exploiting a connection between reinforcement learning and variational inference. The main contribution of our work is an entropy-regularized policy gradient formulation for hierarchical policies, and an associated, data-efficient and robust off-policy gradient algorithm based on stochastic value gradients. We demonstrate the effectiveness of our method on several simulated robotic manipulation tasks. We find that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, our results indicate that the hereby proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards. 8.1 Introduction Recent years have seen great progress in methods for reinforcement learning with rich function approximators, aka “deep reinforcement learning” (DRL). In the field of robotics, DRL holds the promise of automatically learning flexible behaviors end-to-end while dealing with high- dimensional, multi-modal sensor streams (Arulkumaran, Deisenroth, Brundage, & Bharath, 2017). Among these successes, there has been substantial progress in algorithms for continuous action spaces, in terms of the complexity of systems that can be controlled as well as the data-efficiency and stability of the algorithms. Despite this recent progress, the predominant paradigm remains, however, to learn solutions from scratch for every task. Not only this is inefficient and constrains the difficulty of the tasks that can be solved, but also it limits the versatility and adaptivity of the systems that can be built. This is by no means a novel insight and there have been many attempts to address this issue (e.g. (Devin, Gupta, Darrell, Abbeel, & Levine, 2016; Rusu et al., 2016; Finn, Abbeel, & Levine, 2017; Teh et al., 2017)). Nevertheless, the effective discovery, representation, and reuse of skills remains an open research question. 118 We aim to take a step towards this goal. Our method learns manipulation skills that are continuously parameterized in an embedding space. We show how we can take advantage of these skills for rapidly solving new tasks, effectively by solving the control problem in the embedding space rather than the raw action space. To learn skills, we take advantage of latent variables - an important tool in the probabilistic modeling literature for discovering structure in data. The main contribution of our work is an entropy-regularized policy gradient formulation for hierarchical policies, and an associated, data- efficient and robust off-policy gradient algorithm based on stochastic value gradients. Our formulation draws on a connection between reinforcement learning and variational in- ference and is a principled and general scheme for learning hierarchical stochastic policies. We show how stochastic latent variables can be meaningfully incorporated into policies by treating them in the same way as auxiliary variables in parametric variational approximations in inference ((Salimans, Kingma, & Welling, 2014; Maaløe, Sønderby, Sønderby, & Winther, 2016; Ranganath, Tran, & Blei, 2016)). The resulting policies can model complex correlation structure and multi- modality in action space. We represent the skill embedding via such latent variables and find that this view naturally leads to an information-theoretic regularization which ensures that the learned skills are versatile and the embedding space is well formed. We demonstrate the effectiveness of our method on several simulated robotic manipulation tasks. We find that our method allows for the discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. Our results indicate that the hereby proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards. The video of our experiments is available at: https://goo.gl/FbvPGB. 8.2 Related Work The idea of concisely representing and re-using previously learned skills has been explored by a number of researchers, e.g. in the form of the Associative Skill Memories by (Pastor, Kalakrishnan, Righetti, & Schaal, 2012) or the meta-level priors for generalizing the relevance of features between different manipulation skills (Kroemer & Sukhatme, 2016). (Rueckert, Mundo, Paraschos, Peters, & Neumann, 2015) use the framework of probabilistic movement primitives to extract a lower-dimensional set of primitive control variables that allow effective reuse of primitives. Closely connected to this line of work, the framework of hierarchical relative entropy policy search (HiREPS) (Daniel, Neumann, Kroemer, & Peters, 2016) has been used to find hierarchical policies whose sub-policies represent individual skills (see e.g. (End, Akrour, Peters, & Neumann, 2017; Gabriel, Akrour, Peters, & Neumann, 2017)); while HiREPS also makes use of a latent variable - entropy regularized - RL formulation it uses hand-designed contextual policies (ignoring temporal structure within an episode) in an on-policy setting. Finally (Konidaris & Barto, 2007) use the options framework (R. S. Sutton, Precup, & Singh, 1999) to learn transferable options using the so-called agent-space. Inspired by these ideas, we introduce a skill embedding learning method that, by using modern DRL techniques, is able to concisely represent and reuse skills. 119 In the space of multi-task reinforcement learning with neural networks, (Teh et al., 2017) propose a framework that allows sharing of knowledge across tasks via a task agnostic prior. Similarly, (Cabi et al., 2017) make use of off-policy learning to learn about a large number of different tasks while following a main task. (Denil, Colmenarejo, Cabi, Saxton, & de Freitas, 2017) and (Devin et al., 2016) propose architectures that can be reconfigured to solve a variety of tasks, and (Finn et al., 2017) use meta-learning to acquire skills that can be fine-tuned effectively. Sequential learning and the need to retain previously learned skills has also been the focus of a number of researchers (e.g. (Kirkpatrick et al., 2017) and (Rusu et al., 2016)). In this work, we present a method that learns an explicit skill embedding space in a multi-task setting and is complementary to these works. Our formulation draws on a connection between entropy-regularized reinforcement learning and variational inference (VI) (e.g. (Todorov, 2008; Toussaint, 2009; Ziebart, 2010; RSS, 2012; Neumann, 2011; Levine & Koltun, 2013b; Fox, Pakman, & Tishby, 2016)). In particular, it considers formulations with auxiliary latent variables, a topic studied in the VI literature (e.g. (Barber & Agakov, 2003; Salimans et al., 2014; Ranganath et al., 2016; Maaløe et al., 2016)) but not fully explored in the context of RL. The notion of latent variables in policies has been explored e.g. by controllers (Heess et al., 2016) or options (Bacon, Harb, & Precup, 2017). Their main limitation is the lack of a principled approach to avoid a collapse of the latent distribution to a single mode. The auxiliary variable perspective introduces an information-theoretic regularizer that helps the inference model by producing more versatile behaviors. Learning versatile skills has been explored by (Haarnoja, Tang, Abbeel, & Levine, 2017) and (Schulman, Abbeel, & Chen, 2017). In particular, (Haarnoja et al., 2017) learns energy-based, maximum entropy policies via the soft Q-learning algorithm. Our approach similarly uses entropy-regularized reinforcement learning and latent variables but differs in the algorithmic framework. Similar hierarchical approaches have also been studied in the work combining RL with imitation learning (Wang et al., 2017; Merel et al., 2017). The works that are most closely related to this work are (Florensa et al., 2017; Mohamed & Rezende, 2015; Gregor, Rezende, & Wierstra, 2016) and (Hausman et al., 2017; Y . Li, Song, & Ermon, 2017b). They use the same bound that arises in our treatment of the latent variables. (Hausman et al., 2017) uses it to learn structure from demonstrations, while (Mohamed & Rezende, 2015; Gregor et al., 2016) use mutual information as an intrinsic reward for option discovery. (Florensa et al., 2017) follows a similar paradigm of pre-training stochastic neural network policies, which are then used to learn a new task in an on-policy setup. This approach can be viewed as a special case of the method introduced in this work, where the skill embedding distribution is a fixed uniform distribution and an on-policy method is used to optimize the regularized objective. In contrast, our method is able to learn the skill embedding distributions, which enables interpolation between different skills as well as discovering the number of distinct skills necessary to accomplish a set of tasks. In addition, we extend our method to a more sample-efficient off-policy setup, which is important for potential applications of this method to real world environments. 120 8.3 Preliminaries We perform reinforcement learning in Markov decision processes (MDP). We denote withs∈R S the continuous state of the agent;a∈R A denotes the action vector andp(s t+1 Ss t ;a t ) the probability of transitioning to states t+1 when executing actiona t ins t . Actions are drawn from a policy distribution (aSs), with parameters ; in our case a Gaussian distribution whose mean and diagonal covariance are parameterized via a neural network. At every step the agent receives a scalar rewardr(s t ;a t ) and we consider the problem of maximizing the sum of discounted rewards E [∑ ∞ t=0 t r(s t ;s t )]. 8.4 Learning Versatile Skills observations robot/policy environment task ID embedding embedding states history Figure 8.1: Schematics of our approach. We train the agent in a multi-task setup, where the task id is given as a one-hot input to the embedding network (bottom-left). The embedding network generates an embedding distribution that is sampled and concatenated with the current observation to serve as an input to the policy. After interaction with the environment, a segment of states is collected and fed into the inference network (bottom-right). The inference network is trained to classify what embedding vector the segment of states was generated from. Before we introduce our method for learning a latent skill embedding space, it is instructive to identify the exact desiderata that we impose on the acquired skills (and thus the embedding space parameterizing them). A detailed explanation of how these goals align with recent trends in the literature is given in Section 8.2. As stated in the introduction, the general goal of our method is to re-use skills learned for an initial set of tasks to speed up – or in some cases even enable – learning difficult target tasks in a transfer learning setting. We are thus interested in the following properties for the initially learned skills: 121 i) generality: We desire an embedding space, in which solutions to different, potentially orthogonal, tasks can be represented; i.e. tasks such as lifting a block or pushing it through an obstacle course should both be jointly learnable by our approach. ii) versatility: We aim to learn a skill embedding space, in which different embedding vectors that are “close” to each other in the embedding space correspond to distinct solutions to the same task. iii) identifiability: Given the state and action trace of an executed skill, it should be possible to identify the embedding vector that gave rise to the solution. This property would allow us to re-purpose the embedding space for solving new tasks by picking a sequence of embedding vectors. Intuitively, the properties i)-ii) of generality and versatility can be understood as: “we hope to cover as much of the skill embedding space as possible with different clusters of task solutions, within each of which multiple solutions to the same task are represented”. Property iii) intuitively helps us to: “derive a new skill by re-combining a diversified library of existing skills”. 8.4.1 Policy Learning via a Variational Bound on Entropy Regularized RL To learn the skill-embedding we assume to have access to a set of initial tasksT = [1;:::;T] with accompanying, per-task, reward functionsr t (s;a), which could be comprised of different environments, variable robot dynamics, reward functions, etc. During training time, we provide access to the task idt∈T (indicating which task the agent is operating in) to our RL agent. In practice – to obtain data from all training tasks for learning – we draw a task and its id randomly from the set of tasksT at the beginning of each episode and execute the agents current policy (aSs;t) in it. A conceptual diagram presenting our approach is depicted in Fig. 8.1. For our policy to learn a diverse set of skills instead of justT separate solutions (one per task), we endow it with a task-conditional latent variable z. With this latent variable, which we also refer to as “skill embedding”, the policy is able to represent a distribution over skills for each task and to share these across tasks. In the simplest case, this latent variable could be resampled at every timestep and the state-task conditional policy would be defined as(aSs;t)= ∫ (aSz;s;t)p(zSt)dz. One simple choice would be to letz ∈ 1;:::K, in which case the policy would correspond to a mixture ofK subpolicies. Introducing a latent variable facilitates the representation of several alternative solutions but it does not mean that several alternative solutions will be learned. It is easy to see that the expected reward objective does not directly encourage such behavior. To achieve this, we formulate our objective as an entropy regularized RL problem, i.e. we maximize: max E ;p 0 ;t∈T ∞ Q i=0 i r t (s i ;a i )+H [(a i Ss i ;t)]Ua i ∼(⋅Ss;t);s i+1 ∼p(s i+1 Sa i ;s i ); (8.1) wherep 0 (s 0 ) is the initial state distribution, is a weighting term – trading the arbitrarily scaled reward against the entropy – and we can define R(a;s;t) = E [∑ ∞ i=0 i r t (s i ;a i )Ss 0 = s;a i ∼ (⋅Ss;t)] to denote the expected return for taskt (under policy) when starting from states and taking actiona. The entropy regularization term is defined as:H [(aSs;t)]=E [− log(aSs;t)]: 122 It is worth noting that this is very similar to the ”entropy regularization” conventionally applied in many policy gradient schemes (Williams, 1992; Mnih et al., 2016) but with the critical difference that it takes into account not just the entropy of the current but also of future actions. To apply this entropy regularization to our setting, i.e. in the presence of latent variables, extra machinery is necessary since the entropy term becomes intractable for most distributions of interest. Borrowing from the toolkit of variational inference and applying the bound from (Barber & Agakov, 2003), we can construct a lower bound on the entropy term from Equation (8.1) as (see Appendix A.2.1 for details): E [− log(aSs;t)] ≥E (a;zSs;t) log q(zSa;s;t) (a;zSs;t) =−E (aSs;t) CEp(zSa;s;t)Yq (zSa;s)+H [p (zSt)]+E p (zSt) H [ (aSs;z)]; (8.2) whereq(zSa;s;t) is a variational inference distribution that we are free to choose, andCE denotes the cross entropy (CE). Note that althoughp(zSa;s;t) is intractable, a sample based evaluation of the CE term is possible: E (aSs;t) CEp(zSa;s;t)Yq (zSa;s)=E (a;zSs;t) − logq (zSa;s): This bound holds for anyq. We chooseq such that it complies with our desired property of identifiability (cf. Section 8.4): we avoid conditioningq on the task idt to ensure that a given trajectory alone will allow us to identify its embedding. The above variational bound is not only valid on a single state, but can also be easily extended to a short trajectory segment of states s H i = [s i−H ;:::;s i ], where H is the segment length. We thus use the variational distribution q (zSa;s H i ) – parameterized via a neural network with parameters . We also represent the policy (aSs;z) and the embedding distributionp (zSt) using neural networks – with parameters and – and refer to them as policy and embedding networks respectively. The above formulation is for a single time-step; we describe a more general formulation in the Appendix (Section A.2.2). The resulting bound meets the desiderata from Section 8.4: it maximizes the entropy of the embedding given the taskH (p(zSt)) and the entropy of the policy conditioned on the embedding E p(zSt) H ((aSs;z)) (thus, aiming to cover the embedding space with different skill clusters). The negative CE encourages different embedding vectorsz to have different effects in terms of executed actions and visited states: Intuitively, it will be high when we can predictz from the resultinga;s H . The first two terms in our bound also arise from the bound on the mutual information presented in (Florensa et al., 2017). We refer to the related work section for an in-depth discussion. We highlight that the above derivation also holds for the case where the task id is constant (or simply omitted) resulting in a bound for learning a latent embedding space encouraging the development of diverse solutions to a single task. 123 Inserting Equation (8.2) into our objective from Equation (8.1) yields the variational bound L(;; )=E (a;zSs;t) t∈T ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∞ Q i=0 i ^ r(s i ;a i ;z;t)Us i+1 ∼p(s i+1 Sa i ;s i )) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + 1 E t∈T H [p (zSt)]; where ^ r(s i ;a i ;z;t)= r t (s i ;a i )+ 2 logq (zSa i ;s H i )+ 3 H [ (aSs i ;z)]; (8.3) with split entropy weighting terms = 1 + 2 + 3 . Note thatE t∈T H [p (zSt)] does not depend on the trajectory. 8.5 Learning an Embedding for Versatile Skills in an Off-Policy Set- ting While the objective presented in Equation (8.3) could be optimized directly in an on-policy setting (similar to (Florensa et al., 2017)), our focus in this work is on obtaining a data-efficient, off-policy, algorithm that could, conceivably, be applied to a real robotic system in the future. The bound presented in the previous section requires environment interaction to estimate the discounted sums presented in the first three terms of Equation (8.3). As we will show in the following, these terms can, however, also be estimated efficiently from previously gathered data by learning aQ-value function * , yielding an off-policy algorithm. We assume the availability of a replay bufferB (containing full trajectory execution traces including states, actions, task id and reward), that is incrementally filled during training (see the appendix for further details). In conjunction with the trajectory traces, we also store the probabilities of each selected action and denote them with the behavior policy probabilityb(aSz;s;t) as well as the behaviour probabilities of the embeddingb(zSt). Given this replay data, we formulate the off-policy perspective of our algorithm. We start with the notion of a lower-bound Q-function that depends on both states and actiona and is conditioned on both, the embeddingz and the task idt. It encapsulates all time dependent terms from Equation (8.3) and can be recursively defined as: Q (s i ;a i ;z;t)= ^ r(s i ;a i ;z;t)+ E p(s i+1 Sa i ;s i ) [Q (s i+1 ;a i+1 ;z;t)]: (8.4) To learn a parametric representation ofQ ' , we turn to the standard tools for policy evaluation from the RL literature. Specifically, we make use of the recent Retrace algorithm from (Munos, Stepleton, Harutyunyan, & Bellemare, 2016), which allows us to quickly propagate entropy augmented rewards across multiple time-steps while – at the same time – minimizing the bias that * From the perspective of variational inference, from which we are drawing inspiration in this work, such aQ function can be interpreted as an amortized inference network estimating a log-likelihood term. 124 any algorithm relying on a parametric Q-function is prone to. Formally, we fitQ ' by minimizing the squared loss: min ' E B Q ' (s i ;a i ;z;t)−Q ret 2 ; with Q ret = ∞ Q j=i j−i j M k=i c k ^ r(s j ;a j ;z;t)+E (aSz;s;t) [Q b ' ′(s i ;⋅;z;t)]−Q b ' ′(s j ;a j ;z;t); c k = min1; (a k Sz;s k ;t)p(zSt) b(a k Sz;s k ;t)b(zSt) ; (8.5) where we compute the terms contained in ^ r by usingr t andz from the replay buffer and re-compute the (cross-)entropy terms. Here,' ′ denotes the parameters of a target Q-network † (Mnih et al., 2015) that we occasionally copy from the current estimate', andc k are the per-step importance weights. Further, we bootstrap the infinite sum afterN-steps withE Q ' ′(s N ;⋅;z N ;t) instead of introducing a parameter as in the original paper (Munos et al., 2016). Equipped with this Q- function, we can update the policy and embedding network parameters without requiring additional environment interaction (using only data from the replay buffer) by optimizing the following off-policy objective: ^ L(;)=E (aSz;s) p (zSt) s;t∈B ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ Q ' (s;a;z) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ +E t∈T H [p (zSt)]; (8.6) which can be readily obtained by insertingQ ' into Equation (8.3). To minimize this objective via gradient descent, we draw further inspiration from recent successes in variational inference and directly use the pathwise derivative ofQ ' w.r.t. the network parameters by using the reparametriza- tion trick (D. P. Kingma & Welling, 2013; Rezende, Mohamed, & Wierstra, 2014). This method has previously been adapted for off-policy RL in the framework of stochastic value gradient algo- rithms (Heess, Wayne, Silver, Lillicrap, Erez, & Tassa, 2015) and was found to yield low-variance estimates. For the inference network q (zSa;s H ), minimizing equation (8.3) amounts to supervised learning, maximizing: ^ L( )=E (a;zSs;t) t∈T ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∞ Q i=0 i logq (zSa;s H )Us i+1 ∼p (s i+1 Sa i ;s i ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ; (8.7) which requires sampling new trajectories to acquire target embeddings consistent with the current policy and embedding network. We found that simply re-using sampled trajectory snippets from the replay buffer works well empirically; allowing us to update all network parameters at the same time. Together with our choice for learning a Q-function, this results in a sample efficient † Note it will thus evaluate a different policy than the current policy, here denoted byb. Nonetheless by using importance weighting viac k we are guaranteed to obtain an unbiased estimator in the limit. 125 algorithm. We refer to Section A.2.3 in the appendix for the derivation of the stochastic value gradient of Equation (8.6). 8.6 Learning to Control the Previously-Learned Embedding Once the skill-embedding is learned using the described multi-task setup, we utilize it to learn a new skill. There are multiple possibilities to employ the skill-embedding in such a scenarion including fine-tuning the entire policy or learning only a new mapping to the embedding space (modulating the lower level policies). In this work, we decide to focus on the latter: To adapt to a new task we freeze the policy network and only learn a new state-embedding mappingz =f # (x) via a neural network f # (parameterized by parameters #). In other words, we only allow the network to learn how to modulate and interpolate between the already-learned skills, but we do not allow to change the underlying policies. 8.7 Experimental Results Our experiments aim to answer the following questions: (1) Can our method learn versatile skills? (2) Can it determine how many distinct skills are necessary to accomplish a set of tasks? (3) Can we use the learned skill embedding for control in an unseen scenario? (4) Is it important for the skills to be versatile to use their embedding for control? (5) Is it more efficient to use the learned embedding rather than to learn to solve a task from scratch? We evaluate our approach in two domains in simulation: a point mass task to easily visualize different properties of our method and a set of challenging robot manipulation tasks. Our implementation uses 16 asynchronous workers interacting with the environment, and synchronous updates utilizing the replay buffer data. 8.7.1 Didactic Example: Multi-Goal Point Mass Task with Sparse Rewards Similarly to (Haarnoja et al., 2017), we present a set of didactic examples of multi-goal point mass tasks that demonstrate the variability of solutions that our method can discover. The first didactic example consists of a force-controlled point mass that is rewarded for being in a goal region. In order to learn the skill embedding, we use two tasks (T = 2), with the goals located either to the left or to the right of the initial location. Fig. 8.2-bottom compares a set of trajectories produced by our method when conditioned on different Gaussian skill embedding samples with and without the variational-inference-based regu- larization. That is, in the latter case, we remove the cross-entropy term from the reward and train the inference network to predict embedding vectors from observed trajectories in isolation. The hereby introduced cross-entropy term between inference and embedding distributions introduces more variety to the obtained trajectories, which can be explained by the agent’s incentive to help the inference network. Fig. 8.2-top presents the absolute error between the actual and the inferred skill embedding for both tasks. It is apparent that the trajectories generated with regularization, display more variability and are therefore easily distinguishable. This means that the inference network is able to infer the skill embedding more accurately compared to the setup without the 126 regularization term. The constant residual error shown in the top left part of the figure corresponds to the fact that the inference network without regularization can only predict the mean of the embedding used for generating the trajectories. The second didactic example also consists of a point mass that is rewarded for being in a goal region. However, in this experiment, we consider a case where there are four goals, that are located around the initial location (see Fig. 8.3 left and middle) and each of them is equally important (the agent obtains the same reward at each location) for a single task (T = 1). This leads to a situation where there exist multiple optimal policies for a single task. In addition, this task is challenging due to the sparsity of the rewards – as soon as one solution is discovered, it is difficult to keep exploring other goals. Due to these challenges, most existing DRL approaches would be content with finding a single solution. Furthermore, even if a standard policy gradient approach discovered multiple goals, it would have no incentive to represent multiple solutions. For this experiment, we consider both, a Gaussian embedding space as well as a multivariate Bernoulli distribution (which we expect to be more likely to capture the multi-modality of the solutions). The left part of Fig. 8.3 presents the versatility of the solutions when using the multivariate Bernoulli (left) and Gaussian (middle) embedding. The multivariate Bernoulli distribution is able to discover all four solutions, whereas the Gaussian embedding focuses on discovering different trajectories that lead to only two of the goals. 0 100 200 300 400 500 Steps in trajectory 0.0 0.1 0.2 0.3 0.4 0.5 Embedding prediction error 0 100 200 300 400 500 Steps in trajectory 0.0 0.1 0.2 0.3 0.4 0.5 Embedding prediction error 0 100 200 300 400 500 Steps in trajectory 0.0 0.1 0.2 0.3 0.4 0.5 Embedding prediction error 0 100 200 300 400 500 Steps in trajectory 0.0 0.1 0.2 0.3 0.4 0.5 Embedding prediction error 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 Figure 8.2: Bottom: resulting trajectories for different 3D embedding values with (right) and without (left) variational-inference-based regularization. The contours depict the reward gained by the agent. Top: Absolute error between the mean embedding value predicted by the inference network and the actual mean of the embedding used to generate these trajectories. Note that every error curve at the top corresponds to a single trajectory at the bottom. In order to evaluate whether our method can determine the number of distinct skills that are necessary to accomplish a set of tasks, we conduct the following experiment. We set the number of task to four (T = 4) but we set two of the tasks to be exactly the same (t= 1 andt= 3). Next, we use our method to learn skill embeddings and evaluate how many distinct embeddings it learns. The results in Fig. 8.3-right show the KL divergence between learned embedding distributions over training iterations. One can observe that the embedding network is able to discover that task 1 and 127 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0.150 0.300 0.450 0.600 0.750 0 200 400 600 800 1000 Training episodes 0.0 0.2 0.4 0.6 0.8 1.0 KL divergence KL(p 1 ||p t ) t=2 t=3 t=4 Figure 8.3: Left, middle: resulting trajectories that were generated by different distributions used for the skill-embedding space: multivariate Bernoulli (left), Gaussian (middle). The contours depict the reward gained by the agent. Note that there is no reward outside the goal region. Right: KL-divergence between the embedding distributions produced by task 1 and the other three tasks. Task 1 and 3 have different task ids but they are exactly the same tasks. Our method is able to discover that task 1 and 3 can be covered by the same embedding, which corresponds to the minimal KL-divergence between their embeddings. 3 can be represented by the same skill embedding resulting in the KL-divergence between these embedding distribution being close to zero (KL(p(zSt 1 )SSp(zSt 3 ))≈ 0)). This indicates that the embedding network is able to discover the number of distinct skills necessary to accomplish a set of tasks. 8.7.2 Control of the Skill Embedding for Manipulation Tasks Next, we evaluate whether it is possible to use the learned skill embedding for control in an unseen scenario. The video of our experiments is available at: https://goo.gl/FbvPGB. We do so by using three simulated robotic manipulation tasks depicted in Fig. 8.4 and described below: Spring-wall. A robotic arm is tasked to bring a block to a goal. The block is attached to a string that is attached to the ground at the initial block location. In addition, there is a short wall between the target and the initial location of the block, requiring the optimal behavior to pull the block around the wall and hold it at the goal location. The skill embedding space used for learning this skill was learned on two tasks related to this target task: bringing a block attached on a spring to a goal location (without a wall in between) and bringing a block to a goal location with a wall in between (without the spring). In order to successfully learn the new spring-wall skill, the skill-embedding space has to be able to interpolate between the skills it was originally trained on. L-wall. The task is to bring a block to a goal that is surrounded by an L-shaped wall (see Fig. 8.4). The robot needs to learn how to push the block around the L-shaped wall to get to the target location. The skill embedding space used for learning this skill was learned on two tasks: push a block to a goal location (without the L-shaped wall) and lift a block to a certain height. The block was randomly spawned on a ring around the goal location that is in the center of the workspace. The purpose of this task is to demonstrate that the intuitively unrelated skills used for learning the skill embedding can be indeed beneficial for learning a new task. Rail-push. The robot is tasked to first lift the block along the side of a white table that is firmly attached to the ground and then, to push it towards the center of the table. The initial lifting motion 128 Figure 8.4: Left: visualization of the sequence of manipulation tasks we consider. Top row: spring-wall, middle row: L-wall, bottom row: rail-push. The left two columns depict the two initial skills that are learned jointly, the rightmost column (in the left part of the figure) depicts the transfer task that should be solved using the previously acquired skills. Right: trajectories of the block in the plane as manipulated by the robot. The trajectories are produced by sampling a random embedding vector trained with (red) and without (black) the inference network from the marginal distribution over the L-wall pre-training tasks every 50 steps and following the policy. Dots denote points at which the block was lifted. of the block is constrained as if the block was attached to a pole (or an upwards facing rail). This attachment is removed once the block reaches the height of the table. The skill embedding space was learned using two tasks: lift up the block attached on a rail (without the table in the scene) and push a block initialized on top of the table to its center. This task aims to demonstrate the ability to sequence two different skills together to accomplish a new task. The spring-wall and L-wall tasks are performed in a setting with sparse rewards (where the only reward the robot can obtain is tied to the box being inside a small region near a target location); making them very challenging exploration problems. In contrast, the rail-push task (due to its sequential nature as well as the fact that the table acts as an obstacle) uses minor reward shaping (where we additionally reward the robot based on the distance of the box to the center of the table). Fig. 8.5 shows the comparison between our method and various baselines: i) learning the transfer task from scratch, ii) learning the mapping between states and the task id (t) directly without a stochastic skill-embedding space, iii) learning the task by controlling the skill-embedding that was trained without variational-inference-based regularization (no inference net). In the spring-wall task, our approach has an advantage especially in the initial stages of training but the baseline without the inference network (no-KL in the plot) is able to achieve similar asymptotic performance. This indicates that this task does not require versatile skills and it is 129 sufficient to find an embedding in between two skills that is able to successfully interpolate between them. It is worth noting that the remaining baselines are not able to solve this task. For the more challenging L-wall task, our method is considerably more successful than all the baselines. This task is particularly challenging because of the set of skills that it was pre-trained on (lift the block and push the block towards the center). The agent has to discover an embedding that allows the robot to push the block along the edge of the white container - a behavior that is not directly required in any of the pre-training tasks. However, as it turns out, many successful policies for solving the lift task push the block against the wall of the container in order to perform a scooping motion. The agent is able to discover such a skill embedding and utilize it to push the block around the L-shaped wall. In order to investigate why the baselines are not able to find a solution to the L-wall task, we explore the embedding space produced by our method as well as by the no-inference-network baseline. In particular, we sample a random embedding vector from the marginal embedding distribution over tasks and keep it constant to generate a behavior. The resulting trajectories of the block are visualized in Fig. 8.4-right. One can observe that the additional regularization causes the block trajectories to be much more versatile, which makes it easier to discover a working embedding for the L-wall task. The last task consists of a rail that the agent uses to lift the block along the wall of the table. It is worth noting that the rail task used for initial training of the rail lift skill does not include the table. For the transfer task we, however, require the agent to find a skill embedding that is able to lift the block in such a way that the arm is not in collision with the table, even though it has only encountered it in the on-table manipulation task. As shown in the most right plot of Fig. 8.5, such an embedding is only discovered using our method that uses variational-inference-based regularization to diversify the skills during pre-training. This indicates that due to the versatility of the learned skills, the agent is able to discover an embedding that avoids the collision with the previously unseen table and accomplishes the task successfully. In summary, our algorithm is able to solve all the tasks due to having access to better exploration policies that were encoded in the skill embedding space. The consecutive images of the final policies for all three tasks are presented in Fig. 8.6. 0 500 1000 1500 2000 2500 3000 Episodes (x 16 workers) 0 10 20 30 40 50 60 Average reward (10 episodes) L-wall 0 100 200 300 400 500 600 700 800 900 Episodes (x 16 workers) 0 10 20 30 40 50 60 Average reward (10 episodes) Spring-wall Ours no inference net from scratch task selection 0 500 1000 1500 2000 2500 3000 Episodes (x 16 workers) 40 60 80 100 120 140 160 Average reward (10 episodes) Rail-push Figure 8.5: Comparison of our method against different training strategies for our manipulation tasks: spring-wall, L-wall, and rail-push. 130 Figure 8.6: Final policy for all three tasks: spring-wall (top), L-wall (middle), rail-push (bottom). 8.8 Conclusions We presented a method that learns manipulation skills that are continuously parameterized in a skill embedding space, and takes advantage of these skills by solving a new control problem in the embedding space rather than the raw action space. The skills are learned by taking advantage of latent variables and exploiting a connection between reinforcement learning and variational inference. We derived an entropy-regularized policy gradient formulation for hierarchical policies, and an associated, data-efficient off-policy algorithm based on stochastic value gradients. Our experiments indicate that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, we showed that our technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards. 131 Part IV Conclusions 132 Chapter 9 Conclusions In this thesis, we have shown multiple examples of how closing perception-action loops and using learned representations can improve performance in various perceptual or manipulation tasks. In the first part of the thesis, we applied the (Inter)active perception paradigm to multiple robotic systems, such as humanoids and UA Vs, in scenarios where the robots either interact with the environment (IP) or change their position relative to the environment (AP). This paradigm introduced a tight interaction between the control and perception systems that enables closing action-perception loops in order to improve overall performance. We showed the benefits of our approach in various applications such as Grasping and Articulation Model Estimation on a manipulation platform as well as Multi-Robot Target Tracking and Self-Calibration on a UA V . In the second part of thesis, we introduced a set of Deep Robot Learning approaches that address the problem of hand-designed and non-goal-tailored representations by using deep learning techniques to optimize the representation in an end-to-end fashion. This new line of work addresses various challenges associated with learning representations for robotics. In particular, we presented solutions to the problems of sample complexity for deep reinforcement learning, scalability of current imitation learning methods, and reusing and representing previously learned skills. 9.1 Future Work The work presented in this thesis describes different approaches for applying the interactive perception paradigm and using learned representations to overcome various challenges in robotics. There are multiple possibilities of how the ideas presented in this work can be further extended. A promising direction to explore in future work is a combination of the paradigms presented in this thesis: learning goal-oriented representations and interactive perception. An obvious implementation could involve applying active learning techniques to the framework of deep reinforcement learning. The main difficulty in directly applying this idea stems from the fact that correctly representing the uncertainty remains challenging for deep neural networks; indeed, there have been multiple attempts to address this problem (Gal, 2016). However, there exist methods that apply ideas from the active learning community to the problems of guiding exploration in deep reinforcement learning. These mainly focus on curiosity-based exploration (Fu, Co-Reyes, & Levine, 2017; Tang et al., 2017) and information-theoretic-based reward bonuses (Houthooft et al., 133 2016) that are added to the agent’s objective. We believe that this is a promising research direction that should be further explored. In our opinion, however, the approach that has the biggest potential for significantly improving the state of the art in robotics research is based on unsupervised or predictive learning. In this case, the interactive perception paradigm can be used to discover interesting properties of the world for predictive learning. As opposed to reinforcement learning, predictive learning has access to enormous amounts of data and learning signals that are constantly coming from the environment. These are based on predicting the next state given the current one, or predicting one sensory modality given another one. We strongly believe that interactive-perception-enhanced predictive learning should be able to learn generalizable representations that may be used for pretraining reinforcement learning models and policies, which have the potential to greatly advance the field. 134 Appendix A Appendix A.1 Appendix: Combining Model-Based and Model-Free Updates for Deep Reinforcement Learning A.1.1 Derivation of LQR-FLM Given a TVLG dynamics model and quadratic cost approximation, we can approximate our Q and value functions to second order with the following dynamic programming updates, which proceed from the last time stept=T to the first stept= 1: Q x;t =c x;t + f ⊺ x;t V x;t+1 ; Q xx;t =c xx;t + f ⊺ x;t V xx;t+1 f x;t ; Q u;t =c u;t + f ⊺ u;t V x;t+1 ; Q uu;t =c uu;t + f ⊺ u;t V xx;t+1 f u;t ; Q xu;t =c xu;t + f ⊺ x;t V xx;t+1 f u;t ; V x;t =Q x;t −Q xu;t Q −1 uu;t Q u;t ; V xx;t =Q xx;t −Q xu;t Q −1 uu;t Q ux;t : Here, similar to prior work, we use subscripts to denote derivatives. It can be shown (e.g., in (Tassa et al., 2012)) that the action u t that minimizes the second-order approximation of the Q-function at every time stept is given by u t =−Q −1 uu;t Q ux;t x t −Q −1 uu;t Q u;t : This action is a linear function of the state x t , thus we can construct an optimal linear policy by setting K t = −Q −1 uu;t Q ux;t and k t = −Q −1 uu;t Q u;t . We can also show that the maximum-entropy policy that minimizes the approximate Q-function is given by p(u t Sx t )=N (K t x t + k t ;Q uu;t ): This form is useful for LQR-FLM, as we use intermediate policies to generate samples to fit TVLG dynamics. (Levine & Abbeel, 2014) impose a constraint on the total KL-divergence between the old and new trajectory distributions induced by the policies through an augmented cost function 135 c(x t ;u t )= 1 c(x t ;u t )− logp (i−1) (u t Sx t ), where solving for via dual gradient descent can yield an exact solution to a KL-constrained LQR problem, where there is a single constraint that operates at the level of trajectory distributionsp(). We can instead impose a separate KL-divergence constraint at each time step with the constrained optimization min ut;t E x∼p(xt);u∼N (ut;t) [Q(x;u)] s:t: E x∼p(xt) [D KL (N (u t ; t )Yp (i−1) )]≤ t : The new policy will be centered around u t with covariance term t . Let the old policy be parameterized by K t , k t , and C t . We form the Lagrangian (dividing by t ), approximateQ, and expand the KL-divergence term to get L(u t ; t ; t ) = 1 t Q ⊺ x;t x t +Q ⊺ u;t u t + 1 2 x ⊺ t Q xx;t x t + 1 2 tr(Q xx;t x;t ) + 1 2 u ⊺ t Q uu;t u t + 1 2 tr(Q uu;t t )+ x ⊺ t Q xu;t u t + 1 2 logS t S− logS t S−d+tr( −1 t t ) +( K t x t + k t − u t ) ⊺ −1 t ( K t x t + k t − u t ) +tr( K ⊺ t −1 t K t x;t )− t : Now we set the derivative ofL with respect to t equal to 0 and get t = 1 t Q uu;t + −1 t −1 : Setting the derivative with respect to u t equal to 0, we get u t =− t 1 t Q u;t + 1 t Q ux;t x t − ^ C −1 t ( ^ K t x t + ^ k t ); Thus our updated mean has the parameters k t =− t 1 t Q u;t − ^ C −1 t ^ k t ; K t =− t 1 t Q ux;t − ^ C −1 t ^ K t : 136 As discussed by (Tassa et al., 2012), when the updated K t and k t are not actually the optimal solution for the current quadratic Q-function, the update to the value function is a bit more complex, and is given by V x;t =Q ⊺ x;t +Q ⊺ u;t K t + k ⊺ t Q uu;t K t + k ⊺ t Q ux;t ; V xx;t =Q xx;t + K ⊺ t Q uu;t K t + 2Q xu;t K t : A.1.2 PI2 update through constrained optimization The structure of the proof for the PI 2 update follows (Peters et al., 2010), applied to the cost- to-go S(x t ;u t ). Let us first consider the cost-to-go S(x t ;u t ) of a single trajectory or path (x t ;u t ;x t+1 ;u t+1 ;:::;x T ;u T ) whereT is the maximum number of time steps. We can rewrite the Lagrangian in a sample-based form as L(p (i) ; t )= Qp (i) S(x t ;u t )+ t Qp (i) log p (i) p (i−1) −: Taking the derivative ofL(p (i) ; t ) with respect to a single optimal policyp (i) and setting it to zero results in @L @p (i) =S(x t ;u t )+ t log p (i) p (i−1) +p (i) p (i−1) p (i) 1 p (i−1) =S(x t ;u t )+ t log p (i) p (i−1) = 0: Solve the derivative forp (i) by exponentiating both sides log p (i) p (i−1) =− 1 t S(x t ;u t ); p (i) =p (i−1) exp− 1 t S(x t ;u t ): This gives us a probability update rule for a single sample that only considers cost-to-go of one path. However, when sampling from a stochastic policyp (i−1) (u t Sx t ), there are multiple paths that start in state x t with action u t and continue with a noisy policy afterwards. Hence, the updated policyp (i) (u t Sx t ) will incorporate all of these paths as p (i) (u t Sx t )∝p (i−1) (u t Sx t )E p (i−1)exp− 1 t S(x t ;u t ): The updated policy is additionally subject to normalization, which corresponds to computing the normalized probabilities in Eq. (6.2). 137 A.1.3 Detailed Experimental Setup Simulation Experiments All of our cost functions use the following generic loss term on a vector z `(z)= 1 2 YzY 2 2 + ¼ +YzY 2 2 : (A.1) and are hyperparameters that weight the squared` 2 loss and Huber-style loss, respectively, and we set = 10 −5 . On the gripper pusher task, we have three such terms. The first sets z as the vector difference between the block and goal positions with = 10 and = 0:1. z for the second measures the vector difference between the gripper and block positions, again with= 10 and = 0:1, and the last loss term penalizes the magnitude of the fourth robot joint angle with= 10 and = 0. We include this last term because, while the gripper moves in 3D, the block is constrained to a 2D plane and we thus want to encourage the gripper to also stay in this plane. These loss terms are weighted by 4, 1, and 1 respectively. On the reacher task, our only loss term uses as z the vector difference between the arm end effector and the target, with= 0 and = 1. For both the reacher and door opening tasks, we also include a small torque penalty term that penalizes unnecessary actuation and is typically several orders of magnitude smaller than the other loss terms. On the door opening task, we use two loss terms. For the first, z measures the difference between the angle in radians of the door hinge and the desired angle of−1:0, with= 1 and = 0. The second term is time-varying: for the first 25 time steps, z is the vector difference between the bottom of the robot end effector and a position above the door handle, and for the remaining time steps, z is the vector difference from the end effector to inside the handle. This encourages the policy to first navigate to a position above the handle, and then reach down with the hook to grasp the handle. Because we want to emphasize the second loss during the beginning of the trajectory and gradually switch to the first loss, we do a time-varying weighting between the loss terms. The weight of the second loss term is fixed to 1, but the weight of the first loss term at time stept is 5 t T 2 . For the neural network policy architectures, we use two fully-connected hidden layers of rectified linear units (ReLUs) with no output non-linearity. On the reacher task, the hidden layer size is 32 units per layer, and on the door opening task, the hidden layer size is 100 units per layer. All of the tasks involve varying conditions for which we train one TVLG policy per condition and, for reacher and door opening, train a neural network policy to generalize across all conditions. For gripper pusher, the conditions vary the starting positions of the block and the goal, which can have a drastic effect on the difficulty of the task. Figure A.1 illustrates the four initial conditions of the gripper pusher task for which we train TVLG policies. For reacher, analogous to OpenAI Gym, we vary the initial arm configuration and position of the target and train TVLG policies from 16 randomly chosen conditions. Note that, while OpenAI Gym randomizes this initialization per episode, we always reset to the same condition when training TVLG policies as this is an additional assumption we impose. However, when we test the performance of the neural network 138 Figure A.1: The initial conditions for the gripper pusher task that we train TVLG policies on. The top left and bottom right conditions are more difficult due to the distance from the block to the goal and the configuration of the arm. The top left condition results are reported in Section 6.6.1. 139 policy, we collect 300 test episodes with random initial conditions. For the door opening task, we initialize the robot position within a small square in the ground plane. We train TVLG policies from the four corners of this square and test our neural network policies with 100 test episodes from random positions within the square. For the gripper pusher and door opening tasks, we train TVLG policies with PILQR, LQR-FLM and PI 2 with 20 episodes per iteration per condition for 20 iterations. In Appendix A.1.4, we also test PI 2 with 200 episodes per iteration. For the reacher task, we use 3 episodes per iteration per condition for 10 iterations. Note that we do not collect any additional samples for training neural network policies. For the prior methods, we train DDPG with 150 episodes per epoch for 80 epochs on the reacher task, and TRPO uses 600 episodes per iteration for 120 iterations. On door opening, TRPO uses 400 episodes per iteration for 80 iterations and DDPG uses 160 episodes per epoch for 100 epochs, though note that DDPG is ultimately not successful. Real Robot Experiments For the real robot tasks we use a hybrid cost function that includes two loss terms of the form of Eq. A.1. The first loss term` arm (z) computes the difference between the current position of the robot’s end-effector and the position of the end-effector when the hockey stick is located just in front of the puck. We set = 0:1 and = 0:0001 for this cost function. The second loss term ` goal (z) is based on the distance between the puck and the goal that we estimate using a motion capture system. We set= 0:0 and = 1:0. Both` arm and` goal have a linear ramp, which makes the cost increase towards the end of the trajectory. In addition, We include a small torque cost term` torque to penalize unnecessary high torques. The combined function sums over all the cost terms: ` total = 100:0` goal +` arm +` torque . We give a substantially higher weight to the cost on the distance to the goal to achieve a higher precision of the task execution. Our neural network policy includes two fully-connected hidden layers of rectified linear units (ReLUs). Each of the hidden layers consists of 42 neurons. The inputs of the policy include: puck and goal positions measured with a motion capture system, robot joint angles, joint velocities, the end-effector pose and the end-effector velocity. During PILQR-MDGPS training, we use data augmentation to regularize the neural network. In particular, the observations were augmented with Gaussian noise to mitigate overfitting to the noisy sensor data. A.1.4 Additional Simulation Results Figure A.2 shows additional simulation results obtained for the gripper pusher task for the three additional initial conditions. The instances presented here are not as challenging as the one reported in the work. Our method (PILQR) is able to outperform other baselines except for the first two conditions presented in the first rows of Figure A.2, where LQR-FLM performs equally well due to the simplicity of the task. PI 2 is not able to make progress with the same number of samples, however, its performance on each condition is comparable to LQR-FLM when provided with 10 times more samples. We also test PI 2 with 10 times more samples on the reacher and door opening tasks. On the reacher task, PI 2 improves substantially with more samples, though it is still worse than the four 140 0 200 400 600 800 1000 1200 1400 1600 # samples 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Avg final distance PI2 LQR-FLM PILQR 0 200 400 600 800 1000 1200 1400 1600 # samples 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Avg final distance PI2 LQR-FLM PILQR 0 200 400 600 800 1000 1200 1400 1600 # samples 0.0 0.5 1.0 1.5 2.0 Avg final distance PI2 LQR-FLM PILQR Figure A.2: Single condition comparisons of the gripper-pusher task in three additional conditions. The top, middle, and bottom plots correspond to the top right, bottom right, and bottom left conditions depicted in Figure A.1, respectively. The PILQR method outperforms other baselines in two out of the three conditions. The conditions presented in the top and middle figure are significantly easier than the other conditions presented in the work. 141 Figure A.3: Additional results on the door opening task. other methods. However, as Figure A.3 shows, PI 2 is unable to succeed on the door opening task even with 10 times more samples. The performance of PI 2 is likely to continue increasing as we provide even more samples. A.2 Appendix: Learning and Embedding Space for Transferable Robot Skills A.2.1 Variational Bound Derivation In order to introduce an information-theoretic regularization that encourages versatile skills, we borrow ideas from the variational inference literature. In particular, in the following, we present a lower bound of the marginal entropyH (p(x)), which will prove useful when applied to the reinforcement learning objective in Sec. 8.4.1. 142 Theorem 2. The lower bound on the marginal entropyH (p(x)) corresponds to: H (p(x))≥ S S p(x;z) log( q(zSx) p(x;z) dz)dx; (A.2) whereq(zSx) is the variational posterior. Proof. H (p(x))= S p(x) log( 1 p(x) )dx= S p(x) log( S q(zSx) 1 p(x) dz)dx = S p(x) log( S q(zSx) p(zSx) p(x;z) dz)dx≥ S p(x) S p(zSx) log( q(zSx) p(x;z) dz)dx = S S p(x;z) log( q(zSx) p(x;z) dz)dx: (A.3) A.2.2 Derivation for Multiple Timesteps We represent the trajectory as = (s 0 ;a 0 ;s 1 ;a 1 ;:::;s T ) and the learned parametrized posterior (policy) as () =p(s 0 )∏ T−1 i=0 (a i Ss i )p(s i+1 Ss i ;a i ). The learned inference network is repre- sented by q (zS) and we introduce the pseudo likelihood that is equal to cumulative reward: logp(R= 1S)=∑ t r(s t ;a t ). In this derivation we also assume the existence of a prior over trajectories of the form:()= p(s 0 )∏ T−1 i=0 (a i Ss i )p(s i+1 Ss i ;a i ). where represents our ”prior policy”. We thus consider the relative entropy between and. Note that we can choose prior policy to be non-informative (e.g. a uniform prior over action for bounded action spaces). With these definitions, we can cast RL as a variational inference problem: L= log S p(R= 1S)()d ≥ S () log p(R= 1S)() () d = S () logp(R= 1S)d + S () log () () d =E [Q t r(s t ;a t )]+E Q t log (a t Ss t ) (a t Ss t ) =E [Q t r(s t ;a t )]+E Q t KL[ t SS t ] =L; (A.4) 143 We can now introduce the latent variablez that forms a Markov chain: ()= S (Sz)p(z)dz = S p(s 0 )p(z 0 ) T−1 M i=0 (a i Ss i ;z i )p(s i+1 Ss i ;a i )p(z i+1 Sz i )dz 1∶T : (A.5) Applying it to the loss, we obtain: L =E [Q t r(s t ;a t )]+E [KL[()SS()]] =E [Q t r(s t ;a t )]+E log () ∫ (Sz 1∶T )p(z 1∶T )dz 1∶T d ≥E [Q t r(s t ;a t )]+ E () E p(z 1∶T S) Q t S (a ′ t Ss t ;z t ) log (a ′ t Ss t ) (a ′ t Ss t ;z t ) da ′ t + log q(z 1∶T S) p(z 1∶T ) : (A.6) Equation (A.6) arrives at essentially the same bound as that in Equation (8.2) but for sequences. The exact form of (A.6) in the previous equation depends on the form that is chosen forq. For instance, forq(zS)=q(z T S)q(z T−1 Sz T ;)q(z T−2 Sz T−1 ;)::: we obtain: E Q t log (a t Ss t ) (a t Ss t ;z t ) + log q(z 1∶T S) p(z 1∶T ) =E Q t log (a t Ss t ) (a t Ss t ;z t ) + T Q t=1 log q(z t−1 Sz t ;) p(z t+1 Sz t ) + logq(z T S)− logp(z 0 ) : (A.7) Other forms forq are also feasible, but the above form gives a nice temporal decomposition of the (augmented) reward. A.2.3 Algorithm details Stochastic Value Gradient for the policy We here give a derivation of the stochastic value gradient for the objective from Equation (8.6) that we use for gradient based optimization. We start by reparameterizing the sampling stepz ∼p (zSt) for the embedding asg (t; z ), where z is a random variable drawn from an appropriately chosen base distribution. That is, for a Gaussian embedding we can use a normal distribution (D. P. Kingma & Welling, 2013; Rezende et al., 2014) z ∼N (0;I), whereI denotes the identity. For a Bernoulli embedding we can use the Concrete distribution reparametrization (Maddison, Mnih, & Teh, 2017) (also named the Gumbel-softmax trick (Jang, Gu, & Poole, 2017)). For the policy distribution 144 we always assume a Gaussian and can hence reparameterize usingg (t; a ) with a ∼N (0;I). Using a Gaussian embedding we then get the following gradient for the the policy parameters ∇ ^ L(;)=∇ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ E (aSz;s) p (zSt) s;t∈B Q ' (s;a;z)+E t∈T H [p (zSt)] ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ; =E a∼N (0;I) z∼N (0;I) s;t∈B ∇ Q ' (s;g (t; a );g (t; z ))∇ g (t; a ) ; (A.8) and, for the embedding network parameters, ∇ ^ L(;)=∇ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ E (aSz;s) p (zSt) s;t∈B Q ' (s;a;z)+E t∈T H [p (zSt)] ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ; =E a∼N (0;I) z∼N (0;I) s;t∈B ∇ Q ' (s;g (t; a );g (t; z ))∇ g (t; z ) +E t∈T ∇ H [p (zSt)] : (A.9) A.2.4 Implementation Details Task Structure All the tasks presented in Sec. 8.7.2 share a similar structure, in that the observation space used for the pre-trained skills and the observation space used for the final task are the same. For all three tasks, the observations include: joint angles (7) and velocities (7) of the robot joints, the position (3), orientation (4) and linear velocity (3) of the block as well as the position of the goal (3). The action space is also the same across all tasks and consists of joint torques for all the robot joints (7). We choose such a structure (making sure that the action space matches and providing only proprioceptive information to the policy) to make sure we i) can transfer the policy between tasks directly; 2) to ensure that the only way the agent is informed about changing environment dynamics (e.g., the attachment of the block to a string, the existence of a wall, etc.) is through the task id. The rationale behind having the same observation space between the pre-trained skills and the final task comes from the fact that currently, our architecture expects the same observations for the final policy over embeddings and the skill subpolicies. We plan to address this limitation in future work. Network Architecture and Hyperparameters The hereby presented values were used to generate results for the final three manipulation tasks presented in Sec. 8.7.2. For both policy and inference network we used two-layer fully connected neural networks with exponentiaded linear activations (Clevert, Unterthiner, & Hochreiter, 2015) 145 (for layer sizes see table) to parameterize the distribution parameters. As distributions we always relied on a gaussian distributionN ( (x); diag((x))) whose mean and diagonal covariance are parameterized by the policy network via [mu(x); log(x)] = f (x). For the embedding network the mapping from one-hot task vectors to distribution parameters is given via a linear transformation. For the inference network we map to the parameters of the same distribution class via another neural network. Hyperparameter Spring-wall L-wall Rail-push State dims 27 27 27 Action dims 7 7 7 Policy net 200-100 200-100 200-100 Q function net 200-200 200-200 200-100 Inference net 200-200 200-200 200-100 Embedding distribution 3D Gaussian 3D Gaussian 3D Gaussian Minibatch size (per-worker) 32 32 32 Replay buffer size 1e 5 1e 5 1e 5 1 1e 3 1e 3 1e 3 2 1e 3 1e 3 1e 3 3 1e 3 1e 3 1e 3 Discount factor ( ) 0.99 0.99 0.99 Adam learning rate 1e −3 1e −3 1e −3 146 References Abbeel, P., & Ng, A. Y . (2004). Apprenticeship learning via inverse reinforcement learning. In Proc. icml. Retrieved fromciteseer.ist.psu.edu/662229.html Achtelik, M., Weiss, S., Chli, M., & Siegwart, R. (2013). Path planning for motion dependent state estimation on micro aerial vehicles. In Proc. of the ieee int. conf. on robotics & automation (icra). Adamey, E., & Ozguner, U. (2012). A decentralized approach for multi-UA V multitarget tracking and surveillance. In Spie defense, security, and sensing (pp. 838915–838915). Aharon, M., Elad, M., & Bruckstein, A. (2006). k -svd: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11), 4311- 4322. Ahmad, A., & Lima, P. (2013). Multi-robot cooperative spherical-object tracking in 3D space based on particle filters. Robotics and Autonomous Systems, 61(10), 1084–1093. Ahmad, A., Tipaldi, G., Lima, P., & Burgard, W. (2013b). Cooperative robot localization and target tracking based on least squares minimization. In Proc. of the ieee int. conf. on robotics & automation (icra) (pp. 5696–5701). Ahmad, A., Tipaldi, G. D., Lima, P., & Burgard, W. (2013a). Cooperative robot localization and target tracking based on least squares minimization. In Robotics and automation (icra), 2013 ieee international conference on (pp. 5696–5701). Ahmad, I., & Lin, P.-E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions (corresp.). Information Theory, IEEE Transactions on, 22(3), 372–375. Akrour, R., Abdolmaleki, A., Abdulsamad, H., & Neumann, G. (2016). Model-free trajectory optimization for reinforcement learning. In Icml. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5), 469–483. 147 Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. CoRR, abs/1701.07875. Retrieved fromhttp://dblp.uni-trier.de/db/journals/corr/ corr1701.html#ArjovskyCB17 Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866. Babes, M., Marivate, V ., Subramanian, K., & Littman, M. L. (2011). Apprenticeship learning about multiple intentions. In Proceedings of the 28th international conference on machine learning (icml-11) (pp. 897–904). Bacon, P.-L., Harb, J., & Precup, D. (2017). The option-critic architecture. In Aaai (pp. 1726–1734). Bajcsy, R. (1988). Active perception. Proceedings of the IEEE, 76(8), 966–1005. Bajcsy, R., Aloimonos, Y ., & Tsotsos, J. K. (2016). Revisiting active perception. Autonomous Robots, 1–20. Barber, D., & Agakov, F. V . (2003). The IM algorithm: A variational approach to information maximization. In Advances in neural information processing systems 16 [neural information processing systems, NIPS 2003, december 8-13, 2003, vancouver and whistler, british columbia, canada] (pp. 201–208). Beinhofer, M., Muller, J., Krause, A., & Burgard, W. (2013). Robust landmark selection for mobile robot navigation. In Intelligent robots and systems (iros), 2013 ieee/rsj international conference on (pp. 3637–2643). Beinhofer, M., M¨ uller, J., Krause, A., & Burgard, W. (2013). Robust landmark selection for mobile robot navigation. In Proc. of the ieee/rsj int. conf. on intelligent robots and systems (iros) (p. 2638-2643). Bekiroglu, Y ., Kragic, D., & Kyrki, V . (2010). Learning grasp stability based on tactile data and hmms. In Ro-man, 2010 ieee (pp. 132–137). Bekiroglu, Y ., Laaksonen, J., Jørgensen, J., Kyrki, V ., & Kragic, D. (2011). Assessing grasp stability based on learning and haptic data. Robotics, IEEE Transactions on, 27(3), 616–629. Billard, A., Calinon, S., Dillmann, R., & Schaal, S. (2008). Robot programming by demonstration. In Springer handbook of robotics (pp. 1371–1394). Springer. Bo, L., Ren, X., & Fox, D. (2011). Hierarchical matching pursuit for image classification: Architecture and fast algorithms. In Nips (p. 2115-2123). Bohg, J., Hausman, K., Sankaran, B., Brock, O., Kragic, D., Schaal, S., & Sukhatme, G. S. (2016). Interactive perception: Leveraging action in perception and perception in action. CoRR, abs/1604.03670. Retrieved fromhttp://arxiv.org/abs/1604.03670 148 Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. arXiv preprint arXiv:1606.01540. Bry, A., & Roy, N. (2011). Rapidly-exploring random belief trees for motion planning under uncertainty. In Proc. of the ieee int. conf. on robotics & automation (icra). Bryson, M., & Sukkarieh, S. (2004). Vehicle model aided inertial navigation for a UA V using low-cost sensors. In Proc. of the australasian conf. on robotics & automation (acra). Cabi, S., Colmenarejo, S. G., Hoffman, M. W., Denil, M., Wang, Z., & de Freitas, N. (2017). The intentional unintentional agent: Learning to solve many continuous control tasks simultaneously. arXiv preprint arXiv:1707.03300. Camacho, E. F., & Alba, C. B. (2013). Model predictive control. Springer Science & Business Media. Chang, L., Smith, J. R., & Fox, D. (2012). Interactive singulation of objects from a pile. In Robotics and automation (icra), 2012 ieee international conference on (pp. 3875–3882). Charrow, B., Kumar, V ., & Michael, N. (2014). Approximate representations for multi-robot control policies that maximize mutual information. Autonomous Robots, 37(4), 383–400. Chebotar, Y ., Hausman, K., Kroemer, O., Sukhatme, G., & Schaal, S. (2016, October). General- izing regrasping with supervised policy learning. In International symposium on experimental robotics (iser). Tokyo, Japan. Chebotar, Y ., Hausman, K., Su, Z., Molchanov, A., Kroemer, O., Sukhatme, G., & Schaal, S. (2016). Bigs: Biotac grasp stability dataset. In Grasping and manipulation datasets, icra 2016 workshop on. Chebotar, Y ., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., & Levine, S. (2017). Com- bining model-based and model-free updates for trajectory-centric reinforcement learning. In International conference on machine learning (icml). Chebotar, Y ., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., & Levine, S. (2016). Path integral guided policy search. arXiv preprint arXiv:1610.00529. Chebotar, Y ., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., & Levine, S. (2017). Path integral guided policy search. In Icra. Chebotar, Y ., Kroemer, O., & Peters, J. (2014). Learning robot tactile sensing for object manipulation. In Intelligent robots and systems (iros 2014), 2014 ieee/rsj international conference on (pp. 3368–3375). Chen, X., Duan, Y ., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. 149 Clevert, D., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289. Retrieved fromhttp://arxiv.org/ abs/1511.07289 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537. Dang, H., & Allen, P. (2014). Stable grasping under pose uncertainty using tactile feedback. Autonomous Robots, 36(4), 309–330. Daniel, C., Neumann, G., Kroemer, O., & Peters, J. (2013). Learning sequential motor tasks. In Icra. Daniel, C., Neumann, G., Kroemer, O., & Peters, J. (2016). Hierarchical relative entropy policy search. Journal of Machine Learning Research (JMLR). Deisenroth, M., Fox, D., & Rasmussen, C. (2014). Gaussian processes for data-efficient learning in robotics and control. PAMI. Deisenroth, M., Neumann, G., & Peters, J. (2013). A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2), 1-142. Deisenroth, M., Rasmussen, C., & Fox, D. (2011). Learning to control a low-cost manipulator using data-efficient reinforcement learning. In Rss. Denil, M., Colmenarejo, S. G., Cabi, S., Saxton, D., & de Freitas, N. (2017). Programmable agents. arXiv preprint arXiv:1706.06383. Denton, E. L., Chintala, S., & Fergus, R. (2015). Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems (pp. 1486–1494). Devin, C., Gupta, A., Darrell, T., Abbeel, P., & Levine, S. (2016). Learning modular neural network policies for multi-task and multi-robot transfer. CoRR, abs/1609.07088. Dimitrakakis, C., & Rothkopf, C. A. (2011). Bayesian multitask inverse reinforcement learning. In European workshop on reinforcement learning (pp. 273–284). Domahidi, A. (2012). Forces: Fast optimization for real-time control on embedded systems. Available at forces. ethz. ch. Duan, Y ., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., . . . Zaremba, W. (2017). One-shot imitation learning. arXiv preprint arXiv:1703.07326. End, F., Akrour, R., Peters, J., & Neumann, G. (2017). Layered direct policy search for learning hierarchical skills. In International conference on robotics and automation (icra). 150 Farshidian, F., Neunert, M., & Buchli, J. (2014). Learning of closed-loop motion control. In Iros. Fink, J., Ribeiro, A., Kumar, V ., & Sadler, B. (2010). Optimal robust multihop routing for wireless networks of mobile micro autonomous systems. In Military communications conference (milcom) (pp. 1268–1273). Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Finn, C., Christiano, P., Abbeel, P., & Levine, S. (2016). A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd international conference on machine learning (V ol. 48). Florensa, C., Duan, Y ., & Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. Fox, R., Krishnan, S., Stoica, I., & Goldberg, K. (2017). Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294. Fox, R., Pakman, A., & Tishby, N. (2016). Taming the noise in reinforcement learning via soft updates. In Proceedings of the thirty-second conference on uncertainty in artificial intelligence UAI. Fu, J., Co-Reyes, J. D., & Levine, S. (2017). Ex2: Exploration with exemplar models for deep reinforcement learning. arXiv preprint arXiv:1703.01260. Gabriel, A., Akrour, R., Peters, J., & Neumann, G. (2017). Empowered skills. In Proceedings of the international conference on robotics and automation (icra). Gal, Y . (2016). Uncertainty in deep learning (Unpublished doctoral dissertation). University of Cambridge. Gao, W., Zhang, Y ., & Wang, J. (2015). Research on initial alignment and self-calibration of rotary strapdown inertial navigation systems. Sensors, 15(2), 3154. Retrieved fromhttp:// www.mdpi.com/1424-8220/15/2/3154 Gibson, J. (1966). The senses considered as perceptual systems. Boston: Houghton Mifflin. Gibson, J. J. (1979). The ecological approach to visual perception. Houghton Mifflin. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y . (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Nips (p. 2672-2680). Retrieved fromhttp://dblp.uni-trier .de/db/conf/nips/nips2014.html#GoodfellowPMXWOCB14 151 Gregor, K., Rezende, D. J., & Wierstra, D. (2016). Variational intrinsic control. arXiv preprint arXiv:1611.07507. Grocholsky, B. (2002). Information-theoretic control of multiple sensor platforms (Unpublished doctoral dissertation). University of Sydney. School of Aerospace, Mechanical and Mechatronic Engineering. Grocholsky, B., Bayraktar, S., Kumar, V ., Taylor, C., & Pappas, G. (2006). Synergies in feature localization by air-ground robot teams. In Experimental robotics ix (pp. 352–361). Grocholsky, B., Makarenko, A., & Durrant-Whyte, H. (2003). Information-theoretic coordinated control of multiple sensor platforms. In Proc. of the ieee int. conf. on robotics & automation (icra) (V ol. 1, pp. 1521–1526). Gu, S., Lillicrap, T., Sutskever, I., & Levine, S. (2016). Continuous deep Q-learning with model-based acceleration. CoRR, abs/1603.00748. Gupta, M., Ruhr, T., Beetz, M., & Sukhatme, G. S. (2013). Interactive environment exploration in clutter. In Intelligent robots and systems (iros), 2013 ieee/rsj international conference on (pp. 5265–5272). Gupta, M., & Sukhatme, G. (2012). Using manipulation primitives for brick sorting in clutter. In Robotics and automation (icra), 2012 ieee international conference on (pp. 3883–3889). Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165. Hagan, M., & Menhaj, M. (1994, Nov). Training feedforward networks with the marquardt algorithm. Neural Networks, IEEE Transactions on, 5(6), 989-993. Hausman, K., Balint-Benczedi, F., Pangercic, D., Marton, Z.-C., Ueda, R., Okada, K., & Beetz, M. (2013). Tracking-based interactive segmentation of textureless objects. In Robotics and automation (icra), 2013 ieee international conference on (pp. 1122–1129). Hausman, K., Chebotar, Y ., Schaal, S., Sukhatme, G., & Lim, J. (2017). Multi-modal imita- tion learning from unstructured demonstrations using generative adversarial nets. In Neural information processing systems (nips). Hausman, K., Corcos, C., Mueller, J., Sha, F., & Sukhatme, G. S. (2014, September). Towards interactive object recognition. In 3rd workshop on robots in clutter: Perception and interaction in clutter, ieee/rsj international conference on intelligent robots and systems (iros), chicago, il. Hausman, K., Kahn, G., Patil, S., Mueller, J., Goldberg, K., Abbeel, P., & Sukhatme, G. (2016, October). Occlusion-aware multi-robot 3d tracking. In Ieee/rsj international conference on intelligent robots and systems (iros). Daejeon, South Korea. 152 Hausman, K., M¨ uller, J., Hariharan, A., Ayanian, N., & Sukhatme, G. (2014). Cooperative control for target tracking with onboard sensing. In International symposium on experimental robotics (iser). Hausman, K., M¨ uller, J., Hariharan, A., Ayanian, N., & Sukhatme, G. (2015). Cooperative multi-robot control for target tracking with onboard sensing. International Journal of Robotics Research. Hausman, K., M¨ uller, J., Hariharan, A., Ayanian, N., & Sukhatme, G. S. (2015). Cooperative multi-robot control for target tracking with onboard sensing. The International Journal of Robotics Research, 34(13), 1660–1677. Hausman, K., Preiss, J. A., Sukhatme, G. S., & Weiss, S. (2016). Observability-aware trajectory optimization for self-calibration with application to uavs. CoRR, abs/1604.07905. Retrieved from http://arxiv.org/abs/1604.07905 Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., & Riedmiller, M. (2018). Learning an embedding space for transferable robot skills. In International conference on learning representations (iclr), under review. Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y . (2015). Learning continuous control policies by stochastic value gradients. In Advances in neural information processing systems (pp. 2944–2952). Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y ., & Erez, T. (2015). Learning continuous control policies by stochastic value gradients. In Nips. Heess, N., Wayne, G., Tassa, Y ., Lillicrap, T., Riedmiller, M., & Silver, D. (2016). Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182. Hermann, R., & Krener, A. J. (1977). Nonlinear controllability and observability. IEEE Transactions on automatic control, 22(5), 728–740. Hesch, J. A., Kottas, D. G., Bowman, S. L., & Roumeliotis, S. I. (2013). Camera-imu-based localization: Observability analysis and consistency improvement. The International Journal of Robotics Research. Hinson, B., & Morgansen, K. (2013, June). Observability optimization for the nonholonomic integrator. In American control conference (acc), 2013 (p. 4257-4262). Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. CoRR, abs/1606.03476. Retrieved fromhttp://arxiv.org/abs/1606.03476 Hoffmann, G., & Tomlin, C. (2010). Mobile sensor network control using mutual information methods and particle filters. IEEE Transactions on Automatic Control, 55(1), 32–47. 153 Houthooft, R., Chen, X., Duan, Y ., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. In Advances in neural information processing systems (pp. 1109–1117). Howard, A., Matari´ c, M., & Sukhatme, G. (2002). Localization for mobile robot teams using maximum likelihood estimation. In Proc. of the ieee/rsj int. conf. on intelligent robots and systems (iros) (V ol. 1, pp. 434–439). Huang, G., Truax, R., Kaess, M., & Leonard, J. (2013). Unscented iSAM: A consistent incremental solution to cooperative localization and target tracking. In Proc. of the european conf. on mobile robots (ecmr) (pp. 248–254). Huang, X., Walker, I., & Birchfield, S. (2012). Occlusion-aware reconstruction and manipulation of 3d articulated objects. In Robotics and automation (icra), 2012 ieee international conference on (pp. 1365–1371). Jain, A., & Kemp, C. C. (2010). Pulling open doors and drawers: Coordinating an omni- directional base and a compliant arm with equilibrium point control. In Robotics and automation (icra), 2010 ieee international conference on (pp. 1807–1814). Jaitly, N., Nguyen, P., Senior, A., & Vanhoucke, V . (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. In Thirteenth annual conference of the international speech communication association. Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In International conference on learning representations (iclr). Jolliffe, I. T. (1986). Principal component analysis. New York: Springer. Julian, B., Angermann, M., Schwager, M., & Rus, D. (2012). Distributed robotic sensor networks: An information-theoretic approach. Int. Journal of Robotics Research, 31(10), 1134–1154. Jung, B., & Sukhatme, G. (2002). Tracking targets using multiple robots: The effect of environment occlusion. Autonomous Robots, 13(3), 191–205. Jung, B., & Sukhatme, G. (2006). Cooperative multi-robot target tracking. In Distributed autonomous robotic systems 7 (pp. 81–90). Kalakrishnan, M., Righetti, L., Pastor, P., & Schaal, S. (2011). Learning force control policies for compliant manipulation. In Intelligent robots and systems (iros), 2011 ieee/rsj international conference on (pp. 4639–4644). Kappler, D., Pastor, P., Kalakrishnan, M., Wuthrich, M., & Schaal, S. (2015). Data-driven online decision making for autonomous manipulation. In Proceedings of robotics: Science and systems. Rome, Italy. 154 Karol Hausman, S. O., Scott Niekum, & Sukhatme, G. S. (2015, May). Active articulation model estimation through interactive perception. In International conference on robotics and automation. Retrieved fromhttp://robotics.usc.edu/publications/876/ Katz, D., Kazemi, M., Andrew Bagnell, J., & Stentz, A. (2013). Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In Robotics and automation (icra), 2013 ieee international conference on (pp. 5003–5010). Katz, D., Orthey, A., & Brock, O. (2014). Interactive perception of articulated objects. In Experimental robotics (pp. 301–315). Kelly, J., & Sukhatme, G. S. (2011). Visual-inertial sensor fusion: Localization, mapping and sensor-to-sensor self-calibration. The International Journal of Robotics Research, 30(1), 56–79. Kenney, J., Buckley, T., & Brock, O. (2009). Interactive segmentation for manipulation in unstructured environments. In Robotics and automation, 2009. icra’09. ieee international conference on (pp. 1377–1382). Khoshelham, K., & Elberink, S. O. (2012). Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors. Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017, 06–11 Aug). Learning to dis- cover cross-domain relations with generative adversarial networks. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (V ol. 70, pp. 1857–1865). International Convention Centre, Sydney, Australia: PMLR. Retrieved from http://proceedings.mlr.press/v70/kim17a.html Kingma, D., & Welling, M. (2013). Auto-encoding variational Bayes. CoRR, abs/1312.6114. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., . . . Grabska- Barwinska, A. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 201611835. Kitagawa, G. (1996). Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of computational and graphical statistics, 5(1), 1–25. Kober, J., Bagnell, J., & Peters, J. (2013). Reinforcement learning in robotics: a survey. International Journal of Robotic Research, 32(11), 1238-1274. Kober, J., Oztop, E., & Peters, J. (2011). Reinforcement learning to adjust robot movements to new situations. In Ijcai proceedings-international joint conference on artificial intelligence (V ol. 22, p. 2650). Konidaris, G., & Barto, A. G. (2007). Building portable options: Skill transfer in reinforcement learning. In Ijcai (V ol. 7, pp. 895–900). 155 Krener, A. J., & Ide, K. (2009). Measures of unobservability. In Decision and control, 2009 held jointly with the 2009 28th chinese control conference. cdc/ccc 2009. proceedings of the 48th ieee conference on (pp. 6401–6406). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105). Kroemer, O., Daniel, C., Neumann, G., Van Hoof, H., & Peters, J. (2015). Towards learning hierarchical skills for multi-phase manipulation tasks. In Robotics and automation (icra), 2015 ieee international conference on (pp. 1503–1510). Kroemer, O., Detry, R., Piater, J., & Peters, J. (2010). Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems (RAS), 58(9), 1105–1116. Kroemer, O., & Sukhatme, G. S. (2016). Learning relevant features for manipulation skills using meta-level priors. arXiv preprint arXiv:1605.04439. Kuindersma, S., Deits, R., Fallon, M., Valenzuela, A., Dai, H., Permenter, F., . . . Tedrake, R. (2015). Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Autonomous Robots, 1–27. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathemati- cal Statistics, 79–86. LeCun, Y ., Bengio, Y ., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. Levine, S., & Abbeel, P. (2014). Learning neural network policies with guided policy search under unknown dynamics. In Nips. Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016a). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39), 1–40. Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016b). End-to-end training of deep visuomotor policies. JMLR, 17(1). Levine, S., & Koltun, V . (2013a). Guided policy search. In Proceedings of the 30th international conference on machine learning (pp. 1–9). Levine, S., & Koltun, V . (2013b). Variational policy search via trajectory optimization. In Advances in neural information processing systems (pp. 207–215). Levine, S., Popovic, Z., & Koltun, V . (2011). Nonlinear inverse reinforcement learning with gaussian processes. In Advances in neural information processing systems (pp. 19–27). Levine, S., Wagener, N., & Abbeel, P. (2015a). Learning contact-rich manipulation skills with guided policy search. arXiv preprint arXiv:1501.05611. 156 Levine, S., Wagener, N., & Abbeel, P. (2015b). Learning contact-rich manipulation skills with guided policy search. In Icra. Li, M., Bekiroglu, Y ., Kragic, D., & Billard, A. (2014). Learning of grasp adaptation through experience and tactile sensing. In Intelligent robots and systems (iros 2014), 2014 ieee/rsj international conference on (pp. 3339–3346). Li, M., Kim, B., & Mourikis, A. I. (2013, May). Real-time motion estimation on a cellphone using inertial sensing and a rolling-shutter camera. In Proceedings of the ieee international conference on robotics and automation (p. 4697-4704). Karlsruhe, Germany. Li, Y ., Song, J., & Ermon, S. (2017a). Inferring the latent structure of human decision-making from raw visual inputs. CoRR, abs/1703.08840. Retrieved fromhttp://arxiv.org/abs/ 1703.08840 Li, Y ., Song, J., & Ermon, S. (2017b). Inferring the latent structure of human decision-making from raw visual inputs. arXiv preprint arXiv:1703.08840. Likhachev, M., & Ferguson, D. (2009). Planning Long Dynamically Feasible Maneuvers for Autonomous Vehicles. Int. Journal of Robotics Research, 28(8), 933–945. Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., . . . Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Iclr. Lima, P., Ahmad, A., Dias, A., Conceic ¸˜ ao, A., Moreira, A., Silva, E., . . . Nascimento, T. (2014). Formation control driven by cooperative object tracking. Robotics and Autonomous Systems, 68–79. Lioutikov, R., Paraschos, A., Neumann, G., & Peters, J. (2014). Sample-based information- theoretic stochastic optimal control. In Icra. Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., & Siegwart, R. (2013). A robust and modular multi-sensor fusion approach applied to mav navigation. In Intelligent robots and systems (iros), 2013 ieee/rsj international conference on (pp. 3923–3929). Maaløe, L., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2016). Auxiliary deep generative models. arXiv preprint arXiv:1602.05473. Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In International conference on learning representations (iclr). Madry, M., Bo, L., Kragic, D., & Fox, D. (2014, May). St-hmp: Unsupervised spatio-temporal feature learning for tactile data. In Robotics and automation (icra), 2014 ieee international conference on (p. 2262-2269). Martinelli, A., Pont, F., & Siegwart, R. (2005). Multi-robot localization using relative observations. In Proc. of the ieee int. conf. on robotics & automation (icra) (pp. 2797–2802). 157 Martinelli, A., & Siegwart, R. (2006, Dec). Observability properties and optimal trajectories for on-line odometry self-calibration. In Decision and control, 2006 45th ieee conference on (p. 3065-3070). Mart´ ın, R. M., & Brock, O. (2014). Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In Ieee/rsj international conference on intelligent robots and systems. Mathieu, M., Couprie, C., & LeCun, Y . (2015). Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Maye, J., Sommer, H., Agamennoni, G., Siegwart, R., & Furgale, P. (2015). Online self-calibration for robotic systems. The International Journal of Robotics Research. Mellinger, D., & Kumar, V . (2011). Minimum snap trajectory generation and control for quadrotors. In Robotics and automation (icra), 2011 ieee international conference on (pp. 2520–2525). Merel, J., Tassa, Y ., TB, D., Srinivasan, S., Lemmon, J., Wang, Z., . . . Heess, N. (2017). Learning human behaviors from motion capture by adversarial imitation. CoRR, abs/1707.02201. Mnih, V ., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., . . . Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (icml). Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. In Nips workshop on deep learning. Mnih, V ., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. Mohamed, S., & Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems (pp. 2125–2133). Montesano, L., & Lopes, M. (2012). Active learning of visual descriptors for grasping using non-parametric smoothed beta distributions. Robotics and Autonomous Systems, 60(3), 452–462. Montgomery, W., Ajay, A., Finn, C., Abbeel, P., & Levine, S. (2017). Reset-free guided policy search: efficient deep reinforcement learning with stochastic initial states. In Icra. Montgomery, W., & Levine, S. (2016). Guided policy search via approximate mirror descent. In Nips. Moore, J., Cory, R., & Tedrake, R. (2014). Robust post-stall perching with a simple fixed-wing glider using lqr-trees. Bioinspiration & biomimetics, 9(2), 025013. 158 Mottaghi, R., & Vaughan, R. (2006). An integrated particle filter and potential field method for cooperative robot target tracking. In Proc. of the ieee int. conf. on robotics & automation (icra) (pp. 1342–1347). Mourikis, A., & Roumeliotis, S. (2006). Performance analysis of multirobot cooperative localization. Robotics, IEEE Transactions on, 22(4), 666–681. M¨ uller, J., & Sukhatme, G. (2014, September). Risk-aware trajectory generation with application to safe quadrotor landing. In Proceedings of the ieee/rsj international conference on intelligent robots and systems (iros). Chicago, IL, USA. Retrieved fromhttp://robotics.usc.edu/ ˜ muellerj/publications/papers/mueller14iros.pdf M¨ ulling, K., Kober, J., Kroemer, O., & Peters, J. (2013). Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3), 263–279. Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off- policy reinforcement learning. In Advances in neural information processing systems 29: Annual conference on neural information processing systems 2016, december 5-10, 2016, barcelona, spain (pp. 1046–1054). Retrieved fromhttp://papers.nips.cc/paper/6538-safe -and-efficient-off-policy-reinforcement-learning Neumann, G. (2011). Variational inference for policy search in changing situations. In Proceed- ings of the 28th international conference on machine learning (icml-11) (pp. 817–824). Ng, A. Y ., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Icml (pp. 663–670). Niekum, S., Chitta, S., Barto, A. G., Marthi, B., & Osentoski, S. (2013). Incremental semantically grounded learning from demonstration. In Robotics: Science and systems (V ol. 9). Ong, L.-L., Upcroft, B., Bailey, T., Ridley, M., Sukkarieh, S., & Durrant-Whyte, H. (2006). A decentralised particle filtering algorithm for multi-target tracking across multiple flight vehicles. In Proc. of the ieee/rsj int. conf. on intelligent robots and systems (iros) (pp. 4539–4544). Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., . . . Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. Otte, S., Kulick, J., Toussaint, M., & Brock, O. (2014). Entropy-based strategies for physical exploration of the environment’s degrees of freedom. In Ieee/rsj international conference on intelligent robots and systems. O’Regan, J., & No¨ e, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24, 939–973. Pan, V. Y . (2015). How bad are vandermonde matrices? arXiv preprint arXiv:1504.02118. 159 Pan, Y ., & Theodorou, E. (2014). Probabilistic differential dynamic programming. In Nips. Pastor, P., Hoffmann, H., Asfour, T., & Schaal, S. (2009). Learning and generalization of motor skills by learning from demonstration. In Icra. Pastor, P., Kalakrishnan, M., Chitta, S., Theodorou, E., & Schaal, S. (2011). Skill learning and task outcome prediction for manipulation. IEEE International Conference on Robotics and Automation, 3828–3834. Pastor, P., Kalakrishnan, M., Righetti, L., & Schaal, S. (2012). Towards associative skill memories. In Ieee-ras international conference on humanoid robots. (clmc) Pati, Y ., Rezaiifar, R., & Krishnaprasad, P. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Signals, systems and computers. 1993 conference record of the twenty-seventh asilomar conference on (p. 40-44 vol.1). Patil, S., Duan, Y ., Schulman, J., Goldberg, K., & Abbeel, P. (2014). Gaussian belief space planning with discontinuities in sensing domains. In Robotics and automation (icra), 2014 ieee international conference on (pp. 6483–6490). Patil, S., Kahn, G., Laskey, M., Schulman, J., Goldberg, K., & Abbeel, P. (2014). Scaling up gaussian belief space planning through covariance-free trajectory optimization and automatic differentiation. In Intl. workshop on the algorithmic foundations of robotics. Peters, J., M¨ ulling, K., & Altun, Y . (2010). Relative entropy policy search. In Aaai. AAAI Press. Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4). Pfau, D., & Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945. Pillai, S., Walter, M., & Teller, S. (2014, July). Learning articulated motions from visual demonstration. In Proceedings of robotics: Science and systems. Berkeley, USA. Platt Jr, R., Tedrake, R., Kaelbling, L., & Lozano-Perez, T. (2010). Belief space planning assuming maximum likelihood observations. Proc. of Robotics: Science and Systems (RSS). Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97. Ranganath, R., Tran, D., & Blei, D. (2016). Hierarchical variational models. In International conference on machine learning (pp. 324–333). Rawlik, K., Toussaint, M., & Vijayakumar, S. (2012). On stochastic optimal control and reinforcement learning by approximate inference. In (r:ss 2012). (Runner Up Best Paper Award) 160 Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st international conference on machine learning (icml). Richter, C., Bry, A., & Roy, N. (2013). Polynomial trajectory planning for aggressive quadrotor flight in dense indoor environments. In Proceedings of the international symposium on robotics research (isrr). Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3), 832–837. Ross, S., & Bagnell, D. (2010). Efficient reductions for imitation learning. In Aistats (V ol. 3, pp. 3–5). Rueckert, E., Mundo, J., Paraschos, A., Peters, J., & Neumann, G. (2015). Extracting low- dimensional control variables for movement primitives. In Robotics and automation (icra), 2015 ieee international conference on (pp. 1511–1518). Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., . . . Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671. Salimans, T., Kingma, D. P., & Welling, M. (2014, October). Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. ArXiv e-prints. Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6), 233–242. Schaal, S., Peters, J., Nakanishi, J., & Ijspeert, A. (2003). Control, planning, learning, and imitation with dynamic movement primitives. In Iros workshop on bilateral paradigms on humans and humanoids. Schiebener, D., Morimoto, J., Asfour, T., & Ude, A. (2013). Integrating visual perception and manipulation for autonomous learning of object representations. Adaptive Behavior, 21(5), 328–345. Schulman, J., Abbeel, P., & Chen, X. (2017). Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440. Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust region policy optimization. In F. R. Bach & D. M. Blei (Eds.), Icml (V ol. 37, p. 1889-1897). JMLR.org. Retrieved fromhttp://dblp.uni-trier.de/db/conf/icml/icml2015 .html#SchulmanLAJM15 Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy optimization. In Icml. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Iclr. 161 Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3), 605–610. Stump, E., Kumar, V ., Grocholsky, B., & Shiroma, P. (2009). Control for localization of targets using range-only sensors. Int. Journal of Robotics Research, 28(6), 743–757. Sturm, J., Stachniss, C., & Burgard, W. (2011). A probabilistic framework for learning kinematic models of articulated objects. J. Artif. Intell. Res.(JAIR), 41, 477–526. Su, Z., Hausman, K., Chebotar, Y ., Molchanov, A., Loeb, G., Sukhatme, G., & Schaal, S. (2015). Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor. In Humanoid robots (humanoids), 2015 ieee-ras 15th international conference on (p. 297-303). Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Icml. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. Sutton, R. S., Precup, D., & Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2), 181–211. Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Husz´ ar, F. (2016). Amortised map inference for image super-resolution. CoRR, abs/1610.04490. Retrieved from http://dblp.uni -trier.de/db/journals/corr/corr1610.html#SonderbyCTSH16 Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X., Duan, Y ., . . . Abbeel, P. (2017). # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems (pp. 2750–2759). Tassa, Y ., Erez, T., & Todorov, E. (2012). Synthesis and stabilization of complex behaviors. In Iros. Teh, Y. W., Bapst, V ., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., . . . Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. arXiv preprint arXiv:1707.04175. Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning. JMLR, 11. Thrun, S., Burgard, W., & Fox, D. (2005a). Probabilistic robotics. MIT Press. Thrun, S., Burgard, W., & Fox, D. (2005b). Probabilistic robotics. MIT press. Todorov, E. (2008). General duality between optimal control and estimation. In Proceedings of the 47th IEEE conference on decision and control, CDC 2008, december 9-11, 2008, canc´ un, m´ exico (pp. 4286–4292). Torr, P. H., & Zisserman, A. (2000). Mlesac: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78(1), 138–156. 162 Toussaint, M. (2009). Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning (pp. 1049–1056). Van Den Berg, J., Abbeel, P., & Goldberg, K. (2011). Lqg-mp: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7), 895–913. van Hoof, H., Kroemer, O., & Peters, J. (2014). Probabilistic segmentation and targeted exploration of objects in cluttered environments. IEEE Transactions on Robotics (TRo). Van Nieuwstadt, M. J., & Murray, R. M. (1997). Real time trajectory generation for differentially flat systems. California Institute of Technology. Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161. Wang, Z., & Gu, D. (2012). Cooperative target tracking control of multiple robots. IEEE Transactions on Industrial Electronics, 59(8), 3232–3240. Wang, Z., Merel, J., Reed, S. E., Wayne, G., de Freitas, N., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in neural information processing systems. Weiss, S. (2012). Vision based navigation for micro helicopters (PhD Thesis). ETH Zurich. Weiss, S., Achtelik, M. W., Lynen, S., Achtelik, M. C., Kneip, L., Chli, M., & Siegwart, R. (2013). Monocular vision for long-term micro aerial vehicle state estimation: A compendium. Journal of Field Robotics, 30(5), 803–831. Wettels, N., Parnandi, A., Moon, J., Loeb, G., & Sukhatme, G. (2009). Grip control using biomimetic tactile sensing systems. Mechatronics, IEEE/ASME Transactions On, 14(6), 718– 723. Wettels, N., Santos, V ., Johansson, R., & Loeb, G. (2008). Biomimetic tactile sensor array. Advanced Robotics, 22(8), 829-849. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229–256. Yao, Y ., Rosasco, L., & Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation, 26(2), 289-315. Zhou, K., & Roumeliotis, S. (2011). Multirobot active target tracking with combinations of relative observations. IEEE Transactions on Robotics, 27(4), 678–695. Zhou, K. X., & Roumeliotis, S. (2012). A sparsity-aware qr decomposition algorithm for efficient cooperative localization. In Robotics and automation (icra), 2012 ieee international conference on (pp. 799–806). 163 Zhu, J.-Y ., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593. Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy (Unpublished doctoral dissertation). Machine Learning Department, Carnegie Mellon University. Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum en- tropy inverse reinforcement learning. In D. Fox & C. P. Gomes (Eds.), Aaai (p. 1433- 1438). AAAI Press. Retrieved from http://dblp.uni-trier.de/db/conf/aaai/ aaai2008.html#ZiebartMBD08 164
Abstract (if available)
Abstract
For robots to become fully autonomous in real-world environments, they must be able to cope with uncertainties resulting from imperfect control and perception. Although robotic perception and control systems typically account for uncertainty independently, they do not consider uncertainty in the coupling between perception and control, leading to degraded performance. To address this shortcoming, recent approaches in robotics follow the insight that interaction with the environment can facilitate perception and inform manipulation, which results in reducing the uncertainty between perception and control components. ❧ Another challenge that is at the intersection of these components involves intermediate representations used for the transition between the two. Traditionally, the interfaces between perception and control systems have been well-defined and hand-specified by robotics engineers, which might cause the intermediate representations to be constraining. This is in strong contrast to modern, flexible and learned representations that are optimized for the task at hand using deep learning. ❧ In this thesis, we present several methods that demonstrate how interaction with the environment, closed perception-action loops and learned representations are beneficial for manipulation and perceptual tasks. Specifically, we look at the problems of Interactive Perception, where the robot forcefully interacts with the environment, Active Perception, where the robot changes its sensor states to acquire additional data, and Robot Learning, where the robot learns the intermediate representations in an end-to-end fashion. We show applications of these paradigms to the tasks of articulation model estimation, grasping, multi-robot target tracking, self-calibration, locomotion and acquiring manipulation skills.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning affordances through interactive perception and manipulation
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Information theoretical action selection
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Learning from planners to enable new robot capabilities
PDF
Machine learning of motor skills for robotics
PDF
Scaling robot learning with skills
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Learning objective functions for autonomous motion generation
PDF
From active to interactive 3D object recognition
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Closing the reality gap via simulation-based inference and control
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Nonverbal communication for non-humanoid robots
Asset Metadata
Creator
Hausman, Karol Maria
(author)
Core Title
Rethinking perception-action loops via interactive perception and learned representations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/27/2018
Defense Date
01/23/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,deep reinforcement learning,deep robot learning,interactive perception,OAI-PMH Harvest,robot learning,robotics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav (
committee chair
), Abbeel, Pieter (
committee member
), Gupta, Satyandra (
committee member
), Lim, Joseph (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
hausman@usc.edu,hausmankarol@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-488913
Unique identifier
UC11266791
Identifier
etd-HausmanKar-6127.pdf (filename),usctheses-c40-488913 (legacy record id)
Legacy Identifier
etd-HausmanKar-6127.pdf
Dmrecord
488913
Document Type
Dissertation
Rights
Hausman, Karol Maria
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
deep reinforcement learning
deep robot learning
interactive perception
robot learning
robotics