Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging structure for learning robot control and reactive planning
(USC Thesis Other)
Leveraging structure for learning robot control and reactive planning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEVERAGING STRUCTURE FOR LEARNING ROBOT CONTROL AND REACTIVE PLANNING by Giovanni Sutanto A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (Computer Science) August 2020 Copyright 2020 Giovanni Sutanto Dedication For my parents. Thanks for their never-ending support, teachings, and prayers that have brought me this far and beyond. ii Acknowledgements First and foremost, I would like to thank God Almighty for giving me the guidance, perseverance, persistence, knowledge, skills, and opportunity to pursue this research, as well as to complete it satisfactorily. He has brought me far from my home country, Indonesia to Japan, California, Seattle, and Europe, and this wonderful lifetime experience has filled my heart with peace and love, as well as appreciation to the beauty of nature, living creations, mathematics, and life in general. I am so blessed and I am forever grateful for all these blessings. Next, I would like to thank Stefan Schaal who gave me the opportunity, to venture into the amazing world of robotics and machine learning by working in the Computational Learning and Motor Control (CLMC) laboratory, University of Southern California (USC), and the Au- tonomous Motion Department (AMD) laboratory, Max Planck Institute for Intelligent Systems (MPI-IS). I would like also to thank Gaurav Sukhatme for allowing me to complete my Ph.D. with his guidance in the Robotic Embedded Systems Laboratory (RESL). I also spent an amazing research internship in Summer 2018 at NVIDIA Robotics, Seattle thanks to Dieter Fox, Nathan Ratliff, and Ankur Handa. Thanks to my research mentor, Franziska Meier for her guidance, pa- tience, and assistance in navigating the research world, during my research at USC and during my research internship at Facebook Artificial Intelligence Research (FAIR) in Fall 2019. I also thank Heather Culbertson, James Finley, Laurent Itti, and Joseph J. Lim for their time and effort in the committee for my Ph.D. Qualifying Examination, Thesis Proposal, and/or Dissertation Defense. My research has been made possible by the generous funding from the Max-Planck-Society, the NVIDIA Research, the Facebook Artificial Intelligence Research, the National Science Foun- dation, the Office of Naval Research, the Okawa Foundation, USC Research and Teaching Assis- tantships. Any opinions, findings, and conclusions or recommendations expressed in this disser- tation are those of the author and do not necessarily reflect the views of the funding organization. Finally, I would like to acknowledge many of my colleagues, mentors, and friends who I met along the way in this journey. Ludovic Righetti, Nathan Ratliff, Ching-An Cheng, Mustafa Mukadam, Isabel Rayas, Peter Englert, Ragesh Kumar Ramachandran, and Nick Rotella for the exciting discussions on Quaternions, Special Orthogonal Groups, and the beautiful and magical world of Differential Geometry. Yevgen Chebotar for discussions on Reinforcement Learning al- gorithms, the food exploration and the other fun times in Seattle. Harry (Zhe) Su for consultations iii on tactile sensors. Sean Mason for his expertise in locomotion and LQRs. Sarah Bechtle for the lunches and other fun food explorations around USC and Los Angeles in general, and especially for sharing the bittersweet nature of Ph.D. life. Vince Enachescu, Bharath Sankaran, John Re- bula, and Kevin Hitzler for their friendship in the CLMC Lab. Jeannette Bohg for discussions on KUKA robots. Daniel Kappler, Manuel Wuethrich, and Alexander Herzog for their friendship in the AMD Lab. Vincent Berenz for his help in organizing a better codebase at the AMD Lab. Felix Grimminger for his assistance in designing custom 3D mechanical parts for the robots at the AMD/CLMC Lab. Balakumar Sundaralingam for sharing some useful utility codes for tactile sensors. Akshara Rai, Austin S. Wang, Yixin Lin, Alonso Marco Valle, Kristen Morse, Neha Das, and Dinesh Jayaraman for the assistance during my research internship at FAIR, Menlo Park, Fall 2019. Also, to all remaining members of CLMC Lab, AMD Lab, RESL Lab, NVIDIA Robotics Lab, and FAIR Robotics Lab that I have not mentioned yet one-by-one. iv Table of Contents Dedication ii Acknowledgements iii List of Figures ix List of Tables xii List of Algorithms xiii Abstract xiv I Introduction 1 I.1 System Model Acquisition for Planning and Control in Robotics: Analytical Ap- proach versus Empirical/Data-Driven Approach . . . . . . . . . . . . . . . . . . 1 I.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 II Learning Latent Space Dynamics for Tactile Servoing 4 II.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 II.3 Data-Driven Tactile Servoing Model . . . . . . . . . . . . . . . . . . . . . . . . 7 II.3.1 Tactile Servoing Problem Formulation . . . . . . . . . . . . . . . . . . . 7 II.3.2 Latent Space Representation . . . . . . . . . . . . . . . . . . . . . . . . 8 II.3.3 Embedding Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 II.3.4 Latent Space Forward Dynamics (LFD) . . . . . . . . . . . . . . . . . . 10 II.3.4.1 Locally Linear (LL) LFD Model . . . . . . . . . . . . . . . . 11 II.3.4.2 Non-Linear (NL) LFD Model . . . . . . . . . . . . . . . . . . 11 II.3.5 Inverse Dynamics (ID) . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 II.3.5.1 Locally Linear (LL) ID . . . . . . . . . . . . . . . . . . . . . 12 II.3.5.2 Negative-Gradient (NG) ID . . . . . . . . . . . . . . . . . . . 12 II.3.5.3 Neural Network Jacobian (NJ) ID . . . . . . . . . . . . . . . . 12 II.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 II.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 II.4.1.1 Human Demonstration Collection . . . . . . . . . . . . . . . . 15 v II.4.1.2 Action Representation . . . . . . . . . . . . . . . . . . . . . . 16 II.4.1.3 Machine Learning Framework and Training Process . . . . . . 17 II.4.2 Auto-Encoder Reconstruction Performance . . . . . . . . . . . . . . . . 17 II.4.3 Neural Network Multi-Dimensional Scaling (MDS) . . . . . . . . . . . . 18 II.4.4 Latent Forward Dynamics Prediction Performance . . . . . . . . . . . . 18 II.4.5 Inverse Dynamics Prediction Performance . . . . . . . . . . . . . . . . . 20 II.4.6 Real Robot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 II.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 III Encoding Physical Constraints in Differentiable Newton-Euler Algorithm 23 III.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 III.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 III.2.1 Learning models for model-based control . . . . . . . . . . . . . . . . . 25 III.3 Encoding Physical Constraints as Structures in Differentiable Newton-Euler Al- gorithm (DiffNEA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 III.3.1 Unstructured Mass and Rotational Inertia Matrix (DiffNEA No Str) . . . 27 III.3.2 Symmetric Rotational Inertia Matrix (DiffNEA Symm) . . . . . . . . . . 27 III.3.3 Symmetric Positive Definite Rotational Inertia Matrix (DiffNEA SPD) . . 27 III.3.4 Triangular Parameterized Rotational Inertia Matrix (DiffNEA Tri) . . . . 28 III.3.5 Covariance Parameterized Rotational Inertia Matrix (DiffNEA Cov) . . . 30 III.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 III.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 III.4.1.1 Training Speed, Generalization Performance, and Effectiveness of Inverse Dynamics Learning . . . . . . . . . . . . . . . . . . 31 III.4.1.2 Online Learning Speed . . . . . . . . . . . . . . . . . . . . . 32 III.4.2 Real Robot Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 33 III.4.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 III.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 IV Learning Feedback Models for Reactive Behaviors 36 IV .1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 IV .2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 IV .2.1 Movement Representations . . . . . . . . . . . . . . . . . . . . . . . . . 41 IV .2.2 Automated Demonstrations Alignment and Segmentation . . . . . . . . . 43 IV .2.3 Hand-Designed Feedback Models . . . . . . . . . . . . . . . . . . . . . 44 IV .2.4 Learning of Feedback Models . . . . . . . . . . . . . . . . . . . . . . . 45 IV .2.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 47 IV .2.6 Reinforcement Learning of Nominal Behaviors . . . . . . . . . . . . . . 48 IV .2.7 Reinforcement Learning of Feedback Models . . . . . . . . . . . . . . . 48 IV .3 Review: Dynamical Movement Primitives . . . . . . . . . . . . . . . . . . . . . 50 IV .4 Overview: Learning Feedback Models for Reactive Behaviors . . . . . . . . . . 54 IV .5 Automated Segmentation of Nominal Behavior Demonstrations into Movement Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vi IV .5.1 Point Correspondence Matching using Dynamic Time Warping . . . . . 58 IV .5.2 Least Square Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . 60 IV .5.3 Weighted Least Square Solution . . . . . . . . . . . . . . . . . . . . . . 61 IV .5.3.1 Refinement of the Trajectory Segmentation . . . . . . . . . . . 62 IV .5.4 Learning DMPs from Multiple Segmented Demonstrations . . . . . . . . 63 IV .6 Supervised Learning of Feedback Models for Reactive Behaviors . . . . . . . . . 63 IV .6.1 Spatial Generalization using Local Coordinate Frames . . . . . . . . . . 67 IV .6.2 Feedback Model Learning Framework . . . . . . . . . . . . . . . . . . . 68 IV .6.2.1 Feedback Model Input Specification . . . . . . . . . . . . . . 69 IV .6.2.2 Target Adaptation Level Extraction from Human Demonstra- tions Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 IV .6.2.3 Feedback Model Representations . . . . . . . . . . . . . . . . 71 IV .6.2.3.1 Neural Network with Output Post-Processing . . . . 71 IV .6.2.3.2 Phase-Modulated Neural Network (PMNN) . . . . . 72 IV .6.2.4 Supervised Learning of Feedback Models . . . . . . . . . . . 75 IV .7 Reinforcement Learning of Feedback Models for Reactive Behaviors . . . . . . . 75 IV .7.1 Phase 1: Evaluation of the Current Adaptive Behavior and Conversion to a Low-Dimensional Policy . . . . . . . . . . . . . . . . . . . . . . . . . 78 IV .7.2 Phase 2: Optimization of the Low-Dimensional Policy . . . . . . . . . . 78 IV .7.3 Phase 3: Rolling Out the Improved Low-Dimensional Policy . . . . . . . 79 IV .7.4 Phase 4: Supervised Learning of the Feedback Model . . . . . . . . . . . 79 IV .8 Learning Obstacle Avoidance Feedback Model Testbed . . . . . . . . . . . . . . 79 IV .8.1 Neural Network Specifications and Input-Output Details . . . . . . . . . 80 IV .8.2 Neural Network Output Post-Processing for Obstacle Avoidance Testbed 80 IV .8.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 IV .8.4 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 84 IV .8.5 Per setting experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 IV .8.6 Multiple setting experiments . . . . . . . . . . . . . . . . . . . . . . . . 87 IV .8.7 Unseen setting experiments . . . . . . . . . . . . . . . . . . . . . . . . 88 IV .8.8 Real robot experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 IV .9 Learning Tactile Feedback Model Testbed . . . . . . . . . . . . . . . . . . . . . 90 IV .9.1 System Overview and Experimental Setup . . . . . . . . . . . . . . . . . 90 IV .9.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 IV .9.1.2 Environmental Settings Definition and Demonstrations with Sen- sory Traces Association . . . . . . . . . . . . . . . . . . . . . 92 IV .9.1.3 Learning Pipeline Details and Lessons Learned . . . . . . . . 93 IV .9.1.4 Learning Representations Implementation . . . . . . . . . . . 97 IV .9.1.5 Cost Definition for Reinforcement Learning . . . . . . . . . . 97 IV .9.2 Extraction of Nominal Movement Primitives by Semi-Automated Seg- mentation of Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . 98 IV .9.2.1 Correspondence Matching Results and Issues . . . . . . . . . 98 IV .9.2.2 Segmentation Results via Weighted Least Square Method . . . 99 vii IV .9.3 Supervised Learning of Feedback Models . . . . . . . . . . . . . . . . . 101 IV .9.3.1 Regression and Generalization Evaluation of PMNNs . . . . . 102 IV .9.3.2 Performance Comparison between FFNN and PMNN . . . . . 102 IV .9.3.3 Comparison between Separated versus Embedded Feature Rep- resentation and Phase-Dependent Learning . . . . . . . . . . . 104 IV .9.3.4 Evaluation of Movement Phase Dependency . . . . . . . . . . 104 IV .9.3.5 Unrolling the Learned Feedback Model on the Robot . . . . . 105 IV .9.4 Reinforcement Learning of Feedback Models . . . . . . . . . . . . . . . 107 IV .9.4.1 Quantitative Evaluation of Training with Reinforcement Learning108 IV .9.4.2 Feedback Models Performance Before versus After Reinforce- ment Learning and Across-Settings Generalization Performance 109 IV .9.4.3 Qualitative Evaluation of the Real Robot Behavior . . . . . . . 110 IV .10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 V Conclusion and Future Work 112 V .1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 V .2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Appendix 115 A Solution of Constrained Optimal Control of 1-Horizon . . . . . . . . . . . . . . 115 B Quaternion Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 C Time Complexity Analysis and Quadratic Speed-Up of the Automated Nominal Demonstrations Alignment and Segmentation via Weighted Least Square Dy- namic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 D Publications and Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliography 123 viii List of Figures II.1 Learning tactile servoing platform. . . . . . . . . . . . . . . . . . . . . . . . . 5 II.2 Neural Network diagrams and its loss functions (drawn as dotted lines). Inverse dynamics loss functionL ID is not illustrated here. . . . . . . . . . . . . . . . . 7 II.3 x-y dimensions of latent space embedding by MDS . . . . . . . . . . . . . . . 18 II.4 Normalized mean squared error (NMSE) vs. the length of chained forward dynamics prediction, averaged over latent space dimensions, on test dataset. . . 19 II.5 Average cosine distance between rotational and translational inverse dynamics, weighted by the norm of the ground truth. . . . . . . . . . . . . . . . . . . . . 20 II.6 Sequence (Sq.) snapshots of our experiments executing the tactile servoing with the learned model (non-linear LFD model and neural network Jacobian ID model) on a real robot. Red sticker indicates the target contact point. The first row, figures (a)-(d) are for a target contact point whose achievement re- quires rotational change of pose of the BioTac finger. The second row, figures (e)-(h) are for a target contact point whose achievement requires translational change of pose of the BioTac finger. . . . . . . . . . . . . . . . . . . . . . . . 21 III.1 Triangular parameterization of the principal moments of inertia for physical consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 III.2 Online learning, with NMSE in log scale. . . . . . . . . . . . . . . . . . . . . 33 IV .1 Proposed framework for learning feedback models. . . . . . . . . . . . . . . . 38 IV .2 Flow diagram of our framework. Phase 1 is detailed in section IV .5 as outlined in Algorithm IV .1. Phase 2 is detailed in section IV .6 with its flow diagram expanded in Figure IV .4. Phase 3 is detailed in section IV .7 as outlined in Algo- rithm IV .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 IV .3 Proposed framework for learning feedback models with raw sensor traces input (left) and with sensor trace deviations input (right). . . . . . . . . . . . . . . . 65 IV .4 Flow diagram of the proposed framework for supervised learning of feedback models with raw sensor traces input (left) and with sensory trace deviations input (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 ix IV .5 (left) Example of local coordinate frame definition for a set of obstacle avoid- ance demonstrations: a local coordinate frame is defined on trajectories col- lected from human demonstration. (middle and right) Unrolled avoidance be- havior is shown for two different locations of the obstacle and the goal: using local coordinate system definition (right) and not using it (middle). . . . . . . . 67 IV .6 System overview with local coordinate transform. . . . . . . . . . . . . . . . . 68 IV .7 Phase-modulated neural network (PMNN) with one-dimensional output cou- pling termc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 IV .8 Data collection setting and different obstacle geometries used in experiment. . . 82 IV .9 Sample demonstrations. (b), (c), and (d) are a sample set of demonstrations for 1-out-of-40 settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 IV .10 Histograms describing the results of training and testing using a neural network (left plots) and the model from the previous work by Rai et al. [1] (plots to the right). (a) and (b) are average NMSE across all dimensions generated over the complete dataset. (c) and (d) are the NMSE over the dominant axis of demon- strations with obstacle avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . 85 IV .11 Sample unrolled trajectories on trained and unseen settings. . . . . . . . . . . . 88 IV .12 Snapshots from our experiment on our real system. Here the robot avoids a cylindrical obstacle using a neural network that was trained over cylindrical ob- stacle avoidance demonstrations. Seehttps://youtu.be/hgQzQGcyu0Q for the complete video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 IV .13 Experimental setup of the scraping task. . . . . . . . . . . . . . . . . . . . . . 91 IV .14 1 reference segment vs. 1 guess segment: The DTW-computed correspondence matching and refined segmentation results of primitive 1 based on z-axis trajec- tory, compared between the un-weighted version and the weighted version. . . . 98 IV .15 1 reference segment vs. 10 guess segments: The refined segmentation results (space vs. time) of primitive 1 (based on z-axis trajectory) and primitive 3 (based on y-axis trajectory), compared between the un-weighted version and the weighted version, as well as the (space-only) 3D Cartesian plot of the seg- mented primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 IV .16 (Top) comparison of regression results on primitives 2 and 3 using different neu- ral network structures; (Middle) comparison of regression results on primitives 2 and 3 using separated feature learning (PCA or Autoencoder and phase ker- nel modulation) versus embedded feature learning (PMNN); (Bottom) the top 10 dominant regular hidden layer features for each phase RBF in primitive 2, roll-orientation coupling term, displayed in yellow. The less dominant ones are displayed in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 x IV .17 Snapshots of our experiment on the robot, while scraping on the tilt stage with +10 roll angle. The first row is unrolling without the coupling term, i.e. un- rolling the nominal behavior. The second row is unrolling with the learned cou- pling term model. The caption shows the reading of the Digital Angle Gauge mounted on top of the middle finger of the hand. The first column is at initial position. The second column is at the end of the first primitive (going down in z-direction). The third column is at the end of the second primitive (orientation correction). The fourth column is at the end of the third primitive (scraping the board forward in y-direction). We see that on the second row, there is orientation correction applied due to the coupling term being active. . . . . . . . . . . . . 105 IV .18 The roll-orientation coupling term (top) versus the sensor traces deviation of the right BioTac finger’s electrode #6 (bottom) of primitive 2, during the scraping task on environmental setting with the roll-angle of the tilt stage vary from 2:5 (left-most) to 10 (right-most), with +2:5 increments each step. Comparison is between human demonstrations (blue), during unrolling on robot while apply- ing the coupling term computed online by the trained feedback model (red), dur- ing unrolling the nominal behavior on the robot (green), the human demonstra- tion’s mean trajectory (dashed black), and the range of the human demonstration within 1 standard deviation from the mean trajectory (solid black). On the top plots, we see that the trained feedback model can differentiate between differ- ent tilt stage’s roll-orientations and apply the approximately correct amount of correction/coupling term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 IV .19 The learning curves of the RL refinement of the feedback model on the initially unseen setting 10 by primitive 2 (left) and 3 (right). The learning curves here shows the mean and standard deviation of the cost over 8 runs on the real robot of the adaptive behavior after the feedback policy update. The cost at iteration # 0 shows the cost before RL is performed. . . . . . . . . . . . . . . . . . . . . 108 IV .20 The performance comparison in terms of accumulated cost on primitive 2 (left) and on primitive 3 (right) between the nominal behavior without feedback model (red) vs. the adaptive behavior (including the feedback model) before reinforce- ment learning (RL) of the feedback model (green) vs. the adaptive behavior after RL of the feedback model (blue) on all non-default settings. The mean and standard deviation is computed over 8 runs on the real robot. . . . . . . . . 109 IV .21 Snapshots of our experiment on the real robot, comparing the execution of the closed-loop behavior (the nominal behavior and the learned feedback model) before RL (soft shadow) versus after RL. After RL, the feedback model applied more correction as compared to the one before RL, qualitatively showing the improvement result by the RL algorithm. . . . . . . . . . . . . . . . . . . . . . 110 xi List of Tables III.1 Comparison between models trained to optimizeL ID on the sine motion dataset from simulation, in terms of training speed, joint position (q) and velocity ( _ q) tracking, and generalization performance: end-effector position (x) and velocity ( _ x) tracking unseen end-effector reaching tasks. . . . . . . . . . . . . . . . . . . 32 III.2 Comparison between models trained to optimizeL ID on the sine motion dataset on the real robot, in terms of the number of training epochs required to reach con- vergence, sine motion joint position (q) tracking, sine motion joint velocity ( _ q) tracking, and generalization performance: end-effector position (x) and velocity ( _ x) tracking NMSE on an end-effector tracking task (unseen task/situation during training). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 IV .1 Results of the per setting experiments. Negative distance to obstacle implies a collision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 IV .2 Results of the multi setting experiments. Negative distance to obstacle implies a collision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 IV .3 Results of the unseen setting experiments. Negative distance to obstacle implies collision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 IV .4 Force-torque control schedule for steps (ii)-(v). . . . . . . . . . . . . . . . . . . 95 IV .5 NMSE of the roll-orientation coupling term learning with leave-one-demonstration- out test, for each primitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 xii List of Algorithms IV .1 Pseudo-Code of Nominal Movement Primitives Extraction from Demonstrations 57 IV .2 Dynamic Time Warping for Point Correspondence Matching . . . . . . . . . . . 59 IV .3 Reinforcement Learning of Feedback Model for Reactive Motion Planning . . . 77 IV .4 Path Integral Policy Improvement with Covariance Matrix Adaptation (PI 2 CMA) Update Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 xiii Abstract Traditionally, models for control and motion planning were derived from physical properties of the system. While such a classical approach provides mathematical performance guarantees, modeling complex systems is not always feasible. On the other hand, recent advances in machine learning allow for the acquisition of models automatically from data. However, naive empirical methods do not provide performance guarantees, may be slow to train, and often generalize poorly to new situations. In this dissertation, we present a combination of both approaches – infusing prior knowl- edge by incorporating structure into learning methods. We show the benefits of this combined approach in three robotics settings. First, we show that incorporating prior knowledge – in com- bination with a manifold learning technique – applied to learning the latent space dynamics for a tactile servoing testbed, can help simplify the control problem representation such that the solu- tion can be derived analytically from the learned model. Second, we show that infusing physical properties into learning inverse dynamics with a differentiable Newton-Euler algorithm speeds-up the learning process and improves the generalization capability of the model. Finally, we show that structure can be incorporated into the learning framework for feedback models of reactive behaviors, facilitating guarantees on desirable system properties like goal-convergence. xiv Chapter I Introduction I.1 System Model Acquisition for Planning and Control in Robotics: Analytical Approach versus Empirical/Data-Driven Approach In Robotics, system models play critical roles in determining the success of motion planning and control tasks. For instance, erroneous models may drive physical systems unstable during control –causing costly hardware damage–, inaccurate models may lead to sub-optimal motion plans, and so on. Therefore, proper acquisition of the system model is essential in achieving the desired outcomes in robot motion planning and control. Modern system modeling has its underpinnings in centuries of progress in Physics and Math- ematics. The knowledge in these fields has led to the engineering and analytical acquisition of system models. Such engineering typically exploits the desirable system properties toward the design of precise engineered artifacts for the good of humankind. In Control Engineering, for ex- ample, by modeling the system properly, control engineers may design a controller that will drive the system from any initial state toward a desirable state with provable response and stability properties. On the other hand, recent progress in Machine Learning and Optimization has allowed for au- tomatically acquiring system models from data or empirically. Moreover, the recent technological advances in these fields, such as the emergence of automatic differentiation packages, has enabled the specification of complex learning representations – not only limited to neural networks – that was very challenging several years ago. For example, the PyTorch and TensorFlow frameworks 1 [2, 3] has enabled the specification of differentiable positive definite matrices which is important for representing physical entities such as inertia matrices [4, 5, 6]. With the option of employing an analytical modeling approach [7, 8, 9, 10] versus a data- driven/empirical modeling approach [11, 12, 13] at hand, one must consider the advantages and disadvantages of each. In terms of advantages, analytical models may have mathematical guaran- tees in performance, such as in terms of convergence and stability, due to their design. Moreover, due to the well-understood properties of analytical models, such performance can be guaranteed across the design specification region of the model, allowing generalization of the model for in- terpolation and extrapolation in this region. However, analytical models may have associated modeling errors and it is challenging to derive analytical models in many cases, e.g. due to the non-linearity of the problem when a robot makes contact with the environment. On the other hand, empirical models have more flexibility in terms of their expressiveness to capture complex representations from data. However, learned models often do not have mathematical performance guarantees, may require a massive amount of data, may be slow to train, and may generalize poorly to new situations. In this dissertation, we work on a middle-ground between the two approaches and study the following question: how do we infuse data-driven methods with some structure based on our knowledge of the problem? We call the resulting model a structured empirical model. Follow- ing the structured empirical modeling approach provides several benefits. First, the approach allows for the acquisition of complex models empirically via an expressive learning representa- tion. Second, some physical properties can be infused into the model via a structured learning representation that provides some mathematical performance guarantees. Third, in some cases, this approach accelerates the training process of the model and improves sample-efficiency, due to the reduction of the model parameter search space. Finally, the learned model shows good generalization to novel situations in some settings. This dissertation demonstrate these benefits through three projects that adopt the structured empirical modeling approach, as outlined below. I.2 Thesis Outline This dissertation is organized as follows: 2 Chapter II presents a method for learning a latent space dynamics model for tactile servoing. Based on the insight of the structural characterization of tactile sensing as a manifold, we employ a manifold learning technique to impose an Euclidean structure on the latent state representation, such that the control on this latent state space – for tactile servoing – can be derived analytically. Chapter III presents an encoding of physical constraints and physical consistency in a dif- ferentiable computation graph of a Newton-Euler algorithm – a famous algorithm for per- forming inverse dynamics functionality in a robot control framework. We demonstrate that infusing structure into the chosen learning representation speeds-up the training process of the model, and leads to models that have good generalization to novel situations. Chapter IV presents our work on supervised learning and reinforcement learning of feed- back model representation for reactive motion planning in robotics. In this problem, we employ special neural network structures as the function approximators for the feedback model representations. In comparison with the regular feed-forward neural network struc- ture, the special structure was designed in such a way to provide guarantees on the conver- gence of the overall motion planner system to the goal, as well as to provide the capability of the model to capture the dependency of the motion plan adaptation on the movement phase. Chapter V summarizes the contributions of this dissertation and outlines some ideas for future work. 3 Chapter II Learning Latent Space Dynamics for Tactile Servoing II.1 Introduction The ability to adapt actions based on tactile sensing is the key to robustly interact and manipulate with objects in the environment. Previous experiments have shown that when the tactile-driven control is impaired, humans have difficulties performing even basic manipulation tasks [14, 15]. Hence, we believe that equipping robots with tactile feedback capability is important to make progress in robotic manipulation. In line with this direction, recently a variety of tactile sensors [16, 17, 18, 19] have been developed and used in robotics research community, and researchers have designed several tactile- driven control — or popularly termed as tactile servoing — algorithms. However, many tactile servoing algorithms were designed for specific kinds of tactile sensor geometry, such as a planar surface [20] or a spherical surface [21], therefore they do not apply to the broad class of tactile sensors in general. For example, if we would like to equip a robot with a tactile skin of arbitrary geometry or if there is a change in the sensor geometry due to wear or damage, we will need a more general tactile servoing algorithm. In this chapter, we present our work on a learning-based tactile servoing algorithm that does not assume a specific sensor geometry. Our method comprises three steps. At the core of our approach, we treat the tactile skin as a manifold, hence first we perform an offline neural-network based manifold learning, to learn a latent space representation which encodes the essence of the tactile sensing information. Second, we learn a latent space dynamics model from human 4 (a) Simulated Robot (b) Real Robot Figure II.1: Learning tactile servoing platform. demonstrations. Finally, we deploy our model to perform an online control — based on both the current and target tactile signals — for tactile servoing on a robot. Our contribution is twofold: First, we utilize manifold learning to impose an Euclidean struc- ture in the latent space representation of tactile sensing, such that the control in this latent space becomes straightforward. Second, we train a single model that is able to do both forward dy- namics and inverse dynamics prediction using the same demonstration dataset, which is more data-efficient than training separate models for the forward and inverse dynamics. This chapter is organized as follows. Section II.2 provides some related work. Section II.3 presents the model that we use for learning tactile servoing from demonstration. We then present our experimental setup and evaluations in Section II.4. Finally, we summarizes our work in Section II.5. 5 II.2 Related Work Our work is mostly inspired by previous works on learning control and dynamics in the latent space [22, 23]. Both of these works learn a latent space representation of the state, and also learn a dynamics model in the latent space. Watter et al. [23] designed the latent space’s state transition model to be locally linear, such that a stochastic optimal control algorithm can be directly applied to the learned model for control afterwards. Byravan et al. [22] designed the latent space to represent SE(3) poses of the tracked objects in the scene, and the transition model is simply the SE(3) transformations of these poses. Control in the work by Byravan et al. [22] is done by gradient-following of the Euclidean distance between the target and current latent space poses with respect to action. In this work, we train a latent space dynamics model that takes latent space representation of the current tactile sensing and applied action, and predicts the latent space representation of the next tactile sensing, which is termed as forward dynamics. Since we use the model for control, i.e. tactile servoing, it is also essential that we can compute actions, given both the current and target tactile sensing — termed as inverse dynamics. Previous work by Agrawal et al. [24] learns separate models, one for the forward dynamics, and one for the inverse dynamics model, for a poking task. In contrast to this, in our work we train a single model for both forward and inverse dynamics. In terms of latent space representation, our work is inspired by the work of Hadsell et al. [25], where they use a Siamese neural network and construct a loss function such that similar data points are close to each other in the latent space and dissimilar data points are further away from each other in the latent space. In this work, we also use a Siamese neural network, however we employ a loss function that performs Multi-Dimensional Scaling (MDS) [26, 27], such that the first two dimensions of the latent space represent the 2D map of the contact point on the tactile skin surface. Our third dimension in the latent space represents the degree of contact applied on the skin surface, i.e. how much pressure was applied at the point of contact. Regarding tactile servoing, besides the previous works [20, 21] which have been mentioned in Section II.1, Su et al. [28] designed a heuristic for tactile servoing with a tactile finger [16]. Our 6 work treats the tactile sensor as a general manifold, hence the method shall apply to any tactile sensors. Previously, learning tactile feedback has been done through reinforcement learning [29] or a combination of imitation learning and reinforcement learning [30, 31]. Sutanto et al. [32] learns a tactile feedback model for a trajectory-centric reactive policy. In this work, we learn a tactile servoing policy indirectly by learning a latent space dynamics model from demonstrations. As we engineer the latent space to be Euclidean by performing MDS and maintaining contact degree information, the inverse dynamics control action can be computed analytically, given both the current and target latent states. Hence, our method does not require reinforcement learning to learn the desired behavior. II.3 Data-Driven Tactile Servoing Model II.3.1 Tactile Servoing Problem Formulation (a)L MDS (b)L AER ,L CDP , andL LFD Figure II.2: Neural Network diagrams and its loss functions (drawn as dotted lines). Inverse dynamics loss functionL ID is not illustrated here. 7 Given the current tactile sensing s t and the target tactile sensing s T , the objective is to find the action a t which will bring the next tactile sensing s t+1 = f(s t ; a t ) closer to s T , which in the optimal case can be written as: a t = arg min at d(f(s t ; a t ); s T ) (II.1) II.3.2 Latent Space Representation If the distance metricd is a squaredL 2 distance of two states, which lie on a Euclidean space and if f is smooth, then inverse dynamics a t can be computed as proportional to @d @at . Moreover, for some f, the a t in Eq. II.1 can be computed analytically, at the condition @d @at = 0. Unfortunately, both s t+1 and s T may not lie on a Euclidean space. On the other hand, there seems to be some natural characterization of tactile sensing, such as the contact point and the degree of contact pressure applied at the point. The contact point in particular is a 3D coordinate which lies on the skin surface. Obviously we know that the skin surface is not Euclidean, i.e. we cannot go from the current contact point to the target contact point by simply following the vector between them because then it may be off the skin surface while doing so . However, if we are able to flatten the skin surface in 3D space into a 2D surface, then traversing between the two contact points translates into following the vector from one 2D point to the other on the 2D surface, which ensures that the intermediate points being traversed are all still on the 2D surface. Fortunately, there has been a method of mapping/embedding from a 3D surface into a 2D surface, called Multi-Dimensional Scaling (MDS) [27]. In this chapter, we choose the latent space embedding to be three-dimensional y : (i). The first two dimensions of the latent space –called thex andy dimensions of the latent space– corresponds to the 2D embedding of the 3D contact point on the tactile skin surface. (ii). The third dimension (thez dimension) of the latent space represents the degree of contact pressure applied at the contact point. The correct way of traversing from a contact point to the other is by following the geodesics between the two points on the skin surface. y Here we assume that there exists a mapping from a tactile sensing s into the 3D contact point on the tactile skin surface as well as a mapping from s into the degree of contact pressure information. 8 We understand that the above representation can only represent a contact as a single 3D coordinate in the latent space. Therefore, it will not be able to capture the richer set of features, such as an object’s edges and orientations, etc. Tactile servoing for edge tracking is left for a future work. II.3.3 Embedding Function We call the latent state representation of a tactile sensing s as z, and we define the distance metric d as a squaredL 2 distance in the latent space between the embeddings of s t+1 and s T by the embedding function z = f enc (s), as follows : d(s t+1 ; s T ) =kz t+1 z T k 2 (II.2) We represent the embedding function f enc by the encoder part of an auto-encoder neural network. For achieving the latent space representation as mentioned in II.3.2, we impose the following structure: (i). We would like to map points on a surface in 3D space into 2D coordinates. Essentially this can be described as a 2D manifold embedded in 3D space. For such a manifold, the notion of distance between any pair of two 3D points on the manifold is given by the geodesics, i.e. the curve of the shortest path on the surface. For this mapping, we would like to preserve these pairwise geodesic distances in the resulting 2D map. That is, for pairs of data pointsffs (1) a ; s (1) b g;fs (2) a ; s (2) b g;:::;fs (K) a ; s (K) b gg, we want to acquire a latent space embedding via the embedding function z = f enc (s) to get the latent space pairs ffz (1) a ; z (1) b g;fz (2) a ; z (2) b g;:::;fz (K) a ; z (K) b gg whose distance in thex andy dimensions is as close as possible to the pairwise geodesic distance. Therefore we define the loss function [27]: L MDS = K X k=1 z (k) a(xy) z (k) b(xy) g (k) a;b 2 (II.3) K is the number of data point pairs which is quadratic in the total number of data points N. g (k) a;b is the geodesic distance between the two data points in thek-th pair. The pairwise geodesic distance between any two data points is approximated using the shortest path Subscripts in Eq. II.2 corresponds to time indices. 9 algorithm on a sparse distance matrix ofM-nearest-neighbors of each data point. We use M-nearest-neighbors because the space is not 2D-Euclidean globally due to skin curvature, but it is locally 2D-Euclidean –i.e. flat– on a small neighborhood (a small patch) on the skin. The computation result is stored as a symmetric dense approximate geodesic distance matrix of sizeNN before the training begins. The pairwise loss function in Eq. II.3 is applied by using a Siamese neural network as depicted in Figure II.2(a). (ii). Encoding of thez dimension of the latent space with the contact pressure informationp. This is done by imposing the following loss function: L CDP = N X n=1 p (n) z (n) (z) 2 (II.4) While we have the ground truth for the z dimension of the latent state, i.e. p, we do not have the ground truth for the x and y dimensions. We have the 3D coordinate of the data point on the skin — which is used to compute the sparse distance matrix of M-nearest-neighbors of each data point — but we do not know how it is mapped to thex andy dimensions of the latent space, and this is our reason for using an auto-encoder neural network representation. The auto-encoder reconstruction loss is: L AER = N X n=1 f dec (f enc (s (n) )) s (n) 2 (II.5) with f enc is the encoder/embedding function, and f dec is the decoder/inverse-embedding function. II.3.4 Latent Space Forward Dynamics (LFD) We assume the latent space forward dynamics as follows: _ z t = f fd (z t ; a t ; d ) (II.6) For BioTacs, these 3D coordinates can be computed from electrode values, by using the point of contact estima- tion model presented by Lin et al. [33]. 10 where d is the set of trainable (neural network) parameters of the dynamics model. Numerical integration gives us the discretized version: z t+1 = f dfd (z t ; a t ; t; d ) = z t + _ z t t = z t + f fd (z t ; a t ; d )t (II.7) There are two possibilities of f fd as follows: II.3.4.1 Locally Linear (LL) LFD Model _ z t = f fd (z t ; a t ; d ) = A t z t + B t a t + c t (II.8) where A t , B t , and c t is predicted by a fully-connected neural network (fcnn) from input z t , as follows: 2 6 6 4 vec(A t ) vec(B t ) c t 3 7 7 5 = h fcnnLL (z t ; d ) (II.9) with vec(A t ) and vec(B t ) are the vectorized representation of A t and B t , respectively [23]. II.3.4.2 Non-Linear (NL) LFD Model _ z t = f fd (z t ; a t ; d ) = h fcnnNL ( 2 4 z t a t 3 5 ; d ) (II.10) We would like to be able to predict the forward dynamics in the latent space, so we use the following loss function: L LFD = H X t=1 kf dfd (z t ; a t ; t; d ) f enc (s t+1 )k 2 (II.11) with z t+1 is computed from Eq. II.6 and II.7. For additional robustness, we can also do chained predictions for C time steps ahead and sum up the loss function in Eq. II.11 for these chains, similar to the work by Nagabandi et al. [34]. 11 II.3.5 Inverse Dynamics (ID) Beside forward dynamics, we also found that the ability of the model to predict inverse dynamics to be essential for the purpose of action selection or control. There are three possibilities of inverse dynamics model: II.3.5.1 Locally Linear (LL) ID Based on the locally linear LFD model [23] in section II.3.4.1, Eq. II.6, II.7, II.8, we can setup a constrained optimal control problem: min a t ; z t+1 1 2 kz T z t+1 k 2 + 2 ka t k 2 s:t: z t+1 = z t + (A t z t + B t a t + c t )t (II.12) whose solution is: a t;ID = B T t B t B T t + t 2 I 1 z T z t t A t z t c t (II.13) II.3.5.2 Negative-Gradient (NG) ID Based on the non-linear LFD model in section II.3.4.2, we can compute a gradient-based con- troller which minimizes the distance functiond =kf dfd (z t ; a t ; t; d ) z T k 2 , that is [22]: a t;ID = @d @a t at=0 (II.14) with is a positive contant that scales the gradient w.r.t. maximum allowed magnitude of a t II.3.5.3 Neural Network Jacobian (NJ) ID Based on the non-linear LFD model in section II.3.4.2, we can derive the following from Eq. II.10 (dropping time indext for a moment): z = h J z J a i 2 4 _ z _ a 3 5 = J z _ z + J a _ a (II.15) 12 where J z and J a are the Jacobians of h fcnnNL w.r.t. z and a, respectively, which can be dis- cretized into: (z t+1 z t ) (z t z t1 ) t 2 = J z t1 z t z t1 t + J a t1 a t a t1 t (II.16) or z t+1 = z t + 1 t I + J z t1 t (z t z t1 ) + J a t1 t (a t a t1 ) (II.17) with J z t1 and J a t1 are the Jacobians of h fcnnNL w.r.t. previous latent state z t1 and previous action a t1 , respectively. Let us define A t1 = 1 t I + J z t1 , B t1 = J a t1 , and c t1 = A t1 z t1 B t1 a t1 , Eq. II.17 can be written as: z t+1 = z t + ( A t1 z t + B t1 a t + c t1 )t (II.18) We can setup a constrained optimal control problem: min a t ; z t+1 1 2 kz T z t+1 k 2 + 2 ka t k 2 s:t: z t+1 = z t + ( A t1 z t + B t1 a t + c t1 )t (II.19) whose solution is: a t;ID = B T t1 B t1 B T t1 + t 2 I 1 z T z t t A t1 z t c t1 (II.20) The optimal control formulation in Eq. II.12 and II.19 are similar to those of Linear Quadratic Regulator (LQR) with (finite) horizon equal to 1. Derivations of Eq. II.13 and II.20 from Eq. II.12 and II.19, respectively, can be seen in the Appendix A. From Eq. II.13, II.14, and II.20, in general we can write: a t;ID = f id (z T ; z t ; z t1 ; a t1 ; t; d ) (II.21) Please note that d are shared between f dfd (Eq. II.7) and f id (Eq. II.21). These Jacobians exist in our experiment because we choose smooth activation functions for h fcnnNL , such as hyperbolic tangent (tanh). 13 For our purpose, it is mostly important that the inferred inverse dynamics action points to the right direction. Therefore, we can leverage the demonstration dataset to also optimize the following inverse dynamics loss: L ID = H X t=1 f id (z T = z t+1 ; z t ; z t1 ; a t1 ; t; d ) kf id (z T = z t+1 ; z t ; z t1 ; a t1 ; t; d )k a t ka t k 2 (II.22) We combine the loss functions as follows: L totalAE =w AER L AER +w MDS L MDS +w CDP L CDP (II.23) L totalDyn =w LFD L LFD +w ID L ID (II.24) with the weights w MDS ;w CDP ;w AER ;w LFD ;w ID are tuned so that each loss function com- ponents become comparable to each other. Some individual loss functions are depicted in Fig- ure II.2. L totalAE andL totalDyn are minimized separately (with separate optimizer) in parallel with respect to the human demonstrations’ trajectory datasetf(s t1 ; a t1 ; s t ; a t ; s t+1 )g t2f1;:::;Hg . MinimizingL totalDyn effectively means to train the dynamics model parameters d to minimize both the forward dynamics lossL LFD and the inverse dynamics lossL ID . II.4 Experiments II.4.1 Experimental Setup We use the right arm of a bi-manual anthropomorphic robot system, which is a 7-degrees-of- freedom Barrett WAM arm, plus its hand which has three fingers. We mount a biomimetic tactile sensor BioTac [16] on the tip of the middle finger of the right hand. The finger joints configuration are programmed to be fixed during demonstration and testing. The setup is pictured in Figure II.1. We setup the end-effector frame to coincide with the BioTac finger frame as described in the previous work by Lin et al. [33], figure 4. The BioTac has 19 electrodes distributed on the skin surface, capable of measuring deformation of the skin by measuring the change of impedance when the conductive fluid underneath the skin is being compressed or deformed due to a contact with an object. In our experiments, s is a vector of 19 values corresponding to the digital reading 14 of the 19 electrodes, subtracted with its offset value estimated when the finger is in the air and does not make any contact with any object. The contact pressure informationp is a scalar quantity, which is obtained by negating the mean of the vector s, i.e.p = s = 1 19 P 19 i=1 s i , withs i being the digital reading of thei-th electrode minus its offset. II.4.1.1 Human Demonstration Collection For collecting human demonstrations, we set the robot to be in a gravity compensation mode, allowing a human demonstrator to guide the robot to a sequence of contact interaction between the BioTac finger and a drawer handle. The robot’s sampling and control frequency is 300 Hz, while the tactile information s is sampled at 100 Hz. Laterp can be computed from s. The demonstrations are split into two parts: one part corresponds to the contact interaction dynamics due to the rotational change of the finger pose, and the other part due to the translational change of the finger pose. Each part comprises of 7 sub-parts which correspond to contact point trajectories that traverse through different areas of the skin. For each sub-part of the rotational motion, we provide 3 demonstrations, while for the translational motion, we provide 4 demon- strations. These are determined such that we have a 50%-50% composition of data points for rotational and translational motion, respectively. Each rotational demonstration involves the se- quence of making contact, rotational motion clock-wise w.r.t. x-axis of the finger frame, breaking contact, making contact again, rotational motion counter-clock-wise, and finally breaking con- tact. Each translational demonstration involves the sequence of making contact, swiping motion forward on the x-axis of the finger frame, breaking contact, making contact again, swiping mo- tion backward, breaking contact again, and then repeat the whole sequence one more time. The breaking and making contacts are intentionally done to make data segmentation easier, by using a zero-crossing algorithm [35]. In total we collectN = 55431 data points of the tactile sensing vector s, andH = 183825 pairs of (s t1 ; a t1 ; s t ; a t ; s t+1 ). Instead of constructing a single massive geodesic distance ma- trix of sizeNN, we split the data randomly intoP bins, each of sizeN 0 = 2310, so we end up withP geodesic distance matrices, each of sizeN 0 N 0 . During training, for each siamese pair being picked, both data points must be associated with the same geodesic distance matrix. For the geodesic distance computation, we use a nearest-neighborhood of sizeM = 18. On the other 15 hand, we obtain the number of state-action pairsH after excluding the pairs that contain states which correspond to contact pressure informationp below a specific threshold. We exclude these pairs as we deem them being off-contact tactile states and not being informative for performing tactile servoing . After collecting the demonstrations, we pre-process the data by performing low-pass filter- ing with a cut-off frequency of 1 Hz. We determined this cut-off frequency by visualizing the frequency-domain analysis of the data. This frequency selection of tactile servoing is also sup- ported by a previous work by Johansson and Flanagan [36]. During training, we perform the forward dynamics prediction at 100 29 , 100 30 , 100 31 , 100 32 , and 100 33 Hz, while during testing, the model predict at 100=31 Hz. In general it is hard to predict at higher frequencies, because demonstrations are performed slowly. II.4.1.2 Action Representation We choose the end-effector velocity expressed with respect to the end-effector frame as the ac- tion/policy representation. By representing the end-effector velocity with respect to the end- effector frame, effectively we are cancelling out the dependency of the state representation on the end-effector pose information, making the learned policy easier to generalize to new situa- tions. Moreover, this choice also naturally takes care of repeatable position tracking error of the end-effector. We use the Simulation Lab (SL) robot control framework [37] in our experiments y . The framework provides us with the end-effector velocity with respect to the robot base frame _ x b . To get the end-effector velocity with respect to the end-effector frame _ x e , we compute the following [39]: _ x e = 2 4 R T e 0 0 R T e 3 5 _ x b (II.25) In the extreme case, when the robot is not in contact with any object, there is no point of performing tactile servoing. y In the previous version of our experiment we use Riemannian Motion Policies (RMP) [38] for the robot control framework. 16 where R e is the end-effector orientation with respect to the base frame, expressed as a rotation matrix. Hence, we define the action a = _ x e with dimensionality 6, where the first three dimen- sions is the linear velocity and the last three is the angular velocity. During the demonstration, the robot is sampled at 300 Hz, but the prediction is made at 3 Hz, for this we summarize by averaging all _ x b ’s applied between s t and s t+1 , and then convert this average to _ x e . II.4.1.3 Machine Learning Framework and Training Process Our auto-encoder takes in 19 dimensional input vector s, and compresses it down to a 3 dimen- sional latent state embedding, z. The intermediate hidden layers are fully connected layers of size 19, 12, 6, all with tanh activation function, forming the encoder function f enc . The decoder part is a mirrored structure of the encoder function, forming f dec . h fcnnNL is a feedforward neural network with 9 dimensional input (3 dimensional latent state z and 6 dimensional action policy a), 1 hidden layer of size 15 with tanh activation functions, and 3 dimensional output. h fcnnLL is a feedforward neural network with 3 dimensional latent state z as input, 3 hidden layers of size 8, 15, 23, all with tanh activation function, and 30 dimensional output which corresponds to the parameters of A t , B t , and c t in Eq. II.9. We use a batch size of 128, and we use separate RMSProp optimizers [40] to minimize L totalAE andL totalDyn for 200k iterations. We set the values ofw MDS = 2 10 7 , w CDP = 210 7 ,w AER = 100,w LFD = 110 8 ,w ID = 110 7 , and = 0:1 empirically. We implement all components of our model in TensorFlow [3]. We also noticed a significant improvement in learning speed and fitting quality after we add Batch Normalization [41] layers in our model. II.4.2 Auto-Encoder Reconstruction Performance Our first evaluation is on the reconstruction performance of the auto-encoder in terms of normal- ized mean squared error (NMSE). NMSE is the mean squared prediction error divided by the variance of the ground truth. We obtain all NMSE values are below 0.25 for the training (85% split), validation (7.5% split), and test (7.5% split) sets. 17 Figure II.3: x-y dimensions of latent space embedding by MDS II.4.3 Neural Network Multi-Dimensional Scaling (MDS) In terms of MDS performance, we plot the x-y coordinates of the latent space embedding of all tactile sensing s data points in the demonstration data, in Figure II.3. Each data point is colored and labeled based on the BioTac electrode index with maximum activation. This result agrees with the Figure 2 of the previous work by Lin et al. [33]. Moreover, we randomly sampled 10000 siamese pairs from the training, validation, and test dataset, and compare their x-y Euclidean distance in the latent space vs. the ground truth geodesic distance. We got all these comparisons to have NMSE less than 0.02. II.4.4 Latent Forward Dynamics Prediction Performance We trained the latent space forward dynamics function f fd by chain-predicting the nextC latent states with a length of training chainC train = 2 and testing it with a length of chainC test = 3. We then evaluate the NMSEs. In Figure II.4, we compare the performance between 4 different combinations: 18 Figure II.4: Normalized mean squared error (NMSE) vs. the length of chained forward dynamics prediction, averaged over latent space dimensions, on test dataset. using both ofL MDS andL CDP loss functions during training as indicated by LatStruct or not using both of them –i.e. without any structure imposed in the latent space representation– as indicated by noLatStruct, and using inverse dynamics loss L ID during training as indicated by IDloss or not using it (noIDloss). We see that in all cases where no latent space structure is imposed, the performance is generally worse than those with imposed latent space structure. We believe this happens because it is a hard task to train a forward dynamics predictor to predict on an unstructured latent space. On the other hand, in general we see that all models with imposed inverse dynamics lossL ID perform worse than those withoutL ID . We think this is most likely because training a model without imposing L ID loss is easier than training with imposing it. However, as we will see in section II.4.5, the model trained withoutL ID loss does not provide correct action policies for tactile servoing as it was not trained to do so. 19 II.4.5 Inverse Dynamics Prediction Performance Figure II.5: Average cosine distance between rotational and translational inverse dynamics, weighted by the norm of the ground truth. In Figure II.5, we compare the inverse dynamics prediction performance between the 3 pos- sible inverse dynamics models as described in section II.3.5, in terms of the average cosine dis- tance between rotational and translational inverse dynamics, weighted by the norm of the ground truth. We termed fwddynpred and AEpred for inverse dynamics prediction in Eq. II.21 by set- ting z T equal to the constant value of f dfd (z t ; a t ; t; d ) and f enc (s t+1 ), respectively. Obviously f dfd (z t ; a t ; t; d ) is an easier target for inverse dynamics than f enc (s t+1 ), as apparent from the better prediction performance of the left bar group as compared to the middle bar group. If f fd is trained well, we can expect that the performance between fwddynpred and AEpred becomes more similar. On the right bar group, we also compare fwddynpred vs. AEpred: poor performance here indicate whether f fd could not predict well, or unstable f id (i.e. a big change in a t;ID for a small change in z T ). With respect to this analysis, we deem Neural Network Jacobian (NJ) to be the best inverse dynamics model. We also evaluate NJ noID which corresponds to not minimizing Translational and rotational components here correspond to the first three and the last three dimensions of a, respectively. 20 L ID loss. By comparing NJ noID and NJ ID, we can see that minimizingL ID loss is essential for acquiring a good inverse dynamics model. II.4.6 Real Robot Experiment (a) Rotation Sq. 1 (b) Rotation Sq. 2 (c) Rotation Sq. 3 (d) Rotation Sq. 4 (e) Translation Sq. 1 (f) Translation Sq. 2 (g) Translation Sq. 3 (h) Translation Sq. 4 Figure II.6: Sequence (Sq.) snapshots of our experiments executing the tactile servoing with the learned model (non-linear LFD model and neural network Jacobian ID model) on a real robot. Red sticker indicates the target contact point. The first row, figures (a)-(d) are for a target contact point whose achievement requires rotational change of pose of the BioTac finger. The second row, figures (e)-(h) are for a target contact point whose achievement requires translational change of pose of the BioTac finger. In Fig. II.6, we provide snapshots of robot executions on a real hardware with real-time tactile sensing from the BioTac finger. We see that the system is able to produce the required rotational motions (Fig. II.6 (a)-(d)) and translational motions (Fig. II.6 (e)-(h)) needed to achieve the specified target contact point. The full pipeline of the experiment can be seen in the videohttps://youtu.be/0QK0-Vx7WkI. The model gives _ xe as output, while the robot only knows how to track _ x b , thus we need to invert Eq. II.25 to perform tactile servoing. 21 II.5 Summary In this work, we presented a learning-from-demonstration framework for achieving tactile servo- ing behavior. We showed that our manifold representation learning of tactile sensing information is critical to the success of our approach: by imposing a Euclidean structure in the latent state representation of tactile sensing, we turn the control problem for tactile servoing into a more straightforward one. We also showed that for learning a tactile servoing model, it is important to not only be able to predict the next state from the current state and action (forward dynamics), but also be able to predict the action if given a target state (inverse dynamics). Moreover, we leverage the same demonstration dataset to train both forward and inverse dynamics models, for data-efficiency. 22 Chapter III Encoding Physical Constraints in Differentiable Newton-Euler Algorithm III.1 Introduction An accurate dynamics model is key to compliant force control of robots, and there is a rich history of learning of such models for robotics [42, 43, 44]. With an accurate dynamics model, inverse dynamics can be used as a policy to predict the torques required to achieve a desired joint acceleration, given the state of the robot [43]. Due to their widespread utility, robot dynamics have been learned in many ways. One way is to use a purely data-driven approach with parametric models [45], non-parametric models [46], and learning error-models [47], in a supervised or self-supervised fashion. However, these purely data-driven approaches typically suffer from a lack of generalization to previously unexplored parts of the state space. Alternatively, Atkeson et al. [44] recast the dynamics equations such that inertial parameters are a linear function of state-dependent quantities, given the joint torques. While inferior to unstructured approaches in terms of flexibility to fit data, this approach typically provides superior generalization capabilities. Recently, Lutter et al. [4], Gupta et al. [5] learn the parameters of Lagrangian dynamics, incorporating benefits of flexible function approximation into structured models. However, these approaches ignore some of the physical relationships and constraints on the learned parameters, which can lead to physically implausible dynamics. In this work, we combine the modern approach of parameter learning with the structured ap- proach to inverse dynamics learning, similar to the work by Ledezma and Haddadin [48]. We 23 implement the Recursive Newton-Euler algorithm in PyTorch [2], allowing the inertial parame- ters to be learned using gradient descent and automatic-differentiation for gradient computation. The benefit, over traditional least-squares implementations, is that we can easily explore vari- ous re-parameterizations of the inertial parameters and additional constraints that encode physical consistency [49] without the need for linearity. In that context, the contribution of this chapter is threefold: first, we present several re- parameter-izations of the inertial parameters, which allow us to learn physically plausible pa- rameters using a differentiable recursive Newton-Euler algorithm. Second, we show that these re-parameterizations help in improving training speed as well as generalization capability of the model to unseen situations. Third, we evaluate a spectrum of structured dynamics learning ap- proaches on a simulated and real 7 degree-of-freedom robot manipulator. Our results show that adding such structure to the learning can improve the learning speed as well as generalization abilities of the dynamics model. Our models can generalize with much lesser data, and need much fewer training epochs to converge. With our learned dynamics, we see reduced contributions of feedback terms in control, resulting in more compliant motions. III.2 Background and Related Work The dynamics of a robot manipulator are a function of the joint torque , joint acceleration q and joint position and velocity q; _ q: = f ID (q; _ q; q) = H(q) q + C(q; _ q) _ q + g(q) (III.1) with H(q); C(q; _ q); g(q) are the system inertia matrix, Coriolis matrix, and gravity force, re- spectively. f ID (q; _ q; q) is the inverse dynamics model that returns the torques that can achieve a desired joint acceleration, given the current joint positions and velocities. The Recursive Newton-Euler Algorithm (RNEA) [50, 51] is a computationally efficient method of computing the inverse dynamics, which scales linearly with respect to the number of the degrees-of-freedom of the robot. 24 III.2.1 Learning models for model-based control Accurate inverse dynamics models are crucial for compliant force controlled robots, and hence widely studied in robotics. Previously, researchers have used unstructured multi-layer perceptron (MLP) to learn the complete inverse dynamics [45, 52] or a residual component of the inverse dynamics [47]. Nguyen-Tuong et al. [46] compare non-parametric methods like locally weighted projection regression (LWPR), support vector regression (SVR) and Gaussian processes regres- sion (GPR) for learning inverse dynamics models. Recently Lutter et al. [4] and concurrently Gupta et al. [5, 53], proposed a semi-structured learning method for the Lagrangian dynamics of a manipulator, called Deep Lagrangian Networks (DeLaN). In DeLaN, some of the physical constraints of Lagrangian dynamics are obeyed. For example, the inertia matrix is parametrized to be symmetric positive definite. Moreover, the relationship between coriolis and centrifugal terms and the inertia matrix and joint velocities [43] is satisfied via automatic differentiation. Similarly, the gravity term is derived from a neural network which takes generalized coordinates as input, representing the potential energy. However, other constraints in the dynamics, such as the triangle inequality in the principal moments [54] of the inertia matrix are not considered. Moreover, neural networks can be sensitive to the chosen architecture and need variations in input data to generalize to new situations. Similar to DeLaN, Hamiltonian Neural Networks (HNN) [55] predict the Hamiltonian (instead of Lagrangian) of a dynamical system. Many previous works in parameter identification boil down to setting up a least square prob- lem with some (hard) constraints [49, 56, 57], followed by solving a convex optimization problem. On the other hand, our method incorporates hard constraints as structure in learning represen- tations of the parameters, and then performs back-propagation on the computational graph for optimization. As a result, it is not limited to learning linear parameters, and can generalize to a larger range of problems. Moreover, our approach can be applied to an online learning setup: for example when the robot carries an additional mass (an object) on one of its links, our approach can adapt the dynamics parameters online, as the robot continues to operate and the data is col- lected in batches. Traditionally, online learning approaches –including adaptive control [58]– do not guarantee physical plausibility of the learned dynamics parameters. On the other hand, more 25 modern system identification methods which incorporate hard constraints on physical parame- ters [49, 56, 57] require collecting data in the new setting before optimizing the new dynamics parameters. Our work is closely related to the work by Ledezma and Haddadin [48], in the sense that our work is also derived from the Newton-Euler formulation of inverse dynamics. However, in our work, we emphasize more on how incorporating structure in learning dynamics parameters helps with improving the training speed and generalization capability of the model. Moreover, we also compare our method with the state-of-the-art semi-structured DeLaN and an unstructured MLP. III.3 Encoding Physical Constraints as Structures in Differentiable Newton-Euler Algorithm (DiffNEA) The Newton-Euler equations can be implemented as a differentiable computational graph, e.g. with PyTorch [2], which we call differentiable NEA (DiffNEA). The parameters of DiffNEA, e.g. the inertial parameters, can now be optimized via gradient descent utilizing automatic differentiation to compute the gradients. Although both kinematics and dynamics parameters are involved in Newton-Euler algorithm, in this chapter we study the optimization of only the dynamics parameters, assuming the kinematics specification of the robot is correct and fixed. Specifically, we aim to optimize parameters of the Newton-Euler equations, such that the inverse dynamics loss is minimized: L ID = T X t=1 k t f NE (q t ; _ q t ; q t ;)k 2 2 : (III.2) Typically, is a collection of inertial parameters i = [m; h ;I xx ;I xy ;I xz ;I yy ;I yz ;I zz ] T 2R 10 per link, where m is the link mass, h = [h x ;h y ;h z ] = mc withc being the CoM, and the last 6 parameters representing the rotational inertia matrixI C [44]. When optimizing physical consistency of the estimated parameters is not guaranteed. Enforcing physical constraints on the parameters can be done through explicit constraints [49, 54], which requires constrained opti- mization algorithms to find a solution. In the following, we discuss and propose several possible 26 parameter representations, which encode increasingly more physical consistency implicitly and allows us to perform unconstrained gradient descent. III.3.1 Unstructured Mass and Rotational Inertia Matrix (DiffNEA No Str) We start out with the simplest representation, with an unconstrained mass value m and 9 uncon- strained parameters for the rotational inertia matrix: No Str = m h I C 1 I C 2 I C 3 I C 4 I C 5 I C 6 I C 7 I C 8 I C 9 (III.3) This parametrization does not encode any physical constraints and only serves as baseline. III.3.2 Symmetric Rotational Inertia Matrix (DiffNEA Symm) In this parametrization, we explicitly construct the rotational matrix as a symmetric matrix, with only 6 learnable parameters. Furthermore, we represent the link mass as:m = ( p m ) 2 +b, where b> 0 is a (non-learnable) small positive constant to ensurem> 0. Thus the learnable parameters of this representation are Symm = h p m h I C 1 I C 2 I C 3 I C 4 I C 5 I C 6 i (III.4) This parameter representation enforces positive mass estimates and symmetric –but not necessar- ily positive definite– rotational inertia matrices. III.3.3 Symmetric Positive Definite Rotational Inertia Matrix (DiffNEA SPD) Next, we introduce a change of variables, to enforce positive definiteness of the rotational inertia matrix. We construct the lower triangular matrix: L = 2 6 6 4 LI 1 0 0 LI 4 LI 2 0 LI 5 LI 6 LI 3 3 7 7 5 (III.5) 27 and construct the rotational inertia matrixI C via Cholesky decomposition plus a small positive bias on the diagonal: I C = LL T +bI 33 . I 33 is a 3 3 identity matrix andb > 0 is a (non- learnable) small positive constant to ensure positive definiteness ofI C . The learnable parameters are: SPD = h p m h LI 1 LI 2 LI 3 LI 4 LI 5 LI 6 i (III.6) While this representation enforces positive mass and positive definite inertia matrices, it could still lead to inertia estimates that are not physically plausible, as discussed in the work by Traver- saro et al. [54]. To achieve full consistency, the estimated inertia matrix also needs to fulfill the triangular inequality of the principal moments of inertia of the 3D inertia matrices [54]. III.3.4 Triangular Parameterized Rotational Inertia Matrix (DiffNEA Tri) To encode the triangular inequality constraints, we first decompose the rotational inertia matrix as: I C =RJR T (III.7) where R 2 SO(3) is a rotation matrix, and J is a diagonal matrix containing the principal moments of inertiaJ 1 ;J 2 ;J 3 . The principal moments of inertia are all positive (J 1 > 0,J 2 > 0, J 3 > 0) such thatI C is positive definite. In addition to the positiveness of the principal moments of inertia, a physically realizable rotational inertia matrixI C needs to haveJ that satisfies the triangular inequalities [49, 54]: J 1 +J 2 J 3 , J 2 +J 3 J 1 , J 1 +J 3 J 2 (III.8) In the previous works by Wensing et al. [49], Traversaro et al. [54], the triangular inequality and R2 SO(3) constraints were encoded explicitly, here we propose a change of variables such that these constraints are encoded implicitly allowing us to utilize the standard gradient based optimizers of toolboxes such as PyTorch [2]. 28 We start out by introducing a set of unconstrained parameters RAA = h RAA1 RAA2 RAA3 i T that represent an axis-angle orientation from which the rotation matrixR can be recovered by ap- plying the exponential map to the skew-symmetric matrix recovered from RAA . R =exp 2 6 6 4 0 RAA3 RAA2 RAA3 0 RAA1 RAA2 RAA1 0 3 7 7 5 ! (III.9) where exp(:) is the exponential mapping that maps RAA , a member of so(3) group, toR, a member of theSO(3) group [43]. Figure III.1: Triangular parameteri- zation of the principal moments of inertia for physical consistency. Second, to satisfy triangular inequality constraints in Eq. III.8 above, we can parameterizeJ 1 ,J 2 , andJ 3 as the length of the sides of a triangle, as depicted in Fig. III.1. The length of the first 2 sides of the triangle are encoded byJ 1 andJ 2 , and the length of the 3rd side is computed as J 3 = q J 2 1 +J 2 2 2J 1 J 2 cos (III.10) with 0<<. To encode thatJ 1 ;J 2 > 0, and 0<< we choose the following parametrization: J 1 = ( p J 1 ) 2 +b, J 2 = ( p J 2 ) 2 +b, =sigmoid( a ) (III.11) Thus, the learnable parameters of this parametrization are: TRI = h p m h RAA1 RAA2 RAA3 p J 1 p J 2 a i (III.12) Note, even though the underlying learnable parameter vector TRI in Eq. III.12 is unconstrained during the parameter optimization via gradient descent, the intermediate parametersJ 1 ;J 2 ;J 3 ;R always satisfy the hard constraints for physical consistency, i.e. thatJ 1 ,J 2 , andJ 3 are all positive and satisfy the triangle inequality constraints in Eq. III.8, as well asR2SO(3). In other words, it always lies within the constraint manifold during optimization. 29 III.3.5 Covariance Parameterized Rotational Inertia Matrix (DiffNEA Cov) Alternatively, the triangular inequality constraint in Eq. III.8 can be rewritten as [49]: C = 1 2 Tr(I C )I 33 I C 0 (III.13) which provides a somewhat easier and more intuitive representation. Again, in the work by Wens- ing et al. [49], this constraint was imposed explicitly, through linear matrix inequalities. Here, we encode the constraint C 0 implicitly by enforcing a Cholesky decomposition plus a small positive biasb on the diagonal: C = LL T +bI 33 . We parametrize this lower triangular matrix L: L = 2 6 6 4 L 1 0 0 L 4 L 2 0 L 5 L 6 L 3 3 7 7 5 (III.14) and recover the rotational inertia matrix as: I C =Tr( C )I 33 C (III.15) withTr() is the matrix trace operation, and I 33 is a 3 3 identity matrix. The learnable param- eters of this parametrization are: COV = h p m h L 1 L 2 L 3 L 4 L 5 L 6 i (III.16) This parametrization also generates fully consistent inertial parameter estimates, like the previous parametrization, however it is less complex to implement. III.4 Experiments In this Section, we evaluate our Torch implementation of the Newton-Euler algorithm, with the parametrizations introduced in Section III.3. We study how the parametrizations affect conver- gence speed when training the parameters, and how well the dynamics generalize to unseen sce- narios. Furthermore, we compare a spectrum of structured dynamics learning approaches, starting 30 from an unstructured MLP, and semi-structured models like DeLaN, to our highly structured ap- proach. We start out with simulation experiments and then provide real system results on Kuka iiwa7 robot. III.4.1 Simulation In simulation, we collect training data on a simulated KUKA IIWA environment in PyBullet [59], by tracking sine waves in each joint with the ground-truth inverse dynamics model. The sine waves have time periods of [23:0; 19:0; 17:0; 13:0; 11:0; 7:0; 5:0] seconds in each joint, and am- plitudes [0:7; 0:5; 0:5; 0:5; 0:65; 0:65; 0:7] times the maximum absolute movement in each joint. All dynamics models are trained with this sine wave motion dataset. All feed-forward neural networks involved in MLP and DeLaN models have [32; 64; 32] nodes in the hidden layer with tanh() activation functions. We perform each experiment with 5 different random seeds — which affects the random initialization as well as the random end-effector goal position to be tracked during generalization tests — and then compute the performance statistics with mean and standard deviation across these. III.4.1.1 Training Speed, Generalization Performance, and Effectiveness of Inverse Dy- namics Learning We train each model until it achieves at most a normalized mean squared error (NMSE) of 0:1 for all joints. We record total epochs of training required to reach that level of accuracy and store the model once this accuracy has been achieved. Next, we evaluate model on tracking a) the sine motion itself (which the parameters were fitted for), and b) on a series of 5 operational space control tasks. For the second task, we use a velocity-based operational space controller [60], and use the learned inverse dynamics model within that controller. The results for convergence speed and tracking performance are averaged across the 5 random seeds and summarized in Table III.1. We measure the tracking performance through NMSE, which is the mean squared tracking error normalized by the variance of the target trajectory. The better the tracking, the less the controller relies on the feedback component, and more on the inverse dynamics model prediction. The behavior becomes more compliant as the contribution of linear feedback goes down. 31 Table III.1: Comparison between models trained to optimizeL ID on the sine motion dataset from simulation, in terms of training speed, joint position (q) and velocity ( _ q) tracking, and gener- alization performance: end-effector position (x) and velocity ( _ x) tracking unseen end-effector reaching tasks. Model Sine Tracking (NSME) End-Effector Tracking (NSME) # Training Epochs q Tracking _ q Tracking x Tracking _ x Tracking Ground Truth N/A 0.000 0.000 0.0050.006 0.0080.010 MLP 233 0.000 0.001 0.2560.405 5.5426.980 DeLaN 5819 0.000 0.001 0.0160.008 0.2540.278 DiffNEA NoStr 6153 0.000 0.0050.006 0.0050.006 0.0370.057 DiffNEA Symm 84 0.000 0.000 0.0050.006 0.0110.011 DiffNEA SPD 21 0.000 0.000 0.0050.006 0.0080.010 DiffNEA Tri 21 0.000 0.000 0.0050.006 0.0080.010 DiffNEA Cov 21 0.000 0.000 0.0050.006 0.0080.010 Table III.1, shows that the parametrizations of SPD ; TRI ; COV outperform the less con- strained learning, by training faster as well as low NMSE. Moreover, we see that DiffNEA per- forms close to ground-truth in performance, and generalizes better than the unstructured MLP as well as the DeLaN model: MLP model oscillates significantly in the end-effector velocity ( _ x) tracking, while DeLaN also oscillates mildly in _ x tracking. III.4.1.2 Online Learning Speed Next, we measure how fast each model can learn in an online learning setup. We train each model sequentially without shuffling on the sine motion data, where each batch is of size of 256. As the model trains with the sequential data, we measure its prediction performance through the NMSE on the entire dataset. 32 Figure III.2: Online learning, with NMSE in log scale. In Fig. III.2 we see that the DiffNEA model with rotational inertiaI C parameterized with symmetric positive definite (SPD) matrix, triangular parameterization, and covariance parameter- ization learns the fastest, but also generalizes the best to the yet unseen training data, outperform- ing other models in the online learning setup in simulation. III.4.2 Real Robot Experiments For the real KUKA IIWA robot, we collect sine wave tracking data for about 240 seconds at 250 Hz using the default URDF (Unified Robot Description Format) parameters, and the Pinocchio C++ library [61] for dynamics and kinematics of the robot. We noticed un-modeled friction dynamics not present in simulation, and added a joint viscous friction/damping model to both DeLaN and DiffNEA models, whose parameters are also learned from data. We use one positive constant per joint damping with parameterization similar toJ 1 ;J 2 in Eq. III.11, but withb = 0 because each joint damping constant can be 0. 33 III.4.2.1 Evaluation Table III.2: Comparison between models trained to optimizeL ID on the sine motion dataset on the real robot, in terms of the number of training epochs required to reach convergence, sine motion joint position (q) tracking, sine motion joint velocity ( _ q) tracking, and generalization per- formance: end-effector position (x) and velocity ( _ x) tracking NMSE on an end-effector tracking task (unseen task/situation during training). Model Sine Tracking (NMSE) End-Effector Tracking (NMSE) # Training Epochs q Tracking _ q Tracking x Tracking _ x Tracking Default Model N/A 0.001 0.009 0.000 0.016 MLP 2 0.000 0.011 0.003 0.513 DeLaN 4 0.001 0.013 Unstable Unstable DiffNEA Symm 3 0.001 0.012 0.000 0.013 DiffNEA SPD 3 0.001 0.013 0.000 0.014 DiffNEA Tri 2 0.001 0.013 0.000 0.012 DiffNEA Cov 2 0.001 0.012 0.000 0.015 Both MLP and DeLaN models converge to average training NMSE less than 0.1, while the DiffNEA models converge to average training NMSE 0.35, down from the default model with average NMSE 0.74. However, as can be seen in Table III.2, the trained DeLaN model is unsta- ble during the end-effector tracking task, while the trained MLP model has a large end-effector velocity ( _ x) tracking NMSE due to oscillations. On the other hand, trained DiffNEA models still perform reasonably, showing its better generalization capability. We attribute the imperfect training of DiffNEA models to unmodelled dynamics of the real system, such as static friction. 34 III.5 Summary and Future Work In this chapter, we incorporate physical constraints in learned dynamics by adding structure to the learned parameters. This enables us to learn the dynamics of a robot manipulator in a com- putational graph with automatic differentiation, while keeping the learned dynamics physically plausible. We evaluate our approach on both simulated and real 7 degrees-of-freedom KUKA IIWA arm. Our results show that the resulting dynamics model trains faster, and generalize to new situations better than other state-of-the-art approaches. We also observe that moving to the real robot creates new sources of discrepancies between the rigid body dynamics and the true dynamics of the system. Factors such as static friction cannot be sufficiently captured by our model, and point to interesting future directions. 35 Chapter IV Learning Feedback Models for Reactive Behaviors IV .1 Introduction The ability to deal with changes in the environment is crucial for every robot operating in dy- namic environments. In order to reliably accomplish its tasks, the robot needs to adapt its plan whenever unexpected changes arise in the environment. For instance, a grasping manipulation task involves a sequence of motions such as reaching an object and grasping it. While executing the grasp plan, environment changes may occur, requiring an online modulation of the plan, such as avoiding collisions while reaching the object, or adapting a grasp skill to account for object shape variations. Adaptation in motion planning can be achieved either by re-planning [62, 63, 64, 65, 66] or by reactive motion planning via feedback models [7, 8, 32, 67, 68, 69, 70]. Online adaptation via re-planning is done via trajectory optimization, which is computationally expensive due to its iterative process that can be slow to converge to a feasible and optimal solution. On the other hand, most computational burden of adaptation via a feedback model is on the forward computation of either the pre-defined or pre-trained feedback model based on the task context and sensory inputs. Hence, reactivity via a feedback model is computationally cheaper than reactivity via a re-planning method [71, 72]. Moreover, adaptation via re-planning requires the ability to plan ahead, which in turn requires a model that can predict the consequences of planned actions. For cases such as movement in free space and obstacle avoidance, such predictive model is straightforward to derive, and hence re- planning approaches is feasible in these cases –although possibly still computationally expensive. 36 On the other hand, for tasks involving interaction with the environment, for instance when trying to control contacts through tactile sensing, planning is hard due to hard-to-model non-rigid contact dynamics as well as other non-linearities. In this case, learning methods may come to the rescue. Planning as well as re-planning approach is possible once the predictive model is learned [73, 74, 75]. Similarly, reactive policy via a learned feedback model is also a possible solution to achieve adaptation in this case [32, 76]. However, in order to plan and re-plan effectively, a global predictive model is required. On the other hand, feedback models are supposed to react to changes locally, thus local feedback models are sufficient. Learning a local feedback model is more sample-efficient as compared to learning a global predictive model for a re-planning approach, and hence we choose to work on the former. Among the data-driven approaches for learning reactive policies, there are at least two alter- natives. One can learn a global end-to-end sensing-to-action policy. Alternatively, we can provide a nominal motion plan encoded as a movement primitive, and then learn a local feedback adap- tation model on top of the primitive to account for environmental changes, along the same line of thought as the residual learning [77]. Although the global end-to-end reactive policy learning has shown some impressive results, such as learning reactive policy mapping raw images to robot joint torques [78, 79], we chose to work on the latter, as the representation with movement prim- itives and local feedback models explicitly incorporate task parameterization such as start, goal, and movement duration, making this method more generalizable to variations of this parameters. Specifically, we utilize Dynamical Movement Primitives (DMP) [80] to represent the nominal motion. DMPs allow the possibility of adding reactive adaptations of the motion plan based on sensory input [69]. This modulation is realized through the so-called coupling terms which can be a function of the sensory input, essentially creating a feedback loop. This means the DMP becomes a reactive controller. Such online modulation is generated by a feedback model, which maps sensory information of the environment to the motion plan adaptation. In the past, robotics researchers have been hand-designing feedback models for particular applications [7, 8]. However, such feedback models are very problem-specific and most likely are not re-usable in another problem. Moreover, in cases where sensory input is very rich and high-dimensional, mapping this input into the required adaptation is not straightforward, and 37 hand-designing the feedback model can become a tedious process. For this reason, we propose data-driven approaches to learning feedback models for reactive behaviors [32, 70]. Previous research has been directed mostly towards learning feedback models through purely trial-and-error, i.e. through reinforcement learning [68, 76]. However, doing pure reinforcement learning without a good policy initialization is sometimes not feasible because it may produce dangerous behaviors –due to exploration– that can damage the robot hardware, or it can become very sample-inefficient to perform policy search on a massive policy search space. In our work, we propose to bootstrap the learning of feedback models in a supervised manner, by learning from human demonstrations. We hope by doing so, we can minimize the risk of robot damage from repetitive trials and also minimize the policy search space, should we later perform reinforcement learning for improving the policy. We propose the framework as depicted in Figure IV .1. By using Figure IV .1: Proposed framework for learning feedback models. a sufficiently expressive learning representation, we hope to come up with a general, trainable model that can represent any form of feedback models, for any problem domains. We propose to use a learning representation for feedback models that lies in the family of neural networks [81], due to its rich representational power and design flexibility that can be catered to our needs in applications. However, simply using a regular neural network as a feedback model representation may not be a good idea, since we cannot ensure the convergence of the adapted motion plans as well as the generalization of the overall system with respect to task parameters, such as movement duration, start and goal positions, etc. Therefore, we address this challenge by adding some structure on the neural networks that we use. Initially we start with using a regular feed-forward neural network combined with some physical insights to ensure robust behaviors across different environmental settings. Afterwards, we come up with a special neural network architecture for feedback model representations, called Phase-Modulated Neural 38 Networks (PMNN) which is essentially a feed-forward neural network, but with a modulation by movement phase information. We show PMNN’s superiority as compared to the regular feed- forward neural networks in representing feedback models with experimental results. Furthermore, to allow for generalization to previously unseen task settings, we extend our work with a sample-efficient reinforcement learning (RL) approach. Once the feedback model has been initialized by learning-from-demonstration on several known environmental settings, we tackle the problem of encountering novel environment settings during execution. In such new settings, we propose an RL extension of the framework, allowing the feedback model to be refined further by trial-and-error such that the performance of the feedback model on the new setting can be improved over a few trials, while maintaining its performance on the settings it has been trained on before. The new setting can be selected from an extrapolated environment setting –instead of an interpolated one– over the known settings. This gives the system the possibility for constantly expanding the possible range of environmental settings that it can handle, leading toward a lifelong learning of adaptation in reactive behaviors. Aside from the learning feedback model framework, which is the core contribution of this dissertation, we also present a very important part of the process pipeline of our data-driven acquisition of feedback models: the automated segmentation of demonstrations. This part plays an essential role as a pre-processing step for automated extraction of training data from many demonstrations. Our segmentation method goes beyond Zero Velocity Crossing (ZVC) method [35] –whose performance dependent on a selected threshold– and Dynamic Time Warping (DTW) algorithm [82] –which is sensitive to noise in the data–, and achieves robustness by leveraging a weighted least-squares formulation. We present the algorithm, as well as its time complexity analysis and a speed-up possibility. In terms of experimental evaluations, we use two testbeds to test our framework: learning a feedback model for obstacle avoidance, and learning a feedback model from tactile sensing. In the obstacle avoidance setting, we map raw sensory inputs –such as relative obstacle position and velocity with respect to the end-effector– to the motion plan adaptation. On the other hand, in the tactile feedback setting, we map the sensor traces deviation – the difference between experienced sensor traces and an expected one, according to the concept of Associative Skill Memories (ASM) [69]– to the adaptation of the plan. 39 We use these two testbeds to test the robustness of our approaches, in the sense that it will al- low us to test different feedback mappings which vary in terms of difficulty, especially in regards of their dynamics simulations. The obstacle avoidance dynamics can be simulated easily, hence it expedites the development and benchmarking of a variety of learning feedback model frame- works, by testing them in simulations. On the other hand, the tactile-driven action dynamics is not easy to be simulated and the data collection on the real hardware is expensive, thus this helps us to test how feasible is our framework of learning feedback model, especially when applied to the real robot scenario. In summary, we present the following contributions: (i). We present a weighted least square method for semi-automatic segmentation of nominal behavior demonstrations which is robust to noise and requires minimal tuning efforts. (ii). We present a complete framework for supervised learning of feedback models for reactive behaviors. This includes our design of learning representation for feedback models which provides guarantee on the convergence of the adapted motion plans as well as the general- ization of the overall system with respect to task parameters, such as movement duration, start and goal positions. We evaluate the framework on two testbeds: learning feedback models for obstacle avoidance [70] and learning tactile feedback [32]. (iii). We present a sample-efficient reinforcement learning framework for improving the learned feedback models on previously unseen environmental settings. This chapter is organized as follows: Section IV .2 presents related work. Section IV .3 provides some fundamental concepts on the chosen movement primitive rep- resentation. Section IV .4 provides an overview of how the components of our contributions fit into our unified framework for learning feedback models for reactive behaviors. Section IV .5 provides an automated way of segmenting demonstrations into training data for learning feedback models, presented with time complexity analysis and a technique for a quadratic speed-up. 40 Section IV .6 describes the framework for supervised learning of feedback models from human demonstrations on several known/seen environmental settings. Section IV .7 describes the framework of a sample-efficient reinforcement learning for im- proving the learned feedback models on novel/unseen environmental settings. Section IV .8 and Section IV .9 presents evaluations of the framework on learning obstacle avoidance feedback model testbed and learning tactile feedback model testbed, respectively. Section IV .10 summarizes the work we have done so far on learning feedback models for reactive robot motion planning. IV .2 Related Work Movement adaptation can be done via re-planning [62, 63, 64, 65, 66] or via reactive feedback models [7, 8, 32, 67, 68, 69, 70]. Adaptation via re-planning finds the global solution at the cost of an expensive computation. Adaptation via feedback models is computationally cheap for instantly reacting to environment changes, but may get stuck in a local minima. There are several pieces of related work [38, 72, 83] that tried to combine both, in order to get the best out of both approaches: able to instantly adapt to local changes, but can also re-plan whenever a more significant changes occur. In this chapter, we focus our work on the general learning representation for adaptation via feedback models which has not been explored thoroughly. We start out by reviewing motion representations. IV .2.1 Movement Representations There has been several movement representations for motion planning developed by robotics re- searchers in the past few decades. Spline-based trajectory planning [84] described trajectories using polynomials constrained by user-specified initial and goal states. However there is no clear way of adding online sensory-based modulation to adapt the spline trajectory plans given envi- ronment changes. Probabilistic Road Map (PRM) [85] provides a sampling-based collision-free path that can be adjusted by re-planning when the obstacle configuration in the environment changes. However, 41 it is not clear how to use PRM in non-obstacle-avoidance settings, such as tactile-based motion plan adaptation. Recently, there has been some work on end-to-end sensory-motor learning [78, 86, 87, 88, 89]. In this work, researchers train a deep neural network mapping sensory information (e.g. camera images), directly to motor commands (e.g. joint torques). However, there are some drawbacks associated with these global sensory-motor policies. The first drawback is that in general it is harder to train a global sensory-motor policy than to train a local correction/feedback model on top of an existing nominal motion plan. This is simply because the nominal behaviour provides a reliable baseline of the movement, and learning only the correction term is less complex as compared to learning the entire closed-loop behavior. The second drawback is the absence of a nominal motion plan. For instance, for a vision-dependent global grasping policy, if the lighting of the environment is significantly changed –even if the object-to-be-grasped is not moved–, then the behavior of the system is unknown and the system will most likely fail to grasp the object. On the other hand, in a similar situation, the behavior of a system with a nominal plan shall not deviate too far from the nominal behavior, and the system shall be able to grasp the object as guided by the nominal motion plan. The third drawback is that there is less or even no task parameterization at all on the policy, such as the start, goal, and movement duration. Calinon et al. [90] proposed a time-invariant probabilistic motion planning via imitation learn- ing using Hidden Markov Models (HMM) and Gaussian Mixture Regression (GMR) as the learn- ing representations. However, in this model there is no parameterization w.r.t. movement dura- tion, and a stabilizer based on a mass-spring-damper system has to be added to attract trajectories toward the regions where demonstrations have been provided. There are some work of encoding movement plans as primitives for modular motion planning. Dynamical Movement Primitive (DMP) with local coordinate transformation [80] and Associa- tive Skill Memories (ASMs) [69, 91] pose a framework for data-driven modular sensory-motor primitives. It is modular in the sense that the learned primitive can generalize with respect to task parameters such as start, goal, and movement duration. For cyclic and periodic motions, for example during locomotion tasks, researchers has come up with hand-designed sine-wave-based periodic movement primitive [92, 93]. However, in this approach there are a lot of parameters to be tuned for accomplishing the locomotion task properly, 42 such as the sine wave amplitudes and periods. Rhythmic DMP [80] provides a more data-driven approach to acquire these parameterization by supervised learning or reinforcement learning. Paraschos et al. [94], Ewerton et al. [95] proposed Probabilistic Movement Primitive (ProMP) which has parameterization of movement duration similar to DMP, allowing learning and execu- tion of trajectories at different movement speed. However, unlike in DMP, there is no enforcement of convergence to the goal and there is some extra parameterization required for covariances of the probability distributions. Thanks to the representation of DMP as a set of differential equations, the resulting motion plan can be adapted reactively with sensory information by the so-called coupling term. The sensory information here can be in the form of either raw sensor signals [70] or a memory of sensor traces associated with previous successful execution of the motor primitive via ASMs [32, 69, 91]. In this work, we focus on general representation learning of feedback models that maps these sensory information to the coupling term. IV .2.2 Automated Demonstrations Alignment and Segmentation For learning from human demonstrations, after demonstrations are provided, a pre-processing step is required to segment each demonstration into the corresponding movement primitive. This segmentation is usually based on the function/role and semantics of each phase in the task [96, 97]. Once segmented, each primitive can also become a part of a movement primitive library which can be re-used for other complex tasks such as writing or assembly tasks [98, 99]. The simplest technique for demonstration segmentation is the threshold-based cutting, for ex- ample based on a threshold on the norm of the velocity signal like in the Zero Velocity Crossing (ZVC) method [35]. While this method –in combination with manual inspection of the segmen- tation result– is sufficient for segmenting a small number of demonstrations, for a larger number of demonstrations this method becomes tedious since we need to find a threshold that will work for all –potentially noisy– demonstrations. Throughout this dissertation, we refer to this memory of sensor/sensory traces as the expected sensor traces. 43 A more sophisticated segmentation method is based on feature correspondence matching be- tween demonstrations, which is mostly done by dynamic programming techniques such as Dy- namic Time Warping (DTW) [82, 100, 101, 102]. However, due to the noise in the demonstra- tions, the extracted feature correspondences can be erroneous, which in turn will degrade the segmentation quality. In this chapter, we introduce a more robust method for automated demonstrations alignment and segmentation by combining the feature correspondences extraction by DTW with a weighted least-square formulation of the problem, in a similar nature as the Iterative Closest Point (ICP) method [103] in Computer Graphics for mesh alignment and registration. IV .2.3 Hand-Designed Feedback Models In previous work, researchers have hand-designed feedback models for specific purposes. One of the most common ones is for obstacle avoidance. In reference to DMPs, several previous works have tried to develop coupling term models that can reactively adapt the plan to avoid obstacles. Park et al. [7] designed a dynamic potential field model to derive a coupling term for obstacle avoidance. Hoffmann et al. [8], Pastor et al. [9] devised an obstacle avoidance coupling term model based on a human obstacle avoidance model in a navigation experiment [104]. Khansari- Zadeh and Billard [105] designed a multiplicative (instead of additive) coupling term for obstacle avoidance. Gams et al. [10] treated joint angle limits as ”obstacles” and hand-designed a repulsive force around these joint limits to avoid them. The other popular feedback model testbed among researchers is based on force and tactile sensing [106, 107]. Khansari et al. [108] designed a human-inspired feedback model for per- forming robotic surface-to-surface contact alignment based on force-torque sensing. Pastor et al. [69] hand-designed feedback gain matrix which maps deviations from the expected force-torque sensor traces to the grasp plan adaptation. Hogan et al. [109] presented a combined approach between trajectory planning and local feedback models for tactile-driven manipulation by hand- designing manipulation primitives based on contact mechanics such as contact switches, stick/slip interactions, etc. on a vision-based tactile sensor GelSlim [110]. 44 IV .2.4 Learning of Feedback Models Hand-designed features can extract useful information from the environment, but it can be hard to find and tune them. Rai et al. [1] defined human-inspired obstacle avoidance features for coupling terms [8] and learned the weighting of these features from human demonstrations to calculate the feedback term. This model could generate human-like obstacle avoidance movements for one set- ting of demonstrations but did not generalize across different obstacle avoidance environmental settings. Also, recently Pairet et al. [67] devised a more sophisticated learning obstacle avoid- ance feedback model framework which can handle a more complex geometric description of the obstacles as the input features. The above mentioned learning feedback model methods provide a guarantee of convergence to the goal while avoiding obstacles along the way, while being enriched with model flexibility due to learning. Rai et al. [70] added the expressiveness of the learnable feedback model by us- ing a neural network. However, some post-processing heuristics have to be added to ensure the convergence of the closed-loop behavior to the goal. Sutanto et al. [32] eliminated the need of these heuristics by modulating the feedback term with a phase variable which ensures conver- gence to the goal. Moreover, the learning representation in our previous work [32] and in this work –called Phase-Modulated Neural Networks (PMNN)– is a general-purpose feedback model learning representation, in the sense that it does not assume a specific class of problems such as obstacle avoidance cases. This generality is promising for a broader application in many other problem domains. Johannink et al. [77] presented a method to combine a traditional feedback model policy and a residual policy learned by RL. The residual policy is learned to capture contacts-and-friction- based control policy which otherwise would be hard to be hand-designed. Even though this approach share the same purpose as the method that we are proposing on this chapter, our method also include an embedded structure in our feedback model representation which ensures conver- gence of the entire system behavior to the goal. Sung et al. [30] represent a haptic feedback model by a partially-observable Markov deci- sion process (POMDP), which is parameterized by deep recurrent neural networks. In general, POMDPs models are not explicitly provided with the information of the movement phase which 45 may be essential for making prediction on the next corrective action. Our proposed approach [32] can learn phase-dependent corrective actions. Previous work on robotic tactile-driven manipulation with tools has tried to learn feedback models to correct the position plans for handling uncertainty between tools and the environment, via motor babbling [74] or RL [76]. In particular, Chebotar et al. [76] performed RL on phase- dependent feedback models with reduced-dimensional input features, or in other words performed a separate feature learning and phase-dependent parameter optimization. In our previous work [32] and in this chapter, we propose to use PMNN which allows simultaneous feature learning and phase-dependent parameter optimization of the feedback model. In some tasks, where the error signal is linearly parameterizable by several basis functions spanned throughout the movement execution time –such as minimizing force/torque feedback control error in bimanual manipulation tasks [111, 112]–, the model can be adapted by employ- ing the Iterative Learning Control (ILC) algorithm [113]. Gams et al. [114] directly modify the forcing termf of a DMP in an iterative manner and apply it to the task of wiping a surface. Abu- Dakka et al. [115] iteratively learned feedforward terms to improve a force-torque-guided task execution over trials, while fixing feedback models as constant gain matrices. However, many of these works address only learning the forcing term which is a representation of the nominal be- havior, and do not address learning feedback models. Moreover, in the case of learning feedback models, the mapping from sensory input to the action adaptation may not be clear or linear –such as in the tactile feedback testbed of this chapter–, in which case we need to resort to use an RL algorithm. Gams et al. [116] employs Learning-from-Demonstration (LfD) to train separate feedback models for different environmental settings. Gaussian Process regression is used to interpolate between these learned models to predict the required feedback model in a new environmental setting. On the other hand, our work directly uses a single model to handle multiple settings. Furthermore, our model is initialized via LfD on several known environmental settings, and then continued with RL on new settings. 46 IV .2.5 Reinforcement Learning There has been a rich body of work on RL in literature, which comprises model-based and model- free methods. In model-based RL, the environment dynamics model is either provided [117] or learned [23, 118]. The optimal policy is then computed by a search algorithm or model predictive control. However, in some cases –such as on tactile based control–, obtaining the dynamics model of the environment can be difficult, for example due to non-linear dynamics or noisy environment [119]. On the other hand, model-free methods do not utilize any model of the environment, and in- stead work on two alternatives. The first alternative is to learn a representation of value functions, which is either assigned to each state (calledV -value function) or each state-action pair (called Q-value function), from which the optimal policy can then be inferred. RL algorithms that follow this alternative is called value-based methods, such as Neural Fitted Q (NFQ) Iteration [120] and Deep Q-Networks (DQN) [121]. However, these methods are limited by discrete action spaces and are not suited to deal with continuous action spaces like for most cases in robotics. The second alternative is to directly optimize the policy representation –called policy-based methods–, either with policy gradient methods or gradient-free policy improvement methods. Policy gradient methods update the policy by following the gradient of the expected return with respect to the policy itself. One of the first policy gradient methods is the REINFORCE algorithm [122]. Lillicrap et al. [86] presents Deep Deterministic Policy Gradient (DDPG) which extends NFQ and DQN for continuous action spaces. Actor-critic methods uses both value function repre- sentation –as critic– and policy representation –as actor–, which both can be represented as neural networks [123]. Episodic Natural Actor Critic (eNAC) [124] adds a regularization term to penal- ize large steps away from the observed trajectories by using Fisher information matrix, resulting in a policy update which is significantly more efficient than the regular policy gradient. Hwangbo et al. [125] also presents another model-free RL algorithm, which is decomposed into two parts: Gaussian kernel regression of the cost function and the subsequent minimization of the regressed function using natural gradient descent. Recently there are also constraint-based policy optimiza- tion methods, either by Kullback–Leibler (KL) divergence constraint –called Trust Region Policy Optimization (TRPO) [126]– or clipping objective –called Proximal Policy Optimization (PPO) [127]. 47 Several researchers also developed sampling-based RL or gradient-free policy improvement methods, which essentially approximate the policy gradient by some reward-weighted combina- tion of a number of policy samples [128, 129, 130]. Policy learning by Weighting Exploration with the Returns (PoWER) [129] is a sampling-based RL method that seeks to alleviate the tedious learning rate tuning process by performing policy search via Expectation-Maximization (EM) al- gorithm. Relative Entropy Policy Search (REPS) [131] optimizes expected return while bounding the amount of information loss by using Kullback-Leibler (KL) divergence. Policy Improvement with Path Integrals (PI 2 ) [130] is a model-free RL, which intuitively boils down to probabilistic weighting of samples by their cost, such that the next policy update is dominated by samples with lower costs. PI 2 is improved further in PI 2 -CMA [132], by the addition of automatic exploration noise covariance matrix adjustment. IV .2.6 Reinforcement Learning of Nominal Behaviors Despite the abundance of RL algorithms, only a handful of techniques are suitable for real hard- ware applications, which demand sample efficiency. If we talk about nominal behaviors repre- sented as DMPs, one of the advantages of DMP is in its ability to represent a continuous motion trajectory with a small number of parameters, typically of size 25-50 parameters per task space dimension. For such a small parameterization, gradient-free policy-based RL methods such as PoWER [129] or PI 2 [130] are sample-efficient, and have been successfully used in many appli- cations [133, 134, 135, 136], to learn or adjust the nominal behaviors within a few RL iterations. IV .2.7 Reinforcement Learning of Feedback Models In terms of learning feedback model for reactive motion planning, there has been several related works. Kober et al. [68] used the PoWER algorithm to learn an adaptation model for a ball-in- a-cup task, while Chebotar et al. [76] used the REPS algorithm to learn a tactile feedback policy from low-dimensional embedding of tactile features. However, these previous works are mostly performing RL to acquire feedback policies with low number of parameters or are limited to linear weighted combination of phase-modulated features, which are less expressive than the approach that we are presenting in this chapter. 48 The previously mentioned Deep RL techniques [86, 121, 137] offer a solution for learning policies with high number of parameters, such as those represented by neural networks. Although Deep RL techniques have shown promising results in simulated environments, the high sample complexity of these methods hindered their use on real-world robot learning involving hardware. When simulation is available, Simulation-to-Real methods [138, 139] can be leveraged to perform high sample complexity training in simulation, while only updating the dynamics model in the simulation from real-world rollouts a few times. However, for our tactile feedback testbed in this chapter, the dynamics model are difficult to be obtained or simulated, and hence we really need a sample-efficient RL for our purpose. Guided Policy Search (GPS) [140] tackles the high sample complexity issue by decomposing the policy search into two parts: trajectory optimization and supervised learning of the high- dimensional policy. However, the use of smooth local policies in GPS –such as LQR with local time-varying linear models– to supervise the high-dimensional policy has difficulty in learning discontinuous dynamics, such as during door opening task. Path Integral Guided Policy Search (PIGPS) [79] replaced LQR with PI 2 –a model-free and gradient-free RL algorithm– to tackle the problem of learning discontinuous dynamics. We take inspiration from both GPS and PIGPS in this work, by decomposing the feedback model policy search into two steps: RL for open-loop behavior optimization and supervised learn- ing of the closed-loop feedback model policy. In the open-loop behavior optimization step, we assume the new environment setting is fixed, then we learn an open-loop representation of the behavior –similar to learning a new nominal behavior as the DMP’s forcing term–, and perform the open-loop behavior improvement via the PI 2 algorithm similar to the one described in section IV .2.6. The resulting optimized open-loop behavior is then used to generate additional training data for supervised learning of the closed-loop feedback model policy represented as a PMNN, which potentially has been previously pre-trained with human-demonstrated behavior adaptation dataset from a few known environment settings. So, in summary our method extends the previous works on RL methods for learning improved nominal behavior into an RL method for learning feedback models. This extension has two benefits: In the previous work on RL of (open-loop) nominal behavior, if a new environment situation is encountered, the nominal behavior has to be re-learned. Our method –in contrast– learns 49 a closed-loop behavior policy which is more general and will be able to tackle multiple different environmental settings via adaptation of the nominal behavior. As we show in the experiment section of this chapter, due to our feedback model repre- sentation learning, our method maintains its performance on the previous settings, while expanding the range of the adaptive behavior to a new setting. On the other hand, previous methods which refine the nominal behavior will be able to perform well on a new setting, but may not perform well on the previous setting. IV .3 Review: Dynamical Movement Primitives Here we review background material on our chosen motion primitive representation for reactive behaviors. Dynamical Movement Primitive (DMP) [80] is a goal-directed behavior described as a set of differential equations with well-defined attractor dynamics. DMP provides a movement primitive representation which allows modulation of sensory input as a feedback adaptation to the output motion plans, in a manner that is conceptually straightforward and simple to implement relative to other movement representations. In our framework, we use DMPs to represent position and orientation motion plans of the robot’s endeffector, as well as representing the expected sensor traces in the Associative Skill Memories (ASM) framework [91]. Throughout this dissertation, we call the DMP model that we use to represent position motion plans and expected sensor traces as regular DMP. On the other hand, we use quaternions to represent orientations, and hence we call the orientation DMP model as Quaternion DMP. Quaternion DMPs were first introduced in a previous work by Pastor et al. [69], and then improved in several previous works by Ude et al. [141], Kramberger et al. [142] to fully take into account the geometry of SO(3) group. In general, a DMP model consists of a transformation system, a canonical system, and a goal evolution system. The transformation system governs the evolution of the state variable being planned, which is defined as: for a regular DMP: 2 x = v ( v (gx) _ x) +f r +c r (IV .1) 50 x is either multi-dimensional position or expected sensor value,g is the multi-dimensional goal position or expected final sensor value, _ x is the time derivative ofx, and x is the time derivative of _ x.f r andc r are the regular DMP’s forcing term and feedback/coupling term , respectively. for a Quaternion DMP y : 2 _ ! = ! ( ! 2 log (Q g Q )!) +f ! +c ! (IV .2) whereQ is a unit quaternion representing the orientation,Q g is the goal orientation, and !; _ ! are the 3D angular velocity and angular acceleration, respectively.f ! andc ! are the 3D orientation forcing term and coupling term, respectively. During unrolling, we integrate Q forward in time to generate the kinematic orientation trajectory as follows: Q t+1 = exp !t 2 Q t (IV .3) where t is the integration step size. The forcing term encodes the nominal behavior, while the coupling term encodes behavior adap- tation which is commonly based on sensory feedback. Our work focuses on learning feedback models that generate the coupling terms. We set the constants v = ! = 25 and v = ! = ! =4 to get a critically-damped system response when both forcing term and coupling term are zero. is set proportional to the motion duration. The second-order canonical system governs the movement phase variablep and phase velocity u as follows: _ u = u ( u (0p)u) (IV .4) _ p =u (IV .5) Throughout this dissertation, we use the term feedback and the term coupling term interchangeably. y For defining Quaternion DMPs, the operators, , the logarithm mapping log(), and the exponential mapping exp() are required. The definition of these operators are stated in Equations B.1, B.2, B.3, and B.4 in the Appendix B. 51 We set the constants u = 25 and u = u =4 to get a critically-damped system response. The phase variablep is initialized with 1 and will converge to 0. On the other hand, the phase velocityu has initial value 0 and will converge to 0. For the purpose of our work, the position DMP, Quaternion DMP, and expected sensor traces share the same canonical system, such that the transformation systems of these DMPs are synchronized [80]. In general, a multi-dimensional forcing termf (e.g.f r ,f ! ) governs the shape of the transient behavior of the primitive towards the goal attractor. The forcing term is represented as a weighted combination ofN basis functions i with width parameter i and center at i , as follows: f (p;u; f ) = P N i=1 i (p) f i P N j=1 j (p) u (IV .6) where i (p) = exp i (p i ) 2 (IV .7) Note, because the forcing termf is modulated by the phase velocityu, it is initially 0 and will converge back to 0. TheN basis function weights f i in equation IV .6 are learned from human demonstrations of baseline/nominal behaviors, by setting the target regression variable: for a regular DMP: f r, target = v ( v (g bd x bd ) bd _ x bd ) + 2 bd x bd (IV .8) wherefx bd ; _ x bd ; x bd g is the set of baseline/nominal behavior demonstrations or the set of recorded sensor states throughout the nominal behavior demonstrations. for a Quaternion DMP: f !, target = ! ( ! 2 log Q g ;bd Q bd bd ! bd ) + 2 bd _ ! bd (IV .9) wherefQ bd ;! bd ; _ ! bd g is the set of baseline/nominal orientation behavior demonstrations. 52 bd is the movement duration of the baseline/nominal behavior demonstration. From this point, we can perform linear regression to identify parameters f , as shown in the work by Ijspeert et al. [80]. Finally, we include a goal evolution system as follows: for a regular DMP: _ g = g (Gg) (IV .10) whereg andG are the evolving and the steady-state goal position, respectively. for a Quaternion DMP: ! g = !g 2 log Q G Q g (IV .11) whereQ g andQ G are the evolving and the steady-state goal orientation, respectively. We set the constant g = !g = ! =2. The goal evolution system has two important roles related to safety during the algorithm deployment on robot hardware. The first role, as mentioned in the previous work by Ijspeert et al. [80], is to avoid discontinuous jumps in accelerations when the goal is suddenly moved. The second role, as mentioned in a previous work by Nemec and Ude [143], is to ensure continuity between the state at the end of one primitive and the state at the start of the next one when executing a sequence of primitives, by providingg andQ g with an appropriate value initialization, as follows: for a position DMP: Suppose the position, velocity, and acceleration at the end of the predecessor movement primitive are x pr, e ; _ x pr, e ; x pr, e . To ensure continuity, we set the position, velocity, and acceleration at the start of the successor movement primitive asx su, 0 = x pr, e , _ x su, 0 = _ x pr, e , and x su, 0 = x pr, e , respectively. In accordance with Equation IV .1, this is achieved by setting the initial value ofg at the successor primitive as: g su, 0 = 1 v 2 su x pr, e f r, su (p 0 ;u 0 )c r, su (p 0 ;u 0 ) v + su _ x pr, e +x pr, e (IV .12) 53 for a Quaternion DMP: Suppose the orientation, angular velocity, and angular acceleration at the end of the pre- decessor movement primitive areQ pr, e ;! pr, e ; _ ! pr, e . To ensure continuity, we set the ori- entation, angular velocity, and angular acceleration at the start of the successor movement primitive asQ su, 0 =Q pr, e ,! su, 0 =! pr, e , and _ ! su, 0 = _ ! pr, e , respectively. In accordance with Equation IV .2, this is achieved by setting the initial value ofQ g at the successor primitive as: Q g su, 0 =exp ()Q pr, e (IV .13) where = 1 2 ! 2 su _ ! pr, e f !, su (p 0 ;u 0 )c !, su (p 0 ;u 0 ) ! + su ! pr, e (IV .14) with p 0 and u 0 are the start/initial values of the movement phase variable and phase velocity, respectively. IV .4 Overview: Learning Feedback Models for Reactive Behaviors We aim to realize reactive behavior adaptations via feedback models given multi-modal sensory observations within one unified machine learning framework. We envision a general-purpose feedback model learning representation framework that can be used in a variety of scenarios, such as for obstacle avoidance, tactile feedback, locomotion planning adjustment, etc. Such a learning representation ideally has the flexibility to incorporate a variety of sensory feedback signals and can be initialized from human demonstrations. Finally, our learning framework should be able to perform reinforcement learning to refine the model when encountering new environmental settings, while maintaining the performance on the previously seen settings. Our overall learning framework contains 3 main phases: In PHASE 1, we start out by col- lecting human demonstrations of default/nominal behaviors for a task executed in a default envi- ronmental setting. These demonstrations represent the motion trajectories and sensory traces of successful executions. We semi-automatically segment the demonstrations into several movement 54 Figure IV .2: Flow diagram of our framework. Phase 1 is detailed in section IV .5 as outlined in Algorithm IV .1. Phase 2 is detailed in section IV .6 with its flow diagram expanded in Figure IV .4. Phase 3 is detailed in section IV .7 as outlined in Algorithm IV .3. primitives and encode both the default motion trajectory as well as the expected sensor traces of a given segment into DMPs. We describe this phase in detail in section IV .5. In PHASE 2 we then unroll the learned movement primitives on a variety of non-default environmental settings. Due to the change in the environment setting, it is necessary to adapt/correct the behaviors in order to accomplish the task. Thus we provide demonstrations of how to correct each primitive exe- cution. Simultaneously, we also record the experienced sensor traces of these corrected rollouts, followed by a supervised learning of behavior feedback models from these correction demonstra- tions. We describe the details of this phase in section IV .6. In PHASE 3 we now turn to learning how to correct without demonstrations in novel environmental settings. We propose a sample- efficient reinforcement learning algorithm, that can update the feedback model such that behavior is improved on novel settings as well as maintaining its performance on previously known/seen settings. We present the details of our reinforcement learning phase in section IV .7. An overview of our framework, with its phases, is shown in Figure IV .2. 55 IV .5 Automated Segmentation of Nominal Behavior Demonstrations into Movement Primitives In our work, we model complex manipulation tasks as a sequence of motion primitives. More- over, for robustness we want to learn these primitives from multiple demonstrations of the base- line/nominal behavior. In a learning-from-demonstrations setup, it is often easier to obtain each demonstration for such tasks in one go, instead of one primitive at a time. Thus, our work starts by collecting such full demonstrations, and then segmenting the recordings into motion primitives. To avoid the high cost of manually segmenting the demonstrations, we propose a semi- automated segmentation approach. The main idea of our approach is to take the first demon- stration, manually segment it, and then segment the rest by aligning them with the first. Finally, we learnP DMPs fromL segmented demonstrations. Our approach starts by manually segmenting the first demonstration U 1 =fz i g T 1 i=1 in our dataset intoP trajectoriesfz i g e p 1 i=s p 1 , one per primitivep. For this manual segmentation we use the Zero Velocity Crossing (ZVC) method [35], with a manually tuned thresholdh. Moreover, we expand each segmented trajectory in both directions – the start and end points – by, and call the result as the reference trajectoryT p ref =fz i g e p 1 + i=s p 1 . Given, theseP reference trajectories, the automatic segmentation of the unsegmented demon- strations is outlined in Algorithm IV .1. For a task that consists ofP primitives, we incrementally segment one primitive at a time. The segmentation process takes the following as input: the remaining (L 1) un-segmented baseline/nominal behavior demonstrationsfU l g L l=2 the segmented reference trajectoriesfT p ref g P p=1 Zero Velocity Crossing (ZVC) thresholdh y Dynamic Time Warping (DTW) search’s integer time index extension For clarity purposes we say first, but in practice any demonstration can be used. y Here we use the same velocity thresholdh as the one used for segmenting the reference trajectoryT p ref , i.e. there is no re-tuning here. 56 Algorithm IV .1 Pseudo-Code of Nominal Movement Primitives Extraction from Demonstrations 1: function SEGMENTDEMO(fU l g L l=2 ,fT p ref g P p=1 ,h,) 2: # Initialize solution with empty list forP primitives: 3: S =ffgPg 4: forp = 1 toP do 5: forl = 2 toL do 6: # Compute the initial segmentation guess 7: # start points p l and end pointe p l with ZVC: 8: [s p l ;e p l ] =ZVC(U l ;h) 9: 10: # IfU l =fz j g T l j=1 , then: 11: T guess p =fz j g e p l + j=s p l 12: 13: # Compute correspondence pairsC: 14: C =DTW (T ref p ;T guess p ) 15: 16: # Construct least square (LS) parameters: 17: [A; b] =ConstructLS(C) 18: 19: # Compute least square (LS) weights: 20: W =ComputeLSWeights(C;T ref p ;T guess p ) 21: 22: # Solve weighted least square (WLS) problem 23: # to estimate time scale l and time delayd l : 24: [ l ;d l ] =SolveWLS(A; b; W) 25: 26: # Append the refined segmentation to solution: 27: i s =s p l +d l + ( l 1) 28: i e =s p l +d l + ( l 1) + l (e p 1 s p 1 ) 29: S[p] =S[p][ffz j g ie j=is g 30: returnS 57 Our algorithm starts by extracting initial segmentation guessesT guess p using the ZVC method from the remaining demonstrations. However, we expect these initial segments to be very rough, and thus expand each initial segmentation guess to time steps before and after the found start and end points . For each primitivep, we then use Dynamic Time Warping (DTW) [82] to extract point corre- spondencesC between the reference trajectoryT ref p and the guessed trajectory segmentT guess p , as shown in line 14 of Alg. IV .1 and explained in section IV .5.1. Next, using the correspondence pairsC, we want to estimate a time delay and a time scaling parameter that best aligns the corre- spondence points inT guess p with its correspondence pair inT ref p , as we explain in section IV .5.2. We propose to use a (weighted) least squares formulation to identify these parameters, which then allow us to extract the final segmentation, as will be explained in section IV .5.3. The overall segmentation algorithm is shown in Alg. IV .1. Next, we describe the details of the individual segmentation steps. IV .5.1 Point Correspondence Matching using Dynamic Time Warping For finding point correspondences between the reference and target trajectories, we use the Dy- namic Time Warping (DTW) algorithm [82] for matching. The algorithm is outlined in Algo- rithm IV .2. Please note that it is important to standardize both the reference and target trajectories before performing the DTW matching, such that each signal is zero-mean and its standard deviation equal to one, otherwise the computed correspondence pairs will be erroneous due to the incorrect/biased current cost. In practice, when extending the trajectories –e.g. as done in line 11 of Alg. IV .1–, we need to take care of boundary cases, because the extended segmentation points may fall out-of-range. However, for the sake of clarity, here we do not include this boundary-checking in Algorithm IV .1. 58 Algorithm IV .2 Dynamic Time Warping for Point Correspondence Matching 1: function DTW(T ref ,T guess ) 2: N =length(T ref ) 3: M =length(T guess ) 4: T ref =standardize(T ref ) 5: T guess =standardize(T guess ) 6: Letmemo[0:::N; 0:::M] be a new 2D array. 7: Letancestry[1:::N; 1:::M] be a new 2D array. 8: LetC be an empty list. 9: 10: fori = 1 toN do 11: memo[i; 0] =1 12: forj = 1 toM do 13: memo[0;j] =1 14: memo[0; 0] = 0 15: 16: # Begin the dynamic programming: 17: fori = 1 toN do 18: forj = 1 toM do 19: # Compute current cost: 20: cost =norm(T ref [i]T guess [j]) 21: # Compute the minimum value (mv) and 22: # get the minimum index (mi) over the 3 ancestors: 23: [mv;mi] =min( memo[i 1;j]; # insertion memo[i 1;j 1]; # matching memo[i;j 1]) # deletion 24: memo[i;j] =cost +mv 25: ancestry[i;j] =mi 26: 59 27: # Traverse back the ancestry path to fill-in the correspondence pairs listC: 28: i =N 29: j =M 30: while (i> 1) and (j > 1) do 31: ifancestry[i;j] == 1 then . insertion 32: i =i 1 33: else ifancestry[i;j] == 2 then . matching 34: Append (i;j) to the correspondence pairs listC. 35: i =i 1 36: j =j 1 37: else . deletion 38: j =j 1 39: returnC With reference to line 11 of Algorithm IV .1, given: (i). the one-dimensional (1D) reference trajectoryT ref p =fz i g e p 1 + i=s p 1 of sizeN =e p 1 s p 1 + 2 + 1 (ii). a 1D initial segmentation guess trajectoryT guess p =fz j g e p l + j=s p l of sizeM = e p l s p l + 2 + 1 the DTW algorithm returnsK = minimum(M;N) correspondence pairsC =f(t ref k ;t guess k )g K k=1 , where 1t ref k N is a (positive integer) time index in the reference trajectory and 1t guess k M is a (positive integer) time index in the guessed segment. IV .5.2 Least Square Problem Setup Similar to the Iterative Closest Point (ICP) method [103] in Computer Graphics for mesh align- ment, we can setup a least square problem for trajectory segment alignment. To do so, we relate each correspondence pair inC =f(t ref k ;t guess k )g K k=1 by assuming that the time indicest ref k in the 60 reference segment can be scaled and shifted such that they match with the corresponding time indicest guess k in the guessed segment: t guess k = l t ref k +d l (IV .15) We would like to do parameter identification for l andd l , which are the time scale and the time delay, respectively, of the guessed segmentT guess relative to the reference segmentT ref . Since the reference trajectoryT ref has been properly cut using the ZVC method, knowing the values of l andd l parameters informs the best (data-driven) refined segmentation point of the guessed trajectoryT guess . To identify parameters ( l ;d l ) we can set up a least square estimate by using all K correspondence pairs as follows: Ax = b (IV .16) with: A = 2 6 6 6 6 6 6 4 t ref 1 1 t ref 2 1 . . . . . . t ref K 1 3 7 7 7 7 7 7 5 ; b = 2 6 6 6 6 6 6 4 t guess 1 t guess 2 . . . t guess K 3 7 7 7 7 7 7 5 ; x = 2 4 l d l 3 5 With A and b known, the regular solution of least square applies: x = A T A 1 A T b (IV .17) IV .5.3 Weighted Least Square Solution The least square solution in Equation IV .17 above applies if the assumption –that all correspon- dence pairs are accurate– is satisfied. In reality, due to noise, un-modeled disturbances and other factors, some correspondence pairs can be inaccurate, yielding an inaccurate estimation of l and d l parameters. These inaccurate correspondence pairs can be characterized as follows: correspondence pair between points with near-zero velocities 61 correspondence pair between points of incompatible velocities Correspondence pairs with near-zero velocities are most likely inaccurate, due to the way the DTW algorithm works. To give an intuitive example, let us imagine we haveT ref which has a consecutive of 100 zero data points in the end (i.e. trailing zeroes), whileT guess has only 10 zero data points in the end. In this case, there are many number of ways the matching can be done between the 100 trailing zeroes inT ref with the 10 trailing zeroes inT guess . However, there is only one matching that will be consistent with the true value of the ( l ;d l ) parameters. Therefore, we filter out the correspondence pairs with near-zero velocities, so that they do not affect the ( l ;d l ) parameter identification result at all. For handling the case of correspondence pairs between points of incompatible velocities, in- stead of using the regular least square solution, we use a weighted least square formulation as follows: x = A T WA 1 A T Wb (IV .18) The weight matrix W is chosen to be a diagonal positive definite matrix, with each diagonal component is the weight associated with a correspondence pair. This weight is determined by how compatible the velocities are between the two points. If the velocities of the pair are the same, the weight will be 1 –which is the maximum weight possible–; otherwise, the weight decays exponentially as a function of the velocity difference. We also performed time complexity analysis of our algorithm, which isO(MN) in overall, with N is the length of the reference segment and M is the length of the guessed segment as defined in section IV .5.1. A quadratic speed-up can be obtained by down-sampling the reference and guessed segments. Details of this time complexity analysis and speed-up can be seen in Appendix C. IV .5.3.1 Refinement of the Trajectory Segmentation After the time scale l and time delayd l parameters are identified, we can refine the segmentation of the guessed trajectory. The duration of the reference trajectory is (e p 1 s p 1 + 2). Hence, the segment in the guessed trajectory that corresponds to the reference trajectory has duration 62 l (e p 1 s p 1 + 2). Moreover, such segment starts ats p l +d l . Taking into account the scaling of the extension as well, the refined trajectory segment that is to be appended to the solution has the start index: i s =s p l +d l + l =s p l +d l + ( l 1) (IV .19) and the end index: i e =s p l +d l + l (e p 1 s p 1 + 2) l =s p l +d l + ( l 1) + l (e p 1 s p 1 ) (IV .20) Eventually, we append the refined segmentation result to the solution, as shown in line 29 of Alg. IV .1. IV .5.4 Learning DMPs from Multiple Segmented Demonstrations After we segmented the demonstrations into primitives, we can learn the forcing term parameters f for each primitive from these segmented demonstrations. For each data points in each seg- mented demonstration, we extract the target regression variablef target according to Eq. IV .8 or IV .9 and the corresponding phase-dependent radial basis function (RBF) feature vector u P N j=1 j (p) h 1 (p) 2 (p) ::: N (p) i according to Eq. IV .7. Afterwards we stack these target regression variablef target and the corresponding phase-dependent RBF feature vector over all the data points in all segmented demonstrations, and perform regression to estimate the forcing term parameters f based on the relationship in Eq. IV .6. IV .6 Supervised Learning of Feedback Models for Reactive Behav- iors We aim to realize reactive behavior adaptations given multi-modal sensor observations within one unified machine learning framework, and we would like to bootstrap the learning of the feedback/adaptation policy from demonstrations. This goal opens up several problems such as: What is the form of feedback model representation that will be able to take into account a high- dimensional sensory input and how to design such a compact representation of the feedback model 63 while maintaining generalizability across varying task parameters. Towards this goal, in this section we present the following contributions: Part of our work is concerned with the generalization of both the motion primitive and the feedback model with respect to task parameters, such as motion duration, start and goal positions, etc. Towards this, we discuss a principled method of defining a local coordinate system for both the primitive and the feedback model, as well as creating duration-invariant formulation of the feedback model. As a result, demonstrations with different task param- eters become comparable. We chose to represent the feedback model using a family of neural networks that has the inherent potential of incorporating multi-modal sensor feedback. Our specific contribution is that we present a design of neural network feedback model that maintains the convergence property of the motion plan when augmented with the feedback. To learn the mapping from sensory information to motion adaptation, we initialize the feed- back model from human demonstrations, similar to learning the transient behavior of mo- tion primitives. In terms of the type of input of the feedback model, we are exploring two possibilities: (i). Raw Sensor Traces In this case we hand-pick all sensor features that potentially affect the degree of motion plan adaptation as the feedback model input. In general, the learning framework of feedback models with raw sensor traces as the input can be depicted in Figure IV .3 (left), while the process pipeline is visualized in Figure IV .4 (left). (ii). Sensor Trace Deviations As an alternative to raw sensory features as the feedback model input, there are cases where we can leverage the sensory information associated with past successful executions of the motion primitives. Such framework is called Associative Skill Memories (ASMs) [69, 91]. In the ASM framework, when the execution of a motion primitive results in sensor traces which deviate from the expected traces, we can use the sensor traces deviation as input to the feedback model, to inform the required adaptation of the motion plan. In general, 64 the learning framework of feedback models with sensor trace deviations as input can be depicted in Figure IV .3 (right), while the process pipeline is visualized in Figure IV .4 (right). Figure IV .3: Proposed framework for learning feedback models with raw sensor traces input (left) and with sensor trace deviations input (right). Figure IV .4: Flow diagram of the proposed framework for supervised learning of feedback models with raw sensor traces input (left) and with sensory trace deviations input (right). On the other hand, in terms of the feedback model representation, we are investigating two options: (i). Neural Network with Output Post-Processing Among machine learning researchers, artificial neural network [81] has been shown to be a potent learning representation capable of incorporating a high-dimensional input. This is the reason why we choose neural network as the feedback model representation which will take into account multi-modal sensor input. Moreover, to create robust and safe behaviors, we incorporate some of the physical intuition –typically used to design the coupling term 65 representation– by post-processing the output of the neural network. This will ensure that the adapted motion plan converges. (ii). Phase-Modulated Neural Network (PMNN) We also design a special neural network structure which has an inherent capability of cap- turing the relationship between phase-dependent sensory features and the adaptation of motion plan, called the phase-modulated neural network (PMNN). One of the reasons of developing this new learning representation for feedback model is to avoid hand-designing the required form of the post-processing on the feedback model output for guaranteeing the convergence of the adapted motion plan, and instead rely on a more data-driven approach with this representation. We provide evaluation of our approaches on two testbeds: (i). Learning Obstacle Avoidance Feedback Model In this testbed, we use raw sensory features –describing the kinematic relationship between the end-effector and the obstacle– as inputs to the feedback model. In terms of the feed- back model representation, we use a feed-forward neural network (FFNN) with output post- processing in this testbed. It is also possible to use PMNN as feedback model representation in this testbed, and we plan to explore this possibility in the future. Experimental evalua- tions of this testbed are presented in Chapter IV .8. (ii). Learning Tactile Feedback Model In this testbed, we try to learn a feedback model which maps the tactile sensor traces de- viation to the motion plan adaptation. We mainly use PMNN in this testbed, and provide evaluations which shows the superiority of PMNN as compared to FFNN in terms of data- fitting performance. Experimental evaluations of this testbed are presented in Chapter IV .9. First, we are defining the local coordinate system that is very important for spatial general- ization –in terms of start and goal positions– of both the learned motion primitive as well as the learned feedback model. 66 Figure IV .5: (left) Example of local coordinate frame definition for a set of obstacle avoidance demonstrations: a local coordinate frame is defined on trajectories collected from human demon- stration. (middle and right) Unrolled avoidance behavior is shown for two different locations of the obstacle and the goal: using local coordinate system definition (right) and not using it (mid- dle). IV .6.1 Spatial Generalization using Local Coordinate Frames Ijspeert et al. [80] pointed out the importance of a local coordinate system definition for the spatial-generalization of two-dimensional (2D) DMPs. Based on this, we define a three-dimensional (3D) task space DMPs as follows: (i). Local x-axis is the unit vector pointing from the start position towards the goal position. (ii). Local z-axis is the unit vector orthogonal to the local x-axis and closest to the opposite direction of the gravity vector. (iii). Local y-axis is the unit vector orthogonal to both local x-axis and local z-axis, following the right-hand convention. The first figure on the left of Figure IV .5 gives an example of a local coordinate system defined for a set of human obstacle avoidance demonstrations. The importance of using a local coordinate systems for obstacle avoidance is illustrated in Fig- ure IV .5, at the middle and right plots. In both plots, black dots represent points on the obstacles. Solid orange trajectories represent the unrolled trajectory of the DMP with learned coupling term when the goal position is the same as the demonstration (dark green). Dotted orange trajectories represent the unrolled trajectory when both the goal position and the obstacles are rotated by 180 degrees with respect to the start. DMPs without local coordinate system (middle) are unable to 67 Figure IV .6: System overview with local coordinate transform. generalize the learned coupling term to this new task setting, while DMPs with local coordinate system (right) are able to generalize to the new context. When using local coordinate system, all related variables are transformed into the representation in the local coordinate system before using them to compute the features as input to the feedback model, as shown in Figure IV .6. IV .6.2 Feedback Model Learning Framework We envision a general-purpose feedback model learning representation framework that can be used in a variety of scenarios, such as for obstacle avoidance, tactile feedback, locomotion plan- ning adjustment, etc. Such learning representation ideally has the flexibility to incorporate various sensor feedback signals, can be initialized from human data, and can generalize to previously un- seen scenarios. One step towards generalizing to unseen settings is to use a local coordinate system, as intro- duced in Section IV .6.1. The second challenge of creating a flexible feedback model is addressed by choosing an appropriate function approximator, that can be fit to predict the required adap- tation, given sensory information. Here, we choose to represent the feedback model through a family of neural network which is trained from human demonstrations. To learn a general adaptation/feedback model from human demonstrations with respect to a task, we require two kinds of demonstrations: 68 Nominal/baseline demonstrations, which is the demonstrations of nominal execution of the task in the default environment setting. Adapted/corrected demonstrations, which is the demonstrations of adapted/corrected exe- cution of the task when there are environmental changes with respect to the default envi- ronment setting. In this section, we specify the general learning-from-demonstration framework of the feed- back model, starting from the input specification, the target adaptation level (output) extraction, the neural network structures that we use as the feedback model representations, and finally the loss function for supervised training of the feedback model. IV .6.2.1 Feedback Model Input Specification In order to perform adaptation, in each time step the feedback model requires some information –regarding the state between the environment and the robot– related to the task in consideration. This task-related information is obtained via the robot’s sensors in the form of the observation of a set of variables that we call as sensor featuress, whether they are directly or indirectly related to the task. As mentioned before, there are two possible types of input for feedback models in reactive motion planning: (i). Raw Sensor Traces s =s actual In this case, at each time step we use the real-time-acquired sensor features, denoted as actual sensor tracess actual , to drive the adaptation to environmental changes. This can be written as: c =h(s) =h(s actual ) (IV .21) withh(:) is the feedback model, andc is the degree of adaptation/correction of the motion plan, i.e. the general coupling term in Equation IV .1 and Equation IV .2. (ii). Sensor Trace Deviations s =s actual s expected The core idea of Associative Skill Memories (ASMs) [69, 91] rests on the insight that sim- ilar task executions should yield similar sensory events. Thus, an ASM of a task includes This type can also be viewed the same as type (ii), i.e. s =s actuals expected withs expected = 0. 69 both a movement primitive as well as the expected sensor tracess expected associated with this primitive’s execution in its default environment setting. When a movement primitive is executed under environment variations and/or uncertainties, the online-perceived/actual sensor tracess actual tend to deviate froms expected . The disparity s actual s expected = s can be used to drive corrections for adapting to the environmental changes causing the deviated sensor traces. This can be written as: c =h(s) =h(s actual s expected ) (IV .22) For this reason, we need to acquire the expected sensor tracess expected beforehand by learn- ing from demonstrations of nominal behaviors. Learning Expected Sensor Traces To learn thes expected model, we execute the nominal behavior and collect the experienced sensor measurements. Since these measurements are trajectories by nature, we can encode them using DMPs to becomes expected . This has the advantage thats expected is phase-aligned with the motion primitives’ execution. As a part of preparation for the learning-from-demonstration framework of the feedback model, we collect the dataset of the actual sensor tracess actual associated with the adapted/corrected behavior demonstrations, which we call ass actual, demo . IV .6.2.2 Target Adaptation Level Extraction from Human Demonstrations Data Here we assume that the motion primitives have been learned beforehand from the nominal/baseline behavior demonstrations, following the description in Section IV .3, and therefore we can already obtain the behavior forcing termf, following Equation IV .6. To perform the learning-from-demonstration of the feedback model, we need to extract the target adaptation level or the target coupling term –which is the target output/regression variable– from demonstrations data, as follows (in reference to Equations IV .1 and IV .2): for a regular DMP: c r, target = v ( v (g cd x cd ) cd _ x cd ) + 2 cd x cd f r (IV .23) 70 wherefx cd ; _ x cd ; x cd g is the set of adapted/corrected behavior demonstrations. for a Quaternion DMP: c !, target = ! ( ! 2 log Q g , cd Q cd cd ! cd ) + 2 cd _ ! cd f ! (IV .24) wherefQ cd ;! cd ; _ ! cd g is the set of adapted/corrected orientation behavior demonstration. Furthermore, cd is the movement duration of each adapted/corrected demonstration, and v , v , ! , and ! are the same constants defined in Section IV .3. Due to the specification of cd in Equations IV .23 and IV .24, the formulation can handle demonstrations with different movement durations/trajectory lengths. Next, we describe our proposed general learning representations for the feedback model. IV .6.2.3 Feedback Model Representations Our goal is to acquire a feedback model which predicts the adaptation/correction/coupling term c based on sensory information about the environment. In other words, we would like to learn the function h(:), mapping sensory input features s to the coupling termc, as mentioned in Equations IV .21 and IV .22. We propose the following representations to deal with this regression problem. IV .6.2.3.1 Neural Network with Output Post-Processing We use neural network (NN) structures for representing feedback term models due to its ability to learn task-relevant feature representations of high-dimensional inputs from data. Generally speaking however, any non-linear function approximator could be considered for this part of the framework. The coupling term is approximated as the output of our neural network, given sensory inputs perceived during execution of the task, as follows: c =h NN (s) (IV .25) While there is no question that neural networks have the necessary flexibility to represent a feedback model with various sensor inputs, there is concern regarding their unconstrained use in 71 real-time control settings. It is likely that the system encounters scenarios that have not been ex- plicitly trained for, for which it is not always clear what a neural network will predict. However, we want to ensure that our network behaves safely in unseen settings. Thus, as part of our pro- posed approach, we introduce some physically-inspired post-processing measures that we apply to our network predictions which ensure safe behaviors including convergence of the motion plan. The final coupling termc, given a set of sensory input features s becomes: c =P (h NN (s)) (IV .26) where P (:) are the post-processing steps applied to the neural network’s output to ensure safe behavior. IV .6.2.3.2 Phase-Modulated Neural Network (PMNN) While it is possible to come up with some hand-designed output post-processing based on physi- cal intuition of a particular task, it is not easy to hand-design the form of post-processing that will apply to all kinds of task in general. Therefore, we propose a special neural network structure with embedded post-processing as a feedback model representation. This new neural network design is a variant of the radial basis function network (RBFN) [144], which we call the phase- modulated neural networks (PMNNs). PMNN has an embedded structure that allows encoding of the feedback model’s dependency on movement phase as a form of post-processing, whose form is determined automatically from data during learning process. Moreover, the structure ensures convergence of the adapted motion plan, due to a modulation with the phase velocity. Diagrammatically, PMNN can be depicted in Figure IV .7. 72 Figure IV .7: Phase-modulated neural network (PMNN) with one-dimensional output coupling termc. The PMNN consists of: input layer The input is s =s actual s expected . regular hidden layers The regular hidden layers perform non-linear feature transformations on the high-dimensional inputs. If there areL layers, the output ofl-th layer is: h l = 8 > < > : a l (W h l s s +b h l ) forl = 1 a l W h l h l1 h l1 +b h l forl = 2;:::;L (IV .27) a l is the activation function of thel-th hidden layer, which can be tanh, RELU, or others. W h 1 s is the weight matrix between the input layer and the first hidden layer.W h l h l1 is the weight matrix between the (l 1)-th hidden layer and thel-th hidden layer.b h l is the bias vector at thel-th hidden layer. 73 final hidden layer with phase kernel modulation This special and final hidden layer takes care of the dependency of the model on the move- ment phase. The output of this layer ism, which is defined as: m = (W mh L h L +b m ) (IV .28) where denote element-wise product of vectors. = h 1 2 ::: N i T is the phase kernel modulation vector, and each component i is defined as: i (p;u) = i (p) P N j=1 j (p) u i = 1;:::;N (IV .29) with phase variablep and phase velocityu, which comes from the second-order canonical system defined in Equation IV .4 and IV .5. i (p) is the radial basis function (RBF) as defined in Equation IV .7. We useN = 25 phase RBF kernels both in the PMNNs as well as in the DMPs representation. The phase kernel centers have equal spacing in time, and we place these centers in the same way in the DMPs as well as in the PMNNs. output layer The output of this layer is the one-dimensional coupling termc, which is defined as: c =w T cm m (IV .30) w cm is the weight vector. Please note that there is no bias introduced in the output layer, and hence ifm = 0 –which occurs when the phase velocityu is zero– then the coupling termc is also zero. This ensures that the coupling term is initially zero when a primitive is started. The coupling term will also converge to zero because the phase velocity u is converging to zero. This ensures the convergence of the adapted motion plan. For anM-dimensional coupling term, we useM separate PMNNs with the same input vector s and the output of each PMNN corresponds to each dimension of the coupling term. This sepa- ration of neural network for each coupling term dimension allows each network to be optimized independently from each other. 74 IV .6.2.4 Supervised Learning of Feedback Models The feedback model parameters PMNN can then be acquired by supervised learning that opti- mizes the following loss function: L SL = N X n=1 c target n h(s demon ; PMNN ) 2 2 = N X n=1 c target n h(s actual, demo n s expected n ; PMNN ) 2 2 (IV .31) with N is the number of data points available in the adapted/corrected behavior demonstration dataset. We perform several experimental evaluations of our framework for learning feedback mod- els on two testbeds: obstacle avoidance feedback which is described in Chapter IV .8, and tactile feedback which is presented in Chapter IV .9. However, before describing the experimental evalu- ations, in the next section we will present a sample-efficient reinforcement learning extension of the learning feedback models framework, to tackle novel situations which were not seen during the initial supervised training of the feedback models. IV .7 Reinforcement Learning of Feedback Models for Reactive Be- haviors In the previous section, we presented an expressive learning representation of feedback models capable of capturing dependency on the movement phase as well as ensuring convergence of the overall behaviors. Moreover, we presented a method to train the feedback models in a supervised manner, by learning from demonstrations of corrected behaviors. In this section, we show how to refine the feedback model via a sample-efficient reinforcement learning (RL) algorithm, after the initialization by the supervised learning process. Our method extends the previous works on approaches for improving nominal behaviors via RL or Iterative Learning Control [111, 112, 113, 114, 115, 129, 130, 132, 133, 135, 145, 146] into an RL method for learning feedback models. This extension has two benefits. First, in the previous works on the improvement of nominal behaviors, if a new environment situation is encountered, the nominal 75 behavior has to be re-learned. Our method –in contrast– learns a single adaptive behavior policy which is more general and will be able to tackle multiple different environmental settings via adaptation of the nominal behavior. Second, due to our feedback model representation learning, our method maintains its performance on the previous settings, while expanding the range of the adaptive behavior to a new setting. On the other hand, previous methods which refine the nominal behavior will be able to perform well on a new setting, but may not perform well on the previous setting it has seen before. Our sample-efficient RL algorithm for feedback models performs the optimization in the lower dimensional space, i.e. in the space of the weights of the radial basis functions (RBFs) centered on the phase variablep –which is equal in size to the parameter f of the DMP forcing term–, instead of the high-dimensional neural network parameter PMNN space. This is similar in spirit with the PIGPS algorithm [79, 140]. The PIGPS algorithm breaks down the learning of end-to-end policies into three phases: 1) optimization of a low-dimensional policy through PI 2 [130, 132], and then 2) rolling out the optimized policies and collecting data tuples of actions and observations; and finally 3) supervised training of the end-to-end vision-to-torque policy based on data collected in phase 2). In our work, we follow a similar process. The main difference is that in our approach, we use the generated trajectory roll-outs to train a feedback model instead of an end-to-end policy. A summary of our approach is presented in Algorithm IV .3, which takes the following as input: DMP forcing term parameters f which encodes the nominal behavior initial feedback model parameters PMNN corrected behavior datasetD cdemo =fQ cd ;! cd ; _ ! cd ; sg on some known environment settings, which was used to train PMNN in Eq. IV .24 and IV .31 expected sensor traces expected initial policy exploration covariance acceptable cost thresholdJ thresh We now the describe the phases of this algorithm in more detail. We assume that the environment setting being improved upon is fixed throughout these phases. 76 Algorithm IV .3 Reinforcement Learning of Feedback Model for Reactive Motion Planning 1: function RLFB( f , PMNN ,D cdemo ,s expected , ,J thresh ) 2: [T; ;J] unroll( f ; PMNN ;s expected ) 3: whilekJk 2 >J thresh do 4: f c train DMP (fTg) 5: # Exploration: 6: fork 1 toK do 7: f c k sample(N ( f c ; )) 8: [ ; ;J 0 k ] unroll f c k ; 0;s expected 9: #PI 2 update: 10: [ f c 0 ; ] PI 2 CMA( n ( f c k ;J 0 k ) o K k=1 ; f c ) 11: [T 0 new ;s actual ; ] unroll f c 0 ; 0;s expected 12: s s actual s expected 13: D cdemo, additional fT 0 new ; sg 14: D fb train D cdemo +D cdemo, additional 15: PMNN train DMP FB(D fb train ; f ) 16: [T; ;J] unroll( f ; PMNN ;s expected ) 17: return PMNN Algorithm IV .4 Path Integral Policy Improvement with Covariance Matrix Adaptation (PI 2 CMA) Update Function 1: functionPI 2 CMA( ( f k ;J k ) ; f ) 2: T =length(J k ) 3: fort 1 toT do 4: fork 1 toK do 5: S k;t = P T t=1 J k;t 6: P k;t = e 1 S k;t P K k=1 e 1 S k;t 7: f new t = P K k=1 P k;t f k 8: new t = P K k=1 P k;t ( f k f )( f k f ) T 9: f new = P T t=1 (Tt) f new t P T l=1 (Tl) 10: new = P T t=1 (Tt) new t P T l=1 (Tl) 11: return f new , new 77 IV .7.1 Phase 1: Evaluation of the Current Adaptive Behavior and Conversion to a Low-Dimensional Policy Intuitively, the parameters PMNN need to capture feedback terms for a variety of settings, and thus a high-capacity feedback model representation is required to represent the feedback variations of many settings. However, since we focus on improving the feedback term for the current setting only, we can utilize a lower dimensional representation; optimization on the low-dimensional representation –instead of on the high-dimensional feedback model parameters PMNN – helps us to achieve sample-efficiency in our RL approach. Therefore, our algorithm first converts the high- dimensional policy into a low-dimensional one, which happens in two steps: First, the algorithm evaluates the current adaptive behavior based on the high-dimensional policy PMNN , as is done in line 2 and 16 of Algorithm IV .3, which results in the roll-out trajectory T and the cost per time-stepJ. In the second step, the algorithm compresses the observed trajectoryT into another DMP with low-dimensional forcing term parameters f c , where f c is a set of 25 weights (see line 4 of Alg. IV .3). IV .7.2 Phase 2: Optimization of the Low-Dimensional Policy Once the low-dimensional parametrization f c is available, the optimization can be done via the PI 2 algorithm [130]. This consists of three steps: policy exploration, policy evaluation, and policy update. In order to do exploration, we model the noisy version of the transformation system from Eq. IV .2 (without the coupling termc) as: 2 _ ! = ! ( ! 2 log (Q g Q )!) + ( f c +) T with zero-mean multivariate Gaussian noise N (0; ), = h 1 2 ::: N i T is the phase kernel vector, and each component i is as defined in Eq. IV .29. In the policy exploration step, we sampleK policies from the multivariate Gaussian distributionN ( f c ; ) as done in line 7 of Alg. IV .3. In the policy evaluation step, we roll-outK trajectories –i.e. one trajectory for each sampled policy–, and evaluate their cost as done in line 8 of Alg. IV .3. In the policy update 78 step, the algorithm performs a weighted combination of the policy based on the cost: the policies with lower costs are prioritized over those with higher costs, as done in Algorithm IV .4 . IV .7.3 Phase 3: Rolling Out the Improved Low-Dimensional Policy Once the low-dimensional policy is improved, now the algorithm needs to transfer the improve- ment to the high-dimensional feedback model policy PMNN . This transfer is done by rolling out the improved low-dimensional policy f c 0 on the real system (line 11 of Alg. IV .3). This rollout generates the trajectoryT 0 new –which consists of the trajectory of orientationQ, angular veloc- ity!, and angular acceleration _ !– and the corresponding sensor traces deviation s (line 12 of Alg. IV .3), which are both part of the additional feedback model training dataD cdemo, additional (line 13 of Alg. IV .3). IV .7.4 Phase 4: Supervised Learning of the Feedback Model Finally, the algorithm does supervised training of the feedback model PMNN on the combined datasetD cdemo +D cdemo, additional , as is done in line 15 of Alg. IV .3, following our approach in section IV .6. Please note that initially –before the RL algorithm was performed–, PMNN is trained in supervised manner only on the datasetD cdemo , which is the dataset of corrected behaviors on several known environment settings. The additional training dataD cdemo, additional will improve the performance of the feedback model PMNN on the new environment setting where the RL algorithm is performed on. We repeat these phases until the norm of the costkJk 2 converges to either on or below the thresholdJ thresh . IV .8 Learning Obstacle Avoidance Feedback Model Testbed In this testbed, we use raw sensor traces s =s actual as the feedback model input, and a feed- forward neural network (FFNN) with output post-processing as the learning representation of the feedback model. We follow the variant of the PI 2 algorithm with covariance matrix adaptation [132]. 79 IV .8.1 Neural Network Specifications and Input-Output Details Since we consider meaningful input features - that we believe to have an influence on obstacle avoidance behaviors - we do not require the neural network to learn this abstraction. In all our experiments detailed below we use the same neural network structure: The neural network has a depth of 3 layers, with 2 hidden layers of 20 and 10 rectified linear units (ReLU) [147] units each and an output sigmoid layer. The total number of inputs is 17, which is the dimensionality of s =s actual , and comprises: (i). vectors of 3 closest points between the obstacle and the end-effector (9 inputs), (ii). vector between the obstacle center and the end-effector (3 inputs), (iii). motion duration ()-multiplied velocity of end-effector (v, 3 inputs) , (iv). distance between the obstacle center and the end-effector (1 input), and (v). angle between the end-effector velocity vector and the vector from end-effector to the ob- stacle center (1 input). The number of outputs is 3 for the three dimensions of the coupling termc r . Weights and bi- ases are randomly initialized and trained using the Levenberg-Marquardt algorithm. We use the MATLAB Neural Networks toolbox in our experiments [148]. IV .8.2 Neural Network Output Post-Processing for Obstacle Avoidance Testbed Particular care has to be taken when applying neural network predictions in a control loop on a real robot. Extrapolation behavior for neural networks can be difficult to predict and comes without any guarantees of reasonable bounds in unseen situations. In a problem like ours, it is nearly impossible to collect data for all possible situations that might be encountered by the robot. As a result, it is important to apply some extra constraints, based on intuition, on the predictions of the neural network. The coupling termc r as a function of a set of raw sensor traces s =s actual is defined as: c r =P (h NN (s)) (IV .32) We select the featurev instead of just v so that this feature is invariant with respect to the movement duration . 80 where P (:) are the post-processing steps applied to the neural network’s output to ensure safe behavior andh NN (:) is the neural network mapping as the feedback model. One common problem is that in some situations, we physically expect the coupling term to be 0 or near 0. But due to noise in human data,c r is not necessarily 0 in these cases. For instance, after having avoided the obstacle, we should ensure goal convergence by preventing the coupling term from being active. With such cases in mind, the external constraints applied to the output of the neural network while unrolling are as follows: (i). Set coupling term in x-direction as 0: In the transformed local coordinate system, the movement of the obstacle avoidance and the baseline trajectory are identical in the x- direction. This means that the coupling term in this dimension can be set to 0. The post- processed coupling term becomes P ([c NN;x ;c NN;y ;c NN;z ] T ) = [0;c NN;y ;c NN;z ] T (IV .33) (ii). Exponentially reduce coupling term to 0 on passing the obstacle: We would like to stop the coupling term once the robot has passed the obstacle, to ensure convergence to the goal. In the local coordinate frame, this can be easily realized by comparing the x-coordinate of the end effector with the obstacle location. To adjust to the size of the obstacle and multiple obstacles, this post-processing can be modified to take into account obstacle size and the location of the last obstacle. We exponentially reduce the coupling term output in all dimensions once we have passed the obstacle. The post-processing becomes: P (c NN ) = 8 > < > : c NN exp((x o x ee ) 2 ); ifx o <x ee c NN ; otherwise wherex o is the x-coordinate of the obstacle andx ee is the x-coordinate of the end-effector. 81 (iii). Set coupling term to 0 if obstacle is beyond the goal: If the obstacle is beyond the goal, the coupling term should technically be 0 (as humans do not deviate from the original trajectory). This is easily taken care of by setting the coupling term to 0 in such situations. P (c NN ) = 8 > < > : 0; ifx o >x goal c NN ; otherwise wherex o andx goal are the x-coordinates of obstacle and goal, respectively. Note, how all the post-processing steps leverage the local coordinate transformation. This post- processing, while not necessarily helping the network generalize to unseen situations, makes it safe for deployment on a real robot. With this learning framework, and the local coordinate transformation we are now ready to tackle the problem of obstacle avoidance using coupling terms. Next, we describe the experimental setup of our framework. (a) Data collection setting using Vicon objects to represent end-effector, obstacle, start and goal po- sitions. (b) Different types of obstacles used in data collec- tion, from left to right: cube, cylinder, and sphere Figure IV .8: Data collection setting and different obstacle geometries used in experiment. IV .8.3 Experimental Setup To record human demonstrations we used a Vicon motion capture system at 25 Hz sampling rate, with markers at the start position, goal position, obstacle positions, and the end-effector. These can be seen in Figure IV .8. 82 In total there are 40 different obstacle settings, each corresponds to one obstacle position in the setup. We collected 21 demonstrations for the baseline (no obstacle) behavior and 15 demonstrations of obstacle avoidance for each obstacle settings with three different obstacles – sphere, cube and cylinder. From all baseline demonstrations, we learned one baseline primitive, and all obstacle avoidance behaviors are assumed to be a deviation of the baseline primitive, whose degree of deviation is dependent on the obstacle setting. Some examples of the obstacle avoidance demonstrations can be seen in Figure IV .9. Even though the Vicon setup only tracked about 4-6 Vicon markers for each obstacle geometry, we augmented the obstacle representation with more points to represent the volume of each obstacle object. (a) All nominal demonstrations (no obsta- cles). (b) Sphere obstacle avoidance demonstra- tions. (c) Cube obstacle avoidance demonstra- tions. (d) Cylindrical obstacle avoidance demon- strations. Figure IV .9: Sample demonstrations. (b), (c), and (d) are a sample set of demonstrations for 1-out-of-40 settings. 83 IV .8.4 Experimental Evaluations We evaluate our approach in simulation and on a real robot. First, we use obstacle avoidance demonstrations collected as detailed below, to extensively evaluate our learning approach in sim- ulation. In the simulated obstacle avoidance setting, we first learn a coupling term model and then unroll the primitive with the learned neural network. We perform three types of experi- ments: learning/unrolling per single obstacle setting, learning/unrolling across multiple settings and unrolling on unseen settings after learning across multiple settings. We also compare our neural network against the features developed in the previous work by Rai et al. [1]. This involves defining a grid of hand-designed features and using Bayesian regression with automatic relevance determination to remove the redundant features. We are using four performance metrics to mea- sure the performance of our learning algorithm: (i). Training NMSE (normalized mean squared error), calculated as the mean squared error between target and fitted coupling term, normalized by the variance of the regression target. (ii). Test NMSE on a set of examples held out from the training. (iii). Closest distance to the obstacle of the obstacle avoidance trajectory. (iv). Convergence to the goal of the obstacle avoidance trajectory. Finally, we train a neural network across multiple settings and deploy it on a real robot. IV .8.5 Per setting experiments The per setting experiments were conducted on each setting separately. We tried to incorporate demonstrations of near and far-away obstacles. In total we test on 120 scenarios, comprised of 40 settings per obstacle type (spheres, cylinder, and cube). A neural network was trained and unrolled over the particular setting in question. For com- parison, the model defined in the previous work by Rai et al. [1] was also trained on the same coupling term target as the neural network. First, we evaluate and compare the ability of the mod- els to fit the training data and generalize to the unseen test data (80/20 split). The consolidated results for these experiments can be found in Figure IV .10, where we show the training and testing normalized mean square error (NMSE). The top row (plots (a) and (b)) show results over all the 84 0.1 0.3 0.5 0.7 0.9 1.1 0 50 100 NMSE Numberofsettings 0.1 0.3 0.5 0.7 0.9 1.1 0 10 20 30 40 NMSE Numberofsettings 0.1 0.3 0.5 0.7 0.9 1.1 0 20 40 NMSE Numberofsettings 0.1 0.3 0.5 0.7 0.9 1.1 0 5 10 15 NMSE Numberofsettings AveragetrainingNMSE AveragetestNMSE (a)NeuralNetwork (c)NeuralNetwork (b)Hand-designedfeatures (d)Hand-designedfeatures Figure IV .10: Histograms describing the results of training and testing using a neural network (left plots) and the model from the previous work by Rai et al. [1] (plots to the right). (a) and (b) are average NMSE across all dimensions generated over the complete dataset. (c) and (d) are the NMSE over the dominant axis of demonstrations with obstacle avoidance. 85 Table IV .1: Results of the per setting experiments. Negative distance to obstacle implies a colli- sion. Distance Distance Number to goal to obstacle of hits max mean min mean Baseline Demonstration 0.017 0.017 -0.451 4.992 4 Model from the Prev. Work by Rai et al. [1] 1.218 0.072 -0.520 5.038 2 Neural Network 0.113 0.016 0.083 5.241 0 Obstacle Avoidance Demonstration 0.075 0.028 1.409 5.461 0 scenarios (120) - with the NMSE averaged across the 3 dimensions. The histogram shows, for how many settings we achieved a particular training/testing NMSE. As can be seen, when using the neural network, we achieved an NMSE of 0.1 or lower (for both training and testing data) in all scenarios - indicating that the neural network indeed is flexible enough to fit the data. The same is not true for the model from the previous work by Rai et al. [1] (plot b). However, a large portion of these settings have the obstacle too far away such that there is no dominant axis of avoidance. The model from the previous work by Rai et al. [1] has a large training and testing NMSE in such cases. We separated the demonstrations that have a dominant axis of obstacle avoidance (43 scenarios) and show the results for the dominant dimension of obstacle avoidance in plots (c) and (d) of Figure IV .10. As expected, the performance of the features from the previous work by Rai et al. [1] improves, but is still far behind the performance of the neural network. The features in the previous work by Rai et al. [1] are unable to fit the human data satisfactorily, as is illustrated in the high training NMSE. On a further study, we found that the issue with large regression weights using Bayesian regression with ARD, as mentioned by the authors, can be explained by a mis- match between the coupling term model used and the target set. This also explains why they were not able to fit coupling terms across settings. The low training NMSE in Figure IV .10 (a) and (c) show the versatility of our neural network at fitting data very well per setting. Low test errors showed that we were able to fit the data well without over-fitting. 86 Note that the performance during unrolling for the same obstacle setting can be different from the training demonstrations. When unrolling, the DMP can reach states that were never explored during training, and depending on the generalization of our model, we might end up hitting the obstacle or diverge from our initial trajectory. This brings up two points. One, we want to avoid the obstacle and two, we want to converge to our goal in the prescribed time. We test both methods on these two metrics, and the results are summarized in Table IV .1. We compare the two learned coupling term models to the baseline trajectory, as well as human demonstration of obstacle avoidance. While the neural network never hits the obstacle, the model from the previous work by Rai et al. [1] hit the obstacle twice. Likewise, the model from the previous work by Rai et al. [1] does not always converge to the goal, while the neural network always converges to the goal. The mean distance to goal and mean distance from obstacle for both methods are comparable to human demonstrations. IV .8.6 Multiple setting experiments Table IV .2: Results of the multi setting experiments. Negative distance to obstacle implies a collision. NMSE Distance to goal Distance to obstacle Number train test max mean mean min of hits Sphere Baseline - - 0.017 0.017 4.739 -0.025 1 Unrolled 0.155 0.152 0.152 0.018 5.063 0.433 0 Cube Baseline - - 0.017 0.017 5.722 -0.451 1 Unrolled 0.164 0.159 0.145 0.015 5.964 1.045 0 Cylinder Baseline - - 0.017 0.017 4.514 -0.280 2 Unrolled 0.195 0.195 0.078 0.014 5.117 0.750 0 To test if our model generalizes across multiple settings of obstacle avoidance, we train three neural networks over 40 obstacle avoidance demonstrations per object. The results are summa- rized in Table IV .2. The neural network has relatively low training and testing NMSE for the three obstacles. To test the unrolling, each of the networks was used to avoid the 40 settings they were 87 trained on. As can be seen from columns 3 and 4, the unrolled trajectories never hit an obstacle. They also converged to the goal in all the unrolled examples. One example of unrolling on a trained setting can be seen in Figure IV .11a. (a) Unroll on trained setting (b) Unroll on unseen setting Figure IV .11: Sample unrolled trajectories on trained and unseen settings. This shows that our neural network was able to learn coupling terms across multiple settings and produce human-like, reliable obstacle avoidance behavior, unlike previous coupling term models in literature. When we trained our network across all three obstacles, however, the per- formance deteriorated. We think this is because our chosen inputs are very local in nature and to avoid multiple obstacles the network needs a global input. IV .8.7 Unseen setting experiments To test generalization across unseen settings, we tested our trained model on 63 unseen settings, initialized on a close 7 3 3 grid around the baseline trajectory. We purposely created our unseen settings much harder than the trained settings. Out of 63 settings, the baseline hit the obstacle in 35 demonstrations, as can be seen in Table IV .3. While our models were trained on spheres, cubes and cylinders, they were all tested on spherical obstacles for simplicity. Please note that while a model trained for cylinders can avoid spherical obstacles, behaviorally the unrolled trajectory looks more like that of cylindrical obstacle avoidance, than spherical. As can be seen 88 Table IV .3: Results of the unseen setting experiments. Negative distance to obstacle implies collision. Distance Distance Number to goal to obstacle of hits max mean mean min Initial 0.017 0.017 0.095 -0.918 35 Sphere 0.011 0.034 0.933 -0.918 2 Cube 0.021 0.119 2.235 1.172 0 Cylinder 0.033 0.120 1.704 -0.103 1 from Table IV .3, our models were able to generalize to unseen settings quite well. When trained on sphere obstacle settings our approach hit the obstacle in 2 out of 63 settings, when trained on cylinder settings we hit it once, and when trained on cube settings we never hit an obstacle. All the models converged to the goal on all the settings. An example unrolling can be seen in Figure IV .11b. IV .8.8 Real robot experiment Finally we deploy the trained neural network on a 7 degree-of-freedom Barrett WAM arm with 300 Hz real-time control rate, and test its performance in avoiding obstacles. We again use Vicon objects tracked in real-time at 25 Hz sampling rate to represent the obstacle. Some snapshots of the robot avoiding a cylindrical obstacle using a neural network trained on multiple cylin- drical obstacles can be seen in Figure IV .12. Video can be seen in https://youtu.be/ hgQzQGcyu0Q. These are very promising results that show that a neural network with intuitive features and physical constraints can generalize across several settings of obstacle avoidance. It can avoid obstacles in settings never seen before, and converge to the goal in a stable way. This is a start- ing point for learning general feedback terms from data that can generalize robustly to unseen situations. 89 Figure IV .12: Snapshots from our experiment on our real system. Here the robot avoids a cylindri- cal obstacle using a neural network that was trained over cylindrical obstacle avoidance demon- strations. Seehttps://youtu.be/hgQzQGcyu0Q for the complete video. IV .9 Learning Tactile Feedback Model Testbed In this testbed, we use sensor trace deviations s =s actual s expected as the feedback model input, and a phase-modulated neural network (PMNN) as the learning representation of the feedback model. IV .9.1 System Overview and Experimental Setup This work is focused on learning to correct tactile-driven manipulation with tools. As shown in Figure IV .13, our experimental scenario involves a demonstrator teaching our robot to perform a scraping task, utilizing a hand-held tool to scrape the surface of a dry-erase board that may be tilted due to a tilt stage. The system is taught this skill at a default tilt angle, and needs to adapt its behavior when the board is tilted away from that default angle such that it can still scrape the board effectively with the tool. 90 Figure IV .13: Experimental setup of the scraping task. A few important points to note in our experiment: The system is driven only by tactile sensing to inform the adaptation –neither vision nor motion capture system plays a role in driving the adaptation–. A Vicon motion capture system is used as an external automatic scoring system to measure the performance of the scraping task. The performance in terms of cost for reinforcement learning is defined in section IV .9.1.5. 91 One of the main challenges is that the tactile sensors interact indirectly with the board, i.e. through the tool adapter and the scraping tool via a non-rigid contact, and the robot does not explicitly encode the tool kinematics model. This makes hand-designing a feedback gain matrix for contact control difficult. Next, we explain the experimental setup and some lessons learned from the experiments. IV .9.1.1 Hardware The demonstrations were performed on the right arm and the right hand of our bi-manual robot. The arm is a 7-degrees-of-freedom (DoF) Barrett WAM arm which is also equipped with a 6D force-torque (FT) sensor at the wrist. The hand is a Barrett hand whose left and right fingers are equipped with biomimetic tactile sensors (BioTacs) [16]. The two BioTac-equipped fingers were setup to perform a pinch grasp on a tool adapter. The tool adapter is a 3D-printed object designed to hold a scraping tool with an 11mm-wide tool-tip. The dry-erase board was mounted on a tilt stage whose orientation can be adjusted to create static tilts of10 in roll and/or pitch with respect to the robot global coordinates as shown in Figure IV .13. Two digital protractors with 0:1 resolution (Wixey WR 300 Digital Angle Gauge) were used to measure the tilt angles during the experiment. A set of Vicon markers were placed on the surface of the scraping board, and another set of Vicon markers were placed on the tool adapter. A Vicon motion capture system tracks both sets of Vicon markers in order to compute the relative orientation of the scraping tool w.r.t. the scraping board to evaluate the cost during the reinforcement learning part of the experiment. The cost is defined in section IV .9.1.5. IV .9.1.2 Environmental Settings Definition and Demonstrations with Sensory Traces As- sociation For our robot experiment, we considered 7 different environmental settings, and each setting is associated with a specific roll angle of the tilt stage, specifically at 0 , 2:5 , 5 , 6:3 , 7:5 , 8:8 , and 10 . At each setting, we fixed the pitch angle at 0 and maintain the scraping path to be 92 roughly at the same height. Hence, we assume that among the 6D pose action (x-y-z-pitch-roll- yaw), the necessary correction/adaptation is only in the roll-orientation. Here are some definitions of the environmental settings in our experiment: The default setting is the setting where we expect the system to experience the expected sensor tracess expected when executing the nominal behavior without feedback model. We define the setting with roll angle at 0 as the default setting, while the remaining settings become the non-default ones. The known/seen settings are a subset of the non-default settings where we collected the demonstration dataset to initialize the feedback model via supervised training. The initially unknown/unseen setting is a non-default setting –disjoint from the known/seen settings– where the feedback model will be refined on with RL. The unknown/unseen setting is a non-default setting which is located between the known/seen settings and the initially unknown/unseen setting where we will evaluate the generalization capability of the feedback model after the RL process on the initially unknown/unseen set- ting has been done. After the RL phase, the feedback model has been trained on both the known/seen settings as well as the initially unknown/unseen settings. Since the un- known/unseen setting is located in between the previous settings, the feedback model is expected to generalize its performance/behavior to some extent to this new setting. For the demonstrated actions, we recorded the 6D pose trajectory of the right hand end- effector at 300 Hz rate, and along with these demonstrations, we also recorded the multi-dimensional sensory traces associated with this action. The sensory traces are the 38-dimensional tactile sig- nals from the left and right BioTacs’ electrodes, sampled at 100 Hz. IV .9.1.3 Learning Pipeline Details and Lessons Learned DMPs provide kinematic plans to be tracked with a position control scheme. However, for tactile- driven contact manipulation tasks such as the scraping task in this chapter, using position control alone is not sufficient. This is because tactile sensors require some degree of force control upon 93 contact, in order to attain consistent tactile signals on repetitions of the same task during the demonstrations as well as during the robot’s execution. Moreover, for initializing the feedback model via supervised learning, we need to collect several demonstrations of corrected behaviors at a few of the known/seen non-default settings as described in section IV .9.1.2. While it is possible to perform the corrected demonstrations solely by humans, the sensor traces obtained might be significantly different from the traces obtained during the robot’s execution of the motion plan. This is problematic, because during learning and during prediction phases of the feedback terms, the input to the feedback models are different. Hence, instead we try to let the robot execute the nominal plans, and only provide correction by manually adjusting the robot’s execution at different settings as necessary. Therefore, we use the force-torque (FT) sensor in the robot’s right wrist for FT control, with two purposes: (1) to maintain tool-tip contact with the board, such that consistent tactile sig- nals are obtained, and (2) to provide compliance, allowing the human demonstrator to perform corrective action demonstration as the robot executes the nominal behavior. For simplicity, we set the force control set points in our experiment to be constant. We need to set the force control set point carefully: if the downward force (in the z-axis direction) for contact maintenance is too big, the friction will block the robot from being able to execute the corrections as commanded by the feedback model. We found that 1 Newton is a reasonable value for the downward force control set point. Regarding the learning process pipeline as depicted in Fig. IV .2 and Fig. IV .4, here we provide the details of our experiment: (i) Nominal movement primitives acquisition: While the robot is operating in the gravity- compensation mode and the tilt stage is at 0 roll angle (the default setting), the human demonstrator guided the robot’s hand to kinesthetically perform a scraping task, which can be divided into three stages, each of which corresponds to a movement primitive: (a) primitive 1: starting from its home position above the board, go down (in the z-axis direction) until the scraping tool made contact with the scraping board’s surface (no orientation correction at this stage), (b) primitive 2: correct the tool-tip orientation such that it made a full flat tool-tip contact with the surface, 94 (c) primitive 3: go forward in the y-axis direction while scraping the surface, applying orientation correction as necessary to maintain full flat tool-tip contact with the sur- face. For robustness, we learn the above primitives –to represent the nominal behavior– from multiple demonstrations. In particular we collectedL = 11 human demonstrations of nomi- nal behaviors, and use the semi-automated trajectory alignment and segmentation algorithm as mentioned in section IV .5. We extract the reference trajectory segments containing prim- itives 1 and 3 from the first demonstration, by using Zero Velocity Crossing (ZVC) method [35] and local minima search refinement on the filtered velocity signal in the z and y axes, to find the initial guesses of the segmentation points of primitives 1 and 3, respectively. The remaining part – between the end of primitives 1 and the beginning of primitive 3 – becomes the primitive 2. Afterwards, we perform the automated alignment and weighted least square segmentation on the remaining demonstrations as outlined in section IV .5. We encode each of these primitives with position and orientation DMPs. Table IV .4: Force-torque control schedule for steps (ii)-(v). Force-Torque Control Activation Schedule Prim. 1 Prim. 2 Prim. 3 Step (ii) - z 1 N z 1 N Step (iii) - z 1 N, roll 0 Nm z 1 N, roll 0 Nm Step (iv) - z 1 N z 1 N Step (v) - z 1 N z 1 N For the following steps ((ii), (iii), (iv), and (v)), Table IV .4 indicates what force-torque control mode being active at each primitive of these steps. ”z 1 N” refers to the 1 Newton downward z-axis proportional-integral (PI) force control, for making sure that consistent tactile signals are obtained at repetitions of the task; this is important for learning and making correction predictions properly. ”roll 0 Nm” refers to the roll-orientation PI torque control at 0 Newton-meter, for allowing corrective action demonstration. 95 (ii) Expected sensor traces acquisition: Still with the tilt stage at 0 roll angle (the default setting), we let the robot unrolls the nominal primitives 15 times and record the tactile sensor traces. We encode each dimension of the 38-dimensional sensor traces ass expected , using the standard DMP formulation. (iii) Supervised feedback model learning on known/seen settings: Now we vary the tilt stage’s roll-angle to encode each setting in the known/seen environmental settings. At each setting, we let the robot unroll the nominal behavior. Beside the downward force control for contact maintenance, now we also activate the roll-orientation PI torque control at 0 Newton-meter throughout primitives 2 and 3. This allows the human demonstrator to perform the roll- orientation correction demonstration, to maintain a full flat tool-tip contact relative to the now-tilted scraping board. We recorded 15 demonstrations for each setting, from which we extracted the supervised dataset for the feedback model, i.e. the pair of the sensory trace deviation s demo and the target coupling termc target as formulated in Eq. IV .24 and IV .31. Afterwards, we learn the feedback models from this dataset with PMNN. (iv) Reinforcement learning of the feedback model on the initially unknown/unseen setting: We set the tilt stage’s roll-angle to encode the initially unknown/unseen setting. Using the reinforcement learning algorithm outlined in section IV .7, we refine the feedback model to improve its performance over trials in this setting. (v) Adaptive behavior (nominal primitive and feedback model) unrolling/testing on all settings: We test the feedback models on the different settings on the robot: on known/seen settings, whose corrected demonstrations are present in the initial su- pervised training dataset, on the initially unknown/unseen setting –on which reinforcement learning was per- formed on– and on the unknown/unseen setting, which neither was seen during supervised learning nor during reinforcement learning, for testing the generalization capability of the feedback model. 96 For the purpose of our evaluations, we evaluate feedback models only on primitives 2 and 3, for roll-orientation correction. In primitive 1, we deem that there is no action correction, because the height of the dry-erase board surface is maintained constant across all settings. IV .9.1.4 Learning Representations Implementation We implemented all of our models in TensorFlow [3], and use tanh as the activation function of the hidden layer nodes. We use the Root Mean Square Propagation (RMSProp) [40] as the gradient descent optimization algorithm and set the dropout [149] rate to 0.5 to avoid overfitting. IV .9.1.5 Cost Definition for Reinforcement Learning We define the performance/cost of the scraping task as the norm of angular error between the relative orientation of the scraping tool w.r.t. the scraping board during the current task execution versus the relative orientation during the nominal task execution on the default environment set- ting. If the relative orientation of the scraping tool w.r.t. the scraping board during the current task execution at timet is denoted as QuaternionQ cr;t , and the relative orientation of the scraping tool w.r.t. the scraping board during the nominal task execution on the default environment setting at timet is denoted as QuaternionQ nr;t , then the cost at timet is: J t = 2 log Q nr;t Q cr;t 2 (IV .34) The cost vectorJ in Algorithm IV .3 is defined as: J = h J 1 J 2 ::: J T i T (IV .35) These relative orientations are measured by a Vicon motion capture system throughout the execu- tion of the tasks. The reason of the selection of this form of cost is because in order to do the scraping task successfully, the robot shall maintain the particular relative orientation of the scraping tool w.r.t. the scraping board similar to the nominal demonstrations. After previously presenting the system overview and experimental setup, next we present the experimental evaluations of each components in our learning feedback models pipeline: the 97 semi-automated extraction of nominal movement primitives from demonstrations, the supervised learning of feedback models, and the reinforcement learning of feedback models. IV .9.2 Extraction of Nominal Movement Primitives by Semi-Automated Segmen- tation of Demonstrations First, we show the effectiveness of our method for semi-automated segmentation of the nominal behavior demonstrations as outlined in section IV .5. Our process begins with computing the cor- respondence matching between the reference segment and each guess segment using the Dynamic Time Warping (DTW) algorithm. For primitive 1, we perform correspondence matching compu- tation on the z-axis trajectory of the end-effector position, while for primitive 3, the matching is done on the y-axis trajectory. This selection is based on the definition of the primitives in step (i) section IV .9.1.3. Once both the segmentation of primitives 1 and 3 are refined, the remaining segment between these two primitives becomes the primitive 2. IV .9.2.1 Correspondence Matching Results and Issues (a) Several (un-weighted) cor- respondence pairs between the reference segment (top) and a guess segment (bottom). (b) Eight (8) top-ranked weighted correspondence pairs between the reference segment (top) and a guess segment (bottom). (c) The result of trajectory segmen- tation refinement (after stretch- ing for comparability) using the regular (un-weighted) least square method versus using the weighted least square method. Figure IV .14: 1 reference segment vs. 1 guess segment: The DTW-computed correspondence matching and refined segmentation results of primitive 1 based on z-axis trajectory, compared between the un-weighted version and the weighted version. 98 Some of the extracted correspondence matches during the segmentation of primitive 1 are shown in Fig. IV .14a and IV .14b, with numbered pairs as well as alphabet-indexed pairs indicating the matching points. Among the two possible causes of inaccurate correspondence pairs mentioned in section IV .5.3, the correspondence pair B in Fig. IV .14a is an example of the incorrect matches due to near-zero velocities. Including this in the regular least square fashion will result in erroneous alignment and segmentation as can be seen between the red (reference) and green (guess) curves in Figure IV .14c. IV .9.2.2 Segmentation Results via Weighted Least Square Method (a) The reference and guess segments before refinement for prim. 1 (top) and prim. 3 (bot- tom). (b) The result of trajectory segmentation refine- ment (after stretching for comparability) using the regular (un-weighted) least square method (left) versus using the weighted least square method (right) for prim. 1 (top) and prim. 3 (bottom). (c) The 3D Cartesian plot of the refined demonstrations’ segmentation into primitives. Figure IV .15: 1 reference segment vs. 10 guess segments: The refined segmentation results (space vs. time) of primitive 1 (based on z-axis trajectory) and primitive 3 (based on y-axis trajectory), compared between the un-weighted version and the weighted version, as well as the (space-only) 3D Cartesian plot of the segmented primitives. To mitigate the negative effect of the erroneous correspondence matches, in section IV .5 we pro- posed to associate each correspondence pair with a weight, and perform a weighted least square method to refine the segmentation. Fig. IV .14b shows eight (8) of the top-ranked correspondence 99 pairs based on the weights, meaning that these features will have the most significant influence on the result of the segmentation refinement using the weighted least square method. In Fig. IV .14c, we show the result of the segmentation refinement using the regular (un-weighted) least square method versus using the weighted least square method, as mentioned in the section IV .5.3. The red trajectory is the refined segment of the reference trajectory, while the green and blue trajec- tories are the refined segments of the initial segmentation guess trajectory using the least square method and the weighted least square method, respectively. Both the green and blue trajectories are displayed in its stretched version, relative to its time duration, so that they become visually comparable to the red trajectory. As can be seen the alignment is much better in the weighted case than the un-weighted one. Figure IV .15 shows the segmentation result of all trajectories. In Fig. IV .15a, we show the reference segment versus all (10) guess segments before the segmentation refinement for primitive 1 (top) and primitive 3 (bottom). Figure IV .15b shows the result of the segmentation refinement on all guess segments, comparing between using the regular (un-weighted) least square method (left) versus using the weighted least square method (right) on primitive 1 (top) and primitive 3 (bottom); we see that the weighted least square method achieves the superior result as the refined segmentations appear closer together (after the stretching of each refined segments for visual comparability) . In Fig. IV .15c, we show the nominal behavior demonstrations as well as the primitive segments encoded with colors in the 3D Cartesian space. The primitive 2 only contains orientation motion, thus it looks only like a point in 3D space. In this 3D Cartesian space plot, the segmentation result between the least square and the weighted least square methods are in-distinguishable, and hence we only show the result of the weighted least square method in Fig. IV .15c. However, as seen in the space-time plots in Fig. IV .15b, the weighted least square method achieves superior performance, as the relative time delays between the corresponding demonstration segments –which is an un-modeled phenomenon in learning a DMP from multiple demonstrations– are minimized. During the supervised training of the nominal primitive parametersf , we use the un-stretched version of the segmented trajectories, as the DMP transformation system in Eq. IV .2 is able to take care of each trajectory time- stretching via the motion duration parameter. 100 IV .9.3 Supervised Learning of Feedback Models In accordance with the definition in section IV .9.1.2, for all experiments in section IV .9.3, the environmental settings definition are: The default setting is the setting with tilt stage’s roll angle at 0 . The known/seen settings are the settings with tilt stage’s roll angle at 2:5 , 5 , 7:5 , and 10 . To evaluate the performance of the feedback model after supervised training on the demon- stration data, first we evaluate the regression and generalization ability of the PMNNs. Second, we show the superiority of the PMNNs as compared to the regular feed-forward neural networks (FFNNs), for feedback models learning representation. Third, we investigate the importance of learning both the feature representation and the phase dependencies together within the frame- work of learning feedback models. Fourth, we show the significance of the phase modulation in the feedback model learning. Finally, we evaluate the learned feedback model’s performance in making predictions of action corrections online on a real robot. We use normalized mean squared error (NMSE), i.e. the mean squared prediction error di- vided by the target coupling term’s variance, as our metric. To evaluate the learning performance of each model in our experiments, we perform a leave-one-demonstration-out test. In this test, we performK iterations of training and testing, whereK = 15 is the number of demonstrations per setting. At thek-th iteration: The data points of the k-th demonstration of all settings are left-out as unseen data for generalization testing across demonstrations, while the remainingK 1 demonstrations’ data points are shuffled randomly and split 85%, 7:5%, and 7:5% for training, validation, and testing, respectively. We record the training-validation-testing-generalization NMSE pairs corresponding to the lowest generalization NMSE across learning steps. Each demonstration – depending on the data collection sampling rate and demonstration duration – provides hundreds or thousands of data points. 101 We report the mean and standard deviation of training-validation-testing-generalization NMSEs acrossK iterations. IV .9.3.1 Regression and Generalization Evaluation of PMNNs The results for primitive 2 and 3, using PMNN structure with one regular hidden layer of 100 nodes, are shown in Table IV .5. The PMNNs achieve good training, validation, testing results, and a reasonable (across-demonstrations) generalization results for both primitives. Table IV .5: NMSE of the roll-orientation coupling term learning with leave-one-demonstration- out test, for each primitive. Roll-Orientation Coupling Term Learning NMSE Training Validation Testing Generalization Prim. 2 0.150.05 0.150.05 0.160.06 0.360.19 Prim. 3 0.220.05 0.220.05 0.220.05 0.320.13 IV .9.3.2 Performance Comparison between FFNN and PMNN We compare the performance between FFNN and PMNN. For PMNN, we test two structures: one with no regular hidden layer being used, and the other with one regular hidden layer com- prised of 100 nodes. For FFNN, we also test two structures, both uses two hidden layers with 100 and 25 nodes each –which is equivalent to PMNN with one regular hidden layer of 100 nodes but de-activating the phase modulation–, but with different inputs: one with only the sensor traces deviation s as 38-dimensional inputs, while the other one with s, phase variablep, and phase velocityu as 40-dimensional inputs. The second FFNN structure is chosen to see the effect of the inclusion of the movement phase information but not as a phase radial basis functions (RBFs) modulation, to compare it with PMNN. The results can be seen in Figure IV .16 (Top). It can be seen that PMNN with one regular hidden layer of 100 nodes demonstrated the best perfor- mance compared to the other structures. The FFNN with additional movement phase information performs significantly better than the FFNN without phase information, which shows that the movement phase information plays an important role in the coupling term prediction. 102 Figure IV .16: (Top) comparison of regression results on primitives 2 and 3 using different neu- ral network structures; (Middle) comparison of regression results on primitives 2 and 3 using separated feature learning (PCA or Autoencoder and phase kernel modulation) versus embedded feature learning (PMNN); (Bottom) the top 10 dominant regular hidden layer features for each phase RBF in primitive 2, roll-orientation coupling term, displayed in yellow. The less dominant ones are displayed in blue. 103 However, in general the PMNN with one regular hidden layer of 100 nodes is better in per- formance than the FFNN, even with the additional phase information. PMNN with one regular hidden layer of 100 nodes is better than the one without regular hidden layer, most likely because of the richer learned feature representation, without getting overfitted to the data. IV .9.3.3 Comparison between Separated versus Embedded Feature Representation and Phase-Dependent Learning We also compare the effect of separating versus embedding the feature representation learning with overall parameter optimization under phase modulation. Chebotar et al. [76] used PCA for feature representation learning, which was separated from the phase-dependent parameter opti- mization using reinforcement learning. On the other hand, PMNN embeds feature learning to- gether with the overall parameter optimization under phase modulation, into an integrated training process. In this experiment, we used PCA which retained 99% of the overall data variance, reducing the data dimensionality to 7 and 6 (from originally 38) for primitive 2 and 3, respectively. In addition, we also implemented an autoencoder, a non-linear dimensionality reduction method, as a substitute for PCA in representation learning. For PMNNs, we used two kinds of networks: one with one regular hidden layer of 6 nodes (such that it becomes comparable with the PCA counterpart), and the other with one regular hidden layer of 100 nodes. Figure IV .16 (Middle) illustrates the superior performance of PMNNs, due to the feature learning performed together with the overall phase-dependent parameters optimization. Of the two PMNNs, the one with more nodes in the regular hidden layer performs better, because it can more accurately represent the mapping, while not over-fitting to the data. Based on these evaluations, we decided to use PMNNs with one regular hidden layer of 100 nodes and 25 phase-modulated nodes in the final hidden layer for subsequent experiments. IV .9.3.4 Evaluation of Movement Phase Dependency In this part, we visualize the trained weight matrix mapping the output of 100 nodes in the regular hidden layer to the 25 nodes in the final hidden layer being modulated by the phase RBFs. This weight matrix is of dimension 25 100, and each row shows how each of the 100 nodes’ output 104 (or ”features”) in the regular hidden layer being weighted to become the input to a particular phase RBF-modulated node. In Figure IV .16 (Bottom), we only display the top 10 dominant regular hidden layer node output for each phase RBF-modulated node (in yellow color), and the rest (colored in blue) are the less dominant ones. We see that between different phase RBF- modulated nodes, the priority ranking is different, suggesting that there is some dependency of the feedback on the movement phase. IV .9.3.5 Unrolling the Learned Feedback Model on the Robot (a) 0.0 (b) 0.0 (c) 0.0 (d) 2.0 (e) 0.7 (f) 2.5 (g) 5.7 (h) 3.7 Figure IV .17: Snapshots of our experiment on the robot, while scraping on the tilt stage with +10 roll angle. The first row is unrolling without the coupling term, i.e. unrolling the nominal behavior. The second row is unrolling with the learned coupling term model. The caption shows the reading of the Digital Angle Gauge mounted on top of the middle finger of the hand. The first column is at initial position. The second column is at the end of the first primitive (going down in z-direction). The third column is at the end of the second primitive (orientation correction). The fourth column is at the end of the third primitive (scraping the board forward in y-direction). We see that on the second row, there is orientation correction applied due to the coupling term being active. 105 In Figure IV .17, we show the snapshots of our robot unrolling the primitives together with the online correction prediction computed by the trained feedback models on a setting with 10 roll- angle of the tilt stage. We see that if we turn off the coupling term (nominal behavior execution, first row of the figure set), there was no correction applied to the tool-tip orientation and the scraping result was worse than when the online-computed corrections were applied (second row of the figure set). We also show the coupling term that was computed online by the trained feedback model, alongside the corresponding sensor trace deviation during the robot execution in Figure IV .18, plotted in red. When applying the coupling term computed by the trained feedback model, the sensor trace deviation is close to those from demonstrations, as shown in the bottom row of Figure IV .18. Video can be seen in https://youtu.be/7Dx5imy1Kcw. In the video, we show the robot’s scraping task execution at two settings, at 5 and 10 roll-angle of the tilt stage, while applying the corrections predicted online by the trained feedback model. Figure IV .18: The roll-orientation coupling term (top) versus the sensor traces deviation of the right BioTac finger’s electrode #6 (bottom) of primitive 2, during the scraping task on environ- mental setting with the roll-angle of the tilt stage vary from 2:5 (left-most) to 10 (right-most), with +2:5 increments each step. Comparison is between human demonstrations (blue), during unrolling on robot while applying the coupling term computed online by the trained feedback model (red), during unrolling the nominal behavior on the robot (green), the human demonstra- tion’s mean trajectory (dashed black), and the range of the human demonstration within 1 standard deviation from the mean trajectory (solid black). On the top plots, we see that the trained feedback model can differentiate between different tilt stage’s roll-orientations and apply the approximately correct amount of correction/coupling term. 106 IV .9.4 Reinforcement Learning of Feedback Models In accordance with the definition in section IV .9.1.2, for all experiments in section IV .9.4, the environmental settings definition are: The default setting is the setting with tilt stage’s roll angle at 0 . The known/seen settings are the settings with tilt stage’s roll angle at 5 , 6:3 , and 7:5 . These are the settings we performed supervised learning on (as described in section IV .6), to initialize the feedback model before performing reinforcement learning. The initially unknown/unseen setting is the setting with tilt stage’s roll angle at 10 . This is the novel setting on which we perform the reinforcement learning approach described in section IV .7. The unknown/unseen setting is the setting with tilt stage’s roll angle at 8:8 , which neither was seen during the supervised learning nor during the reinforcement learning. We test the final feedback policy after RL on this never-before-seen setting, in order to evaluate the across-settings generalization capability of the feedback model. Before evaluating our framework for reinforcement learning (RL) refinement of the feed- back model, we initialize the feedback model with supervised training dataset collected at the known/seen settings. Afterwards, first, we evaluate the learning curve of the feedback model RL refinement on the initially unknown/unseen setting. Second, we compare the feedback mod- els’ performance on all non-default settings involved before and after the reinforcement learning, including the across-settings generalization performance test on the unknown/unseen setting. Fi- nally, we provide snapshots of the real robot execution, comparing the adaptive behavior before and after the RL refinement on the initially unknown/unseen setting. 107 IV .9.4.1 Quantitative Evaluation of Training with Reinforcement Learning Figure IV .19: The learning curves of the RL refinement of the feedback model on the initially unseen setting 10 by primitive 2 (left) and 3 (right). The learning curves here shows the mean and standard deviation of the cost over 8 runs on the real robot of the adaptive behavior after the feedback policy update. The cost at iteration # 0 shows the cost before RL is performed. Figure IV .19 shows the learning curves of the feedback model refinement by reinforcement learn- ing (RL) on its usage in the adaptive behavior on the initially unseen setting 10 , on primitive 2 and 3, with the mean and standard deviation computed over 8 runs on the real robot. The robot first performs RL on the feedback model of the primitive 2, and once its performance converged in primitive 2, the robot performs RL on the feedback model of the primitive 3. We use K = 38 as the number of policy samples taken in the PI 2 CMA algorithm, in Algorithms IV .3 and IV .4. An iteration of the while-loop in Algorithm IV .3 usually takes around 30-45 minutes to complete by the robot (most of the time is taken by the evaluation of the K sampled policies, since the robot needs to unroll each sampled policy in order to evaluate it). The behavior has already converged by the end of the second iteration both on primitive 2 and 3 –in the sense that further iterations do not show significant improvements–, hence we only show the learning curves up to RL iteration # 2. Hence in total the RL refinement process takes around 2-3 hours to complete on the robot. 108 This shows the sample efficiency of our proposed reinforcement learning algorithm for the feedback models, since many of the deep reinforcement learning techniques nowadays require millions of samples to complete which is infeasible to be executed on a robot. IV .9.4.2 Feedback Models Performance Before versus After Reinforcement Learning and Across-Settings Generalization Performance Figure IV .20: The performance comparison in terms of accumulated cost on primitive 2 (left) and on primitive 3 (right) between the nominal behavior without feedback model (red) vs. the adap- tive behavior (including the feedback model) before reinforcement learning (RL) of the feedback model (green) vs. the adaptive behavior after RL of the feedback model (blue) on all non-default settings. The mean and standard deviation is computed over 8 runs on the real robot. In Figure IV .20, we compare the performance in terms of the accumulated cost on primitive 2 and primitive 3 at the execution on all non-default settings, between the nominal behavior without feedback model (red), the adaptive behavior before the reinforcement learning (RL) of the feed- back model (green), and the adaptive behavior after the RL of the feedback model (blue), with the mean and standard deviation statistics computed over 8 runs on the real robot. We see that: The nominal behavior without feedback model (red) always performs worse than the adap- tive behaviors with the learned feedback models (green and blue), showing the effectiveness of the learned feedback models. The performance of the adaptive behavior with the learned feedback model on the initially unseen setting 10 improves significantly after the feedback model is refined by RL. 109 The performance of the adaptive behavior with the learned feedback model on the seen settings 5 , 6:3 , and 7:5 does not degrade –and even improves– after the feedback model is refined by RL, as compared to the performance before RL. The performance of the adaptive behavior with the learned feedback model on the unseen setting 8:8 –which was neither seen during the initial supervised learning nor during the reinforcement learning of the feedback model– is in general in between the performance of its neighboring settings 7:5 and 10 both before and after RL, which shows some degree of generalization across environment settings of the learned feedback model. (a) Before Prim. 1 (b) End of Prim. 1 (c) End of Prim. 2 (d) End of Prim. 3 Figure IV .21: Snapshots of our experiment on the real robot, comparing the execution of the closed-loop behavior (the nominal behavior and the learned feedback model) before RL (soft shadow) versus after RL. After RL, the feedback model applied more correction as compared to the one before RL, qualitatively showing the improvement result by the RL algorithm. IV .9.4.3 Qualitative Evaluation of the Real Robot Behavior Figure IV .21 shows the snapshots of our anthropomorphic robot executing the adaptive behavior –that is the nominal behavior and the learned feedback model– at different stages in the initially unknown/unseen setting (10 ). We compare the behavior before RL (shown as soft shadows) and after RL refinement of the feedback model. As we can see in Fig. IV .21(c), the feedback model after RL applied more correction than the one before RL, showing the qualitative result 110 of the improvement by the RL algorithm. The quantitative performance improvement due to RL can be seen in Fig. IV .20. The pipeline of the RL experiment can be seen in the video https://youtu.be/WDq1rcupVM0. IV .10 Summary We introduced a general framework for learning feedback models for reactive behaviors. At the core of our approach, we presented two expressive learning representations that guar- antee convergence of the adapted behaviors to the goal. First, we presented a feed-forward neural network (FFNN) with output post-processing as a learning representation of feedback model in obstacle avoidance testbed. However, we realized that hand-designing the form of output post- processing may be infeasible for some tasks, and hence we introduced the phase-modulated neural network (PMNN). PMNN has an embedded post-processing, whose form is determined automat- ically when learning from demonstration data. Besides, PMNN can capture the dependency of the adaptation on the movement phase while ensuring convergence of the adapted motion plan. Furthermore, we show the superiority of PMNN as compared to approaches in previous work, including FFNN, when applied to learning tactile feedback model testbed. Furthermore, we show the complete pipeline of our framework for learning feedback mod- els for reactive behaviors. We began by introducing a weighted least square method for auto- mated alignment and segmentation of human demonstrations, in order to extract the movement primitives; we complemented this with its algorithmic time complexity analysis and a speed-up possibility. Afterwards, we show that the learning representation of the feedback models can be initialized by supervised learning from demonstrations on several known settings. Finally, we presented a sample-efficient reinforcement learning algorithm that can refine the feedback mod- els further, to improve its performance on settings not seen during the supervised training phase, while retaining its performance on the initially-known settings. We also showed that the learned feedback model can generalize to a setting that was neither seen during the supervised training phase nor during the reinforcement learning phase. We demonstrated the effectiveness of our approaches in real-time operations with experiments on an anthropomorphic robot. 111 Chapter V Conclusion and Future Work V .1 Conclusion In this dissertation, we presented several structured empirical methods – which are approaches for incorporating prior knowledge as structures within learning representations – in robot control and motion planning tasks. In the first part, we showed how we can leverage insight about the manifold structure of the point of contact on a tactile skin and utilize a manifold learning technique to learn an Euclidean- structured latent space representation suitable for dynamics prediction and control for tactile ser- voing. The benefit of such embedding into the Euclidean latent space is that the control problem for tactile servoing can be solved analytically. In the second part of the dissertation, we showed how we can incorporate physical constraints into a learning representation for inverse dynamics using a differentiable Newton-Euler algorithm. Moreover, we showed that this structured empirical method achieves superior results in terms of the training speed and the generalization capability of the learned dynamics model, as compared to the existing unstructured and the semi-structured methods. The last part of this dissertation presented a structured empirical method for learning feed- back models for reactive behaviors. We proposed an expressive special structure neural network called the phase-modulated neural network (PMNN) as the learning representation of feedback models for reactive behaviors. The design of PMNN allows it to capture the phase-dependent adaptation of behaviors, as well as ensuring the convergence of the adapted motion plans to the goal. Moreover, we showed the methods for training PMNNs both in a supervised learning setup 112 – by learning from demonstrations of corrected behaviors – as well as in a reinforcement learning setup. In summary, in this dissertation, we showed the benefits of incorporating prior knowledge as structure in learning robot control and motion planning. The first benefit is that such a method can simplify the control problem and allow for the derivation of the control policy analytically from the learned model. Second, the structured empirical approach improves the training speed and the generalization capability of the model. Finally, the structured empirical models can be designed in such a way that provides some desirable performance guarantees. V .2 Future Work There are several promising directions for future work that can extend the approaches presented in this dissertation. First, the manifold learning technique presented in the learning tactile servoing work requires an understanding of the manifold characteristics of point contacts on a tactile skin. On the other hand, the established field of study in Differential Geometry has advanced the understanding of manifold characteristics in general, regardless of domain-specific applications. A promising direction is to infuse learning representation with these insights from Differential Geometry that will lead towards a more general manifold learning method. Such a method can be used for example in learning constraint manifolds for sequential motion planning [150, 151]. Second, in our work on encoding physical constraints on the differentiable Newton-Euler algorithm (DiffNEA), there are unmodelled dynamics in the real robot system, such as static fric- tion. Such discrepancies cause imperfect training of the DiffNEA model on the real robot dynam- ics dataset. Future work on this topic addressing these unmodelled dynamics could substantially improve performance on real robot systems. Finally, in learning feedback models for reactive behaviors, we see there is still an issue during the supervised training of the feedback models. After supervised training, sometimes the behavior drifts from the expected corrective behaviors present in the training dataset. Such drift is possibly caused by the treatment of the training data points as independent and identically distributed (i.i.d.). A potential way to mitigate this problem is by performing the supervised training inside a differentiable trajectory optimization of the adaptive behavior, with the corrected 113 behavior demonstration as the target of the trajectory optimization. By performing the supervised training of the feedback model this way, it may be possible to eliminate the assumption that the data are i.i.d. 114 Appendix A Solution of Constrained Optimal Control of 1-Horizon Given the constrained optimization problem: min a t ; z t+1 1 2 kz T z t+1 k 2 + 2 ka t k 2 s:t: z t+1 = z t + (A t z t + B t a t + c t )t (A.1) which is equivalent to solving the following problem: minimize a t ; z t+1 ; 1 2 kz T z t+1 k 2 + 2 ka t k 2 + T (z t+1 z t (A t z t + B t a t + c t )t) (A.2) Setting the partial derivative of: L CO (a t ; z t+1 ;) = 1 2 kz T z t+1 k 2 + 2 ka t k 2 + T (z t+1 z t (A t z t + B t a t + c t )t) (A.3) w.r.t. the optimization variables to zero, gives us: @L CO @ = 0) z t+1 = z t + (A t z t + B t a t + c t )t (A.4) @L CO @a t = 0)a t B T t t = 0) a t = t B T t (A.5) 115 @L CO @z t+1 = 0)(z T z t+1 ) + = 0) z t+1 = z T (A.6) Combining Eq. A.4, A.6, and A.5: z T = z t + (A t z t + B t a t + c t )t z T = z t + A t z t t + t 2 B t B T t + c t t (A.7) t 2 B t B T t + I = z T z t A t z t t c t t = t B t B T t + t 2 I 1 z T z t t A t z t c t (A.8) Finally we substitute Eq. A.8 into Eq. A.5, we get: a t = B T t B t B T t + t 2 I 1 z T z t t A t z t c t (A.9) 116 Appendix B Quaternion Algebra Unit quaternion is a hypercomplex number which can be written as a vectorQ = h r q T i T , such thatkQk = 1 withr andq = h q 1 q 2 q 3 i T are the real scalar and the vector of three imaginary components of the quaternions, respectively. For computation with orientation trajectory, several operations needs to be defined as follows: quaternion composition operation: Q A Q B = 2 6 6 6 6 6 4 r A q A1 q A2 q A3 q A1 r A q A3 q A2 q A2 q A3 r A q A1 q A3 q A2 q A1 r A 3 7 7 7 7 7 5 2 6 6 6 6 6 4 r B q B1 q B2 q B3 3 7 7 7 7 7 5 (B.1) quaternion conjugation operation: Q = 2 4 r q 3 5 (B.2) 117 logarithm mapping (log() operation), which maps an element of SO(3) to so(3), is defined as : log (Q) = log 0 @ 2 4 r q 3 5 1 A = arccos (r) sin (arccos (r)) q (B.3) exponential mapping (exp() operation, the inverse of log() operation), which maps an element of so(3) to SO(3), is defined as: exp (!) = 2 4 cos (k!k) sin (k!k) k!k ! 3 5 (B.4) This numerical implementation of logarithm mapping is better than those mentioned in the previous works by Ude et al. [141], Kramberger et al. [142], due to the fact that lim r!1 arccos(r) sin(arccos(r)) = lim x!0 x sin(x) = 1 hence it transitions smoother in the proximity ofQ = h 1 0 0 0 i T . 118 Appendix C Time Complexity Analysis and Quadratic Speed-Up of the Automated Nominal Demonstrations Alignment and Segmentation via Weighted Least Square Dynamic Time Warping The time complexity for DTW algorithm [82] isO(MN), withN is the length of the reference segment andM is the length of the initial guessed segment, as defined in section IV .5.1. For the weighted least square regression: There are 2 parameters ( l andd l ) to be estimated, and the total number of correspondence pairs is K = minimum(M;N). Thus at most the dimensionality of A is K 2, the dimensionality of W isKK, and the dimensionality of b isK 1. The time complexity for matrix-matrix multiplication A T W isO(K) because W is a di- agonal matrix. The time complexity for matrix-matrix multiplication (A T W)A isO(K). The time complexity for matrix inversion A T WA 1 , i.e. inversion of a 2 2 matrix is O(1). The time complexity for matrix-vector multiplication (A T W)b isO(K). The time complexity for matrix-vector multiplication A T WA 1 A T Wb isO(1). 119 Thus the time complexity of the overall process isO(MN), i.e. it is dominated by the time complexity of the DTW algorithm. However, we can gain a significant speed-up of the overall process by down-sampling both the reference and the guessed segments together by a factor ofg, such that the new length of the reference segment isN 0 = 1 g N and the new length of the guessed segment isM 0 = 1 g M. We need thatN 0 2 andM 0 2, such that the weighted least square computation is still reasonable. The resulting relative time scale will be un-changed, i.e. 0 l = l , while the relative time delay is related byd 0 l = 1 g d l . The overall time complexity then becomesO(M 0 N 0 ) =O( 1 g 2 MN), i.e. a quadratic speed-up as compared to the original version. 120 Appendix D Publications and Presentations D.1 Publications (i). Journal Paper: International Journal of Robotics Research (IJRR) (First Author, Submit- ted on June 25, 2020, currently being peer-reviewed) ”Supervised learning and reinforcement learning of feedback models for reactive behaviors: tactile feedback testbed” [152] (ii). Conference Paper: Learning for Dynamics & Control (L4DC) 2020 (First Author, Pub- lished) ”Encoding physical constraints in differentiable Newton-Euler algorithm” [6] (iii). Conference Paper: IEEE International Conference on Robotics and Automation (ICRA) 2019 (First Author, Published) ”Learning latent space dynamics for tactile servoing” [75] (iv). Conference Paper: IEEE International Conference on Robotics and Automation (ICRA) 2018 (First Author, Published) ”Learning sensor feedback models from demonstrations via phase-modulated neural net- works” [32] (v). Conference Paper: IEEE International Conference on Robotics and Automation (ICRA) 2017 (Co-First Author, Published) ”Learning feedback terms for reactive planning and control” [70] 121 D.2 Presentations (i). Workshop Presentation: RSS 2020 Workshop for Learning (in) Task and Motion Planning “Learning Manifolds for Sequential Motion Planning” [151] (ii). Poster: USC CS PhD Annual Research Review 2020 “Encoding physical constraints in differentiable Newton-Euler algorithm” (iii). Workshop Presentation and Poster: ICRA 2019 ViTac: Integrating Vision and Touch for Multimodal and Cross-modal Perception Workshop “Learning latent space dynamics for tactile servoing” (iv). Poster: Southern California Robotics Symposium 2019 “Learning latent space dynamics for tactile servoing” (v). Poster: USC CS PhD Annual Research Review 2018 “Learning sensor feedback models from demonstrations via phase-modulated neural net- works” (vi). Poster: Southern California Robotics Symposium 2017 “Learning feedback terms for reactive planning and control” (vii). Poster: USC CS PhD Annual Research Review 2017 “Learning feedback terms for reactive planning and control” (viii). Workshop Presentation and Poster: Humanoids 2016 Human Performance and Robotics Workshop “Learning feedback terms for reactive planning and control” (ix). Abstract: Neural Control of Movement Conference 2016 “Distinct adaptation to abrupt and gradual torque perturbations with a multi-joint exoskele- ton robot” 122 Bibliography [1] Akshara Rai, Franziska Meier, Auke Jan Ijspeert, and Stefan Schaal. Learning coupling terms for obstacle avoidance. In IEEE-RAS International Conference on Humanoid Robots, pages 512–518, 2014. [2] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differ- entiation in pytorch. In NIPS-W, 2017. [3] Mart´ ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel- low, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´ e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Ku- nal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´ egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2015. URLhttp://download.tensorflow.org/paper/whitepaper2015.pdf. [4] M. Lutter, C. Ritter, and J. Peters. Deep lagrangian networks: Using physics as model prior for deep learning. In 7th International Conference on Learning Representations (ICLR), May 2019. URLhttps://openreview.net/pdf?id=BklHpjCqKm. [5] Jayesh K. Gupta, Kunal Menda, Zachary Manchester, and Mykel J. Kochenderfer. A gen- eral framework for structured learning of mechanical systems. CoRR, abs/1902.08705, 2019. URLhttp://arxiv.org/abs/1902.08705. [6] Giovanni Sutanto, Austin S. Wang, Yixin Lin, Mustafa Mukadam, Gaurav Sukhatme, Ak- shara Rai, and Franziska Meier. Encoding physical constraints in differentiable newton- euler algorithm. CoRR, abs/2001.08861, 2020. [7] Dae-Hyung Park, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Movement reproduc- tion and obstacle avoidance with dynamic movement primitives and potential fields. In IEEE International Conference on Humanoid Robots, pages 91–98, 2008. 123 [8] Heiko Hoffmann, Peter Pastor, Dae-Hyung Park, and Stefan Schaal. Biologically-inspired dynamical systems for movement generation: Automatic real-time goal adaptation and obstacle avoidance. In IEEE International Conference on Robotics and Automation, pages 2587–2592, 2009. [9] Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and general- ization of motor skills by learning from demonstration. In IEEE International Conference on Robotics and Automation, pages 763–768, 2009. [10] Andrej Gams, Auke Jan Ijspeert, Stefan Schaal, and Jadran Lenarˇ ciˇ c. On-line learning and modulation of periodic movements with nonlinear dynamical systems. Autonomous robots, 27(1):3–23, 2009. [11] A. Yahya, A. Li, M. Kalakrishnan, Y . Chebotar, and S. Levine. Collective robot rein- forcement learning with distributed asynchronous guided policy search. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 79–86, Sep. 2017. doi: 10.1109/IROS.2017.8202141. [12] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learn- ing hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018. doi: 10.1177/0278364917710318. URL https://doi.org/10.1177/ 0278364917710318. [13] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience, 2018. [14] Roland Johansson. Light a match: Normal, pre-anesthetization performance vs post-anesthetization performance. https://www.youtube.com/watch?v= 0LfJ3M3Kn80, 2018. Accessed: 2018-08-04. [15] Ronald S. Johansson and G¨ oran Westling. Roles of glabrous skin receptors and sensorimo- tor memory in automatic control of precision grip when lifting rougher or more slippery objects. Experimental Brain Research, 56(3):550–564, Oct 1984. ISSN 1432-1106. doi: 10.1007/BF00237997. URLhttps://doi.org/10.1007/BF00237997. [16] Nicholas Wettels, Veronica J. Santos, Roland S. Johansson, and Gerald E. Loeb. Biomimetic tactile sensor array. Advanced Robotics, 22(8):829–849, 2008. doi: 10.1163/ 156855308X314533. URLhttps://doi.org/10.1163/156855308X314533. [17] Craig Chorley, Chris Melhuish, Tony Pipe, and Jonathan Rossiter. Development of a tactile sensor based on biologically inspired edge encoding. In 2009 International Conference on Advanced Robotics, pages 1–6, June 2009. 124 [18] Wenzhen Yuan, Siyuan Dong, and Edward H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12), 2017. ISSN 1424- 8220. doi: 10.3390/s17122762. URL http://www.mdpi.com/1424-8220/17/ 12/2762. [19] Philipp Mittendorfer and Gordon Cheng. Humanoid multimodal tactile-sensing modules. IEEE Transactions on Robotics, 27(3):401–410, June 2011. ISSN 1552-3098. doi: 10. 1109/TRO.2011.2106330. [20] Qiang Li, Carsten Sch¨ urmann, Robert Haschke, and Helge J. Ritter. A control framework for tactile servoing. In Robotics: Science and Systems, 2013. [21] Nathan F. Lepora, Kirsty Aquilina, and Luke Cramphorn. Exploratory tactile servoing with active touch. IEEE Robotics and Automation Letters, 2(2):1156–1163, April 2017. ISSN 2377-3766. doi: 10.1109/LRA.2017.2662071. [22] Arunkumar Byravan, Felix Leeb, Franziska Meier, and Dieter Fox. Se3-pose-nets: Struc- tured deep dynamics models for visuomotor planning and control. CoRR, abs/1710.00489, 2017. URLhttp://arxiv.org/abs/1710.00489. [23] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Em- bed to control: A locally linear latent dynamics model for control from raw images. In Proceedings of the 28th International Conference on Neural Information Processing Sys- tems - Volume 2, NIPS’15, pages 2746–2754, Cambridge, MA, USA, 2015. MIT Press. URLhttp://dl.acm.org/citation.cfm?id=2969442.2969546. [24] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. CoRR, abs/1606.07419, 2016. URLhttp://arxiv.org/abs/1606.07419. [25] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, pages 1735–1742, Wash- ington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2597-0. doi: 10.1109/ CVPR.2006.100. URLhttp://dx.doi.org/10.1109/CVPR.2006.100. [26] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. ISSN 0036-8075. doi: 10.1126/ science.290.5500.2323. URL https://science.sciencemag.org/content/ 290/5500/2323. [27] Gautam Pai, Ronen Talmon, and Ron Kimmel. Parametric manifold learning via sparse multidimensional scaling. CoRR, abs/1711.06011, 2017. URL http://arxiv.org/ abs/1711.06011. 125 [28] Zhe Su, Jeremy Fishel, Tomonori Yamamoto, and Gerald Loeb. Use of tactile feedback to control exploratory movements to characterize object compliance. Frontiers in neuro- robotics, 6:7, 07 2012. [29] Herke van Hoof, Tucker Hermans, Gerhard Neumann, and Jan Peters. Learning robot in-hand manipulation with tactile features. In 2015 IEEE-RAS 15th International Con- ference on Humanoid Robots (Humanoids), pages 121–127, Nov 2015. doi: 10.1109/ HUMANOIDS.2015.7363524. [30] Jaeyong Sung, J. Kenneth Salisbury, and Ashutosh Saxena. Learning to represent haptic feedback for partially-observable tasks. In IEEE International Conference on Robotics and Automation, pages 2802–2809, 2017. [31] Vikash Kumar, Abhishek Gupta, Emanuel Todorov, and Sergey Levine. Learning dexterous manipulation policies from experience and imitation. CoRR, abs/1611.05095, 2016. [32] Giovanni Sutanto, Zhe Su, Stefan Schaal, and Franziska Meier. Learning sensor feedback models from demonstrations via phase-modulated neural networks. In 2018 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 1142–1149, May 2018. doi: 10.1109/ICRA.2018.8460986. [33] Chia-Hsien Lin, Jeremy A. Fishel, and Gerald E. Loeb. Estimating point of contact , force and torque in a biomimetic tactile sensor with deformable skin. In SynTouch LLC, 2013. [34] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. CoRR, abs/1708.02596, 2017. [35] Ajo Fod, Maja J Matari´ c, and Odest Chadwicke Jenkins. Automated derivation of primi- tives for movement classification. Autonomous robots, 12(1):39–54, 2002. [36] Roland Johansson and John Flanagan. Coding and use of tactile signals from the fingertips in object manipulation tasks. Nature reviews. Neuroscience, 10:345–59, 05 2009. [37] Stefan Schaal. The sl simulation and real-time control software package. Technical report, University of Southern California, Los Angeles, CA, 2009. URLhttp://www-clmc. usc.edu/publications/S/schaal-TRSL.pdf. clmc. [38] Nathan D. Ratliff, Jan Issac, Daniel Kappler, Stan Birchfield, and Dieter Fox. Rieman- nian motion policies. CoRR, abs/1801.02854, 2018. URLhttp://arxiv.org/abs/ 1801.02854. [39] Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo. Robotics: Mod- elling, Planning and Control. Springer Science & Business Media, 2010. [40] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 126 [41] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. [42] Chae H. An, Christopher G. Atkeson, and John M. Hollerbach. Model-based Control of a Robot Manipulator. MIT Press, Cambridge, MA, USA, 1988. ISBN 0-262-01102-6. [43] Richard M. Murray, S. Shankar Sastry, and Li Zexiang. A Mathematical Introduction to Robotic Manipulation. CRC Press, Inc., Boca Raton, FL, USA, 1st edition, 1994. ISBN 0849379814. [44] Christopher G. Atkeson, Chae H. An, and John M. Hollerbach. Estimation of inertial parameters of manipulator loads and links. The International Journal of Robotics Research, 5(3):101–119, 1986. doi: 10.1177/027836498600500306. [45] Kevin Hitzler, Franziska Meier, Stefan Schaal, and Tamim Asfour. Learning and adapta- tion of inverse dynamics models: A comparison. In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 491–498, 2019. [46] D. Nguyen-Tuong, J. Peters, M. Seeger, and B. Sch¨ olkopf. Learning inverse dynamics: A comparison. In Advances in Computational Intelligence and Learning: Proceedings of the European Symposium on Artificial Neural Networks, pages 13–18, Evere, Belgium, Apr 2008. Max-Planck-Gesellschaft, d-side. [47] Daniel Kappler, Franziska Meier, Nathan Ratliff, and Stefan Schaal. A new data source for inverse dynamics learning. In 2017 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 4723–4730, Sep. 2017. doi: 10.1109/IROS.2017. 8206345. [48] F. D. Ledezma and S. Haddadin. First-order-principles-based constructive network topolo- gies: An application to robot inverse dynamics. In 2017 IEEE-RAS 17th Interna- tional Conference on Humanoid Robotics (Humanoids), pages 438–445, Nov 2017. doi: 10.1109/HUMANOIDS.2017.8246910. [49] P. M. Wensing, S. Kim, and J. E. Slotine. Linear matrix inequalities for physically con- sistent inertial parameter identification: A statistical perspective on the mass distribution. IEEE Robotics and Automation Letters, 3(1):60–67, Jan 2018. ISSN 2377-3774. doi: 10.1109/LRA.2017.2729659. [50] J. Y . S. Luh, M. W. Walker, and R. P. C. Paul. On-Line Computational Scheme for Mechanical Manipulators. Journal of Dynamic Systems, Measurement, and Control, 102(2):69–76, 06 1980. ISSN 0022-0434. doi: 10.1115/1.3149599. URL https: //doi.org/10.1115/1.3149599. [51] Roy Featherstone. Rigid Body Dynamics Algorithms. Springer-Verlag, Berlin, Heidelberg, 2007. ISBN 0387743146. 127 [52] M. Jansen. Learning an accurate neural model of the dynamics of a typical industrial robot. In International Conference on Artificial Neural Networks, page 1257–1260, 1994. [53] Jayesh K. Gupta, Kunal Menda, Zachary Manchester, and Mykel J. Kochenderfer. Struc- tured mechanical models for robot learning and control. CoRR, abs/2004.10301, 2020. URLhttp://arxiv.org/abs/2004.10301. [54] S. Traversaro, S. Brossette, A. Escande, and F. Nori. Identification of fully physical consis- tent inertial parameters using optimization on manifolds. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5446–5451, Oct 2016. doi: 10.1109/IROS.2016.7759801. [55] Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. CoRR, abs/1906.01563, 2019. URLhttp://arxiv.org/abs/1906.01563. [56] M. Mistry, S. Schaal, and K. Yamane. Inertial parameter estimation of floating base hu- manoid systems using partial force sensing. In 2009 9th IEEE-RAS International Confer- ence on Humanoid Robots, pages 492–497, Dec 2009. doi: 10.1109/ICHR.2009.5379531. [57] Krzysztof R. Kozlowski. Modelling and identification in robotics. Springer Science & Business Media, 2012. [58] Jean-Jacques E. Slotine and Weiping Li. On the adaptive control of robot manipulators. The International Journal of Robotics Research, 6(3):49–59, 1987. doi: 10.1177/027836498700600303. URL https://doi.org/10.1177/ 027836498700600303. [59] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2018. [60] Jun Nakanishi, Rick Cory, Michael Mistry, Jan Peters, and Stefan Schaal. Opera- tional space control: A theoretical and empirical comparison. The International Jour- nal of Robotics Research, 27(6):737–757, 2008. doi: 10.1177/0278364908091463. URL https://doi.org/10.1177/0278364908091463. [61] Justin Carpentier, Florian Valenza, Nicolas Mansard, et al. Pinocchio: fast forward and in- verse dynamics for poly-articulated systems.https://stack-of-tasks.github. io/pinocchio, 2015–2019. [62] Chonhyon Park, Jia Pan, and Dinesh Manocha. Itomp: Incremental trajectory optimiza- tion for real-time replanning in dynamic environments. In Proceedings of the Twenty- Second International Conference on International Conference on Automated Planning and Scheduling, ICAPS’12, page 207–215. AAAI Press, 2012. [63] N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa. Chomp: Gradient optimization tech- niques for efficient motion planning. In 2009 IEEE International Conference on Robotics and Automation, pages 489–494, May 2009. doi: 10.1109/ROBOT.2009.5152817. 128 [64] A. Byravan, B. Boots, S. S. Srinivasa, and D. Fox. Space-time functional gradient opti- mization for motion planning. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 6499–6506, May 2014. doi: 10.1109/ICRA.2014.6907818. [65] N. Ratliff, M. Toussaint, and S. Schaal. Understanding the geometry of workspace ob- stacles in motion optimization. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 4202–4209, May 2015. doi: 10.1109/ICRA.2015.7139778. [66] Mustafa Mukadam, Jing Dong, Xinyan Yan, Frank Dellaert, and Byron Boots. Continuous- time gaussian process motion planning via probabilistic inference. The International Jour- nal of Robotics Research, 37(11):1319–1340, 2018. doi: 10.1177/0278364918790369. [67] ` Eric Pairet, Paola Ard´ on, Michael Mistry, and Yvan R. Petillot. Learning generalisable coupling terms for obstacle avoidance via low-dimensional geometric descriptors. CoRR, abs/1906.09941, 2019. [68] Jens Kober, Betty Mohler, and Jan Peters. Learning perceptual coupling for motor prim- itives. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 834–839, 2008. [69] Peter Pastor, Ludovic Righetti, Mrinal Kalakrishnan, and Stefan Schaal. Online movement adaptation based on previous sensor experiences. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 365–371, 2011. [70] Akshara Rai, Giovanni Sutanto, Stefan Schaal, and Franziska Meier. Learning feed- back terms for reactive planning and control. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2184–2191, May 2017. doi: 10.1109/ICRA.2017. 7989252. [71] Clemens Eppner, Sebastian H¨ ofer, Rico Jonschkowski, Roberto Mart´ ın-Mart´ ın, Arne Siev- erling, Vincent Wall, and Oliver Brock. Lessons from the amazon picking challenge: Four aspects of building robotic systems. In Proceedings of Robotics: Science and Systems, AnnArbor, Michigan, June 2016. doi: 10.15607/RSS.2016.XII.036. [72] D. Kappler, F. Meier, J. Issac, J. Mainprice, C. G. Cifuentes, M. W¨ uthrich, V . Berenz, S. Schaal, N. Ratliff, and J. Bohg. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters, 3(3):1864–1871, July 2018. ISSN 2377-3774. doi: 10.1109/LRA.2018.2795645. [73] S. Tian, F. Ebert, D. Jayaraman, M. Mudigonda, C. Finn, R. Calandra, and S. Levine. Ma- nipulation by feel: Touch-based control with deep predictive models. In 2019 International Conference on Robotics and Automation (ICRA), pages 818–824, 2019. [74] Heiko Hoffmann, Zhichao Chen, Darren Earl, Derek Mitchell, Behnam Salemi, and Jivko Sinapov. Adaptive robotic tool use under variable grasps. Robotics and Autonomous Sys- tems, 62(6):833–846, 2014. 129 [75] G. Sutanto, N. Ratliff, B. Sundaralingam, Y . Chebotar, Z. Su, A. Handa, and D. Fox. Learning latent space dynamics for tactile servoing. In 2019 International Conference on Robotics and Automation (ICRA), pages 3622–3628, May 2019. doi: 10.1109/ICRA.2019. 8793520. [76] Yevgen Chebotar, Oliver Kroemer, and Jan Peters. Learning robot tactile sensing for object manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3368–3375, 2014. [77] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA), pages 6023–6029, May 2019. doi: 10. 1109/ICRA.2019.8794127. [78] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016. [79] Y . Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine. Path integral guided policy search. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3381–3388, May 2017. [80] Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dy- namical movement primitives: Learning attractor models for motor behaviors. Neural Computation, 25(2):328–373, 2013. [81] J¨ urgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. [82] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. [83] Ching-An Cheng, Mustafa Mukadam, Jan Issac, Stan Birchfield, Dieter Fox, Byron Boots, and Nathan Ratliff. RMPflow: A computational graph for automatic motion policy gen- eration. In The 13th International Workshop on the Algorithmic Foundations of Robotics, 2018. URLhttp://arxiv.org/abs/1811.07049. [84] Tamar Flash and Neville Hogan. The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5:1688–703, 08 1985. [85] Lydia Kavraki, Petr Svestka, J.C. Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transac- tions on Robotics and Automation, 12(4):566–580, Aug 1996. ISSN 1042-296X. doi: 10.1109/70.508439. 130 [86] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yu- val Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. [87] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation, pages 3406–3413, 2016. [88] Fangyi Zhang, J¨ urgen Leitner, Michael Milford, Ben Upcroft, and Peter Corke. To- wards vision-based deep reinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791, 2015. [89] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. arXiv preprint arXiv:1610.00633, 2016. [90] S. Calinon, E. L. Sauser, A. G. Billard, and D. G. Caldwell. Evaluation of a prob- abilistic approach to learn and reproduce gestures by imitation. In 2010 IEEE Inter- national Conference on Robotics and Automation, pages 2671–2676, May 2010. doi: 10.1109/ROBOT.2010.5509988. [91] Peter Pastor, Mrinal Kalakrishnan, Franziska Meier, Freek Stulp, Jonas Buchli, Evangelos Theodorou, and Stefan Schaal. From dynamic movement primitives to associative skill memories. Robotics and Autonomous Systems, 61(4):351–361, 2013. [92] Giovanni Sutanto, Kusprasapta Mutijarsa, and Hilwadi Hindersah. Design and construction of “artro” omnidirectional walking bipedal robot using hierarchical reactive behaviors con- trol architecture. International Journal on Electrical Engineering and Informatics (IJEEI), 3(1):43–60, 2011. doi: 10.15676/ijeei.2011.3.1.4. [93] S. Behnke. Online trajectory generation for omnidirectional biped walking. In Proceed- ings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 1597–1603, 2006. [94] Alexandros Paraschos, Christian Daniel, Jan Peters, and Gerhard Neumann. Using proba- bilistic movement primitives in robotics. Autonomous Robots, 42(3):529–551, Mar 2018. ISSN 1573-7527. [95] M. Ewerton, G. Maeda, J. Peters, and G. Neumann. Learning motor skills from par- tially observed movements executed at different speeds. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 456–463, Sep. 2015. doi: 10.1109/IROS.2015.7353412. [96] Scott Niekum, Sachin Chitta, Andrew Barto, Bhaskara Marthi, and Sarah Osentoski. Incre- mental semantically grounded learning from demonstration. In Proceedings of Robotics: Science and Systems, Berlin, Germany, June 2013. doi: 10.15607/RSS.2013.IX.048. 131 [97] Scott Niekum, Sarah Osentoski, George Konidaris, Sachin Chitta, Bhaskara Marthi, and Andrew G Barto. Learning grounded finite-state representations from unstructured demon- strations. The International Journal of Robotics Research, 34(2):131–157, 2015. [98] F. Meier, E. Theodorou, F. Stulp, and S. Schaal. Movement segmentation using a primitive library. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3407–3412, Sep. 2011. doi: 10.1109/IROS.2011.6094676. [99] Rudolf Lioutikov, Gerhard Neumann, Guilherme Maeda, and Jan Peters. Probabilistic segmentation applied to an assembly task. In Humanoid Robots (Humanoids), 2015 IEEE- RAS 15th International Conference on, pages 533–540. IEEE, 2015. [100] Adam Coates, Pieter Abbeel, and Andrew Y . Ng. Learning for control from multiple demonstrations. In Proceedings of the 25th International Conference on Machine Learn- ing, ICML ’08, page 144–151, New York, NY , USA, 2008. Association for Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390175. [101] F. Zhou and F. De la Torre. Generalized time warping for multi-modal alignment of human motion. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1282–1289, June 2012. doi: 10.1109/CVPR.2012.6247812. [102] Z. Aung, K. Sim, and W. S. Ng. Traj align: A method for precise matching of 3-d trajec- tories. In 2010 20th International Conference on Pattern Recognition, pages 3818–3821, Aug 2010. doi: 10.1109/ICPR.2010.930. [103] Yang Chen and G´ erard Medioni. Object modelling by registration of multiple range im- ages. Image Vision Comput., 10(3):145–155, 1992. [104] Brett R. Fajen and William H. Warren. Behavioral dynamics of steering, obstable avoid- ance, and route selection. Journal of Experimental Psychology: Human Perception and Performance, 29(2):343–362, 2003. [105] Seyed Mohammad Khansari-Zadeh and Aude Billard. A dynamical system approach to realtime obstacle avoidance. Auton. Robots, 32(4):433–454, 2012. [106] B. Sundaralingam, A. S. Lambert, A. Handa, B. Boots, T. Hermans, S. Birchfield, N. Ratliff, and D. Fox. Robust learning of tactile force estimation through robot interaction. In 2019 International Conference on Robotics and Automation (ICRA), pages 9035–9042, 2019. [107] Filipe Veiga, Benoni Edin, and Jan Peters. Grip stabilization through independent finger tactile feedback control. Sensors, 20(6), 2020. ISSN 1424-8220. doi: 10.3390/s20061748. URLhttps://www.mdpi.com/1424-8220/20/6/1748. [108] Mohammad Khansari, Ellen Klingbeil, and Oussama Khatib. Adaptive human-inspired compliant contact primitives to perform surface–surface contact under uncertainty. The International Journal of Robotics Research, 35(13):1651–1675, 2016. 132 [109] Francois R. Hogan, Jose Ballester, Siyuan Dong, and Alberto Rodriguez. Tactile dexterity: Manipulation primitives with tactile feedback, 2020. [110] Elliott Donlon, Siyuan Dong, Melody Liu, Jianhua Li, Edward Adelson, and Alberto Ro- driguez. Gelslim: A high-resolution, compact, robust, and calibrated tactile-sensing finger, 2018. [111] Andrej Gams, Bojan Nemec, Auke Jan Ijspeert, and Aleˇ s Ude. Coupling movement primitives: Interaction with the environment and bimanual tasks. IEEE Transactions on Robotics, 30(4):816–830, 2014. [112] N. Likar, B. Nemec, L. ˇ Zlajpah, S. Ando, and A. Ude. Adaptation of bimanual assembly tasks using iterative learning framework. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pages 771–776, Nov 2015. [113] D. A. Bristow, M. Tharayil, and A. G. Alleyne. A survey of iterative learning control. IEEE Control Systems Magazine, 26(3):96–114, June 2006. ISSN 1066-033X. [114] Andrej Gams, Tadej Petric, Bojan Nemec, and Aleˇ s Ude. Learning and adaptation of periodic motion primitives based on force feedback and human coaching interaction. In IEEE-RAS International Conference on Humanoid Robots, pages 166–171, 2014. [115] Fares J. Abu-Dakka, Bojan Nemec, Jimmy A. Jørgensen, Thiusius R. Savarimuthu, Norbert Kr¨ uger, and Aleˇ s Ude. Adaptation of manipulation skills in physical contact with the environment to reference force profiles. Autonomous Robots, 39(2):199–217, Aug 2015. [116] Andrej Gams, Miha Denisa, and Aleˇ s Ude. Learning of parametric coupling terms for robot-environment interaction. In IEEE International Conference on Humanoid Robots, pages 304–309, 2015. [117] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016. [118] MP. Deisenroth and CE. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pages 465–472. Omnipress, 2011. [119] Vincent Franc ¸ois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning, 11(3-4):219–354, 2018. ISSN 1935-8237. doi: 10.1561/2200000071. 133 [120] Martin Riedmiller. Neural fitted q iteration – first experiences with a data efficient neural re- inforcement learning method. In Jo˜ ao Gama, Rui Camacho, Pavel B. Brazdil, Al´ ıpio M´ ario Jorge, and Lu´ ıs Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31692-3. [121] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. [122] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning. Mach. Learn., 8(3-4):229–256, May 1992. ISSN 0885-6125. [123] V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Pro- ceedings of The 33rd International Conference on Machine Learning, volume 48 of Pro- ceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR. [124] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180–1190, March 2008. ISSN 0925-2312. [125] J. Hwangbo, C. Gehring, H. Sommer, R. Siegwart, and J. Buchli. Rock — efficient black- box optimization for policy learning. In 2014 IEEE-RAS International Conference on Humanoid Robots, pages 535–540, Nov 2014. [126] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Ma- chine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. [127] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. [128] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 745–750, New York, NY , USA, 2007. ACM. [129] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learn- ing, 84(1):171–203, Jul 2011. ISSN 1573-0565. [130] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. Learning policy improvements with path integrals. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 828–835, Chia Laguna Resort, Sar- dinia, Italy, 13–15 May 2010. PMLR. 134 [131] Jan Peters, Katharina M¨ ulling, and Yasemin Alt¨ un. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pages 1607–1612. AAAI Press, 2010. [132] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Coference on International Confer- ence on Machine Learning, ICML’12, pages 1547–1554, USA, 2012. Omnipress. [133] F. Stulp, E. A. Theodorou, and S. Schaal. Reinforcement learning with sequences of motion primitives for robust manipulation. IEEE Transactions on Robotics, 28(6):1360–1370, Dec 2012. ISSN 1552-3098. [134] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal. Learning force control policies for compliant manipulation. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4639–4644, Sep. 2011. doi: 10.1109/IROS.2011.6095096. [135] M. Hazara and V . Kyrki. Reinforcement learning for improving imitated in-contact skills. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pages 194–201, Nov 2016. doi: 10.1109/HUMANOIDS.2016.7803277. [136] Carlos E. Celemin, Guilherme Maeda, Javier Ruiz-del Solar, Jan Peters, and Jens Kober. Reinforcement learning of motor skills using policy search and human corrective ad- vice. International Journal of Robotics Research, OnlineFirst, 2019. doi: 10.1177/ 0278364919871998. [137] Aviv Tamar, Sergey Levine, and Pieter Abbeel. Value iteration networks. CoRR, abs/1602.02867, 2016. [138] Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Clos- ing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pages 8973–8979, May 2019. doi: 10.1109/ICRA.2019.8793789. [139] Artem Molchanov, Tao Chen, Wolfgang H¨ onig, James A. Preiss, Nora Ayanian, and Gau- rav S. Sukhatme. Sim-to-(multi)-real: Transfer of low-level robust control policies to mul- tiple quadrotors. CoRR, abs/1903.04628, 2019. [140] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Vol- ume 28, ICML’13, pages III–1–III–9. JMLR.org, 2013. URL http://dl.acm.org/ citation.cfm?id=3042817.3042937. [141] Aleˇ s Ude, Bojan Nemec, Tadej Petric, and Jun Morimoto. Orientation in cartesian space dynamic movement primitives. In IEEE International Conference on Robotics and Au- tomation, pages 2997–3004, 2014. 135 [142] Aljaˇ z Kramberger, Andrej Gams, Bojan Nemec, and Aleˇ s Ude. Generalization of ori- entational motion in unit quaternion space. In IEEE-RAS International Conference on Humanoid Robots, pages 808–813, 2016. [143] Bojan Nemec and Aleˇ s Ude. Action sequencing using dynamic movement primitives. Robotica, 30(05):837–846, 2012. [144] Chris Bishop. Improving the generalization properties of radial basis function neural net- works. Neural Computation, 3(4):579–588, 1991. [145] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res., 11:3137–3181, December 2010. ISSN 1532-4435. [146] J. F. Queiber, B. Hammer, H. Ishihara, M. Asada, and J. J. Steil. Skill memories for parameterized dynamic action primitives on the pneumatically driven humanoid robot child affetto. In 2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 39–45, 2018. [147] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pages 807–814, 2010. [148] MATLAB. MATLAB and Neural Network Toolbox Release 2015a. The MathWorks Inc., Natick, Massachusetts, 2015. [149] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014. [150] Peter Englert, Isabel M. Rayas Fern´ andez, Ragesh K. Ramachandran, and Gaurav S. Sukhatme. Sampling-Based Motion Planning on Manifold Sequences. arXiv:2006.02027, 2020. [151] Isabel M. Rayas Fern´ andez, Giovanni Sutanto, Peter Englert, Ragesh K. Ramachan- dran, and Gaurav S. Sukhatme. Learning Manifolds for Sequential Motion Planning. arXiv:2006.07746, 2020. [152] Giovanni Sutanto, Katharina Rombach, Yevgen Chebotar, Zhe Su, Stefan Schaal, Gaurav S. Sukhatme, and Franziska Meier. Supervised Learning and Reinforcement Learning of Feedback Models for Reactive Behaviors: Tactile Feedback Testbed. arXiv:2007.00450, 2020. 136
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Learning from planners to enable new robot capabilities
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Exploiting mechanical properties of bipedal robots for proprioception and learning of walking
PDF
Machine learning of motor skills for robotics
PDF
Algorithms and systems for continual robot learning
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Data-driven learning for dynamical systems in biology
PDF
Deep learning for subsurface characterization and forecasting
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Analysis, design, and optimization of large-scale networks of dynamical systems
PDF
Learning distributed representations from network data and human navigation
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Motion coordination for large multi-robot teams in obstacle-rich environments
Asset Metadata
Creator
Sutanto, Giovanni
(author)
Core Title
Leveraging structure for learning robot control and reactive planning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/25/2020
Defense Date
04/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment,contact point tracking,Control,demonstration,dexterous robotic manipulation,differentiable,differentiable computation,differentiable programming,differential geometry,dynamical movement primitives,exponential mapping,feedback models,inertia matrix,inverse dynamics,latent space dynamics learning,Learning and Instruction,log mapping,manifold learning,neural networks,OAI-PMH Harvest,obstacle avoidance,phase-modulated neural networks,physical constraints,positive definite,reactive behaviors,reactive planning,recursive Newton-Euler algorithm,reinforcement learning,rigid body parameters,robot,segmentation,special orthogonal groups,structure,supervised learning,tactile feedback,tactile feedback capability,tactile finger,tactile sensing information,tactile sensors,tactile servoing,tactile skin geometry,weighted least square
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav S. (
committee chair
), Culbertson, Heather (
committee member
), Finley, James (
committee member
), Meier, Franziska (
committee member
)
Creator Email
giovanni.sutanto@gmail.com,gsutanto@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-339689
Unique identifier
UC11665385
Identifier
etd-SutantoGio-8743.pdf (filename),usctheses-c89-339689 (legacy record id)
Legacy Identifier
etd-SutantoGio-8743.pdf
Dmrecord
339689
Document Type
Dissertation
Rights
Sutanto, Giovanni
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment
contact point tracking
demonstration
dexterous robotic manipulation
differentiable
differentiable computation
differentiable programming
differential geometry
dynamical movement primitives
exponential mapping
feedback models
inertia matrix
inverse dynamics
latent space dynamics learning
log mapping
manifold learning
neural networks
obstacle avoidance
phase-modulated neural networks
physical constraints
positive definite
reactive behaviors
reactive planning
recursive Newton-Euler algorithm
reinforcement learning
rigid body parameters
robot
segmentation
special orthogonal groups
supervised learning
tactile feedback
tactile feedback capability
tactile finger
tactile sensing information
tactile sensors
tactile servoing
tactile skin geometry
weighted least square