Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning affordances through interactive perception and manipulation
(USC Thesis Other)
Learning affordances through interactive perception and manipulation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning Aordances through Interactive Perception and Manipulation by David Inkyu Kim A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) Committee: Prof. Gaurav S. Sukhatme (Chair) Computer Science Prof. Stefan Schaal Computer Science Prof. Satyandra K. Gupta Aerospace and Mechanical Engineering August 2018 Copyright 2018 David Inkyu Kim Dedicated to my wife Eunkyung and daughter Jane ii Acknowledgements First and foremost, I would like to thank my parents who have always been supportive and given me the freedom to pursue whatever I wanted to do. I understand that it is a huge opportunity that I was given, from my childhood till now that I can have condence to continue my study and beyond. I could not have nished this dissertation without the guidance and support from my advisor, Professor Gaurav Sukhatme. I especially thank Gaurav for giving me the chance to research on topics that I especially wanted to pursue. His clear view over the big picture in robotics showed me how to be a good researcher. I could not have met anyone better to be the advisor as a graduate student. And I would also like to add thanks to all my lab mates. Special thanks to Oliver Kroemer, who enjoyed his post-doc life with me for both research and daily life. I am sure that Oliver will continue to be a good advisor at Carnegie Mellon. I also would like to thank Stephanie Kemna, who has been the closest friend in the lab. I enjoyed every moment when we talked over various interests around our lives as well as discussing over the research topics. Thanks to Geo Hollinger, Andreas Breitenmoser, Joerg Muller, and Lantao Liu as being wonderful postdocs who advised me on my research. I loved every moment when we had fun conversation among us mates, Jonathan Binney, Harshvardhan Vathsangam, Jnaneshwar Das, Arvind Pereira, Christian Potthast, Megha Gupta, Max P ueger, Karol Hausman, Hordur Heidarsson, Artem Molchanov, James Priess, and Eric Heiden. And I also enjoyed working with master student, Supreeth Subbaraya. I would express my gratitude to Lizsl De Leon Spedding, who helped me so much from the very rst start to the end of my degree. She is the most incredible person and gave answers and advices to every problem I had! I also want to thank all my Korean friends, who have gone through all ups and downs. Jaeyoung Bang and Youngmi Lee have been the best couple, and we enjoyed travelling and eating altogether. I had so much fun with Kanggeon Kim and Jyeun Son, sharing crucial information for raising a baby. Having a coee break with Sungyong Seo and Jaewon Shin was a delight of afternoon teatime. Soowang Park and Daeun Kim were iii friendly neighbor around Park La Brea. Ipsae Park was a nice friend, helping me to adapt to new school life. Seoul National University alumni, Mooryong Ra, Yoonsik Jo, Kyuengeun Choi, Junyoung Kwak, Byungjoo Cha, Dongwoo Won, Dongwoo Kang, and Inkwon Hwang, were great supporters and friends. I especially spent fun coursework with Jeunhyung Kang. Last, but not the least, I want to thank my wife Eunkyung who has been the best partner over my life. Without her kind support and encouragement, I could not have nished my Ph.D. and have the greatest gift to our life, our daughter Jane. And I hope our beautiful baby Jane, who provided the biggest joy, to grow as happy as nowadays, enjoying her life with mom and dad. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xiii Chapter 1 Introduction 1 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Affordance Learning by Perception 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Classier for semantic labeling . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Segmentation method . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Geometric features . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 Training with logistic regression . . . . . . . . . . . . . . . . . . . . 10 2.4 Iterative k-means clustering for entropy minimization . . . . . . . . . . . . 11 2.5 Incremental object segmentation rening by merging multiple views . . . 14 2.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.1 Logistic regression for semantic classication . . . . . . . . . . . . 14 2.6.2 Iterative k-means clustering with entropy minimization . . . . . . 15 2.6.3 Incremental object segmentation rening by merging multiple views 18 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3 Affordance Map Building with Interactive Manipulation 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Aordance prediction and manipulation planning . . . . . . . . . . . . . . 21 3.3.1 Dening aordance . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 Building a grid map and collecting visual data . . . . . . . . . . . 21 3.3.3 Aordance prediction based on geometric features . . . . . . . . . 22 3.3.4 Manipulation planning for aordance mapping . . . . . . . . . . . 24 v 3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 4 Hierarchical Affordance Prediction Model Learning using Environmental Context 31 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Learning Prediction Models for Push Aordances . . . . . . . . . . . . . . 34 4.3.1 Push Aordance Prediction Modeling . . . . . . . . . . . . . . . . 34 4.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.3 Prediction Model Learning . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Hierarchical Prediction Modeling . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.2 Hierarchical Prediction Models with Selected Features . . . . . . . 39 4.5.3 Application of Prediction Model for Task Planning . . . . . . . . . 40 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 5 Probabilistic Hierarchical Affordance Prediction Models with Varying Feature Relevances 43 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Learning Prediction Models for Push Aordances . . . . . . . . . . . . . . 46 5.3.1 Push Aordance Model . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Hierarchical Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4.1 Hierarchical Model Structure . . . . . . . . . . . . . . . . . . . . . 49 5.4.2 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.2.2 Expectation Step . . . . . . . . . . . . . . . . . . . . . . 50 5.4.2.3 Maximization Step . . . . . . . . . . . . . . . . . . . . . . 51 5.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.5.2 Hierarchical Prediction Models with Selected Features . . . . . . . 53 5.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5.4 Application of Prediction Model for Task Planning . . . . . . . . . 56 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Modular Affordance Prediction Models for Multiple Ob- ject Interactions 58 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3 Modular aordance prediction models . . . . . . . . . . . . . . . . . . . . 60 6.3.1 Aordance prediction using transition model . . . . . . . . . . . . 60 6.3.2 Modularity of the prediction model . . . . . . . . . . . . . . . . . . 61 6.4 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 vi 6.5 Aordance transition model learning . . . . . . . . . . . . . . . . . . . . . 63 6.5.1 Building a prediction model . . . . . . . . . . . . . . . . . . . . . . 63 6.5.2 Forward and backward components of transition models . . . . . . 64 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.6.2 Forward and backward components of transition model . . . . . . 65 6.6.3 Aordance prediction over skill sets . . . . . . . . . . . . . . . . . 66 6.6.4 Application of modular structure to multiple objects scenarios . . 68 6.6.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusions 72 References 75 vii List of Tables 2.1 The list of geometric features . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Result of logistic regression for semantic labels . . . . . . . . . . . . . . . 14 viii List of Figures 1.1 Understanding aordances for robots such as placing, grasping, cutting, cooking and cleaning is necessary in order to autonomously accomplish tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Example of the PR2 robot situated in front of unknown obstacles. Unless PR2 understands each object's aordance, it cannot decide which object to move in order to clear the path. . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Example of segmentation: 2D image captured from Kinect (left) and result of segmented 3D point cloud with randomly colored segments (right). . . 8 2.3 Scatter, Linearity, Planarity of point cloud, adapted from [1]. Each saliency feature can be computed with eigenvalues and eigenvectors of the covari- ance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Occupancy binning of neighboring segments. Each vector between cen- troids of segment pairs is binned into one of these 12 directions (3 layers of planes, each with 4 directions). . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Binary ground truth label of `pushable forward'. The left table in front of box and the chair in front of right table are `not pushable forward' as there are objects behind them (red: pushable forward, blue: not pushable forward). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Concept of iterative k-means clustering, with two label vectors V 1 ;V 2 . Set S =fS 1 ;S 2 ;S 3 g is split into two subsets by k-means clustering and the entropies are computed from the centroids of the sets. If H(C)>H(C 1 ) or H(C 2 ), subsets are iteratively split again. . . . . . . . . . . . . . . . . . 12 2.7 Result of predicted classication. The ground truth is shown (left) with binary classication (red: possible, blue: not possible) and the probabilities of prediction are shown in the heatmap style (right). . . . . . . . . . . . . 15 2.8 Results of iterative k-means clustering. (a) object segmentation gets rened as two tables and a chair are separated from other segments (left) and a wall and a oor are separated (right). (b) Every split iteration, the entropies get lower. The last segments have the lowest uncertainties among segments of a box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 ix 2.9 Incremental classication with multiple views. (a) two views obtained by rotating the head-mount Kinect on PR2. (b) segments merged into a global frame. (c) segments representing a big box (top) and a small box (bottom) are shown from left view (left), right view (right), and merged cloud (middle). 17 3.1 A PR2 robot situated in a simulated warehouse with disorganized boxes (left). In order to rearrange boxes at desired positions, the robot must push the boxes. But some boxes cannot be pushed because they are blocked either by walls or other boxes. An aordance represents \pushability" in a certain direction. In this chapter, we explore how to learn an aordance map via interactive manipulation in order to plan a rearrangement (right). 20 3.2 An example point cloud of a single view (left). One example of the ware- house setting with randomly placed boxes (right). . . . . . . . . . . . . . 22 3.3 Obtained 2D occupancy grid using ray tracing. Black means occupied, light gray means empty, and dark gray means unknown (left). Registered 3D world point cloud by stitching all obtained point clouds with EKF (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Ground truth for 4 directions of aordances. Green arrow represents a cell being pushable and red represents not (left). Predicted pushing up aorandance values for each voxel, color coded in heatmap style: 0 (not pushable) being blue and 1 (pushable) being red (right). . . . . . . . . . . 27 3.5 Prediction result for aordance \pushing up" with heatmap color (left). Aordance prediction after MRF smoothing (right). . . . . . . . . . . . . 28 3.6 A sequence of manipulation steps. At each step, a robot chooses the ma- nipulation that has highest information gain which means larger entropy reduction with less movement. . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 Average entropies between two approaches: exhaustive and maximum in- formation gain approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 A PR2 robot trying to learn prediction model for push aordance of a tabletop object. The eect of the pushing action depends on environmental context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 The prediction model for push aordance is dened in the coordinate frame of the robot's gripper. The color of axis corresponds tox p (red),y p (green), andz p (blue). The push is made alongz p axis and the eecte 1 = y p ;e 2 = z p ; and e 3 = p is a distribution of displacement along y p and z p and rotation p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 A region of point cloud used for extracting local descriptors (red). These local shape descriptors explain how the contact will be made when an object is pushed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Encoded environmental context features from the pushing frame. Each grid cell is colored with dominant label as green(object), red(obstacle), gray(table), and black(free space). . . . . . . . . . . . . . . . . . . . . . . 36 x 4.5 Clustering 2D eects of samples with dierent number of submodels: (left) two, (middle) three, and (right) four. In each cluster, similar eects are grouped and respective GPR models are trained. . . . . . . . . . . . . . . 38 4.6 Prediction errors for push aordance models with dierent set of features along with dierent number of submodels. Object features(red), object and contact features(green), and object, contact, and environmental context features(blue) sets of features are used for training. One, two, three, and four submodels are trained to evaluate which number of cluster captures the dierence of context best. . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7 (top) Sampled transition candidates(blue) using A* planning algorithm (bottom) Sequence of estimated plan to push the book to the goal position on the edge of the table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.8 PR2 executing the manipulation task. Hierarchical prediction model is applied to predict push eects . The object is pushed toward the edge of the table then grabbed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 A PR2 robot trying to learn prediction model for push aordance of table- top objects. The eect of the pushing action depends on context of objects' conguration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 The prediction model for push aordance is dened in the coordinate frame of the robot's hand. The color of axis corresponds to x p (red), y p (green), and z p (blue). The push is made horizontal trajectory in z p direction and the eects e l =fy o 1 ; z o 1 ; o 1 ; y o 2 ; z o 2 ; o 2 g are distributions of displacements in y p and z p and rotations p . . . . . . . . . . . . . . . . . . 47 5.3 Extraction of (left) Contact features and (right) Context features. For contact features, position and shape(=normals) of expected contact are encoded. For context features, relative geometric information of objects are encoded by binning the layout of objects position and shape. . . . . . 48 5.4 Example of clustering with 3 submodels using Gaussian Mixture Model. Similar eects of pushes are grouped as initial submodel distribution. Green group shows that the rst object did move but did not contact second object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.5 Objects used in the experiment. . . . . . . . . . . . . . . . . . . . . . . . . 53 5.6 Prediction errors for push aordance models with dierent set of features along with dierent number of submodels. Object features(f o ), object and contact features(f o & f c ), and object, contact, and context features(f o & f c & f x ) sets of features are used for training. One to six submodels are trained to evaluate which number of submodels captures the dierence of context best. The fourth bar is tested for using dierent sets of features for WLR and WGPR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 xi 5.7 Example of two step push experiment using the learned hierarchical model. The robot aims to move the unreachable white box into the reachable region by pushing the cracker box. The robot successfully pushes the cracker box to move the white box closer to the left hand within the reach. 55 6.1 Push aordances can be learned from experiences of various pushing ac- tions: push in free-space, push against a wall, or push toward other objects 59 6.2 From the trajectory of the end-eector T m , an action frame at time stept is dened asA t , which is pose dierence betweent + 1 andt. In the action frame,x andy are dened as o-action and along-action axis, respectively. Orientations of object frame B t and contact frame C t are same as that of A t . The resulting eecte t is the displacement of the object relative to the action frame A t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Global shape features (left) are 3D voxel grid representation which encode global shape and size of the object and the end-eector relative to the object frame B t . Each voxel is labeled as either 0 or 1. Similarly, contact shape features (right) are encoded over contact points from contact frame C t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4 The mean distance errors of prediction models:with and without backward component for transition modeling. The model utilizing the backward component has better accuracy in push-along axis y, resulting from com- pensating the pose of end-eector while computing the predictions. . . . . 66 6.5 Objects which are graspable (top-left), not graspable (top-right), place- able (bottom-left), and not placeable (bottom-right) . . . . . . . . . . . . . 67 6.6 Prediction errors of pick and place aordance prediction models. When the prediction model correctly estimate the aordance of not-pickable or placeable, the error would be nearly zero. . . . . . . . . . . . . . . . . . . 68 6.7 An example of pushing 2 objects over time steps. Ground truth execution is shown on left. On the right, ground truth trajectory of end-eector (red) and objects (green) is shown compared to predicted trajectories of the end- eector (magenta) and objects (blue) are shown in right. . . . . . . . . . . 69 6.8 Mean prediction errors between monolithic and modular model. The eects onz,rx, andry are not signicant and omitted as the push is performed on the surfaces. The modular model excels the monolithic model signicantly in y axis by accurately predicting possible interactions between objects. . 70 xii Abstract Robots can plan and accomplish various tasks in unknown environments by under- standing the underlying functionalities of objects around them. These attributes are called aordances, describing action possibilities between a robot and objects in the en- vironment. Aordance is not an universal property due to its relative nature, therefore it must be learned from experiences. Such learning involves predicting aordances from perception, followed by interactive manipulation. Learned aordance models can be di- rectly applied to robotic tasks as the model describes how to manipulate and what the consequence will be. In this thesis, we present several methods to learn aordances with interactive per- ception and manipulation. Specically, we introduce learning aordance models from perception and utilizing predicted aordance to generate an interactive manipulation scheme. First, we examine building aordance models from perception only. From the 3D point cloud data, visual features are extracted and prediction of aordance is made. The developed model incorporates relative geometric information of nearby objects and predicted labels are utilized for rened object segmentation. Next, we look at planning an interactive manipulation based on the predicted aordances to build aordance map. The robot predicts the aordances of objects and they are examined with manipulation. The perception-manipulation loop is iterated by applying maximum information gaining strategy to build a map until convergence. Lastly, we propose three dierent aordance modeling schemes. The context-based aordance model is introduced to eciently con- sider the context information to build aordance models and the hierarchical approach is used for categorizing dierent types of eects caused by the actions. The model is further developed with nding an optimal number of submodels describing distinctive eects of actions into submodels with weighted predictions. Finally, interaction based modeling of aordances between entities is studied, where modular predictions can be applied to multi-object scenarios in cluttered environment. The general set of features and eect predictions are also applied to various skills: pushing, picking, and placing. For the developed aordance models, extensive experiments are performed to verify the xiii models and the application to the robotic tasks. We have shown the modeling scheme can be applied to accurately predict the consequence of actions in various environments. Our learning framework with interactive perception and manipulation can be extended to many other robotic applications. xiv Chapter 1 Introduction When robots operate in an unknown environment to accomplish tasks, they must understand how to interact with the environment. Especially when deployed in human environments, everyday objects should be utilized to carry out tasks. For this purpose, research has been conducted on recognizing objects' identities. Extracting dominant fea- tures to match existing objects from existing database has been the main focus. However, object recognition has limits when it is applied to robotics: robots need to physically in- teract with objects and such information is not explicitly explained by object identities. Objects in the human environment are designed for specic functions: a cup for drink- ing, a chair to sit, a bed for sleep, etc. And more specialized \tools" have been invented, like wheels and knives, to make human life more convenient(Fig. 1.1). Understanding these functionalities and utilizing them will enhance robot's capability to perform tasks. For instance, when a robot is asked to bring water, it could use any object that can con- tain liquid: a cup, a glass or a mug. It can be claimed that the functions and usages can be derived from object identities, as it is a unique properties of the object. However, it is not consistent characteristic in every scenario since functionailities are not pre-dened for every object: a thick book can be used to hammer a nail and a table can be sat on instead of the chair. Additionally, most of functions are described from the human's point of view, which might be dierent from robots' stance: a staircase cannot be easily climbed by a wheeled robot. Therefore, robots should be capable of understanding functionalities of the objects with various perspectives. The term aordance was rst introduced by psychologist J. J. Gibson in 1977 [2]. He dened aordance as an action possibility formed by the relation between an agent and its environment. The concept of aordance has been expanded from psychology to robotics recently, for its advantage by describing the functionality. It has been applied to object recognition by nding parts holding certain aordances. Navigation was also improved 1 Figure 1.1: Understanding aordances for robots such as placing, grasping, cutting, cook- ing and cleaning is necessary in order to autonomously accomplish tasks. with the concept by nding proper aordances to plan a path. Human-Robot interaction eld also expanded the work by inferring human activities by relating the objects used by humans along with their aordances. Also, aordances proved useful in robot task planning as they can directly describe the relation of applied actions and corresponding eects. The additional component of aordance aside from a robot and the object, is the environment. Depending on the environmental settings, aordances might dier. For example, a chair cannot be sat unless it is in the upright pose on stable oor. And a door cannot be opened when it is blocked by heavy snow. Therefore, robots should not solely focus on objects, but should model surrounding environment to understand aordances. Aordances can be learned from experiences with real physical interactions. Aor- dance can be predicted rst and corresponding actions can be performed to examine the dierent between the expectation and the outcome. This trial-and-error resembles the developmental stage of human infants. Babies learn how to play with toys by applying various perceptions and actions: they touch, see, even taste and eat. Such behavior is called `motor babbling', learning from action-eect pairs by applying random actions. As 2 babies grow, action primitives become more strategic, by utilizing past experiences. The learning framework can be similarly applied to robots to learning object aordances. In the thesis, we introduce methods for robots to learn aordances through interactive perception and manipulation. First, the perceptive approach will be introduced which semantically maps the point cloud world by aordances and make use of such aordances for object segmentation. The interactive manipulation scheme to predict and examine aordances via action selection will be described. The technique is applied to build an aordance map which can be utilized for a task like re-arrangement. Last, three dier- ent aordance modeling schemes are prosed to enrich the model by considering context information. The context-based aordance model is introduced to eciently consider context information to build aordance models. Dierent eects of actions are catego- rized into two sets: free-space push and constrained push. The hierarchical approach is used to predict the labels of eects from extracted features. The model is applied to the sequential manipulation task where the robot pushes an object around obstacles in order to grasp on the edge of the table. Second, the model to nd an optimal numbers of submodels describing distinctive eects of actions into submodels with weights. A two step planner is applied to the model to perform a task to utilize the rst object to interact with the second object, to make the second object reachable. Finally, the inter- action model of aordances are built between entities, where modular predictions can be applied to multi-object scenarios in cluttered environment. The general set of features and eect predictions are also applied to various skills like pushing, picking, and placing. For the developed aordance models, a sequential manipulation scheme was examined to correctly predict the action possibilities of objects and utilize predictions to perform picking, placing, and pushing in order to obtain desired goal states. Overall, learned aordance models from interactive perception-manipulation enable robots to accomplish complicated tasks autonomously. 1.1 Outline This thesis document is organized as follows: Chapter 2 introduces the aordance learning by perception, which uses 3D point cloud sensor data to learn aordance models. In Chapter 3, we present an interactive manipulation strategy to learn aordances and build an aordance map. 3 In Chapter 4, we show a method to model aordance and forward aordance and apply models for task planning. Hierarchical models to categorize environmental aordances will be described. In Chapter 5, we develop the probabilistic hierarchical models with weights, incor- porating dierent situation contexts along with relevant features. In Chapter 6, we build a modular aordance prediction model and apply it to multi- object environments. Various skills are learned and used to carry out tasks based on the predicted eects of actions. In Chapter 7, we summarize and conclude the works. 4 Chapter 2 Affordance Learning by Perception 2.1 Introduction Imagine a robot just entered a crowded oce room, with bunches of tables, chairs, and boxes all around (Fig. 2.1). The robot wants to travel to the next door, where obstacles are blocking the way. When a robot is deployed in such an unknown environment and wants to navigate, the very rst action should be perceiving the environment to gure out where, what, and how objects are placed. Unless the robot nds an obstacle-free path, if there is any, it must nd a way to move things, reorganize, and build a desirable path. This is a complicated problem for a robot because it requires knowledge of object aordance [2], which describes the way an object can be manipulated. Much research has been done in the area of object recognition, especially in the computer vision community, to solve the problem of identifying objects. But in a situation like above, knowing object aordance, which tells how an object can be moved, will be more essential than recognizing the object identity. With knowledge of whether an object is 'pushable forward' or 'liftable', a robot can plan the manipulation pipeline to achieve its desired recongurations of objects. In this chapter, we propose and evaluate an algorithm to explore an unknown envi- ronment, using semantic labels of object aordance. We use Microsoft Kinect to acquire 3D point cloud and apply a region growing method to sort point cloud into numbers of segments. Various geometric features like shape, location, and geometric relationship are examined, and machine learning technique was applied to train a classier. We compare the results with manually labeled ground truth, yielding average of 81.8% precision and 81.7% recall. With the obtained classier, we propose an iterative clustering algorithm to improve object segmentation and reduce manipulation uncertainty. Also we show that incremental merging multiple views renes the object segmentation. 5 Goal Chair Box Table ? Figure 2.1: Example of the PR2 robot situated in front of unknown obstacles. Unless PR2 understands each object's aordance, it cannot decide which object to move in order to clear the path. 2.2 Related work When a robot needs to navigate among movable obstacles [3], it must understand the congurations and characteristics of objects in order to plan the manipulation task. Given an unknown environment, Wu et al. [4] proposed an navigation algorithm which interacts with the obstacles to learn the characteristics. They used a simple pushing motion primitive to test whether an obstacle is static or not. Our work focuses on under- standing aordances of obstacles, by precessing 3D point cloud to predict the semantic aordances of objects with which a robot has to interact. There has been rigorous research in the area of scene understanding and object recog- nition from 2D images. Finding good features was the one of main topics and features such as SIFT (Scale Invariant Feature Transform) [5], HOG (Histogram Of Gradients) [6], and contextual features [7] were developed. But as 2D images are projections of the 3D world, parts of the geometric characteristics are lost during the conversion. To overcome this limitation, 3D layouts or sequential multiple views were taken into consideration. Hoeim et al. [8] tried to capture the geometric features by modeling the interdependency of objects, surface orientations, and camera viewpoints. However, it could only be applied with certain relations of objects, and still suered from lack of geometric representation. Recently, RGB-D cameras like Microsoft Kinect became popular, which can capture the world in depth images along with colors. Object recognition has been improved 6 by using such RGB-D cameras with both visual and shape information (e.g. [9]). 3D features like NARF [10] were also developed to capture geometric characteristics in 3D depth images. In the area of object classication using 3D point clouds, Shapovalov et al. [11] clas- sied outdoor scenes into ground, building, tree, and low vegetation. Xiong et al. [12] labeled indoor scene as walls, oors, ceiling, and clutter by using CRF with geometrical relationships such as orthogonal, parallel, adjacent and coplanar. Koppula et al. [13] in- troduced semantic labels and applied associative relationship between objects to predict labels such as `chair back rest' and `chair base'. Our research diers from previous works as we solve the object classication problem by labeling object aordance. The advantage of aordance labeling is that labels can directly be adopted for manipulation planning and it does not a require prior database of objects identities. Also, the predicted labels can be utilized as new features to rene object segmentation and reduce manipulation uncertainty. 2.3 Classier for semantic labeling We develop a classier that predicts the object aordance for a given 3D point cloud. We perform the classication in three steps: 1. Segmentation of the 3D point cloud 2. Extraction of geometric features 3. Training and learning parameters of the classier 2.3.1 Segmentation method There are various segmentation methods such as planar, cylindrical, and Euclidean clustering [14]. Since our proposed algorithm aims to label object aordance, the segmen- tation method does not need to precisely extract the outlines of objects. Therefore, we apply the region growing method from the Point Cloud Library (PCL) [14], which simply captures continuous surfaces of a 3D point cloud. The method starts with randomly selected seed points and grows them by grouping neighboring points of similar normal angle and curvature. As a result, we can sort point cloud clusters, shown in Fig. 2.2 with the 2D image from Kinect and the corresponding segmented 3D point cloud. 7 Figure 2.2: Example of segmentation: 2D image captured from Kinect (left) and result of segmented 3D point cloud with randomly colored segments (right). 2.3.2 Geometric features Among many features such as color, texture, and material, the geometric properties of objects play a crucial role regarding object aordance. For example, the PR2 robot can push a box forward, if it is large and tall enough to be reachable by the arms of the PR2. Additionally, the surface of the box should be facing the PR2. To be `graspable', an object must have a thin part that ts within the gripper of the PR2. To determine the aordance of objects, we consider two types of geometric features. First, we derive unary features from a single point cloud segment in terms of the shape, normal, and location of the segment. Second, we consider pairwise features that capture the geometric relation between neighboring segments. Pairwise geometric features are especially useful in explaining object aordance by capturing the relative position of segments. For example, if there is a table in front of a wall, that table would not be pushable toward the wall. To classify several types of object aordance, we use the unary and pairwise features shown in Table 2.1. In the following, we describe the aordance information contained within each feature, and provide details about how to compute feature values. Saliency features of point cloud describe the shape of point cloud cluster. It is computed with the method inspired by tensor voting [1, 15]. Symmetric positive denite covariance matrix for a set of N 3D points X i = (x i ;y i ;z i ) T is dened as 1 N N X i=1 (X i X)(X i X) T 8 Table 2.1: The list of geometric features Features Description Scatter How round a segment is Linearity How long and thin a segment is Planarity How planar a segment is Normal Average normal of a segment Span How large a segment is and how far it spans in XYZ coordinate Centroid XYZ coordinates of centroid of a segment Occupancy Relative locations of neighboring segments Figure 2.3: Scatter, Linearity, Planarity of point cloud, adapted from [1]. Each saliency feature can be computed with eigenvalues and eigenvectors of the covariance matrix where X is the centroid of the segment. From the covariance matrix, we can get three eigenvalues 0 ; 1 ; 2 in a descending order and three corresponding eigen- vectors e 0 ;e 1 ;e 2 . 0 ; 0 1 ; 1 2 from Fig. 2.3 represent scatter, linearity, and planarity features of 3D point cloud segment. These saliency features explain object aordances. For example, if a segment has a high linearity value, it is likely to be graspable. Normal is derived from eigenvector of covariance matrix. The normal represents which direction the segment is facing. Therefore it is a distinct geometric feature, indicating in which direction the segment can be manipulated. Span implies how far a segment reaches in the XYZ coordinate. This gives an idea of how large the point cloud is and where it is located. For example, a wall will have longz-axis span from the oor to the ceiling, implying that it would likely be static and not pushable. 9 Figure 2.4: Occupancy binning of neighboring segments. Each vector between centroids of segment pairs is binned into one of these 12 directions (3 layers of planes, each with 4 directions). Occupancy is the relative pairwise geometric feature between segment neighbors. To express the relative pose, we rst compute the nearest neighbors of each segment by computing shortest distances between segments and thresholding them. Then we bin the direction of the vector between two centroids of segment pairs into one of 12 bins (3 split in z axis, each with 4 directions in xy plane, shown in Fig. 2.4). The shortest distances between two segments are scaled and normalized to express the measure of closeness. 2.3.3 Training with logistic regression Object aordances describe an action a robot can do with the object. For example, a mobile robot could perform simple motions like pushing an object, while a humanoid robot can perform complicated multi-step motions like folding a chair or opening a door by turning the knob. In our work, we dene 6 semantic labels with simplied motion primitives: pushable forward, pushable backward, pushable left, pushable right, liftable, and graspable. Each label except `graspable' is a single step, omnidirectional motion with discretized orientations (forward, backward, left, right, up). All directions are relative to the current orientation and position of the robot, so we assumed xed body pose of the robot in our work. In order to build ground truth samples for training and evaluation, we dene these criteria to manually label point cloud segments: All segments belonging to a single object should share same labels. A segment is not pushable in a certain direction where neighboring segments exist in that direction (e.g a table in front of a box is not pushable forward, a table with an object on is not liftable). 10 Figure 2.5: Binary ground truth label of `pushable forward'. The left table in front of box and the chair in front of right table are `not pushable forward' as there are objects behind them (red: pushable forward, blue: not pushable forward). A segment is graspable if it has a thin and narrow part that PR2 can grab with its gripper. Labels represent physically feasible primitives (e.g an object on a table is pushable in all directions, even though it might fall o). For training a 3D point cloud classier, Xiong et al. [12] used CRF model to label indoor point cloud with geometric labels and Koppula et al. [13] used MRF model with log-linear model and pairwise edge potentials for semantic labeling. The shortcomings of those techniques are their computation time: in [13], it took average 18 minutes per scene to predict labels with mixed-integer-solver. To speed up, we use logistic regression which is simple and fast. Logistic regression predicts the probability of a segment having a label l by computing p l (x) = e f l (x) e f l (x) + 1 ; f l (x) = l 0 + n X i=1 l i x i where x =fx 1 ;x 2 ;:::x i g,x i stand for ith feature value of the segment. To get parameters l =f l 0 ; l 1 ;:::; l i g, we manually labeled point cloud segments with binary classication and trained them. Fig. 2.5 shows an example of labeled ground truth data. 2.4 Iterative k-means clustering for entropy minimization If a robot is to manipulate objects to recongure locations or orientations, it is important to recognize an object as an entity. It is because when a robot interacts with 11 V 1 V 2 S 1 S 2 S 3 C V 1 V 2 S 1 S 2 S 3 C 2 C 1 Figure 2.6: Concept of iterative k-means clustering, with two label vectorsV 1 ;V 2 . SetS = fS 1 ;S 2 ;S 3 g is split into two subsets by k-means clustering and the entropies are computed from the centroids of the sets. If H(C)>H(C 1 ) or H(C 2 ), subsets are iteratively split again. a certain object, all segments belonging that object will move altogether. As the region growing segmentation method usually picks up parts of an object, segments should be stitched together to represent a single object. To nd such segments, predicted labels can be utilized as the feature vectors. As every segment from a single object shares same characteristics, those predicted segments should have similar labels as well. So we can rene the object segmentation by grouping segments of similar label vectors. For the manipulation planning, it is also crucial to pick up a segment with the lowest aordance uncertainty in order to minimize the manipulation failure. As each label represents the predicted probability, the uncertainty of aordance can be expressed by the entropy of label vectors: H(X) =E[ln(P (X)] = n X l=1 p l (x)lnp l (x) where p l (x) is the probability obtained from the classier of the label l. Higher accuracy of prediction (either close to 0 or 1) will have lower entropy value. So we can nd the segment of highest aordance certainty by searching for the lowest entropy. Algorithm 1 sorts every segment into groups with two objectives: 1) nding similar label vectors to track down a single object, 2) nding low entropy segments to reduce manipulation uncertainty. 12 Algorithm 1 Iterative k-means clustering for entropy minimization 1: for Segment Set S =fS 1 ;S 2 ;:::;S n g do 2: Split S into two subsets S 1 ;S 2 by 2-means clustering 3: Start with two random point vectors m 1 , m 2 4: while 1 N j P i2j jjS i m j jj 2 >d th do 5: ifjjS i m 1 jj 2 <jjS i m 2 jj 2 then 6: S i S 1 7: else 8: S i S 2 9: end if 10: Update m j = 1 N j P i2S jS i 11: end while 12: if H(S)>(H(S j )) then 13: Goto 4 and split iteratively 14: else 15: Group segments in the subset as an object 16: end if 17: end for 18: return Groups of segments S j The algorithm starts with all segments in one set and tries to separate a group of segments with similarity. The concept of one iteration step is shown in Fig. 2.6. As the similarity between segments can be expressed by Euclidean distance between label vectors, k-means clustering algorithm can be applied to group segments. The problem with k-means clustering is that the exact number of objects is unknown and nding the optimal k is NP hard problem. So we set k=2 and run clustering iteratively to pick out groups of similar segments at each iteration. 2-means clustering starts with two random points and split the segment set S into two subsets by comparing Euclidean distances of label vectors from seed points. The seed points are updated by averaging the label vectors of the subset and the process is repeated until convergence. We also consider the entropy change by each split, where the entropy of a set is computed from the centroid of the set. If splitting lowers the entropy, it means subset has higher certainty of object aordance. Therefore, segment with lower uncertainty can be searched through iterations. The result of algorithm can be portrayed into a tree structure with increasing depth for each iteration. Object segmentation is obtained at each depth of the tree and the entropy is minimized at the leaves of the tree. 13 Table 2.2: Result of logistic regression for semantic labels Label Precision Recall Pushable forward 82.0 81.0 Pushable backward 82.4 81.9 Pushable left 82.4 81.9 Pushable right 78.0 78.5 Liftable 78.9 80.4 Graspable 86.8 86.7 Average 81.8 81.7 2.5 Incremental object segmentation rening by merging multiple views Object segmentation can also be enhanced by merging predicted segments from mul- tiple views. In each scene, the labels of a single object should be consistent no matter from which viewpoint they were seen. Also, those segments should be close to each others (sometimes they might even overlap). We propose an algorithm to nd groups of segments having both proximity and simi- larity. We run the classier on individual scenes and merge all the predicted segments into the global frame. Euclidean distances in XYZ space and label vector space are computed among segments for proximity and similarity. The process can incrementally rene the object segmentation as merging more views yields more consensus of predicted segments. 2.6 Experimental results In our experiment, the PR2 robot from Willow Garage was used. With its two 7-DOF gripper equipped arms and a mobile base, PR2 can perform simple tasks like pushing, grasping, and lifting (upto 1.8kg payload). All 3D point clouds were captured by Microsoft Kinect mounted on the head of the PR2 and algorithms were implemented in Robot Operating System (ROS). 2.6.1 Logistic regression for semantic classication In our experiment, captured 3D point cloud was transformed into the PR2 frame, with the base link of PR2 located at the origin. In the PR2 frame, z axis corresponds 14 (a) Label: pushable forward (b) Label: liftable Figure 2.7: Result of predicted classication. The ground truth is shown (left) with binary classication (red: possible, blue: not possible) and the probabilities of prediction are shown in the heatmap style (right). to upward and x axis corresponds to frontal direction of the PR2. We collected 10 of- ce scenes with dierent congurations of tables, chairs, and boxes. Each scene had been segmented by the region growing method, resulting total 195 segments. We manu- ally labeled each segment with 6 semantic labels and applied WEKA logistic regression tool [16] to learn the parameters l 0 ;:::; l n for each label l=1,...,6, with number of features n=25. 4-fold cross-validation was used to evaluate the classier and the result is shown in Table 2.2 with average of 81.8% precision and 81.7% recall. Fig. 2.7 shows examples of the classier prediction, with hand labeled ground truth (left) and the probabilities of the predicted label (right). 2.6.2 Iterative k-means clustering with entropy minimization Fig. 2.8 shows the result of iterative k-means clustering for entropy minimization. In each step, we can see the segments split into two subgroups and object segmentation gets rened at each step. In Fig. 2.8(a), two tables and a chair have been separated from the environment rst, then a wall and a oor were divided in other iteration. Also we can increase certainty of segment label (Fig. 2.8(b)). Among segments belonging to a box, 15 (a) Renement of object segmentation. (b) Entropy minimization by each iteration Figure 2.8: Results of iterative k-means clustering. (a) object segmentation gets rened as two tables and a chair are separated from other segments (left) and a wall and a oor are separated (right). (b) Every split iteration, the entropies get lower. The last segments have the lowest uncertainties among segments of a box. 16 PR2 (a) Point clouds taken from two dierent views of a scene (b) Merged Point clouds with randomly colored segments (c) Examples of object segmentation renement Figure 2.9: Incremental classication with multiple views. (a) two views obtained by rotating the head-mount Kinect on PR2. (b) segments merged into a global frame. (c) segments representing a big box (top) and a small box (bottom) are shown from left view (left), right view (right), and merged cloud (middle). 17 entropy gets lower every iteration and last two segments show the least uncertainty of prediction. 2.6.3 Incremental object segmentation rening by merging multiple views An example of incremental classication renement is shown in Fig. 2.9. Since our directions of object aordance (e.g `pushable forward') are relative to the position and orientation of the PR2 robot, multiple views were collected by rotating the head-mount sensor with xed body position. Individually classied segments were merged into the global frame (Fig. 2.9(b)) and we ltered XYZ Euclidean distances between segments with threshold of 0.05m for a proximity measure. Then we ltered similarity to be within the threshold of 0.7 (relatively, 11.6% of dierence). As shown in Fig. 2.9(c), the segmentation of two boxes in the the merged frame got improved, compared to the individual segmentations. 2.7 Conclusions In this chapter, we presented semantic 3D point cloud classier of object aordance. The point cloud was divided into segments and classied by logistic regression. Object segmentation was rened with the iterative k-means clustering and incremental merging multiple point clouds. As a result, a robot can map the environment and plan how to interact with objects. 18 Chapter 3 Affordance Map Building with Interactive Manipulation 3.1 Introduction Consider a robot tasked to rearrange disorganized boxes (Fig. 3.1). Without prior knowledge of the environment, the robot must explore the world and identify target objects that need to be moved. It must also identify objects that it can move. Such functionalities are dened as aordances. In order to recognize aordances, a robot rst needs to utilize its sensors to perceive the environment. Input sensor data are processed to recognize object identities and corresponding aordances. However, visual cues are not sucient to fully understand aordances for lack of interaction: whether a box is empty or not is often unknown unless the situation is disambiguated by the robot trying to push the box. Therefore, visual perception and interactive manipulation are both necessary for aordance modeling. In this chapter, we propose an algorithm to semantically map aordances for robotic tasks as follows: 1) Explore and build a map of the environment, 2) Predict aordances of the map based on sensor observations, 3) Plan a sequence of interactive manipulations to evaluate and conrm the predicted aordances and reduce overall uncertainty of the aordance map. 3.2 Related work When a robot is given a task in an unknown environment, it must understand the congurations and characteristics of the objects to accomplish the task. For navigation toward a goal among movable obstacles [3], knowledge about physical characteristics of objects is necessary. Wu et al. [4] proposed a navigation algorithm by interacting with 19 Figure 3.1: A PR2 robot situated in a simulated warehouse with disorganized boxes (left). In order to rearrange boxes at desired positions, the robot must push the boxes. But some boxes cannot be pushed because they are blocked either by walls or other boxes. An aordance represents \pushability" in a certain direction. In this chapter, we explore how to learn an aordance map via interactive manipulation in order to plan a rearrangement (right). obstacles to learn characteristics of the objects. They used greedy approach to select an object to test whether it is static. Many techniques have been developed to reveal characteristics of objects like physical properties, colors, shapes, etc. However, object aordance, diering from general properties of objects, is a higher level abstraction and is directly suitable for manipulation. Koppula et al. [17] learned human activities and object aordance from RGB-D videos. A human performing an action, like drinking, relates aordance label to the interacting object, like a cup. So an aordance could be extracted by observing human activities and modeling the relationship between a human and the object. Varadarajan et al. [18] built an aordance network to recognize unknown objects. By searching for the part with a specic aordance (e.g. a handle for graspable), a novel object could be recognized with the matched aordance. These aordances are focused on understanding objects in a human context. Here, we interpret aordances of objects from the robot's view and use them to accomplish robotic tasks. Among many approaches for recognizing and learning about unknown objects, there has been prior research on utilizing interaction between the robot and the environment. Dogar et al. [19] used a pushing primitive to grasp objects in the cluttered scene. Hausman et al. [20] tracked movement of objects in clutter to disambiguate both textured and textureless objects. Chu et al. [21] used haptic feedbacks from touching objects to learn 20 haptic adjectives. Gupta et al. [22] studied interactive manipulation in the cluttered environment to plan rearrangement of objects to maximize expectation of a hidden target object being revealed. Our research diers from previous work in that we obtain task specic aordances for a robot and build an aordance map. Instead of applying simple pre-assigned motions for interaction, we plan a manipulation sequence based on a classier that predicts aor- dances. We use a MRF model to estimate the consequences of manipulation to maximize information gain from such interaction. 3.3 Aordance prediction and manipulation planning 3.3.1 Dening aordance Dening aordance for a robotic task is complicated since aordance has semantic meaning [23]. Four components need to be considered in dening aordance: When an agent perform a behavior on an entity, an eect occurs. Behavior (robot action) and eect (interaction result) should be described with parameters to utilize aordance in robotic application. For example, an aordance \pushable" can be expressed with parameters of direction and magnitude. Aordance can also be sampled from the distribution, consid- ering uncertainty and errors in perception and manipulation. In our work, we simplify the parameters of action and eect by tting them into a grid. Predicting aordance becomes tractable and the derived aordance can be directly applied to the planning. 3.3.2 Building a grid map and collecting visual data In the work reported here, a PR2 robot with a RGB-D camera is used to obtain a 3D point cloud based on which a 2D occupancy grid map of the environment is created. Pre- viously dened discrete aordances can be utilized on this simplied 2D grid world. Here we assume a PR2 robot is situated in a simulated warehouse with randomly distributed boxes. First, the robot explores the world to build a grid map. To nd out the occupancy of grid cells, we use a ray tracing technique (Fig. 3.3 left). As the robot traverses the environment and discovers unknown occupied cells, each incoming point cloud from a single view (Fig. 3.2 left) is merged into the global point cloud. Registration of point clouds is aided by an Extended Kalman Filter based on odometry data. A frontier-based approach is used to explore the map by setting a frontier on the boundary between known and unknown regions. The robot is driven through all approachable regions without any 21 Figure 3.2: An example point cloud of a single view (left). One example of the warehouse setting with randomly placed boxes (right). Figure 3.3: Obtained 2D occupancy grid using ray tracing. Black means occupied, light gray means empty, and dark gray means unknown (left). Registered 3D world point cloud by stitching all obtained point clouds with EKF (right). physical interaction with objects and covers the map. As a result, we get multiple point cloud snapshots, a merged and fully registered 3D world point cloud (Fig. 3.3 right) and discretized 2D occupancy grid map. 3.3.3 Aordance prediction based on geometric features Previously, we studied the semantic labeling of 3D point cloud with object aordance to predict object aordances based on RGB-D data [24]. In terms of object recognition, visual features like colors and texts are useful. In terms of aordance, it has be shown that geometric appearances are useful, explicitly expressing aordance of an object even with- out previous knowledge of an object identity. Here, we adapt the classier that predicts 22 aordance by utilizing geometric features. Point cloud segmentation is required to extract segments and to label them with an aordance. A segment doesn't necessarily correspond to a single object since the goal is to derive aordance, not object identity. Here, we use voxel segmentation as it clearly confer the advantage of discretized aordances. As studied in [24], there are two types of geometric features, unary and pairwise. Unary features represent geometric properties of a single voxel segment while pairwise features describe geometric relationship between neighboring voxels. In the following, we describe geometric features used in the chapter and provide the aordance information carried within each feature. Saliency describes the shape of the point cloud by scatterness, linearity, and pla- narity. It is computed by the method inspired by tensor voting [1]. The eigenvalues and eigenvectors are derived from the symmetric positive denite covariance matrix of the point cloud to compute saliency features. Normals are derived from the eigenvectors of covariance matrices, representing the planar direction where aordance for pushing can be inferred. Span and centroid imply how far and where each segment reaches in the XYZ coordinates. The relative position between the observed point cloud and the robot's position in uences aordance as well. If sensor data show that there is no obstacle be- tween the sensor and the object, this knowledge of empty space is used to estimate accessibility. Accessibility is one of the relative factors that determines aordance. Hierarchical Matching Pursuit 3D (HMP3D) features are extracted from point cloud for object recognition [25]. In each voxelized point cloud, a dictionary is learned by the K-SVD algorithm to express a point cloud with a sparse vector. The sparse vectors are encoded with hierarchical matching pursuit where feature components are derived from the point cloud voxels with various scales. The result shows that the geometric properties are captured successfully to detect objects. In our work, we apply a simplied version of HMP3D feature vectors by collecting surface normals of voxels, each with dierent sizes. Occupancy of the neighboring voxel segments can be encoded in pairwise features. For example, a single object in a free space is pushable to any nearby empty cells. But two contiguously positioned objects cannot be pushed towards each other unless 23 a robot is strong enough to push both together. Also, if a wall is blocking an object, the object cannot be pushed towards the wall. By looking at neighboring voxel segments, we can estimate the in uence of neighbors to aordances. With the extracted unary and pairwise geometric features, a classier is trained to predict aordance labels. We dene the robotic task aordances as follows: pushable up, down, left, and right in 2D. This notation is useful as a robot can directly use the aordances as pushing primitives. In addition to the direction of the aordance, the aordance parameter about how far an object can be pushed needs to be dened. For simplicity and lower complexity, we set up aordance unit values to be one grid cell size. Among many classiers, we choose a logistic regression classier for its simplicity and fast computation. We trained the classier with the sampled voxel segments along with geometric features. These point cloud examples are gathered by driving the PR2 around the simulated warehouse among various object congurations. Through training, we attain parameters l =f l 0 ; l 1 ;:::; l m g, where l i stands for ith parameter over m features for aordance label l. The logistic regression is expressed as follows: p l (f) = e g l (f) e g l (f) + 1 ; g l (f) = l 0 + m X i=1 l i f i (3.1) where f =ff 1 ;f 2 ;:::;f m g and f i stand for the ith feature value of the segment. A function g l (f) is a linear combination of parameter l and features. An outcome of prediction p l (f) is computed from the logit function of g l (f) and the resulting prediction value lies between 0 and 1, expressing how probable an object is to have a particular aordance. 3.3.4 Manipulation planning for aordance mapping Based upon the predicted aordance, we need a plan to verify the aordance of each occupying cell by interactive manipulation. Here, we consider a simple manipula- tion primitive|pushing. After pushing, we need to validate the aordance whether the prediction was correct. For validation, one might acquire haptic feedback from the end eector [21] or check the torque feedback from the mobile base wheels since pushing an overweight object might result in slippage. In our approach, we perform change detec- tion of the point cloud for verication. After executing one unit of pushing in a given direction, two candidate point clouds are compared, one being the hypothesized state of the environment after the unit push and the other being the observed point cloud. By 24 computing average Euclidean distance error between expected point cloud and observed point cloud, the aordance can be conrmed. In order to complete aordance mapping, a strategy is needed to plan a sequence of manipulations and execute it. The basic approach would be a naive one, which examines all occupied cells exhaustively. This is expensive since a robot must travel all around the map and manipulate (push) every occupied cell. We decrease the number of manipulations by estimating reduction in uncertainty resulting from each manipulation. The uncertainty of the aordance map can expressed by entropy where lower entropy indicates higher certainty. The planning strategy searches for manipulation that reduces entropy the most. Additionally, we predict the in uence of manipulation by adapting a Markov Random Field (MRF). A probabilistic graphical structure is constructed by setting each occupied cell to be a node and connecting edges between the neighboring cells. P (x 1 ;x 2 ;:::;x n ) = 1 Z n Y i=1 i (x i ) Y i;j ij (x i ;x j ) (3.2) The probability distribution of occupied cells' aordancesP (x 1 ;x 2 ;:::;x n ) is computed from the product of unary and pairwise potentials where n is the number of occupied cells and Z is the partition function. The unary potential i (x i ) is a prior, which is the predicted aordance from the classier. The pairwise potential ij (x i ;x j ) encodes the mutual in uence between cell x i and x j . We dene the pairwise potential function as follows: ij (x i ;x j ) =exp(Jx i x j ) whereJ > 0 is a parameter that describes how compatible celli andj are. This pairwise factor expresses the similarity between two cells. This is based on the idea that neighbor- ing segments tend to have similar aordances. For example, cells constituting a wall at the top of the grid map would all share \not pushable up" aordance properties. Also, a directional pair of aordances, like pushable up and down, of neighboring cells have similar aordance values by symmetry. That is, if a cell is not pushable upward, the cell above should be not pushable down. This symmetry occurs as the aordances encode both accessibility and space availability: A \pushable up" cell should be accessible from the lower cell and the upper cell should be vacant. From the examples shown above, the pairwise factor expressing similarity of the neighbors can \smooth" the beliefs of the aordances. 25 By using an MRF model, we can estimate the entropy reduction from the manipulation by both conrming the aordance of a cell and estimating the smoothing in uence on the probability distribution. We use this model to plan a sequence of manipulations to reveal the aordance map. The problem can be framed as the maximum Traveling Salesman Problem, where we have distances (costs) between nodes (target cells) and rewards (entropy reduction) for visiting each node. As the TSP problem is proven to be NP-hard, here we apply a simple heuristic by choosing the maximum information gaining manipulation at every step. We compute the information gain by each possible manipulation as follows: G =R (1)C R =p l E(x i =true) + (1p l )E(x i =false) C =D(pose current ;pose access ) Information gain G is the scaled sum of the reward R and the cost C with scaling factor . Reward R is the expected entropy reduction by conrming cell x i 's aordance to be true or false, each associated with predicted aordance p l from equation 3.1. Cost C is a distance the robot needs to travel from current position pose current to the goal position pose access where it can perform the manipulation. The goal position depends on the type of aordance a robot wants to examine. From the formula above, a robot chooses maximum information gaining movements iteratively. 3.4 Experimental results The experiments have been conducted in the Gazebo simulator warehouse with the PR2 robot. The PR2 is equipped with the Microsoft Kinect to obtain a 3D point cloud and two 7 DOF manipulator arms to interact with objects. To train the aordance classier, we built 20 maps of a 12 x 12m virtual warehouse each with dierent congurations of boxes (Fig. 3.2 right). The dimension of the box is 1 x 1 x 2m. Each box occupies a single 1 x 1m grid cell. The numbers of boxes in the scenes vary from 10 to 30, making various training inputs of relatively positioned boxes and walls. The boxes and walls are set to have same textures and heights so that they cannot be distinguished by visual cues alone. The ground truth for training examples are automatically generated based upon the assumption that walls and two aligned boxes cannot be pushed. An example 26 Figure 3.4: Ground truth for 4 directions of aordances. Green arrow represents a cell being pushable and red represents not (left). Predicted pushing up aorandance values for each voxel, color coded in heatmap style: 0 (not pushable) being blue and 1 (pushable) being red (right). of ground truth for the 4 aordances, pushable up, down, left, and right are shown in Fig. 3.4 left. We assume that robot in a cell can move to one of 4 directions to the neighboring cells. The robot is driven around the map by approaching nearest frontier (non-explored area) to cover 2D occupancy grid map and to collect snapshots of point clouds. The robot travels all reachable space and the gathered point clouds are stitched into a 3D world point cloud. For each view, obtained point cloud is segmented into 1 x 1 x 1m voxels and geometric features are extracted to predict the aordance (Fig. 3.4 right). Since a single object might span over multiple voxels, the predicted aordance values are averaged over one grid cell to represent estimated aordance of object standing in the cell (Fig. 3.5 left). Logistic regression (using WEKA tool [16]) was used to learn the parameters l 0 ;:::; l m for each label l=1,...,4 with number of features m = 42. 10-fold cross-validation was used to evaluate the classier. The prediction precision and recall for pushing up and down was 81.5% and 79.7% and pushing left and right was 73.1% and 74.1%, respectively. As mentioned in the Section III, the aordance predictions for directional pairs are same as they are symmetric. After obtaining the 2D occupancy grid map and aordance predictions for each occu- pied cell, a robot starts to plan a manipulation sequence by computing the information gain. To perform a pushing primitive, PR2 uses its two manipulator arms by aligning the end eectors vertically to maximize contact surface with the object. Accessibility needs to be considered in selecting candidate pushes. Accessibility can be exploited via 27 Figure 3.5: Prediction result for aordance \pushing up" with heatmap color (left). Aordance prediction after MRF smoothing (right). sequences of moving objects, but we dene it to be a single step approachability. For a given cell and aordance direction, if there exists a path to the target position where certain the aordance can be examined, the accessibility is set to be true. The parameter for computing information gain is set to be 0.8 and cost of moving one cell is set to be 1. LibDAI library [26] is used to build up a MRF graph and to run belief propagation method to derive the probability distribution P (x) in equation 3.2. A factor graph is generated by setting each occupied cell to be a node and connecting the neighboring up, down, left, and right cells by edges. The unary potentials are directly imported from the prediction result. The pairwise potentials are computed using the Ising model with J=0.4. An example of predicted aordance of \pushing up" is shown in Fig. 3.5 left. The prediction values between 0 and 1 are shown in heatmap color where 0 being blue (not pushable) and 1 being red (pushable). The result of MRF smoothing in the rst manipulation step is shown in Fig. 3.5 right. The MRF successfully reduces the overall entropy by re ecting the assumption of aordance similarity between neighbors. Sometimes, applying MRF can \over-smooth" the aordance prediction by treating separate objects to be similar (e.g. Fig. 3.5 top left pushable up object). But its in uence degrades when future manipulation is applied to conrm the true aordance. In Fig. 3.6, a sequence of manipulations is shown. All candidate cells with 4 directional aordances are examined for their estimated information gain by manipulation. The robot selects the most information gaining manipulation and moves to the position where it can examine that aordance (e.g. A robot reaches to the `right' vacant cell to conrm pushable 28 Not Pushable Pushable Affordance Direction Not Pushable Pushable Affordance Direction Not Pushable Pushable Affordance Direction Not Pushable Pushable Affordance Direction Figure 3.6: A sequence of manipulation steps. At each step, a robot chooses the manipu- lation that has highest information gain which means larger entropy reduction with less movement. `left'). After performing a pushing action, the robot returns back to the position where it started pushing and detects the change to conrm the aordance. As aordance depends on the relative positions of environment, aordance might be changed when an object gets pushed. Therefore, aordances should be modied according to the new conguration with the occupancy map the robot has. For example, in the second and third manipulation steps in Fig. 3.6, the object that has been pushed to the right `was' pushable, but it becomes not pushable after the push, due to the wall. Other 3 directional aordances are also aected by successful pushing and those aordances are updated as well. At the end of manipulation step, the overall aordance predictions are updated through MRF with newly obtained aordance information. In every iteration, an accessibility is recomputed according to the conguration change. The experiment is conducted by iterating manipulation selection, action, and aordance conrmation until the maximum information is reached (further information gain will be negative: movement costs more than reducing entropy). The result of two approaches, exhaustive approach and maximum information gain method, are compared in Fig. 3.7. The experiments were executed over 20 dierent maps and the average entropy plot is shown. The exhaustive approach selects a manipulating cell based on proximity: the nearest occupied cell is selected on each step. The exhaus- tive method doesn't utilize MRF model, since it doesn't consider how in uential each manipulation can be. Second approach selects manipulation by choosing maximizing in- formation gain. Each manipulation gains new knowledge of aordance and the eects propagate through MRF. As shown on the plot, the information maximizing approach reduces overall entropy faster than exhaustive method. 29 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4 7 10 13 16 19 Maximum Information Gain Exhaustive Approach Manipulation steps Average Entropy Figure 3.7: Average entropies between two approaches: exhaustive and maximum infor- mation gain approach. 3.5 Conclusions In this chapter, we presented a method to build a task oriented aordance map by interactive manipulation. Through robot exploration, a 2D occupancy grid map was built with a 3D registered world point cloud. The point cloud was voxelized and aordance was predicted using unary and pairwise geometric features. Based upon the prediction, interactive manipulation was planned and executed using a MRF model to reduce the aordance map uncertainty. As a result, a complete aordance map was constructed for the robotic task. 30 Chapter 4 Hierarchical Affordance Prediction Model Learning using Environmental Context 4.1 Introduction Robots need to be capable of predicting the eects of their actions to plan manipu- lations in dierent situations. The set of actions that can be performed with an object, as well as the corresponding eects, are known as the object's aordances [2, 27]. For example, a robot can perform a grasping action with a graspable object, which will re- sult in the object becoming attached to the robot's manipulator. These aordances are not universal properties and depend upon state of the object in the environment. For instance, a salt shaker on the shelf can be grasped, but it is not graspable when it is behind other objects. Similarly, a cup is not always a container when it is placed upside down [28]. Therefore, robots should consider the environmental context of the object to determine its aordances. A prediction model estimates an outcome of an action applied to an object. The model of aordance can be learned from experience by performing actions and observing their eects on the objects. For example, to learn push aordances of a book (Fig. 5.1), the robot can apply various pushes and observe how the book is moved by actions. Certain features of the state and action will have continuous eects on the changing of the object state. For example, if the push has a xed length, then the distance between the starting position of the robot's hand and the location where it makes a contact with the object will eect the distance that the object is pushed. A longer distance until the contact will result in shorter pushing movement. In certain situations, the set of relevant features will be dierent. For example, if an object is placed against an obstacle, then the distance until contact will not be relevant as the motion will be blocked. To capture these situations, 31 Figure 4.1: A PR2 robot trying to learn prediction model for push aordance of a tabletop object. The eect of the pushing action depends on environmental context. the robot could use a hierarchical model with low-level submodels that use dierent sets of features for predicting the eects. We explore learning prediction models with environmental context features for push aordances. First, data are gathered from sample pushes from various contextual situ- ations. Features describing characteristic of object and its environment are extracted. We model the eects of the pushes using Gaussian processes regression (GPR). We use the automatic relevance determination (ARD) kernel to determine the in uence of each feature on the model's prediction. To allow for dierent sets of relevant features, we also evaluate using a hierarchical model, which uses a high-level SVM to select between low- level GPR submodels. The models were evaluated on a pushing task, as show in Fig. 5.1. We evaluated dierent sets of features and varying the number of submodels. The model is validated by using it to plan a series of pushes to move the object to the edge of the table, where it is then grasped. 32 4.2 Related Work The concept of aordances has been broadly studied in robotics. Aordances can be used to recognize objects [18, 29, 30, 31]. Varadarajan et al. [18] built an aordance network to recognize unknown objects from the aordances of the object parts. Myers et al. [31] studied learning tool parts' aordances for recognizing unknown tools. Traversibility aordances for mobile robots have been studied. Similar to push af- fordances, traversibility aordances required the robot to detect obstacles in its environ- ment. Ugur et al. [32] studied traversibility aordances with interactive learning, which was useful for a robot path planning. Kim et al. [33] studied traversibility of outdoor en- vironments from images and in [34], reinforcement learning based adaptive traversibility was also studied. Given a prediction model, robots can plan a series of actions to perfrom a desired task. Barry et al. [35] propsed a planner which could push an object to the edge of the table to grasp. Similarly, the planner of King et al. [36] used push aordances to achieve better grasps of objects. We focus on learning the prediction model for push aordances. We successfully validated our model on a push-to-grasp task. Work has been done on models for pushing single objects, i.e.,without obstacles or interactions with other objects. Hermans et al. [37] studied how to predict the eects of pushes based upon the shapes of the objects. Local and global shape descriptors were evaluated to predict the score over push-stability and rotation variance. Kopicki et al. [38] studied pushing rigid objects using a product of experts model to predict the eects of the pushes. The work of Kroemer et al. [39, 40] is the most similar to our own. They present a prediction model that decomposes manipulation tasks into multiple distinct modes, and predicts when the mode switches will occur. Instead of linear models, we use Gaussian process regressors with ARD kernels to automatically determine the in uence of the individual features depending on the situation. Learning dierent contexts to learn aordance was stuided by Ugur et al. [41]. They presented a hierarchical structure of aordance to categorize the eects of actions and built hierarchical clusters over the eect space. Eppner et al. [42] studied environmental constraints to describe the contextual information for generating a plan to grasp an object. Dierent conguration of environmental constraints supported dierent aordances(e.g. grasp, contact, slide, etc.). An abstract transition graph was built upon approximating the transferable state congurations. 33 4.3 Learning Prediction Models for Push Aordances In this section, we dene the prediction model for push aordances and present an approach to learn the model. 4.3.1 Push Aordance Prediction Modeling Rather than learning a binary aordance model, e.g. graspable or not graspable, the robot learns a prediction model to estimate the continuous change in state of an object when pushed. The pose of the object on the tabletop is dened by the 2D position x;y and orientation in the global frame. Pushing actions are dened in the pushing coordinate frame as shown in Fig. 5.2, where axes x p ;y p ; and z p are shown in red, blue, and green respectively. The pushing action moves the gripper by L = 15cm in the z p directions, starting at the origin of the pushing frame. Dierent pushing actions are then dened by placing the pushing frame at dierent poses in the global frame. The pushing frame is always oriented toward the center of the object. The eects of the actions e 1 = y p ;e 2 = z p ; and e 3 = p are dened as displace- ments and rotation in the pushing frame. The eects of the push are then transformed into the global coordinate frame. The prediction model is dened in the pushing coordinate frame as a probabilistic distribution p(e i jo;a)N ( i ; 2 i ) where a is the pushing action, o is the manipulated object, and i and 2 i are the mean and variance for eect e i . 4.3.2 Feature Extraction To predict the eects of the robot's actions, the robot has to extract a set of features describing the state of the object, the interaction between the object and the hand, and the environmental context. We therefore utilize three dierent types of features: object, contact, and environmental context features. f =ff 1 ;f 2 ;:::;f n g = f o [f c [f x 34 Figure 4.2: The prediction model for push aordance is dened in the coordinate frame of the robot's gripper. The color of axis corresponds to x p (red), y p (green), and z p (blue). The push is made along z p axis and the eect e 1 = y p ;e 2 = z p ; and e 3 = p is a distribution of displacement along y p and z p and rotation p . Object features f o 2R 13 describe the pose and size of the object. We use seven features for the pose, three for the position and four for the quaternion orientation. We use six features that indicate the size of the object by taking minimum and maximum values of the point cloud along each axis of the pushing frame. These features can be measured from 3D point cloud, captured from RGB-D camera. The position of the object is derived from position of centroid from the object point cloud and the orientation is derived from computing eigenvector of covariance matrix of object point cloud. Given the pose, the model is projected to obtain a full the point cloud. Contact features f c 2 R 42 describe how the object makes contact with the robot's manipulator. A local shape descriptor is extracted, which describes the shape of object around the region of contact. The local shape descriptor uses a volume V r around the hand, as shown in Fig. 4.3 in red region. The point cloud within the region is voxelized with voxel size V v and the average surface normals of the point cloud within each voxel is computed. The features are then expressed as a multivariate normal distribution with mean r 2R 6 and covariance r 2R 66 , for each dimension for position(R 3 ) and average surface normals(R 3 ). Each element of mean and covariance is used as a feature. Environmental context features f x 2R 576 capture the relation between the object and its environment. For the tabletop pushing scenario, contextual features represent the 35 Figure 4.3: A region of point cloud used for extracting local descriptors (red). These local shape descriptors explain how the contact will be made when an object is pushed. -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Figure 4.4: Encoded environmental context features from the pushing frame. Each grid cell is colored with dominant label as green(object), red(obstacle), gray(table), and black(free space). relation among objects, obstacles, and the table. The table itself bounds the space, with any object pushed beyond edge to fall o. When an object is pushed against the obstacle, its motion will be blocked. We encode the environmental context using a d c d c m 2 grid around the object, aligned with the pushing frame. Each grid is associated with a single one-hot label: object, obstacle, table or free-space. The label is computed by projecting the point cloud into the table plane. All of the points in each grid cell are counted and majority number of labels is chosen to represent the grid. Thus, the value of a grid cell containing mainly object points will be [1 0 0 0]. Examples of contextual features are shown in Fig. 4.4 where each grid is colored with its corresponding label. The green color is for objects, gray is for table, red is for obstacle and black is for free-space. The size of the extracted contextual features will be f x 2R dc ng dc ng nc with grid size n g with number of entity categories n c . 36 4.3.3 Prediction Model Learning Given the extracted features, the robot learns the prediction model to represent the push aordance. The model computes the expected distribution over the action's eects. We assume a Gaussian distribution over the change in state, which we model using Gaussian process regression. With the N observed features F N =ff 1 ;f 2 ;:::;f N g with corresponding eects E iN =fe i1 ;e i2 ;:::;e iN g, the conditional distribution of the eect for new actions is a Gaussian distribution. p(E i(N+1) jE iN )N ( i(N+1) ; 2 i(N+1) ) with mean and covariance of i(N+1) = k T ((K N + 1 I) 1 )E iN 2 i(N+1) =k(f N+1 ;f N+1 ) + 1 k T (K N + 1 I) 1 k where K N is a Gram matrix and k T = h k(f N+1 ;f 1 ) k(f N+1 ;f 2 ) ::: k(f N+1 ;f N ) i with being a hyperparameter for the precision of the noise. We use a squared exponential kernel with automatic relevance determination (ARD) with length scale l s ;8s2f1; 2;:::;ng k(f;f 0 ) = 0 exp 1 2 n X s=1 (f s f 0 s ) 2 l 2 s ARD kernel automatically determines the length scalel s of the individual features. Irrel- evant features receive high values for l s and thus have lower impact on the prediction. 4.4 Hierarchical Prediction Modeling The ARD kernel determines the relevance of each feature. However, a set of relevant features may change depending on the context. For instance, object and contact features are less relevant than environmental context features in predicting the eect if the object is pushed against the obstacle. To capture these dierences, we employ a hierarchical model with multiple GPR submodels that have individual sets of hyperparameter values. 37 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 y p (m) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 z p (m) 1 2 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 y p (m) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 z p (m) 1 2 3 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 y p (m) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 z p (m) 1 2 3 4 Figure 4.5: Clustering 2D eects of samples with dierent number of submodels: (left) two, (middle) three, and (right) four. In each cluster, similar eects are grouped and respective GPR models are trained. To learn the model, rst we cluster the samples according to their eects. We apply a Gaussian mixture model to cluster the eects into m groups in a 2D space e h = fy p ; z p g. Given the clustered data, the robot learns separate GPR models G j for each cluster j with its samples. The robot also learns the gating function X(f) =j;8j2 f1; 2;:::;mg to select the submodel given the current context by features f. We use multi- class SVM trained on the clustered data as the gating function. 4.5 Evaluations We evaluate the proposed approach using a PR2 robot and a tabletop object pushing task. In the rst experiment, we evaluate how dierent sets of features aect the accuracy of predicting the eects of pushes and how the hierarchical modeling approach performs with dierent numbers of clusters. In the second experiment, we validate the learned models by applying them to the task planning. 4.5.1 Experimental Setup The target object is a book which is at and the PR2 cannot grasp it unless it is pushed to the edge of the table. The obstacle is a large battery, that is too heavy for the PR2 to push. The robot executes 120 pushes in various contextual congurations for training. For computing the contact features, a local volume V r = 20 10 10cm 3 around the gripper is voxelized by V v = 1cm 3 . For the environmental context features, we used d c d c = 6060cm 2 grid with granularityn g = 5cm. Overall, totaln = 13+42+576 = 621 features are used in the experiment. 38 1 2 3 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 meter RMSE of Prediction Models for Δy p Object features Object+Contact features Object+Contact+Context features 1 2 3 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 meter RMSE of Prediction Models for Δz p Object features Object+Contact features Object+Contact+Context features 1 2 3 4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 radian RMSE of Prediction Models for Δθ p Object features Object+Contact features Object+Contact+Context features Figure 4.6: Prediction errors for push aordance models with dierent set of features along with dierent number of submodels. Object features(red), object and contact features(green), and object, contact, and environmental context features(blue) sets of features are used for training. One, two, three, and four submodels are trained to evaluate which number of cluster captures the dierence of context best. 4.5.2 Hierarchical Prediction Models with Selected Features Using the GPR model, the robot learns the push aordance prediction model for three eect spaces: displacement o the push axise 1 = y p , displacement along the push axis e 2 = z p , and the rotation e 3 = p . Three dierent sets of features are selected to train prediction models to analyze their in uence on the prediction: object features, object and contact features, and object, contact, and environmental context features. For each GPR model, the kernel parameter 0 is set to 0.01 and the prediction accuracy is tested using 5-fold cross validation. For hierarchical modeling with multiple GPRs, the number of submodels m is tested for one to four as shown in Fig. 4.5. In the case of using two submodels, the contextual dierence between the groups can be understood as one having free-space pushes and the other having blocked pushes. For the case of three submodels, blocked pushes are further categorized into fully blocked and partially blocked pushes. For the case of four submodels, a group of prediction outliers appears as an additional cluster. The prediction errors of GPR models with selected sets of features over dierent numbers of submodels are shown in Fig. 4.6 for each eect space. For the case of prediction RMSE of the o axis push,e 1 , the accuracy of the models using all of the features shows the best result. However, it is better by 0.12cm and 0.13cm on average compared to one trained with object features, and object and contact features. The contextual features are valid in terms of understanding the environmental dierence to improve the model, but do not signicantly improve the accuracy. This result is similar for the cases where 39 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y(m) -0.4 -0.2 0 0.2 0.4 0.6 x(m) Figure 4.7: (top) Sampled transition candidates(blue) using A* planning algorithm (bot- tom) Sequence of estimated plan to push the book to the goal position on the edge of the table hierarchical approaches are applied with dierent numbers of submodels. Among various numbers of submodels, three submodels shows the best accuracy, indicating that the multiple GPR models capture the dierences of context. Using four submodels resulted in a decreased performance, most likely due to the overtting to the outliers as shown in Fig. 4.6. For the case of prediction RMSE along the push axis,e 2 , the model shows similar improvement by including environmental context features. Comparably, the accuracy of two and three submodels improve by 1.83cm and 2.63cm on average compared to one without clustering. As the displacement in z p axis describes how far the book has been pushed, the eect highly depends upon whether the interaction with the obstacle happened during the push. By dierentiating these context, whether the push is free, partially blocked, and fully blocked, the performance of the prediction was improved. For the case of prediction RMSE of rotation e 3 , adding environmental context features improved the model by 13.79% compared with the case for y p and z p case with 8.93% and 9.75%, respectively. As interactions during the push cause the object to rotate, considering the environmental context results in better predictions. 4.5.3 Application of Prediction Model for Task Planning Hierarchical prediction model for push aordance can be applied to the task planning. The evaluation task is to push the book from the middle of the table to the edge of the table where the robot can grasp it. A pose near the edge of the table is selected as a goal conguration. One of the planned trajectories is shown in Fig. 5.7 on the top. The goal is set to be on the edge of the table and the obstacle restricts straight pushes toward 40 Figure 4.8: PR2 executing the manipulation task. Hierarchical prediction model is applied to predict push eects . The object is pushed toward the edge of the table then grabbed. the goal. A* planning generates the plan using the model with two submodels, learned in previous experiment. For transition candidates from a state, dierent predictions of push eect are sampled from the hierarchical models. When a sampled push moves the book toward the obstacle, the GPR submodels for the blocked push is chosen by the gating function, resulting in shorter displacement estimates. These candidates are often discarded by the planner, as it tries to reach the goal within as few steps as possible. Without any predened collision detection scheme, the planner avoids the interaction with the obstacle based on the learned model with contextual features. The resulting sequence of the pushes are shown in Fig. 5.7 on the bottom. The PR2 performs the task 15 times with various congurations of the object and the obstacle. One of the execution is shown in Fig. 4.8. The eects of the push are compared with prediction after every execution. When the errors are larger than 5cm, a new plan is computed. Over all the experiments, it took an average 4.4 steps to push the book to the goal pose. The robot had to re-plan 3 out of the 15 trials. The average Euclidean distance error between the prediction and the real push was 3.65cm. The method only failed once(93.3% success rate), when the book was pushed o of the table. 41 4.6 Conclusion A prediction model for push aordances was learned with context features. Object, contact, and environmental context features were investigated to capture the relative con- text of the aordance. A hierarchical approach was applied to dierentiate the context with relevant sets of the features, building multiple Gaussian process regression submod- els. The learned prediction model was evaluated on the pushing task to move the object to the desired position. The robot successfully performed the push-to-grasp task by using the proposed model. 42 Chapter 5 Probabilistic Hierarchical Affordance Prediction Models with Varying Feature Relevances 5.1 Introduction Robots need to be capable of predicting the eects of their actions to plan manipu- lations in dierent situations. The set of actions that can be performed with an object, as well as the corresponding eects, are known as the object's aordances [2, 27]. For example, a robot can perform a grasping action with a graspable object, which will result in the object becoming attached to the robot's hand. These aordances are however not always applicable, and they depend upon state of the objects in the environment. For example, a salt shaker on a shelf can be grasped, but it is not graspable when it is behind other objects. Robots should therefore consider the environmental context of objects to determine their valid aordances. Robots can capture the aordances of an object using prediction models that estimate the eects of actions applied to the object. These models can be learned from experience by performing actions and observing their eects on the objects. For example, to learn push aordances, the robot can apply various pushing actions to objects and observe the resulting movements, as shown in Fig. 5.1. Depending on the arrangement of the objects in the scene, some aspects of the objects' state space will have a greater or lesser in uence on the eects of the robot's action. For example, if an object being pushed is blocked by an obstacle, then the exact size and location of the obstacle will aect the object's motion. The model should therefore capture the eects of small changes in the obstacle's position and size when predicting the eects of pushes in these situations. However, if the obstacle is very far away from the object being pushed, then small changes in the obstacle's position and size are irrelevant for predicting the eects on the pushed object. Thus, the in uence of the objects' features 43 Figure 5.1: A PR2 robot trying to learn prediction model for push aordance of tabletop objects. The eect of the pushing action depends on context of objects' conguration. changes depending on the arrangement of objects in the scene. In order to learn the aordance prediction model more eciently, the robot should take into consideration the varying relevances of the features when learning the model. In this chapter, we explore learning prediction models with context features for push aordances. We model the eects of the pushes using Gaussian processes regression (GPR) with squared exponential kernel. The features' length scales for this kernel are optimized to determine the in uence of each feature on the model's predictions. To allow for dierent sets of relevant features, we propose using a hierarchical model that incorporates multiple GPR submodels, with dierent sets of hyperparmeter values, to model dierent types of situations. The robot learns to select between the dierent submodels using a high-level logistic regression activating function. The proposed method was evaluated on a PR2 robot performing a pushing task, as shown in Fig. 5.1. The experiments evaluated using dierent sets of features and varying the number of submodels. The learned model was validated by using it to plan a series of pushes to move an object in order to move a second object that is outside of the robot's workspace. 44 5.2 Related Work The concept of aordances has been broadly studied in robotics. Aordances have been used to recognize objects [18, 29, 30, 31]. Varadarajan et al. [18] built an aordance network to recognize unknown objects from the aordances of the objects' parts. Myers et al. [31] studied learning tool parts' aordances for recognizing unknown tools. Traversibility aordances have been studied for mobile robots. Similar to push aor- dances, traversibility aordances require the robot to detect obstacles in its environment. Ugur et al. [32] studied traversibility aordances with interactive learning for robot path planning. Kim et al. [33] studied traversibility of outdoor environments from images and Zimmermann and Zuzanek [34] studied reinforcement learning based adaptive traversibil- ity. Given a prediction model, robots can plan a series of actions to perform a desired task. Barry et al. [35] proposed a planner which could push an object to the edge of the table to grasp it. Similarly, the planner of King et al. [36] used push aordances to achieve better grasps of objects. We focus on learning the prediction model for push aordances. Work has been done on models for pushing single objects, i.e.,without obstacles or interactions with other objects. Hermans et al. [37] studied how to predict the eects of pushes based upon the shapes of the objects. Local and global shape descriptors were evaluated to predict the score over push-stability and rotation variance. Kopicki et al. [38] studied pushing rigid objects using a product of experts model to predict the eects of the pushes. Omrcen et al. [43] studied pushing objects to move to desired position using recurrent neural network. The work of Kroemer et al. [39, 40] is the most similar to our own. They present a prediction model that decomposes manipulation tasks into multiple distinct modes, and predicts when the mode switches will occur. Instead of linear models, we use Gaussian process regressors with ARD kernels to automatically determine the in uence of the individual features depending on the situation. Discovering eect categories from aordance action-eects was studied by Ugur et al. [41]. They presented a hierarchical structure of aordances to categorize the eects of actions and built hierarchical clusters over the eect space. Eppner et al. [42] studied environmental constraints to describe the contextual information for generating plans to grasp objects. Dierent congurations of environmental constraints supported dierent aordances(e.g. grasp, contact, slide, etc.). An abstract transition graph was built upon approximating the transferable state congurations. 45 Our work learns aordance model using a hierarchical approach. Multiple submodels are learned to capture dierent interactions between object pairs, each with using context information. 5.3 Learning Prediction Models for Push Aordances In this section, we dene the prediction model for push aordances and present a method for learning the model. 5.3.1 Push Aordance Model Rather than learning a binary aordance model, i.e. pushable or not pushable, the robot learns a prediction model that predicts the continuous change in state of an object when a pushing action is applied to it. A pushing action denes the hand-relative pushing coordinate frame, as shown in Fig. 5.2, with axes x p ;y p ; and z p , and the origin at the initial position of the hand. The pushing action moves the gripper following a straight horizontal trajectory in the z p direction a distance of d p = 25cm. We assume that there are two objects on the table, i.e., object one o 1 and object two o 2 . We dene the rst objecto 1 to be the one that is closer to the pushing frame's origin. The eects of the pushing actions on the objects' planar positions [y z] T and orientations are given by e l =fy o 1 ; z o 1 ; o 1 fy o 2 ; z o 2 ; o 2 g. The displacement eects e l are dened in the pushing frame. The prediction model is also dened in the pushing coordinate frame as a probabilistic distribution p(e l jo; a) =N ( l ; 2 l ) where a is the pushing action, o are the manipulated objects, and l and 2 l are the mean and variance for eects e l . 5.3.2 Feature Extraction To predict the eects of the robot's actions, the robot extracts a set of features describing the state of the objects, the interaction between the robot hand to the objects, and their context. We utilize three dierent types of features to capture the state of the objects: object, contact, and context features. f = [f 1 f 2 ::: f n ] T = [f oT f cT f xT ] T 46 Figure 5.2: The prediction model for push aordance is dened in the coordinate frame of the robot's hand. The color of axis corresponds to x p (red), y p (green), and z p (blue). The push is made horizontal trajectory in z p direction and the eects e l =fy o 1 ; z o 1 ; o 1 ; y o 2 ; z o 2 ; o 2 g are distributions of displacements in y p and z p and rotations p . Object features f o 2R 8 describe the poses of the objects. We use four features for each object, three for the position and one for the rotation: objects are assumed to be placed on the tabletop surface and limited to a single rotation around the table's normal, i.e., we do not model toppling. In the experiments, the objects are tracked with AR tag and we assume that the robot has coarse 3D point cloud models of objects. The positions are given by the centroids of these 3D models. Contact features f c 2 R 13 describe the predicted contact between the rst object and the robot's hand, as shown in Fig. 5.3 on the left. The contact features encode the distance from the initial hand pose to the expected contact hand pose R 1 , and the shape of contact region R 12 . A local volume of width 5cm around each gripper's nger is projected toward the pushing direction to check the objects' point cloud to estimate when the contact will be made with the objects. Since the robot hand is equipped with a two-nger gripper, we derive contact features for each nger. The 3D positions and corresponding surface normals are averaged over the region of expected contact, resulting in six features for each nger. If the path of the hand does not intercept with the object model, the contact features are assigned with furthest distance of push frame with zero surface normals. 47 Figure 5.3: Extraction of (left) Contact features and (right) Context features. For contact features, position and shape(=normals) of expected contact are encoded. For context features, relative geometric information of objects are encoded by binning the layout of objects position and shape. The context features f x 2 R 80 capture the general geometric relations among the objects in the scene. The purpose of the context features is to provide a coarse description of the scene, which the robot can then use to predict potential interactions between objects. For example, the context features should allow the robot to predict if the robot's hand will interact with the rst object as well as if the two objects will interact with each other as a result of the pushing motion. We compute the context features from bins along the pushing axisz p , as shown in Fig. 5.3 on the right. For each of the n b = 20 bins of 2.5cm width, the robot computes the width and median values of the objects' models as features. The features are measured individually for each object. For the case when there are no point cloud points within a bin, the features are assigned zero values. 5.4 Hierarchical Prediction Models The relevance of the features for predicting the pushes' eects will vary depending on the arrangement of the objects in the scene. To capture these variations in the fea- tures' relevances, we propose using a hierarchical model with multiple submodels that have dierent sets of hyperparameters. The structure of the model is described in Sec- tion 5.4.1. The assignment of samples to submodels is not known. The model parameters are therefore learned using a expectation maximization (EM) method, as described in Section 5.4.2. 48 5.4.1 Hierarchical Model Structure In order to learn the prediction model, we assume that the robot has N samples, where the ith sample includes the state s i 2R ds , action a i 2R da , next state s 0 i 2R ds . These states and actions are used to compute the features f i and eects e i = s 0 i s i . Our hierarchical model consists of M low-level GPR submodels, and a high-level logistic regression activating distribution. Each training sample corresponds to one of these M submodels m i 2f1;:::;Mg. However, the submodel assignments m i are not known and must be inferred from the training data. The high-level activating distribution has the form p(m i =jjf i ; w) = exp(w T j f i ) P M k=1 exp(w T k f i ) where the vectors w k determine in which states the dierent submodels are more likely to be active. Given a sample's submodel assignment m i , the corresponding submodel predicts the eect using Gaussian process regression, such that p(e i jf i ;m i =j) =N ( j (f i ); 2 j (f i )); where j and 2 j are the mean and standard deviation function from the jth submodel's GPR. In our experiments, we evaluate using dierent sets of features for the activating distribution and the low-level submodels. The GPRs use squared exponential kernels of the form k(f i ; f j ) = exp(0:5(f i f j ) T L 1 (f i f j )) where is the amplitude of the kernel, and L is a diagonal matrix of length scales L = diag(l 1 ;l 2 ;:::). These length scale parameters determine the in uence of the individual features. For example, if the ith feature has a relatively large length scale l i , then the feature's value will have little in uence on the predicted eect. The amplitude and length scales are hyperparameters of the individual submodels. These hyperparameters are optimized using gradient ascent to maximize the likelihood of the training data. This maximization automatically selects a set of length scale values , which dene the relevances of the features for each submodel. This form of kernel is therefore also known as a squared exponential automatic relevance determination (ARD) kernel. 49 5.4.2 Model Learning In order to the learn the parameters and hyperparameters of the model, the robot uses an expectation-maximization (EM) algorithm. The EM algorithm iterates between two steps: In the expectation step, the robot estimates the distributions over the latent submodel assignment variablesm i . In the maximization step, the robot uses the distribu- tions over the latent variables to update the parameters of the activation distribution and the individual submodels. The EM algorithm is only guaranteed to nd a local optimum solution, and its performance therefore depends on the initialization. The initialization, as well as the expectation and maximization steps of the EM algorithm, are detailed below. 5.4.2.1 Initialization The EM algorithm is initialized by assigning the training samples to dierent sub- models using a clustering method. We cluster the samples using a Gaussian Mixture Model (GMM) with M Gaussians in the eect space e i . In this manner, samples with similar eects are initially assigned to the same cluster and thus the same submodel. Ex- amples of the initial submodel assignments from the GMM for the case of three submodels is shown in Fig. 5.4. This initial clustering helps the robot to detect situations such as when neither object were pushed (red points), or only the rst object was moved (green points). Once the training samples have been clustered and assigned to the corresponding submodels, the robot uses the samples to learn the individual submodels, as well as the high-level activating distribution using the sample assignments. Given the initial model parameters and submodels, the robot can start with the rst expectation step. 5.4.2.2 Expectation Step In the expectation step, we need to compute the distributions over the latent sub- model assignments m i given the observed variablesfe i ; f i g. The model parameters and hyperparameters are xed. We compute this distribution as p(m i =jje i ; f i ) = p(m i =j; e i ; f i ) p(e i ; f i ) = p(m i =j; e i ; f i ) P M k=1 p(m i =k; e i ; f i ) We expand the joint distribution as p(m i =j; e i ; f i ) =p(e i jf i ;m i =j)p(m i =jjf i )p(f i ) 50 -0.1 -0.05 0 0.05 0.1 0.15 y p1 0 0.05 0.1 0.15 0.2 z p1 Initial label from GMM on first object -0.15 -0.1 -0.05 0 0.05 0.1 y p2 0 0.05 0.1 0.15 0.2 z p2 Initial label from GMM on second object Figure 5.4: Example of clustering with 3 submodels using Gaussian Mixture Model. Similar eects of pushes are grouped as initial submodel distribution. Green group shows that the rst object did move but did not contact second object. which, when we cancel the distributionp(f i ) in the numerator and denominator, gives us p(m i =j; e i ; f i ) P M k=1 p(m i =k; e i ; f i ) = p(e i jf i ;m i =j)p(m i =jjf i ) P M k=1 p(p(e i jf i ;m i =k)p(m i =kjf i ) The conditional distributionsp(e i jf i ;m i =j) andp(m i =jjf i ) are computed using logistic regression and Gaussian process regression respectively. 5.4.2.3 Maximization Step In the maximization step of the EM algorithm, we must compute the parameters and hyperparameters that maximize the expected log-likelihood of the observed and hidden variables. new = arg max X m 1:N p(m 1:N je 1:N ; f 1:N ; old ) lnp(e 1:N ; f 1:N ;m 1:N ;) where the summation over is all possible assignments of the latent variables m 1:N . We expand the log of the product as the sum of logs, such that lnp(e 1:N ; f 1:N ;m 1:N ;) = N X i lnp(e i jf i ;m i ;) + lnp(m i jf i ;) + lnp(e i ; f i ;) rewrite the maximization by factorization and independences, and it simplies to 51 new = arg max N X i X m 1:N p(m 1:N je 1:N ; f 1:N ; old )lnp(e i jf i ;m i ;) + N X i X m 1:N p(m 1:N je 1:N ; f 1:N ; old )lnp(m i jf i ;) Hence, when computing the parameters to maximize the log likelihoods of the data for the logistic regression and GPR models, the samples should be weighted by correspond- ing submodel assignment probabilityp(m i je i ; f i ; old ) For the weighted logistic regression p(m i = jjf i ; w), we compute the parameters w using gradient descent. Weighted itera- tive reweighted least square with Newton-Raphson update is used to minimize the cross entropy error function. For the weighted Gaussian process regressionp(e i jf i ;m i =j), The predicted outcome of weighted Gaussian process regression becomes [? ] j (f i ) = k T (K +W 1 ) 1 f 1:N 2 j (f i ) = k + k T (K +W 1 ) 1 k where k is a vector containing the kernel values between the new input and the training samples, K is the Gram matrix of the training samples, is the noise hyperparameter and W is a diagonal matrix with sample weight sample weight [W] ii =p(m i je i ; f i ; old ). The hyperparameters of each GPR model are optimized using gradient descent for each iteration of the EM algorithm. 5.5 Evaluations We evaluated the proposed approach using a PR2 robot and a tabletop object push- ing task. In the rst experiment, we evaluated how dierent sets of features and the number of submodels aect the accuracy of predicting the eects of pushes. In the sec- ond experiment, we demonstrated how the learned models can be used to plan a series of pushes of one object to move a second object that is outside of the robot's workspace. 52 Figure 5.5: Objects used in the experiment. 5.5.1 Experimental Setup The tabletop scenes consist of a variety of everyday objects: cracker box, tool box, phone box, nut container, waterpot lid, and a wood block, as shown in Fig. 5.5. Each push sample is conducted with two objects on the table. The PR2 robot collects samples by pushing from various positions and directions and observing the resulting eects. The length of push is xed to be 25cm at a time. The robot executed 321 pushes using various sets of objects and object congurations. 5.5.2 Hierarchical Prediction Models with Selected Features For the hierarchical model, we evaluated using M = 1 to M = 6 GPR submodels. Using only one submodel submodelM = 1 corresponds to a non-hierarchical GPR model. The EM algorithm was run for ten iterations, or terminated earlier if the prediction accuracy of the training examples converged. The activating distribution and submodels were trained on 321 samples, using 3-fold cross validation for evaluation. Dierent combinations of features were tested to analyze how they aect prediction performance. The rst three sets of features used the same features for the high-level activating distribution as well as the low-level submodels. These three approaches used: only object features (f o ), object and contact features (f o & f c ), and all three types of features (f o & f c & f x ). The fourth set of features evaluted in the experiment used the context features f x for the activating distribution, and the object features f o and contact features f c for submodels (f x = f o & f c ). 53 1 2 3 4 5 6 Number of submodels 0 0.5 1 1.5 2 2.5 3 3.5 Displacement (cm) RMSE of Prediction Models: y o 1 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c 1 2 3 4 5 6 Number of submodels 0 0.5 1 1.5 2 2.5 3 3.5 4 Displacement (cm) RMSE of Prediction Models: z o 1 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c 1 2 3 4 5 6 Number of submodels 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 radian RMSE of Prediction Models: o 1 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c 1 2 3 4 5 6 Number of submodels 0 0.5 1 1.5 2 2.5 3 Displacement (cm) RMSE of Prediction Models: y o 2 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c 1 2 3 4 5 6 Number of submodels 0 1 2 3 4 5 6 7 Displacement (cm) RMSE of Prediction Models: z o 2 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c 1 2 3 4 5 6 Number of submodels 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 radian RMSE of Prediction Models: o 2 f o / f o f o & f c / f o & f c f o & f c & f x / f o & f c & f x f x / f o & f c Figure 5.6: Prediction errors for push aordance models with dierent set of features along with dierent number of submodels. Object features(f o ), object and contact features(f o & f c ), and object, contact, and context features(f o & f c & f x ) sets of fea- tures are used for training. One to six submodels are trained to evaluate which number of submodels captures the dierence of context best. The fourth bar is tested for using dierent sets of features for WLR and WGPR. The prediction results for dierent sets of features and numbers of submodels are shown in Fig. 5.6. The top and bottom rows of the gure correspond to the prediction errors for the rst object o 1 and second object o 2 respectively. 5.5.3 Discussion The results show that including context features f o & f c & f x instead of using only object features f o or only object and contact features f o & f c resulted in decreases in the RMSE of 8.71% and 3.67%, respectively. The prediction accuracy for the rst object was not signicantly better when context features were included. However, the accuracy for the second object was better by 11.8% and 4.37%, respectively. This results suggests that the object and contact features are not sucient to fully describe how the interaction between objects. The context features provide the robot with additional knowledge for determining if the objects will interact, which results in better mean prediction accuracies, especially for the second object. 54 0.4 0.5 0.6 0.7 0.8 0.9 y (m) -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 y Initial Position Push Action 0.4 0.5 0.6 0.7 0.8 0.9 y (m) -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 y Predicted Position Actual Execution Push Action 0.4 0.5 0.6 0.7 0.8 0.9 y (m) -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 y Predicted Position Actual Execution Push Action Figure 5.7: Example of two step push experiment using the learned hierarchical model. The robot aims to move the unreachable white box into the reachable region by pushing the cracker box. The robot successfully pushes the cracker box to move the white box closer to the left hand within the reach. 55 Using multiple submodels resulted better performance compared to one submodel, with accuracy improvements for the f o & f c & f x case of 5.9%, 6.5%, 5.8%, 6.1%, and 6.1% for M = 2 to M = 6 submodels respectively. Using M = 3 submodels resulted in the best average prediction. The three submodels roughly correspond to moving zero, one, or both objects. The dierent submodels thus capture the aordances of pushable individual objects and pushable pairs of objects. The f x = f o & f c approach uses dierent sets of features for the high-level activation distribution and the low-level submodels' predictions. We evaluated several dierent splits of the features (not shown) and found this one to achieve the best results. This split of the features performs particularly well for predicting the movement of the rst object o 1 in the pushing direction. The results corresponds to the intuition that context features are useful for distinguishing dierent types of interactions, while object and contact features are useful for predicting the precise movements of the objects within these dierent types of interactions. Including additional features, e.g., using all of the features for both levels of the model, results in a more complicated model that can be dicult to learn. The proposed method can be applied to dierent sets of objects, which may be mod- eled with dierent numbers of submodels. One of the limitation of our approach is that the features only describe visual information, physical properties such as non-uniform mass, or friction cannot be modeled. The possible extension of the work is to tranfer learned aordance model to other aordances, such as stacking. 5.5.4 Application of Prediction Model for Task Planning Hierarchical prediction models for push aordance can be used to perform task plan- ning. In this demonstration task, the robot has to generate a sequence of two pushes in order to move an object that cannot be reached directly by robot's manipulator, as shown in Fig. 5.7. In the scene, the white box is just outside of the robot's workspace. The goal is to move the red box, like a tool, in order to move the white box within the robot's reach. First, the robot sampled 100 possible pushes of the red box and using the hierar- chical model to compute predicted movement of the box. From the predicted pushes, the robot selected the ten candidate pushes with the smallest distances between the objects. From these candidates, another 100 pushes were sampled and the robot selected the nal sequence of pushes in which the white box was displaced the most to the left and within the workspace of the left manipulator. The initial positions, predicted movements, and actual displacements of the two objects in steps one and two are shown in Fig. 5.7 on the 56 left. The pictures on the right of the gure show the results of the execution on the real robot. As a result of the pushing motion, the white box becomes reachable by the left hand. This experiment demonstrates how the hierarchical model can be used to predict the movement of objects and use the predicted interactions between objects. 5.6 Conclusion We presented A hierarchical prediction model for push aordances. The model is learned using object, contact, and context features. The model captures the eects of manipulating two objects in the scene, including the interactions between the objects. The model consists of a high-level activation distribution and multiple low-level Gaussian process regression submodels. The hierarchical structure of the model allows it to capture variations in the relevance of the features depending on the object congurations. The learned model was successfully evaluated on a pushing task using a PR2 robot. We also demonstrated how the robot could use the learned model in order to indirectly manipulate an object that was outside of the robot's workspace. 57 Chapter 6 Modular Affordance Prediction Models for Multiple Object Interactions 6.1 Introduction ` Robots need to be capable of predicting the eects of their actions to plan manipu- lations in dierent situations. The set of actions that can be performed with an object, as well as the corresponding eects, are known as the object's aordances [2, 27]. For example, if a robot performs a grasping action with an object, and the object aords grasping, then the object becomes attached to the robot's hand. These aordances de- pend upon the state of the object, the agent and the environment. For example, a cup aords pouring into when it is upright and a block aords grasping depends on the size of the robot's gripper. Aordances are also aected by the environment when multiple objects aord interactions with each other. Therefore, in order to estimate and utilize aordances for manipulation tasks, robots should consider the context of the aordance, consisting of the objects, agent and the environment. Robots can learn about object aordances from experiences by performing actions and observing the resulting eects on the objects. This data is used to learn prediction models that estimate the eects of actions. For example, to learn push aordances, the robot can apply various pushing actions to objects in dierent scenarios: pushing in free space, against a wall, or toward other objects (Fig. 6.1). The robot can thus learn that an object is pushable if the motion is not blocked by any obstacle, or when it is pushed along with another object. The aordance prediction model learning from the action and corresponding eects can be applied to other skills such as grasping or placing. The learned prediction is modeled to be modular to predict the interaction between entities in action-eect framework. This modularity can be extended to the scenario with multiple objects with chain-reactions of interactions. For instance, when pushing an object against 58 Figure 6.1: Push aordances can be learned from experiences of various pushing actions: push in free-space, push against a wall, or push toward other objects the second object, both object-agent and object-object interactions can be predicted using a modular prediction model. In this chapter, we present a modular aordance learning framework and show its application. First, various executions of pushing, picking and placing are performed on various objects. The robot extracts a set of features from the samples, and learns aordance prediction models using Gaussian process regression. The learned model is applied to scenarios where multiple interactions happen among multiple objects. We show the result of the predicted trajectories of pushing two objects in a simulated setup. 6.2 Related Work The concept of aordances has been broadly studied in robotics. Aordances have been used for recognizing objects [29, 31], planning paths using traversibility [32, 33], learning tool aordances [44], or task planning from prediction models [35, 36]. Work has been done on modeling aordance for pushing single objects. Scholz et al. [45] used forward transition models to predict how an object will be pushed. Forward models, dened as normal distributions with uncertainties, were learned separately for each predened pushing action. Hermans et al. [37] studied how to predict the eects of pushes based upon the shapes of the objects. Local and global shape descriptors were evaluated to predict the score over push-stability and rotation variance. Kopicki et al. [38] studied pushing rigid objects using a product of experts model to predict the eects of the pushes. Omrcen et al. [43] studied pushing objects to move to desired position using recurrent neural network. For more complex pushing situations, approaches have been proposed to model vari- ous situations. Eect categorization from aordance action-eects was studied by Ugur et al. [41]. They presented a hierarchical structure of aordances to categorize the eects 59 of actions and built hierarchical clusters over the eect space. Dierent prediction models were trained for each eect category. Eppner et al. [42] studied environmental constraints to describe the contextual information for aordances. Dierent congurations of envi- ronmental constraints were used to model dierent aordances(e.g. grasp, contact, slide, etc.). An abstract transition graph was then built upon approximation of the transfer- able state congurations. Rather than using dierent models for specic situations, our approach utilize the modularity of aordance to predict the interaction. Learning aordance models for multiple object scenes were studied. Dogar et al. [19] used physics-based push-grasping in multiple-object environment. They proposed a plan- ning framework using sequences of sweeping and push-grasping actions based on the phys- ical modeling. Kitaev et al. formulated the manipulation in cluttered environments as trajectory optimzation [46]. They introduced failure modes while computing the cost to optimize using physic based simulation. Simulating physical properties of object of interactions yield more accurate results but prior knowledge of parameters was required. Our approach learns the interaction model from experience, which can adapt to unseen objects as well as interaction between them. 6.3 Modular aordance prediction models In this section, we dene the aordance prediction models and how the modularity of the models can be applied to multi-object scenarios. 6.3.1 Aordance prediction using transition model The denition of aordance can be interpreted from the eect of an action: by applying a pushing action, an object aords pushing if it was moved. However, binary labels of aordances do not encode the magnitude of the eect: how far it will be pushed. In this chapter, we dene a prediction model of an aordance, which is a skill-specic transition model that estimates the eect when an action is applied (Fig. 6.2). Skills like pushing or grasping can be described by the continuous motions of the end-eector. The motions include changing poses of the end-eector as well as opening/closing the ngers. Then a skill is dened as piece-wise linear actions in discrete time steps and at each time step, the action frameA t is dened asP e t+1 P e t whereP e t is the position and orientation of the end-eector at time stept. For each piece-wise action, the end-eector is controlled by a low-level position controller. The eect e t caused by the action at time step t is the displacement of the object relative to the action frame A t . By having the actions 60 Figure 6.2: From the trajectory of the end-eector T m , an action frame at time step t is dened as A t , which is pose dierence between t + 1 and t. In the action frame, x andy are dened as o-action and along-action axis, respectively. Orientations of object frame B t and contact frame C t are same as that of A t . The resulting eect e t is the displacement of the object relative to the action frame A t . and eects encoded in action frame, the prediction model can estimate the transition regardless of the direction and distance of the action. 6.3.2 Modularity of the prediction model The aordance prediction model predicts the eects of the interaction between an agent and an object. However, the model is not restricted to object-agent interaction and it can be applied to object-object interaction as well. The input action for the model is dened as the trajectory of the end-eector in the object-agent case. The modularity extends to the object-object case where the trajectory of the object becomes the action that is applied to the other object. For instance, when the robot is manipulating in the scenarios with multiple objects and multiple interactions may occur between the objects: when the rst object is pushed, it will push the second object causing another interaction. In such cases, if we want to predict the eects of every object in the scene given an action of the robot, we might develop a monolithic prediction model that incorporates every pa- rameter of individual objects. However, if we apply the modular structure for prediction, the robot can individually estimate the movement of each object and anticipate interac- tions between the objects. In detail, if a skill is performed by the robot as a trajectory T m of the end-eector motion, the robot can estimate which object it will make contact 61 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 -0.1 -0.05 0 0.05 0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 -0.04 -0.02 0 0.02 0.04 0.06 Figure 6.3: Global shape features (left) are 3D voxel grid representation which encode global shape and size of the object and the end-eector relative to the object frame B t . Each voxel is labeled as either 0 or 1. Similarly, contact shape features (right) are encoded over contact points from contact frame C t . with given an input T m . From the prediction model, the resulting movements of object O i ,fe O i t ;e O i t+1 ;:::;e O i t end g will form a trajectoryT O i . Now,T O i becomes an action ofO i and the modular prediction model will again predict the interaction with the other object. 6.4 Feature extraction To predict the eects of a skill in a given scenario, we rst need to represent the current state and action. For manipulation skills, we use object, contact, and actions features for predicting eects. Objects' shapes and poses are features that carry geometric information. Shapes describes the size of the object along with the geometric properties. Here, we use labeled voxel grid representations for shape features, which were used for shape templates [47]. First, we draw a 3D voxel grid over a volume V g with voxel size d g . The point clouds of the object and the end-eector are transformed into the object frame B t . The position of the object frame is at the center of the object, which is the median of object point cloud. The rotation of the frame is derived from the action frameA t , aligning features to always be in the direction of the action. Then each voxel is labeled as either objects (1) or empty space (0) (Fig. 6.3). Additionally, the relative pose between the end-eector and the object is encoded as relative pose features (R 6 ). The interaction between objects begins at the moment of contact. Therefore, the contact shape features should be extracted to describe the distribution of contacts. At 62 every time step t when an actions is executed, the contact can be determined from the point cloud of the manipulator and objects using the criteria below: jjP o;i t P e;j t jj 2 <r d n o;i t n e;j t <r n The contact points that satisfy these conditions of distance and inner product of normals are extracted. The Euclidean distance between the i th point in the object point cloud P o;i t and the j th end-eector point from point cloud P e;j t should be in close proximity of threshold r d . The inner product between the normal of object point n o;i t and the end- eector point n e;j t should be smaller than the threshold r n . This condition means that the surfaces of the manipulator and the object are facing opposite to each other. When the number of contact points exceeds the threshold n c , the objects are considered to be in contact and the corresponding contact shape features are extracted. We dene the contact frame C t 2 R 6 with the origin at the mean position of the contact points and the orientation of action frame A t . Similar to global shape features, the 3D voxel grid is drawn from a volume V c around the contact frame, with each d c size voxel labeled as either contact points (1) or empty (0). The nal features are the action feature, describing how the action is applied at the certain time step. The action features will consist of 6D pose and orientation of the action frame A t at time step t. 6.5 Aordance transition model learning In this section, we describe how to learn the transition aordance model using Gaus- sian process regression and forward and backward components of prediction models. 6.5.1 Building a prediction model The transition aordance models predict the eects of the actions for the correspond- ing aordances. We assume a Gaussian distribution over the change in state(eects), which we model using Gaussian process regression. To learn the transition aordance model, we extract features from training examples gathered from the various executions of each skill set. The predicted eect e will be a Gaussian distribution with mean and covariance of the displacement from the action with a set of input features f, 63 p(ejf)N (; ) where is a vector of displacement means in six dimensions: 2 R 6 , R 3 for position andR 3 for orientation. The covariance is diagonal matrixR 66 , assuming independence between the axes. Given N sets of observed features F N =ff 1 ;f 2 ;:::;f N g with corresponding eects E N =fe 1 ; e 2 ;:::; e N g, the conditional distribution of the predicted eect e N+1 for a new action is also a Gaussian distribution with mean and covariance of N+1 = k T ((K N +I) 1 )E N N+1 =k(f N+1 ;f N+1 ) + k T (K N +I) 1 k where K N is a Gram matrix and k T = h k(f N+1 ;f 1 ) k(f N+1 ;f 2 ) ::: k(f N+1 ;f N ) i with being a hyperparameter for the variance of the noise. We use a squared exponential kernel with automatic relevance determination (ARD) with length scale l s ;8s2f1; 2;:::;ng for number of features n. k(f;f 0 ) = 0 exp 1 2 n X s=1 (f s f 0 s ) 2 l 2 s The ARD kernel automatically determines the length scale l s of the individual features. Irrelevant features receive high values forl s and thus have lower impact on the prediction. 6.5.2 Forward and backward components of transition models The aordance interactions happen between two entities. When an action is applied from the agent to the object yielding an eect, the reactive forces are exerted back to the agent. That is, the desired trajectory of the manipulator would not synchronize with executed position due to the feedback force. This eect on the object is modeled from the forward transition model, and to incorporate reactive feedback, we include backward component of transition model. Here, similar sets of features are used to train the backward model. In addition, we add one more set of features from object's predicted motion, to describe the magnitude of feedback from the interaction. The result from the 64 backward model will be applied to estimate dierences between poses of the end-eector relative to the desired position commanded by the robot. 6.6 Experiments 6.6.1 Experimental Setup We use the V-REP [48] simulator environment with a manipulator arm of 6 DOF end-eector. The end-eector is a simple gripper with two ngers with a span of 15cm. Three dierent skill sets are used in the experiment: push, pick and place. For each skill, we trained an aordance transition model from 20 training scenes and test on 10 test scenes. Samples of action and eects were extracted from each training scene when the robot made contact, resulting in total 221, 168, 20 training samples for push, pick and placing skills, respectively. For the placing skill, one sample was extracted per scene as the releasing action is a one-time motion that releases the object o of the gripper. For each skill, the scene consists of one or two cuboid objects of random size (10- 25cm size for length, width and height). The trajectory executed for pushing is a linear trajectory of 35cm and objects are placed along the pushing direction with deviation of 30cm in the x axis. The grasping starts from 15cm above the object, goes down toward the grasping pose which is half of the object height then tries grasping, and goes upward for 20cm. The placing action starts with the grasped cuboid of xed size (101010cm) 20cm above the target object, approaches the releasing position 0-10cm above the target object, and then releases the object. The action frames at the moment of opening and closing the ngers were encoded as the action of nger, which were toward and outward the center of the gripper. For the global feature, a V g = 25 25 25cm volume was divided with the grid size of d g = 5cm. For contact features, a V c = 20 10 20 volume was divided by the grid size of d c = 5cm. Overall, n = 125 + 32 + 6 + 6 = 169 features were extracted from each sample. The prediction models are learned separately for each skill using GP with ARD kernel. The hyperparemeters of kernels were initialized with kernel amplitude 0 = 0:1 with signal noise = 0:1 and optimized using iterative reweighted least square method. 6.6.2 Forward and backward components of transition model As part of our experiments, we evaluated the result of not involving both forward and backward components of transition model for aordance prediction. Predicting the eects 65 x y rz 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 meter Mean distance error of prediction: With and Without backward models Without backward model With backward model 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 radian Figure 6.4: The mean distance errors of prediction models:with and without backward component for transition modeling. The model utilizing the backward component has better accuracy in push-along axisy, resulting from compensating the pose of end-eector while computing the predictions. over time without the backward feedback model may yield physically infeasible congu- ration of the end-eector and the object. In such cases, the robot cannot extract proper contact features. Fig. 6.4 shows the prediction accuracy over 10 test scenes of pushing one object. The error along thex direction and rotationrz is not signicantly dierent, while errors in the push directiony is signicantly improved from 5.8cm to 2.57cm. This is due to the backward transition which compensated the pose of the end-eector by pushing back while pushing the object. 6.6.3 Aordance prediction over skill sets Three dierent skills, push, pick and place, were trained using aordance prediction model. Each model was learned using the same set of features when interaction between the end-eector and the object happens. For picking, possible contacts are when the object is grasped by the gripper or when the gripper slips o the surface of object when the object to too large to grasp (Fig. 6.5 top). Eects will be same as input action when the object is successfully grasped as object moves along with the end-eector. On the 66 Figure 6.5: Objects which are graspable (top-left), not graspable (top-right), place- able (bottom-left), and not placeable (bottom-right) other hand, the eect will be zero when grasping fails. For placing action, the contact is dened when the releasing actions is performed (Fig. 6.5 bottom). If the releasing pose of the end-eector is right above the surface, the eect will be zero, and when the object falls o, the eect will be the distance between the objects. In Fig. 6.6, we show the result of the prediction over pick and place skills. Out of 10 test cases, 5 of each skill were the case when the object was not pickable or placeable. For both cases, the aordance prediction model correctly estimated the displacement of the object to be nearly zero. For case of picking, 2 of graspable objects were estimated as not pickable, by showing the displacement of object to be relatively smaller compared to the real value. Since the pose of the object is estimated from the prediction of previous step, once failing to estimate correct aordance at the rst contact moment might dier at further prediction stages. 67 x y z rx ry rz 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 meter Mean distance error of prediction: Pick and Place Pick Place 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 radian Figure 6.6: Prediction errors of pick and place aordance prediction models. When the prediction model correctly estimate the aordance of not-pickable or placeable, the error would be nearly zero. 6.6.4 Application of modular structure to multiple objects scenarios Here we show the results from applying modular structure in multiple objects scenarios. In Fig. 6.7, an example of two-object pushing manipulation is shown on left, and prediction using modular model is shown on right. In such scenario, the prediction is made by estimating possible interactions and estimate the eects(trajectory) using modular model. First, the robot estimate the interaction with rst object from desired trajectory of the end-eector. The modular model with forward and backward component predicts the eects on the object(shown as blue trajectory) and the end-eector(magenta trajectory). From the resulting trajectory of the rst object, other feasible interaction between two objects are then predicted using same modular structure. We compare the result of applying modular structure with the prediction using mono- lithic modeling. The monolithic model learns the aordance preidction model from an over-viewing perspective for each objects in the scene. Instead of modeling individual interaction between objects, the monolithic modeling builds individual object aordance model based upon the given scene. Therefore, the features for the monolithic model consist of all features of the scene: the global conguration of end-eector and objects, relative objects' poses from the end-eector and the action pose features. 20 dierent 68 -0.15 -0.1 -0.05 0 0.05 0.1 x (meter) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 y (meter) Comparison between executed trajectories and ground truth Ground truth - End Effector Ground truth - Objects Predicted - End Effector Predicted - Objects Figure 6.7: An example of pushing 2 objects over time steps. Ground truth execution is shown on left. On the right, ground truth trajectory of end-eector (red) and ob- jects (green) is shown compared to predicted trajectories of the end-eector (magenta) and objects (blue) are shown in right. scenes of pushing two objects were used for training and 10 scenes were tested for both models. The mean distance errors between predictions and ground truth of two approaches are shown in Fig. 6.8. Since the modular structure incorporates both object-agent and object- object interactions, the prediction accuracy on push-o axisx is worse than the monolithic model. However, the error on push-along axis y, the modular model performs better for both objects by 1.89cm and 1.02cm, compared to monolithic model of 4.83cm and 2.49cm. This is because the monolithic model does not predict based on the interactions. So the monolithic model predicted the movement of the rst object as unhindered push and predicted the second object to move even when there was no interaction. In comparison, the modular model correctly estimated the existence of interaction between the rst and the second object. Additionally, since the monolithic model is trained specically for the 69 x y rz 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 meter Mean distance error of prediction: Monolithic and Modular models monolithic object 1 modular object 1 monolithic object 2 modular object 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 radian Figure 6.8: Mean prediction errors between monolithic and modular model. The eects onz,rx, andry are not signicant and omitted as the push is performed on the surfaces. The modular model excels the monolithic model signicantly in y axis by accurately predicting possible interactions between objects. case of two objects, it cannot be extended the scenarios with dierent number of objects while the modular model can adapt. 6.6.5 Future work Modular structure of aordance prediction can be applied to other skills, which can be extended to task planning. For instance, building a tower by stacking objects can be performed by using prediction model of placing. A planning scheme estimating possible combination of placeable aordances between objects can be predicted from modular placeable aordance prediction model. 6.7 Conclusion We presented an exploration of modular aordance learning framework for multiple object interactions. We showed the modular aordance prediction model can be learned for push, pick and place skills. Each model was trained with a set of features describing the shape, contact and action of interaction. For more accurate prediction of interaction, 70 forward and backward components were considered for transition modeling. The validity of the modular structure was veried in case of multiple object interactions. 71 Chapter 7 Conclusions In this thesis, we have presented learning aordances through interactive perception and manipulation. Data were acquired from RGB-D sensor and manipulation samples were gathered from the real robot experiments. Various aordance modeling techniques were applied to capture the variety inherent in environmental contexts. Through extensive experiments on robot tasks, trained models and the learning framework were veried, resulting in robots that can successfully carry out autonomous tasks by understanding the aordance. In this chapter, we summarize the contributions made in this thesis. In Chapter 2, we presented semantic 3D point cloud object aordance classier. The acquired point cloud from the Kinect sensor was divided into segments and classied by logistic regression for the aordance labels. Object segmentation was further improved by the iterative k-means clustering and incremental merging of multiple point clouds. As a result, a robot could map the surrounding environment from the perception data. The advantage of labeling object aordance is that the labels can be directly be applied for planning manipulation, without recognizing object identities. Also, the predicted labels could be utilized as new features, which can improve the accuracy of segmenting out the object. In Chapter 3, we presented a method to build an aordance map for robotic tasks by interactive manipulation. The robot explored an unknown environment to build a 2D occupancy grid map along with a 3D registered world point cloud map. The point cloud was voxelized and corresponding aordance for each voxel was predicted from the unary and pairwise geometric features. Based upon the prediction and interactive manipulation scheme was planned and executed using a MRF model, which reduces the uncertainty of the aordance predictions. As a result, a complete aordance map was constructed. The proposed work successfully enabled a robot to obtain a task-specic aordance map, which can be used in a robotic task such as re-arrangement. 72 In Chapter 4, a prediction model for push aordances was learned based on con- text information around the environment. Object, contact, and context features were extracted to capture the relational context information of the scene. A hierarchical ap- proach was applied to dierentiate the types of situations by building multiple Gaussian process regression models. Each model automatically determined the in uence of each feature. The learned prediction model was evaluated on a pushing task to move an object to the desired position by avoiding collision with obstacles. In Chapter 5, a hierarchical prediction modeling with feature relevance was studied for push aordance. Each model learned the relevance of object, contact, and context features, according to the type of situation. The hierarchical models captured the various types of manipulation eects when pushing two objects in the scene. The high-level acti- vation distribution described which submodel the scene belongs to and for each submodel a low-level Gaussian process regression model was learned. The hierarchical structures of the modeling allowed the robot to capture optimized numbers of various interaction types and the corresponding relevance of the features. In Chapter 6, an exploration of modular aordance learning framework for multi- object interactions was presented. We showed that the modular aordance prediction model can be learned for a single interaction and extended to predict a sequential chain reactions of interactions. And the general learning framework can be applied for various robot skills such as pushing, picking and placing. Each model was trained with a set of features describing shapes, contacts and applied actions during the interaction. To improve the prediction accuracy, forward and backward components of interaction were considered. The modular prediction models were veried with simulated experiments of multi-object interactions. Compared to a monolithic modeling approach which builds a model for every specic scenario, our approach utilize the modularity to predict various results of interactions. To summarize, learning aordance through interactive perception and manipulation has been discussed throughout the thesis. The concept of learning aordances and ap- plying to various robotic tasks were veried with real robot experiments. For the future work, aordance learning can be further studied in two directions. First, the robot can use interactive perception and manipulation strategy to gather scalable numbers of data by itself to improve the modeling accuracy. Data gathering could be done more e- ciently by strategically choosing a sampling manipulation that will improve the modeling the most. Second, aordance learning can be extended to various skills, even for complex skills which cannot be labeled by human. This could pave the way towards the robots 73 that automatically distinguish varying contexts for dierent skills and build models to predict eects of actions for general manipulation tasks. 74 Bibliography [1] J. Lalonde, N. Vandapel, D. Huber, and M. Hebert, \Natural terrain classica- tion using three-dimensional ladar data for ground robot mobility," Journal of Field Robotics, vol. 23, pp. 839{861, 2006. [2] J. Gibson, \The concept of aordances," Perceiving, acting, and knowing, pp. 67{82, 1977. [3] M. Stilman and J. Kuner, \Navigation among movable obstacles: Real-time reason- ing in complex environments," International Journal of Humanoid Robotics, vol. 2, no. 04, pp. 479{503, 2005. [4] H. Wu, M. Levihn, and M. Stilman, \Navigation among movable obstacles in un- known environments," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2010, pp. 1433{1438. [5] D. Lowe, \Object recognition from local scale-invariant features," in Proc. of the Int. Conf. on Computer Vision 2, 1999, pp. 1150{1157. [6] N. Dalal and B. Triggs, \Histograms of oriented gradients for human detection," in Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886{893. [7] A. Torralba, \Contextual priming for object detection," International Journal of Computer Vision (IJCV), vol. 53, p. 2003, 2003. [8] D. Hoiem, A. Efros, and M. Hebert, \Putting objects in perspective," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 2137{2144. [9] S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller, \Integrating visual and range data for robotic object detection," in Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008. [10] B. Steder, R. B. Rusu, K. Konolige, and W. Burgard, \NARF: 3D range image features for object recognition," in Workshop on Dening and Solving Realistic Per- ception Problems in Personal Robotics at the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), vol. 44, 2010. 75 [11] R. Shapovalov, A. Velizhev, and O. Barinova, \Non-associative markov networks for 3D point cloud classication," in Photogrammetric Computer Vision and Image Analysis, 2010. [12] X. Xiong and D. Huber, \Using context to create semantic 3D models of indoor environments," in Proceedings of the British Machine Vision Conference (BMVC), 2010. [13] H. Koppula, A. Anand, T. Joachims, and A. Saxena, \Semantic labeling of 3D point clouds for indoor scenes," in Advances in Neural Information Processing Systems, 2011, pp. 244{252. [14] R. Rusu and S. Cousins, \3D is here: Point Cloud Library (PCL)," in IEEE Inter- national Conference on Robotics and Automation (ICRA), 2011, pp. 1{4. [15] G. Medioni, M. Lee, and C. Tang, A Computational Framework for Segmentation and Grouping. Elsivier, 2000. [16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten, \The weka data mining software: An update," SIGKDD Explorations, vol. 11, no. 1, 2009. [17] H. Koppula, R. Gupta, and A. Saxena, \Learning human activities and object af- fordances from RGB-D videos," International Journal of Robotics Research (IJRR), 2013. [18] K. Varadarajan and M. Vincze, \Afnet: The aordance network," in Computer Vision ACCV, 2012. [19] M. Dogar and S. Srinivasa, \A framework for push-grasping in clutter," Robotics: Science and Systems VII, 2011. [20] K. Hausman, F. Balint-Benczedi, D. Pangercic, Z.-C. Marton, R. Ueda, K. Okada, and M. Beetz, \Tracking-based interactive segmentation of textureless objects," in Robotics and Automation (ICRA), IEEE International Conference on, 2013. [21] V. Chu, I. McMahon, L. Riano, C. McDonald, Q. He, J. Martinez Perez-Tejada, M. Arrigo, N. Fitter, J. Nappo, T. Darrell, and K. Kuchenbecker, \Using robotic exploratory procedures to learn the meaning of haptic adjectives," in Robotics and Automation (ICRA), IEEE International Conference on, 2013. [22] M. Gupta, J. Mueller, and G. Sukhatme, \Using manipulation primitives for object sorting in cluttered environments," IEEE transactions on Automation Science and Engineering, 2013, (to appear). [23] E. S ahin, M. C akmak, M. Do gar, E. U gur, and G. U coluk, \To aord or not to aord: A new formalization of aordances toward aordance-based robot control," Adaptive Behavior, vol. 15, no. 4, pp. 447{472, 2007. 76 [24] D. Kim and G. Sukhatme, \Semantic labeling of 3d point clouds with object aor- dance for robot manipulation," in IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2014. [25] K. Lai, L. Bo, and D. Fox, \Unsupervised feature learning for 3d scene labeling," in IEEE International Conference on on Robotics and Automation (ICRA), 2014. [26] J. Mooij, \libDAI: A free and open source C++ library for discrete approximate inference in graphical models," Journal of Machine Learning Research, vol. 11, pp. 2169{2173, Aug. 2010. [27] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor, \Learning object aordances: From sensory{motor coordination to imitation," IEEE Transactions on Robotics, vol. 24, no. 1, pp. 15{26, 2008. [28] S. Grith, J. Sinapov, V. Sukhoy, and A. Stoytchev, \A behavior-grounded ap- proach to forming object categories: Separating containers from noncontainers," IEEE Transactions on Autonomous Mental Development, pp. 54{69, 2012. [29] L. Stark and K. Bowyer, \Function-based generic recognition for multiple object categories," CVGIP: Image Understanding, vol. 59, no. 1, pp. 1{21, 1994. [30] C. Castellini, T. Tommasi, N. Noceti, F. Odone, and B. Caputo, \Using object aor- dances to improve object recognition," IEEE Transactions on Autonomous Mental Development, vol. 3, no. 3, pp. 207{215, 2011. [31] A. Myers, C. L. Teo, C. Fermuller, and Y. Aloimonos, \Aordance detection of tool parts from geometric features," in Robotics and Automation (ICRA), IEEE International Conference on, 2015, pp. 1374{1381. [32] E. U gur and E. S ahin, \Traversability: A case study for learning and perceiving aordances in robots," Adaptive Behavior, 2010. [33] D. Kim, J. Sun, S. Oh, J. Rehg, and A. Bobick, \Traversability classication using unsupervised on-line visual learning for outdoor robot navigation," in Robotics and Automation (ICRA), IEEE International Conference on, 2006, pp. 518{525. [34] K. Zimmermann, P. Zuzanek, M. Reinstein, and V. Hlavac, \Adaptive traversability of unknown complex terrain with obstacles for mobile robots," in 2014 IEEE Inter- national Conference on Robotics and Automation (ICRA), 2014, pp. 5177{5182. [35] J. Barry, K. Hsiao, L. P. Kaelbling, and L.-P. T., \Manipulation with multiple action types," in International Symposium on Experimental Robotics, June 2012. [36] J. King, M. Klingensmith, C. Dellin, M. Dogar, P. Velagapudi, N. Pollard, and S. Srinivasa, \Pregrasp manipulation as trajectory optimization." in Robotics: Sci- ence and Systems, 2013. 77 [37] T. Hermans, F. Li, J. Rehg, and A. Bobick, \Learning contact locations for push- ing and orienting unknown objects," in Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on, 2013, pp. 435{442. [38] M. Kopicki, S. Zurek, R. Stolkin, T. Mrwald, and J. Wyatt, \Learning to predict how rigid objects behave under simple manipulation," in Robotics and Automation (ICRA), 2011 IEEE International Conference on, 2011. [39] O. Kroemer, H. Van Hoof, G. Neumann, and J. Peters, \Learning to predict phases of manipulation tasks as hidden states," in 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 4009{4014. [40] O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters, \Towards learning hierarchical skills for multi-phase manipulation tasks," in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1503{1510. [41] E. Ugur, E. Oztop, and E. Sahin, \Goal emulation and planning in perceptual space using learned aordances," Robotics and Autonomous Systems, vol. 59, no. 7, pp. 580{595, 2011. [42] C. Eppner and O. Brock, \Planning grasp strategies that exploit environmental constraints," in 2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 4947{4952. [43] D. Omren, C. Bge, A. T., U. A., and D. R., \Autonomous acquisition of pushing actions to support object grasping with a humanoid robot," in 2009 9th IEEE-RAS International Conference on Humanoid Robots, 2009, pp. 277{283. [44] A. Stoytchev, \Behavior-grounded representation of tool aordances," in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 3060{3065. [45] J. Scholz and M. Stilman, \Combining motion planning and optimization for exi- ble robot manipulation," in 2010 10th IEEE-RAS International Conference on Hu- manoid Robots, 2010, pp. 80{85. [46] N. Kitaev, I. Mordatch, S. Patil, and P. Abbeel, \Physics-based trajectory optimiza- tion for grasping in cluttered environments," in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 3102{3109. [47] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour, and S. Schaal, \Learning of grasp selection based on shape-templates," Autonomous Robots, vol. 36, no. 1-2, pp. 51{65, 2014. [48] E. Rohmer, S. Singh, and M. Freese, \V-rep: A versatile and scalable robot simula- tion framework," in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ Inter- national Conference on. IEEE, 2013, pp. 1321{1326. 78
Abstract (if available)
Abstract
Robots can plan and accomplish various tasks in unknown environments by understanding the underlying functionalities of objects around them. These attributes are called affordances, describing action possibilities between a robot and objects in the environment. Affordance is not an universal property due to its relative nature, therefore it must be learned from experiences. Such learning involves predicting affordances from perception, followed by interactive manipulation. Learned affordance models can be directly applied to robotic tasks as the model describes how to manipulate and what the consequence will be. ❧ In this thesis, we present several methods to learn affordances with interactive perception and manipulation. Specifically, we introduce learning affordance models from perception and utilizing predicted affordance to generate an interactive manipulation scheme. First, we examine building affordance models from perception only. From the 3D point cloud data, visual features are extracted and prediction of affordance is made. The developed model incorporates relative geometric information of nearby objects and predicted labels are utilized for refined object segmentation. Next, we look at planning an interactive manipulation based on the predicted affordances to build affordance map. The robot predicts the affordances of objects and they are examined with manipulation. The perception-manipulation loop is iterated by applying maximum information gaining strategy to build a map until convergence. Lastly, we propose three different affordance modeling schemes. The context-based affordance model is introduced to efficiently consider the context information to build affordance models and the hierarchical approach is used for categorizing different types of effects caused by the actions. The model is further developed with finding an optimal number of submodels describing distinctive effects of actions into submodels with weighted predictions. Finally, interaction based modeling of affordances between entities is studied, where modular predictions can be applied to multi-object scenarios in cluttered environment. The general set of features and effect predictions are also applied to various skills: pushing, picking, and placing. For the developed affordance models, extensive experiments are performed to verify the models and the application to the robotic tasks. We have shown the modeling scheme can be applied to accurately predict the consequence of actions in various environments. Our learning framework with interactive perception and manipulation can be extended to many other robotic applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Intelligent robotic manipulation of cluttered environments
PDF
From active to interactive 3D object recognition
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Data-driven autonomous manipulation
PDF
Algorithms and systems for continual robot learning
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Learning objective functions for autonomous motion generation
PDF
Hierarchical tactile manipulation on a haptic manipulation platform
PDF
Information theoretical action selection
PDF
Trajectory planning for manipulators performing complex tasks
PDF
Learning from planners to enable new robot capabilities
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Planning for mobile manipulation
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Closing the reality gap via simulation-based inference and control
PDF
Accelerating robot manipulation using demonstrations
PDF
Machine learning of motor skills for robotics
Asset Metadata
Creator
Kim, David Inkyu
(author)
Core Title
Learning affordances through interactive perception and manipulation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/18/2018
Defense Date
11/20/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affordance,Learning and Instruction,manipulation,OAI-PMH Harvest,perception,robot
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav (
committee chair
), Gupta, Satyandra K. (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
davidink@usc.edu,gimming9@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-18697
Unique identifier
UC11672609
Identifier
etd-KimDavidIn-6418.pdf (filename),usctheses-c89-18697 (legacy record id)
Legacy Identifier
etd-KimDavidIn-6418.pdf
Dmrecord
18697
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kim, David Inkyu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affordance
manipulation
perception
robot