Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Send files to FTP
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
From active to interactive 3D object recognition
(USC Thesis Other)
From active to interactive 3D object recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FROM ACTIVE TO INTERACTIVE 3D OBJECT RECOGNITION by Bharath Sankaran A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2018 Copyright 2018 Bharath Sankaran Dedication In fond memory of my dear friend, Varun Kumar Dhanapathy. A journey of a thousand miles starts with a single step. Wish I could let you know that today I have managed to take another small step. ii Acknowledgments First and foremost I would like to thank my advisor Stefan Schaal. I have learnedalotbywatchingyou. ThanksforbeingtheDronatomyEkalavya. Iwould like to thank you for creating a collaborative work environment that has fostered so much creativity. I would also like to thank my qualifying and dissertation committeemembers: NoraAyanian,GauravSukhatme,LaurentItti,KeithJenkins and Geoffrey Spedding. A special thanks to Nora for stepping in to chair my defense in Stefan’s absence. I really appreciate the gesture (it also makes me your first official student). Next I would like to thank my collaborators: Jeannette Bohg, Nathan Ratliff, Nikolay Atanasov and Kostas Daniilidis. I have immensely benefited from all the intellectual discussions we have had over the years. These discussions have opened my mind to new ideas and avenues for scientific pursuit. I would also like to acknowledge my other collaborators on the work done outside the scope of this thesis. Thanks Karol for the numerous discussions we have had on how to characterize Interactive Perception. Thanks Sri and Yuichi for your amazing mentorship at MERL. I would also like to thank David, Marjan and Liron for the long nights spent understanding the depths of submodularity and its implications on the problems of the world. iii I also want to thank my CLMC-AMD family for being an integral part of my journey over the last 6 years. A special mention to Vince, Nick, Sean, Vincent, Alex, Manuel, Daniel, Harry, Yevgen, Franzi, Felix, Alon and Ludo. It was a pleasure to get to know each of you personally, and I truly enjoy the time we spent together. Six years is a very long time. You are not informed when you join the program that life still goes on and you have to face the challenges it throws your way, while still being focused on your PhD. I would have not have made it through this journey had it not been for the amazing support of some of my closest friends. First and foremost Lizsl De-Leon Spedding, thanks for being an amazing friend and mentor during these past 6 years. Next, Tobi aka Herr Flach, thanks for all the early gym mornings and the late poker nights. Then Mr. Askew, yeah I am not calling you Dr yet! Jamie, thanks for your continued support and all the unintended pub crawls. Next in line, Dr Sahu. Thanks for your support during my times of personal crisis. Hopefully I can reciprocate it someday. Then, I would like to thank Stuart. Thanks for your friendship and patience. Two more people I cannot leave out of any conversation. Gautam Tikekar and Aswin Sundarakrishnan. Machan Shwin, here’s to one more chapter, let’s see where life takes us from here. G, words cannot describe how amazing a friend you have been over the years. Thanks for everything so far and all the things to come. Finally I would like to thank my family for all their support over the years. I cannot explain how grateful I am and I hope, I have made you proud. Last but not the least, my dear Angèle. I could not have asked for a better partner. Your support over the last year has gotten me to the finish line and now I cannot wait for us to finally start our lives together. iv Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Notation xvi Abstract xvii 1 Introduction 1 1.1 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . 7 2 Problem Formulation 10 2.1 Stochastic Optimal Control for Active and Interactive Information Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Information Measures for Object Recognition . . . . . . . . 14 2.1.2 Capturing the Cost of Actions . . . . . . . . . . . . . . . . . 15 3 Tools for Inference and Decision Making 17 3.1 Recursive Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 POMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Max Margin Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Static 3D Object Recognition 25 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Viewpoint Pose Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 Training the VP-Tree . . . . . . . . . . . . . . . . . . . . . . 29 v 4.3.3 Online Classification . . . . . . . . . . . . . . . . . . . . . . 30 4.3.4 Performance of the VP-Tree . . . . . . . . . . . . . . . . . . 31 4.4 Correspondence Grouping . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Global Hypothesis Verification . . . . . . . . . . . . . . . . . . . . . 34 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Active 3D Object Recognition 36 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3.1 Discrete Space Active Information Acquisition . . . . . . . . 44 5.3.2 Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.3 Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.4 Active Object Classification and Pose Estimation . . . . . . 50 5.4 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.5 Active Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 54 5.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.6.1 Segmentation and data association . . . . . . . . . . . . . . 58 5.6.2 Coupling among objects . . . . . . . . . . . . . . . . . . . . 58 5.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 59 5.7.1 Performance evaluation in simulation . . . . . . . . . . . . . 61 5.7.2 Accuracy of the orientation estimates . . . . . . . . . . . . . 64 5.7.3 Performance evaluation in real-world experiments . . . . . . 67 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.9 Challenges in Active Perception . . . . . . . . . . . . . . . . . . . . 71 6 Interactive 3D Object Recognition 72 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Learning Greedy Control Policies from Demonstration . . . . . . . . 78 6.3.1 Low Dimensional Simulation Setup . . . . . . . . . . . . . . 80 6.3.2 High Dimensional Simulation Setup . . . . . . . . . . . . . . 83 6.4 Interactive Information Acquisition for 3D Object Recognition . . . 84 6.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 84 6.4.2 Experiments and Observations . . . . . . . . . . . . . . . . . 93 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.6 Challenges in Interactive Perception . . . . . . . . . . . . . . . . . . 100 7 Summary and Future Directions 101 Reference List 103 vi List of Tables 5.1 Simulation results for a bottle detection experiment . . . . . . . . . 65 5.2 Results for a real-world bottle detection experiment . . . . . . . . . 67 6.1 Results of Learned Greedy Policy vs Greedy Heuristic Policy . . . . 83 vii List of Figures 1.1 The images on the left show controlled industrial settings where robots have traditionally been deployed. These include manufac- turing plants which have seen the incorporation of robot manipula- tors and warehousing applications that have seen the deployment of mobile robots. Soon we envision a transition of robots from thesecontrolledsettingstomoreunstructuredhumanenvironments. These settings are either too hazardous for humans to operate in or too physically demanding (images on the right). Both these set- tings are also extremely cluttered and noisy for autonomous vision systems to succeed . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The images on the left show the output of a state-of-the-art object recognition system Zhang et al. (2013). Despite reasoning about geometry and the 3D structure of the scene, the system still makes basic semantic errors in object recognition. The image on the right depicts the performance of a state-of-the-art object recognition sys- tem in cluttered environments Hebert and Hsiao (2012). It can be seen that in noisy environments the algorithm is susceptible to false positives despite reasoning about occlusions . . . . . . . . . . . . . 3 viii 1.3 A mechanical system where movement of Kitten A is replicated onto Kitten P. Both kittens receive the same visual stimuli. Kitten A controls the motion, i.e. it is active. Kitten P is moved by Kitten A, i.e. it is passive. Only the active kittens developed meaningful visually-guided behavior that was tested in separate tasks. Figure adapted from Held and Hein (1963). . . . . . . . . . . . . . . . . . 4 2.1 The figures above show the various components of the Information Acquisition pipeline. The image on the left shows the problem com- ponents for the Active setup and the figure on the left depicts the Interactive setup. In the active setup the system has access to a sen- sor motion model as shown in Eq 2.1, in the interactive setting the system has access to a target motion model Eq 2.2. The information measures (Eq 2.7, 2.6) help determine the utility of an control u t . Withrecursive Bayesian estimationwecanupdateourbeliefregard- ing the target state with the prediction and the data likelihood. The likelihood of observations is computed from the observation model Eq 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 ix 4.1 The figure shows the static object detection and pose estimation pipeline. The offline phase consists of extracting point clouds tem- plates from different views of CAD models, extracting descriptors from these templates and building the Viewpoint Pose Tree. The online phase involves matching sampled key points in the scene, computing descriptors on these key points and matching the com- puted descriptors to the Viewpoint Pose Tree. These matches are refined by an intermediate robust ICP step. Then these class and pose hypotheses are refined by verifying geometric consistency to reject false positives. Finally, these hypotheses are then verified in a final hypothesis verification optimization procedure. . . . . . . . . 27 4.2 The sensor position is restricted to a set of points on a sphere cen- tered at the location of the object. Its orientation is fixed so that it points at the centroid of the object. A point cloud is obtained at each viewpoint, key points are selected, and local features are extracted (top right). The features are used to construct a VP-Tree (bottom right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 ThefigureshowstheonlineclassificationphaseoftheVPTree. First key points are sampled on the scene, then descriptors are computed at each one of these key points. These scene descriptors are then matched to the VP Tree descriptors to recover the corresponding template. The recovered template gives us the corresponding model and pose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Confusion matrix for all classes in the VP-Tree. A class is formed from all views associated with an object. . . . . . . . . . . . . . . . 32 4.5 Effect of signal noise on the classification accuracy of the VP-Tree . 32 x 4.6 This figure shows gives the intuition behind geometric correspon- dence grouping. The scene correspondences of key points from a particular object model are checked for geometric consistency based ontheirgeometricrelationshipintheobjectmodel. Ifthisconstraint is violated by a matched key point, then the match is pruned. With this approach we refine matches between the scene and object model 33 4.7 This figure shows the output of the 3D recognition pipeline shown in Fig 4.2. The green points belong to a recognized target object. The red points belong to other objects that the Viewpoint Pose Tree was trained on. The white points represent data that was either not matched or explained during hypothesis verification . . . . . . . . . 35 5.1 Database of models . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Observation model obtained with seven hypotheses for the Handle- bottle model and the planning viewpoints used in the simulation experiments (Sec. 5.7.1). Given a new VP-Tree observation z t+1 from the viewpointx t+1 , the observation model is used to determine the data likelihood of the observation and to update the hypotheses’ prior by applying Bayes rule. . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Twenty five hypotheses (red dotted lines) were used to decide on the orientation of a Watercan. The top plot shows the error in the orientation estimates as the ground truth orientation varies. The error averaged over the ground truth roll, the hypotheses over the object’s yaw (blue dots), and the overall average error (red line) are shown in the bottom plot. . . . . . . . . . . . . . . . . . . . . . . . 66 xi 5.4 An example of the experimental setup (left), which contains two instances of the object of interest (Handlebottle). A PR2 robot with an Asus Xtion RGB-D camera attached to the right wrist (middle) employs the nonmyopic view planning approach for active object classification and pose estimation. In the robot’s under- standing of the scene (right), the object which is currently under evaluation is colored yellow. Once the system makes a decision about an object, it is colored green if it is of interest, i.e. in I, and red otherwise. Hypothesis H(0 ◦ ) (Handlebottle with yaw 0 ◦ ) was chosen correctly for the green object. See the video in https://tinyurl.com/y99rpsyg for more details. . . . . . . . . . . 68 6.1 The figures above show the fundamental difference in the process models between active and interactive methods. As shown in the figure on the left, active methods only change view point. This can be easily approximated for tractability to model the behavior of the classification algorithm as shown in Chapter 5.3.3. This helps us derive non-myopic solution for the active problem setting. In con- trast, interactivemethodsneedtomodeltheevolutionofanenviron- ment as a consequence of contact. When there are multiple objects in contact in the scene this model is non-trivial to approximate over large time horizons. Hence we resort to myopic planning . . . . . . 74 xii 6.2 The image on the left shows the simulation environment where the robot is looking for a target object. Instances of the target object in the scene are marked with a green square. As shown only one of the objects is visible from the viewpoint of the robot as shown in the right image. This instance is partially occluded (marked green). The second instance is fully occluded (marked red). . . . . . . . . . 77 6.3 The first figure shows the simulation environment. The learned (cyan) agent has knowledge about the entire state of the environ- ment it can only manipulate the cells in its immediate vicinity (8- connected neighborhood) shown by the red arrows. In the first fig- ure, the dark covered cells are occluded cells and the light colored cells are empty environment cells. The second figure shows the 2D pose of the hidden target object (red cells). The third figure shows a snapshot at timet of the evolution of the agent’s belief about the distribution of the poses of the target object. This distribution of possible poses is shown by the green cells and the other occluded cells are shown in red. The blue cells are unoccluded environment cells. The final figure shows the result on convergence, where the agent’s belief about the target object pose has converged to a sin- gle pose estimate. A video of this simulation can be viewed in https://tinyurl.com/ycsnzncv . . . . . . . . . . . . . . . . . . . 81 6.4 This figure shows the graphical model for the interactive object detection problem. It captures the independences in the problem . . 85 xiii 6.5 The figure above shows the update of the occupancy depth map for a given object pose. The target object is shown in yellow. The pose of the target object evolves from b t to b t+1 on applying control u t . This updates the occupancy of the pixel o i t to o i t+1 . The sensor is shown as a green circle with sensing rays . . . . . . . . . . . . . . . 87 6.6 The figure above shows the recursive update of the distribution of poses. This procedure is explained below . . . . . . . . . . . . . . . 89 6.7 The figure above shows approximation for interactive information acquisitiondiscussedinthissection. Wefirstsampleposesaccording to their likelihood (1). Then the sampled poses are propagated accordingtothedifferentgreedycontrolpolicieswewanttoevaluate (2). We then compute the expected observation on the propagated poses (3). Finally we compute the expected information for each of the greedy policies (4) . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.8 The image on the left shows the target impact wrench whose pose we are trying to estimate in the current environment. The image on the right shows the view of the environment from the RGBD sensor in gazebo. The environment can be manipulated with pushing actions which are simulated by spherical (white) force objects. The force object is meant to emulate a gripper and weighs 5kg and can exert a force of 5N. The force object is 5 cm in diameter. The direction of push is shown by the green arrows . . . . . . . . . . . . . . . . . 94 xiv 6.9 The images above are views of the environment from the gazebo physicssimulator(left)andthesensor(right).Thefirstrowofimages showthealgorithm’sinitialestimateofthetargetobjectpose, which is rendered as a green CAD model in the image on the right. The second row of images show the result of convergence after executing multiple information seeking interactive actions. As it can be noted in the figure on the right the pose estimate converges close to the true pose of the object. A video of this experiment can be viewed in https://tinyurl.com/y9x2psyz . . . . . . . . . . . . . . . . . . 97 6.10 The cyan curve is the information acquisition algorithm and the blue curve depicts greedy open loop control policy’s action selection. The figure on the left shows the convergence of the position error measured in cm. The figure on the right shows the mean orientation error (difference of roll, pitch, yaw) measured in degrees. For both policies there 10 interaction actions that are evaluated. . . . . . . . 98 xv Notation • All functions are denoted by either small or upper case letters with paren- thesis ex: f(),H() • All scalars are denoted by non bold greek letters ex: α or italic small letters ex: a. • All vectors are denoted by bold small letters ex: x • All sets are denoted by scripted upper case lettersI • Subscripts in scalars, vectors and functions indicate type from a general class of scalars, vectors • Subscripts in sets indicate the type of variable contained in the set. ex: if y is the pose, thenL y is the set of poses • The current time step is always denoted by t and the variables relating to time are denoted by T or τ xvi Abstract Ifrobotsaretosuccessfullymigratefromcontrolledindustrialsettingstounstruc- tured human environments, they should be capable of robust perception in densely cluttered, noisy environments, plagued with poor lighting. They should also be able to reason about inaccurate models of the environment. This transition will require robots to move towards building semantic representations of the environ- ment by detecting objects of interest and accurately estimating their pose. Most of the current state-of-the-art in object recognition is still based on single view recog- nition and their performance is limited by occlusions and ambiguity in appearance and geometry. In this thesis we explore the use of movement and interaction to solve one of the fundamental problems of computer vision, object detection and pose esti- mation. We exploit the notion that perception is a process that is both active and exploratory, and reformulate the problem of 3D object recognition as one of movement and interaction. We formalize these problems as information acquisition optimal control problems and adapt them to the active and interactive settings. We develop both myopic and non-myopic solutions to the information acquisition problem for 3D object recognition using tools from optimal control and inverse optimal control. We also introduce efficient approximations to these optimal con- trol problems by exploiting the nature of the information measure we are utilizing xvii or exploiting the behavior of the static recognition system. Our approximations allow for tractable solutions to otherwise intractable problems. xviii Chapter 1 Introduction With the rapid progress in robotics research, the utility of autonomous robots will no longer be restricted to controlled industrial environments. They will soon find themselves in highly unstructured human environments where they will have to deal with very high degrees of clutter and heavy occlusions while also having to cope with noisy sensing (Figure 1.1). The factors that will drive this transition will be the need to deploy robots in environments which are either unsafe for humans or too physically demanding. Inboththesesettingsperceptionwillbeacomplexandnoisytaskastherobotic systems operating in these settings, will not have access to accurate models of the environment. In this spirit, robotics will have to start shifting from specific predefined motion in controlled environments to high level interactive tasks in unstructured environments. The effective execution of such tasks requires the addition of semantic information to the traditional traversability representation (metric maps) of the environment. For example, apart from having access to geometric maps for navigation, robots will need to identify objects of interest and estimate their pose accurately. This will enable the successful completion of manipulation tasks. The ability to effectively parse the environment for such tasks is well known to be error prone. The common sources of error include occlusions, clutter,variationsinlighting,andimperfectmodelsincomplexenvironments. Such environments are also known to be extremely challenging for current state-of-the- art vision systems as shown in Figure 1.2. 1 Figure 1.1: The images on the left show controlled industrial settings where robots have traditionally been deployed. These include manufacturing plants which have seen the incorporation of robot manipulators and warehousing applications that have seen the deployment of mobile robots. Soon we envision a transition of robots from these controlled settings to more unstructured human environments. These settings are either too hazardous for humans to operate in or too physically demanding (images on the right). Both these settings are also extremely cluttered and noisy for autonomous vision systems to succeed Techniques like probabilistic modeling (Thrun et al., 2005), fuzzy logic, evi- dence based estimation and learning from data, have been used to cope with uncer- tainty in perception, but recent surveys Bajcsy et al. (2018); Sankaran et al. (2017) have shown that movement and interaction can help simplify complex perception tasks that are otherwise unsolvable. Despite these insights, robot visual percep- tion is still treated as a single view inference problem. This is particularly evident in core problems of visual perception such as object recognition which are still 2 Figure 1.2: The images on the left show the output of a state-of-the-art object recognition system Zhang et al. (2013). Despite reasoning about geometry and the 3D structure of the scene, the system still makes basic semantic errors in object recognition. The image on the right depicts the performance of a state-of-the- art object recognition system in cluttered environments Hebert and Hsiao (2012). It can be seen that in noisy environments the algorithm is susceptible to false positives despite reasoning about occlusions largely treated as static decision making problems. In fact, state-of-the-art tech- niques in visual inference are still performed using models that use single images or point clouds as input. Such an approach to visual inference requires the models to: Either be learned from datasets containing hundreds of thousands of anno- tated static images as in the case of deep learning techniques (ex: Szegedy et al. (2015)); Or perform complex single view inference as in the case of 3D recognition (ex: Aldoma et al. (2016)). In both these settings, exploration and interaction are not considered to be part of the visual process. This has also historically been the case in computer vision where the problem of object detection and pose estima- tion has primarily been addressed with the assumption that the sensing device is static, ex: Lowe (2004), Belongie et al. (2002), Fan et al. (1989). This approach to visual perception, is despite the widely accepted insight, that visual perception in humans and animals is an active and exploratory process Gibson (1979); O’Regan and Noë (2001). 3 Figure 1.3: A mechanical system where movement of Kitten A is replicated onto Kitten P. Both kittens receive the same visual stimuli. Kitten A controls the motion, i.e. it is active. Kitten P is moved by Kitten A, i.e. it is passive. Only the active kittens developed meaningful visually-guided behavior that was tested in separate tasks. Figure adapted from Held and Hein (1963). Thereisalsostrongevidenceinbiologytosupportthisclaim. Forinstance,Gib- son(1966)showedthatphysicalinteractionfurtheraugmentsperceptualprocessing beyond what can be achieved by deliberate pose changes. In a specific experiment, human subjects had to find a reference object among a set of irregularly-shaped, three-dimensional objects. They achieved an average accuracy of 49% if these objects were shown in a single image. This accuracy increased to 72% when sub- jects viewed rotating versions of the objects. They achieved nearly perfect per- formance (99%) when touching and rotating the objects in their hands. There is also a strong argument to be made for endowing robots with the ability to 4 actively explore the environment. Held and Hein (1963) analyzed the development of visually-guided behavior in kittens. They found that this development critically depends on the opportunity to learn the relationship between self-produced move- ment and concurrent visual feedback. The authors conducted an experiment with kittens that were only exposed to daylight when placed in a carousel as depicted in Fig. 1.3. Through this mechanism, the active kittens (A) transferred their own, deliberate motion to the passive kittens (P) that were sitting in a basket. Although, both types of kittens received the same visual stimuli, only the active kittens showed meaningful visually-guided behavior in test situations. Now with theadventofmobilemanipulators, wecanfinallystartaddressingvisualperception as a process that active and exploratory process. Optimizing motion that improves the estimate of the sensed quantity also finds parallels in human biology. Najemnik and Geisler (2005) have shown that in the problem of visual search, the human eye does not fixate randomly, but moves to maximize the posterior probability of the location of the target being searched. The aforementioned study however, does not show any evidence to support the notion that the cost of movement is penalized during fixation. In contrast, in our formalism that we introduce in Chapter 2 we account for the cost of movement. With these insights, in this document we address one of the central problems in visual perception namely; object recognition, as a problem of movement and inter- action. WefollowthedefinitionofanActiveperception processasonewherethe agent has the ability to control a sensor without affecting changes in the environ- ment Bajcsy et al. (2018); Bajcsy (1988). Similarly, we define Interactive per- ception as the process where an agent can apply a (potentially time-varying) force on the environment with the intent of changing the state of the environment, in order to aid perception Sankaran et al. (2017). This approach to visual perception 5 also allows us to move away from the traditional sense-plan-act paradigm to closed perception-action loops, where we directly reason about the causal links between perception-action and vice versa. Recent research in this direction by Kappler et al. (2015) has looked at adapting manipulation skills based on learned reference signals. But, they do not explicitly incorporate accurate perception as the objec- tive to their approach. Our approach to perception relies on explicitly defining an optimal control problem, where accurate perception is the objective of the robot. More specifically, our goal in 3D object recognition is to accurately detect instancesofatargetobjectintheenvironmentandestimatetheirthreedimensional pose. At this point we would like to clarify the differences between recognition and detection. Detection (or classification) is to measure the similarity between an input shape and shapes stored in a database. In contrast recognition not only measures similarity but also estimates the transform that maps the input to the recognized model. To accomplish this, we make a few assumptions regarding the inputs and outputs of the system. In both, the active and interactive case, we assume we have a static object recognition system that can take as input a single view representation of the environment and recognize the target object. The single view representation can either be an RGB image or RGBD point cloud. The input to our algorithms in both the active and interactive setting are the estimation procedure and the single view representation. The output of our algorithms are myopic or non-myopic control policies that select active or interactive actions to reduce the uncertainty regarding the target object. We define this in more formal mathematical terms in Chapter 2. In Active Perception, we define the actions as movements that induce a change in the pose of a mobile sensor. In Interactive Perception, we define actions as interactive actions that can change the physical state of the environment. In Interactive Perception, the pose of the visual sensor 6 used to perceive the environment is assumed to remain unchanged during the interaction. Challenges pertaining to both these methods are discussed in their respective sections (Sec 5.9, Sec 6.6). 1.1 Thesis Outline and Contributions The research presented in this dissertation is not the outcome of a single mono- lithic research project. It is instead the outcome of a diverse research question which is how to exploit motion and physical interaction to solve traditional prob- lems in computer vision. This question has resulted in multiple research projects during my 6 years at USC. Inasmuch this dissertation has a central theme which is leveraging movement and physical interaction to solve the problem of 3D object recognition. Some of the work presented in this dissertation was performed exclu- sively at USC Chapter 6.4 and other research was done in collaboration with Max Planck Institute for Intelligent Systems Chapter 6.3; and the collaborators in Uni- versity of Pennsylvania GRASP Lab, Chapter 5 4. The work presented in Chap- ter 4, was primarily developed by me. The observation model presented in Sec 5.4 was developed by collaborators at the GRASP Lab based on the work presented in Chapter 4. The active object detection optimization problem was inspired by workdoneinhypothesistestingbyNaghshavarandJavidi(2012);Sankaran(2012). UnlikeSankaran(2012), wehandlethenon-myopicactiveobjectdetectionproblem for multiple classes using the Viewpoint Pose Tree [Sankaran et al. (2013)], instead of the binary class next best view problem addressed in Sankaran (2012). The active object detection optimization problem was jointly developed with collabo- rators at the GRASP Lab at University of Pennsylvania. Apart from the relevant publications Sankaran et al. (2013); Atanasov et al. (2014), the results of this work 7 are also reported in the co-author’s PhD thesis [Atanasov (2015)]. The work in interactive recognition presented in Chapter 6.3 was developed independently by me in a collaborative setup with the Max Planck Institute for Intelligent Systems. Finally, the work presented in Chapter 6.4 was developed exclusively by me. In the rest of this dissertation we reformulate the problem of 3D object recogni- tion as a problem of active/interactive information acquisition and use techniques from optimal control and inverse optimal control to derive solutions to these prob- lems. The remainder of this document is organized as follows: • In Chapter 2, we develop the general optimal control problem for information acquisition through action selection. We then adapt this formalism to the problem of active and interactive information acquisition (Sec 2.1). We also introduce optimizable information measures specific to our target problems in Sec 2.1.1 of Chapter 2. • Chapter 3 introduces the preliminary tools required for solving the opti- mization problems we introduced in Chapter 2. The tools we discuss include Recursive Bayesian Estimation (Sec. 3.1), Partially Observable Markov Deci- sion Processes (Sec. 3.2) and Max Margin Learning (Sec. 3.3). • In Chapter 4, we discuss our work in static 3D object recognition. We intro- duce a novel 3D recognition algorithm called the Viewpoint Pose Tree. We also discuss the enhancements we have made to the initial algorithm using developments from state-of-the-art that have been introduced since this work was initially published. The work in this chapter was published in Sankaran et al. (2013). • In Chapter 5, we discuss our work in active 3D object recognition. When an initial static recognition algorithm (Chapter 4) identifies an object of 8 interest and generates hypotheses about its class and orientation. Our novel non myopic decision making framework allows an active sensor to select a sequence of views balancing the cost of movement and the probability of error. Weformulatethisasanactivehypothesistestingproblemthatincludes sensor mobility and solves it using a point based approximate POMDP algo- rithm. The work discussed in this chapter was published in Sankaran et al. (2013); Atanasov et al. (2014). • In Chapter 6 we discuss our approach to interactive object recognition. We discuss the challenges associated with interactive perception and develop novel myopic solutions to the problem. We first introduce an algorithm for learning greedy control policies using techniques from inverse optimal con- trol in Sec 6.3. This was published in (Sankaran et al., 2015). We then discuss the shortcomings of this approach in dealing with high dimensional systems and introduce our novel formulation for greedy closed loop inter- active information acquisition. We discuss the approximations we make to derive a tractable solution to this problem for high dimensional systems. (At the time of submission of this dissertation, this work has not yet been published) • Finally in Chapter 7, we summarize this thesis and propose extensions based on the work done in this thesis. 9 Chapter 2 Problem Formulation In this chapter we formulate the general optimal control problem for improving object detection and pose estimation. The main ingredients of our problem setup involve dynamic models that characterize the evolution of the physical process, the controllable degrees of freedom of the problem and the relationship between the sensor observation and target process. These models take as input, variables that characterizethestateofourrobot(whichcanbeamobilesensororamanipulator), the state of our target object and the controls that can be applied to either state. Using these models and variables we can formalize our target problems as problems of"Active Information Acquisition via sensor control"inChapter5and"Interactive Information Acquisition via manipulator control" in Chapter 6. In both the Active and Interactive Information Acquisition setting the objec- tive is to estimate the state of an observed system of interest by planning future configurations of the controllable device in order to improve the estimation pro- cess. More formally this target estimation process can be defined as follows. Let us characterize the state of our sensor at time t as x t . The evolution of the sensor state is governed by the sensor motion model x t+1 =f(x t , u t ,noise) (2.1) 10 where u is the control input applied at time t. Similarly the state of the target at timet is y t . The dynamics of evolution of the target state is governed by a target motion model given by y t+1 =g(y t , v t ,noise) (2.2) where v is a control input applied at time t that impacts the target state. For uncontrolled target evolution, which is the case in the Active Information Acqui- sition the target evolution would simply be governed by y t+1 =g(y t ,noise) The sensor receives measurements z t of the target. This process is governed by the following observation model z t =h(x t , y t ,noise) (2.3) The observation models can be either learned or analytical. In the Active Infor- mation Acquisition problem setting we use a learned model and in the Interactive Information Acquisition problem we use an extension of the standard beam based model Thrun et al. (2005); Wüthrich et al. (2013). 2.1 Stochastic Optimal Control for Active and Interactive Information Acquisition With the dynamics models defined in the previous section the Active Informa- tion Acquisition problem can be formulated as follows 11 Active Information Acquisition: Given initial sensor state x 0 and a prior p 0 on the target state. Choose control policies u t for t = 0,...,T that maximize information about the target max u 0:T I(y 1:T +1 , x 1:T +1 , z 1:T +1 ) (2.4) s.t x t+1 =f(x t ,u t ,noise) y t+1 =g(y t ,noise) z t =h(x t , y t ,noise) whereI is an information measure. A similar optimization problem for a controllable measurement device was first introduced by Hager and Mintz (1988), which is widely adapted in the active sensing community. Atanasov (2015) also adopt a similar approach to active infor- mation acquisition. Similar to the active setting, the Interactive Information Acquisition problem can be formulated as Interactive Information Acquisition: Given a stationary sensor with initial state x 0 and a priorp 0 on the target state. Choose control policies v t for t = 0,...,T that maximize information about the target max u 0:T I(y 1:T +1 , x 1:T +1 , z 1:T +1 ) (2.5) s.t x t+1 =f(x t ,noise) y t+1 =g(y t , v t ,noise) z t =h(x t , y t ,noise) whereI is an information measure. 12 Figure 2.1: The figures above show the various components of the Information Acquisition pipeline. The image on the left shows the problem components for the Active setup and the figure on the left depicts the Interactive setup. In the active setup the system has access to a sensor motion model as shown in Eq 2.1, in the interactive setting the system has access to a target motion model Eq 2.2. The information measures (Eq 2.7, 2.6) help determine the utility of an controlu t . With recursive Bayesian estimation we can update our belief regarding the target state with the prediction and the data likelihood. The likelihood of observations is computed from the observation model Eq 2.3 ForthebothoftheproblemsdefinedaboveinEq2.4,2.5, thefunctions f,g,hcan be either linear or nonlinear. In the most general form of the problem, the states can be discrete or continuous and so can the control spaces. The definitions of the active/interactive information acquisition problems also allow for different kinds of noise distributions and information measures. In our target domains of active and interactive object detection and pose estimation, we consider optimization problems that are discrete-time and continuous-state while our controls are always discrete. The information measures we utilize are discussed below. 13 2.1.1 Information Measures for Object Recognition In the optimal control problem discussed in Sec 2.1 we maintain a probabil- ity distribution (pdf) over the target state which we continuously refine based on observations. Given a prior pdf p 0 on the target state y 0 , the belief about the target state is updated via Recursive Bayesian Estimation. We discuss the esti- mation procedure in greater detail in Chapter 3.1. This estimation procedure provides a pdf p t about the current target state estimate y t . This pdf quantifies the uncertainty regarding the target state estimate which can be measured using an information measureI(y 1:T +1 , x 1:T +1 , z 1:T +1 ). This information measure helps capture the change in uncertainty of the target state estimate. When different con- trols are executed, they produce different changes in uncertainty. These changes in uncertainty, canbeusedtosolvetheoptimalcontrolproblemsdefinedinEq2.4,2.5 of the previous section. The information measure should ensure that we get the desired behavior of the robot in relation to our estimation problem. Apart from this the measure should also be fast to compute. This will become evident later in thesis where this information measure needs to be evaluated multiple times in a reactive control setting. In this thesis the information measures we use, are conditional entropy H(y 1:T +1 , x 1:T +1 |z 1:T +1 ) and probability of error. We define these two quantities for discrete random variables below. Conditional Entropy If x is a discrete random variable with a probability mass function p(x). Then the entropy of X is given by H(x) :=− X x p(x)log(p(x)) 14 If x and y are discrete random variables with a joint probability mass function p(x,y) and marginal probability mass functions p(x) and p(y) respectively. Then the entropy of y conditioned on x is given by H(y|x) :=− X x p(x)log(p(x))H(y|x =x) =− X x,y p(x,y)log p(y) p(x,y) (2.6) Probability of Error Let x 1:t = x 1 ,..., x t be a sequence of random variables and y be a discrete ran- dom variable with a random conditional probability mass function p t (y) := P(y = y|x 1:t ). Then the probability of error of the maximum likelihood estimate at time t, ˆ y t := arg max y p t (y) is defined as p e (t) :=P(y6= ˆ y t ) =E X 1:t X y 1 y6=arg max l pt(l) p t (y) =E X 1:t 1− max y p t (y) (2.7) With these definitions in place we can solve our stochastic optimal control problem for information acquisition. Solving this optimal control problem for informationacquisitionconsistsofselectingatargetstateestimator,anappropriate information measure, a planning horizon and an optimization algorithm. The most generalformofEq2.4,2.5areconsideredinstancesofapartiallyobservableMarkov decision process Bertsekas (2000). 2.1.2 Capturing the Cost of Actions The optimization problems defined in Eq 2.4, 2.5 assume that all controls incur the same cost and only focus on maximizing information about a target. We 15 redefine the general problem of information acquisition as one that also accounts for the cost of selecting a control action such that it maximizes a decision utility I d .I d is a combination of a movement utilityI m and an information measureI g . Hence the optimization problems in Eq 2.4, 2.5 can be rewritten in terms of a decision utility. In Eq 2.8 this decision utility is shown for the active information acquisition problem. Decision Utility (Active). The decision utility I d of reaching state x t+1 at time t + 1 state by executing a control u t at state x t can be defined as I d (y 1:T +1 , x 1:T +1 , z 1:T +1 ,u 1:T ) =I g (y 1:T +1 , x 1:T +1 , z 1:T +1 ) +λI m (x 1:T ,u 1:T ) (2.8) where I g is the information measure, I m is the movement utility and λ≤ 1 is the trade-off between I g and I m . Similarly for the interactive information acquisition problem the decision utility is defined as Decision Utility (Interactive). I d (y 1:T +1 , x 1:T +1 , z 1:T +1 ,v 1:T ) =I g (y 1:T +1 , x 1:T +1 , z 1:T +1 ) +λI m (y 1:T ,v 1:T ) (2.9) Thesecostsarecommonlyreferredtoasvalue of information(Krause,2008, ch.15) or expected Bayesian risk (Karasev et al. (2012)). A formulation incorporating a decision cost and a control cost was first introduced by Bajcsy (1988) for mobile sensors. The general form of the problem with the updated information measure (Decision Utility) is also a POMDP. 16 Chapter 3 Tools for Inference and Decision Making With the general information acquisition problem defined in Section 2.1 of the previous chapter, we can take a look at the tools required to solve these decision making problems. Both the information acquisition problems introduced in the previous chapter are Stochastic Optimal Control problems. In the target domain of object detection and pose estimation studied in this thesis, the active informa- tion acquisition problem is a discrete-space discrete-time stochastic optimal control problem. The exact problem definition is detailed in Chapter 5.3. The interac- tive information acquisition problem is a continuous-space discrete-time stochastic problem, which is defined exactly in Chapter 6.4. Before, solving the stochastic optimal control problem for information acquisition we need to design an estima- tor that can use sensor observations to update the system’s belief about the target state. The estimator propagates a probability density function (pdf) p t of the tar- get states y t over time, with motion and observation models. Since we are given a prior pdfp 0 on the target states y 0 , we can initialize the estimator with this prior. The actual estimation process can be performed through filtering Huber (2015); Srkk (2013) or graphical model inference Koller and Friedman (2009). In this the- sis we utilize variants of Recursive Bayesian Estimation or Bayesian Filtering for the estimation process. A primer for Bayes Filtering is provided in Sec 3.1. In our target applications, we utilize a sophisticated variant of the Bayes Filter in 17 the active information acquisition setting; and a Particle Filter (Sec 3.1.1) in the interactive setting. With the estimation procedure in place, the information acquisition optimiza- tion problems can be solved in different ways depending on the planning horizonT (Eq 2.4, 2.5). The planning horizon can be myopic or non-myopic. In the myopic setting the selected control policy is only optimized for the information measure at the next time step. This produces a greedy control policy. Non-myopic plans optimize the information measure over a larger planning horizon (T > 1). If the planning horizon is subject to optimization, we get an optimal stopping problem. Apart from being myopic or non-myopic, the optimization procedure can also be open or closed loop. Open loop control policies are computed by fixing expected values for future measurements and computing the sequence of controls based on the most likely observations. These controls are executed regardless of the actual measurements realized. Closed loop policies compute the appropriate control pol- icy based on the actual measurement. In this thesis we only look at closed loop methods. The solution techniques for the optimization problems vary depending on the problem setup. For the interactive setting we resort to myopic policies due to the nature of the process model. We discuss the complexities associated with the process model in the interactive information acquisition problem, in greater detail in Chapter 6.1. To produce these Myopic policies, we resort to two different approaches. In the first approach, we learn a function that maps observations to greedy control policies via max margin planning Sec 6.3. The max margin greedy control policies are learned with Max Margin Learning, which is detailed in Sec 3.3. Inthesecondapproach, wegeneratemyopicpoliciesviabruteforceoptimizationby making assumptions about our process model. These assumptions are discussed 18 in Chapter 6.4.1. Since the general form of active information acquisition is a POMDP, we formulate the problem as a POMDP and use a general POMDP solver to compute a non-myopic control policy. The general POMDP formulation is discussed in detail in Sec 3.2. 3.1 Recursive Bayesian Estimation The process of estimating the probability mass of a random variable overtime using measurements, an observation model and a motion model is commonly called Recursive Bayesian Estimation. More formally given motion and observation mod- els y t+1 =g(y t ,noise), z t =h(y t ,noise) (3.1) the evolution of the variable of interest can be modeled as a hidden Markov model Koller and Friedman (2009). The transition probabilitiesp g (y t+1 |y t ) and emission probabilitiesp h (z t |y t ) can be derived from the motion and transition model respec- tively. Using the Markov assumption the joint pdf of all states and measurements can be factorized as p(y 0:t+1 , z 0:t+1 ) = p 0 (y 0 ) t Y j=0 p h (z j+1 |y j+1 ) t Y j=1 p g (y j+1 |y j ) (3.2) 19 where p 0 is a prior pdf. A Bayes filter can track the evolution of p(y t |z 0:t ) in two steps Predict: p(y t+1 |z 0:t ) = Z p g (y t+1 |y t )p(y t |z 0:t )dy t (3.3) Update: p(y t+1 |z 0:t+1 ) = p h (z t+1 |y t+1 )p(y t+1 |z 0:t ) R p h (z t+1 |y t+1 )p(y t+1 |z 0:t )dy t+1 (3.4) When the target motion model is linear in the target state and the noise in the motionmodelisGaussian; andthesensorobservationmodelislinearintargetstate and sensor noise is Gaussian; the target state can be estimated using a Kalman filter or one of its variants. The information measure required for information acquisition in these scenarios can be mutual information as it has a closed form solution for Gaussian target pdfs. 3.1.1 Particle Filtering For target pdfs that are multimodal, particle filters can be used for inference. A particle filter represents p(y t |z 0:t ) by a set of weighted samples or particles {y i t ,w i t } N i=1 , where N is the number of samples. This is represented as p(y t |z 0:t ) = N X i=1 w i t δ(y− y i t ) (3.5) 20 where, δ is the Dirac delta function and the weights sum to 1 i.e., P N i=1 w i t = 1. The mean and covariance of this distribution are given by μ t = N X i=1 w i t y i t and Σ t = N X i=1 w i t y i t (y i t ) T −μ t μ T t (3.6) Hence the prediction step becomes Predict: p(y t+1 |z 0:t ) = N X i=1 w i t p g (y t+1 |y i t ) (3.7) The pdf p(y t+1 |y 0:t ) is also approximated with a weighted particle set. The dis- tribution p(y t+1 |z 0:t ) is computed from p(y t |z 0:t ) through Sequential Importance Sampling (SIS) Thrun et al. (2005). The procedure is as follows • Draw a sample k∈{1,...,N} according to the probabilities [w 1 t ,...,w N t ] • Propagate this sample y k t according to the transition probability p g (·|y k t ) to get y k t+1 • Repeat the procedure N times to get the new particle set{y i t+1 ,w i t+1 } N i=1 . Here w i t+1 :=p(y i t+1 |z 0:t ) The SIS procedure discussed above only propagates the most likely samples and gives us p(y t+1 |z 0:t ). This is distribution is now approximately equivalent to a weighted particle set. p(y t+1 |z 0:t )≈ N X i=1 w i t+1 δ(y− y i t+1 ) 21 Now we can update this prediction with the update step Update: p(y t+1 |z 0:t+1 ) = N X i=1 p h (z t+1 |y i t+1 )w i t+1 P N k=1 p h (z t+1 |y k t+1 )w k t+1 δ(y− y i t+1 ) (3.8) The version of the particle filter introduced above is a bootstrap filter where the motion model p g (·|y i t ) is used as the proposal distribution. Applying SIS leads to sample degeneration, i.e the weights of almost all particles go to zero. To avoid this a resampling step is added in practice, where N particles are drawn from the set{y i t ,w i t } with replacement. All the new particles have the weight 1/N. There are multiple criteria to trigger the resampling procedure. Common triggers include maintaining the effective number of particles above a particular threshold or measuring the KL divergence between p(y t+1 |z 0:t+1 ) and the uniform distribution. Off the different methods used for resampling, we utilize stratified resamplinginthisthesis. Thissamplingprocedureisoptimalasithaslowsampling variance and computational complexity (Thrun et al., 2005). 3.2 POMDP Partially observable Markov decision processes (POMDPs) are models for sequential decision making under uncertainty. They are extensions of the Markov decision process framework Puterman (1994); Howard (1960) to settings where an agent cannot directly observe the system state. A typical POMDP can be specified with a finite set of environment states{s 0 , s 1 ,..., s η }∈S; a finite set of actions that an agent can choose from{a 0 , a 1 ,..., a α }∈A; a set of observations the agent can receive{o 0 , o 1 ,..., o κ }∈O; a transition function t :S×A→ p(s τ+1 |a τ , s τ ) 22 that gives a distribution over successor states given the current state and action pair at time τ; a reward function for each state-action pair r :S×A→R that provides a numeric reward for each state action pair and an observation func- tion z :S×A→ p(o τ |s τ , a τ ) that gives the probability of receiving a particular observation given the current state action pair. The value of a specific trajectory ρ in an infinite horizon discounted reward setting is is given by a value function as follows Bellman (1957) v(ρ) = ∞ X τ=1 γ τ r(s τ , a τ ), 0≤γ < 1 Here γ is the discount factor. Using the tuplehS,A,O,r,t,zi, the optimal sequence of actions can be found by an agent by maximizing the value function v(·) for all states s∈S. This is in the infinite horizon setting can be formulated as follows v ∗ (b(s)) = max a∈A r(s, a) +γ X o∈O p(o|s, a)v ∗ (t(s, a)) 3.3 Max Margin Learning In supervised classification we learn a function h() :X →Y, from a set of η i.i.d samplesS ={x i , y i =t(x i )} η i=1 . Here t(x i ) is a label function that gives the true label for x i . These samplesS ={x i , y i } η i=1 are drawn from a distribution D X×Y . The classification function h() is generally selected from some parametric familyH, where the linear family is a typical choice for such a selection. Give 23 real-valued basis functions n i :X×Y→R a classification h w is defined by a set of coefficients w i such that: h ω (x) = arg max y X i=1 w i f i (x, y) = arg max y w T f(x, y) where f(x, y) are features or basis functions. is the number of coefficients defin- ing the classification function h w . In single label classification problems,Y = {y 1 ,...,y η }. In standard max margin classification we select a w that maximizes the mar- gin between decision boundaries of the various classes. Support vector machines Vapnik (1995) provide an effective method of learning such a max-margin deci- sion boundary for single label binary classification problems. Crammer and Singer (2002) extend this framework to single-label multiclass classification. In struc- tured problems, multiple labels are predicted hence the loss is the per label loss, i.e. the proportion of incorrect labels predicted. This extension of the max margin framework to the multi-label setting is formulated as follows. min 1 2 w T w s.t. w T (f(x,t(x))−f(x, y))≥ Δt x (y) ∀x∈S,∀y where Δt x (y) = λ P i=1 Δt x (y i ) and Δt x (y i )≡I(y i 6= (t()) i ). λ is the number of labels. This was first shown by Taskar et al. (2004). 24 Chapter 4 Static 3D Object Recognition 4.1 Introduction Recent advances in the technology for acquisition of 3D geometry has led to significant progress in Object Recognition research, one of the most fundamental problems in Computer Vision. This progress has been primarily driven by low cost RGDB sensors, that now allow for algorithms to incorporate multimodal cues for robust object recognition even in the presence of noise and clutter. Object recognition unlike object classification/shape retrieval not only measures similarity between an input shape and shapes stored in a library, but also estimates the transform that maps the input to the recognized model. This helps solve the problem of both detection and pose estimation. 4.2 Related Work The literature in 3D recognition can be broadly classified in to two major domains, feature-based and template-based. Early feature based approaches, detected features (Bay et al., 2008; Lowe, 2004) on images and back-projected them to 3D (Lowe, 2001; Pauwels et al., 2013). These were eventually replaced by methods that directly computed features on the 3D point cloud (Mian et al., 2010) with 3D descriptors (Tombari et al., 2010; Rusu et al., 2009). The introduction of 3D feature matching methods, has also contributed to the development of robust 25 filtering schemes for 3D correspondences and methods to explain the complete scene geometry to account for occlusions (Hao et al., 2013; Aldoma et al., 2013; Buch et al., 2014). Though these methods have been made scalable (sub-linear) with approximate nearest neighbor schemes (Muja and Lowe, 2014), they are still limited when matching surfaces with poor shape information. In contrast, template-based schemes are often robust to clutter. Hinterstoisser et al. (2013) match templates from rendered views of 3D models embedded with quantized image contours and normal information. This approach was improved with a custom hashing method for this representation by Kehl et al. (2015). Build- ing discriminative models on these representations to accomplish robust detection and pose estimation via classification has been demonstrated by Malisiewicz et al. (2011); Gu and Ren (2010). Discriminative learning has also been applied to fea- ture based methods but they require an additional pose estimation step. Such an approach was demonstrated by Li et al. (2016). Though template based methods are robust to clutter they only scale linearly with the number of models unlike feature based methods that are sublinear. In this thesis we take a hybrid approach between feature and template based methods and build a Viewpoint Pose Tree. We first introduced the Viewpoint Pose Tree in (Sankaran et al., 2013). The Viewpoint Pose Tree, exploits the sublinearscalabilityoffeaturebasedmethodsbycomputingfeaturesonviewsphere templates. The robustness to clutter is achieved by the inherent template based matching of the view sphere templates. The approach is discussed in detail in Section 4.3. We then add extensions to this approach for online pose refinement with a robust ICP stage and a geometric corresponding grouping step discussed Section 4.4. Once we receive model level hypotheses, we refine these hypotheses 26 for scene consistency using hypothesis verification discussed in Section 4.5. An overview of our recognition pipeline is shown in Figure 4.2. Figure 4.1: The figure shows the static object detection and pose estimation pipeline. The offline phase consists of extracting point clouds templates from different views of CAD models, extracting descriptors from these templates and buildingtheViewpoint Pose Tree. Theonlinephaseinvolvesmatchingsampledkey points in the scene, computing descriptors on these key points and matching the computed descriptors to the Viewpoint Pose Tree. These matches are refined by an intermediate robust ICP step. Then these class and pose hypotheses are refined by verifying geometric consistency to reject false positives. Finally, these hypotheses are then verified in a final hypothesis verification optimization procedure. 4.3 Viewpoint Pose Tree In this section we introduce our 3D object detector, the viewpoint-pose tree, which provides coarse pose estimates in addition to recognizing an object’s class. The VP-Tree is built on the principles of the vocabulary tree, introduced by Nistér and Stewénius (2006). A vocabulary tree is primarily used for large scale image retrieval where the number of semantic classes is in the order of a few thousand. The VP-Tree extends the utility of the vocabulary tree to joint recognition and 27 pose estimation in 3D by using point cloud templates extracted from views on a sphere around the models in the databaseD. The templates serve to discretize the orientation of an object and make it implicit in the detection. Given a query point cloud, the best matching template carries information about both the class and the pose of the object relative to the sensor. A simulated depth sensor is used to extract templates from a model by observ- ing it from a discrete set{v 1 (ρ),...,v Γ (ρ)}⊂ V ρ of viewpoints (Fig. 4.2). The subscript Γ stands for the number of viewpoints. (We used Γ = 48 viewpoints in our experiments). The obtained point clouds are collected in a training set T :={P g,l | g = 1,..., Γ,l = 1,...,|D|}, whereP is a point cloud template, the subscript g corresponds to the viewpoints and l corresponds to the objects in the databaseD . Features, which describe the local surface curvature are extracted for each template as described below and are used to train the VP-Tree. Given a query point cloud at test time, features are extracted and the VP-Tree is used to find the template fromT , whose features match those of the query the closest. 4.3.1 Feature extraction It is necessary to identify a set of key pointsK P for each templateP∈T , at which to compute local surface features. Most 3D features are some variation of surface normal estimation and are very sensitive to noise. Using a unique key point estimator would be prone to errors. Instead, the key points are obtained by sam- pling the point cloud uniformly (Fig. 4.2), which accounts for global appearance and reduces noise sensitivity. Neighboring points within a fixed radius of every key point are used to compute 3D Descriptors, In this thesis we have explored the use of Fast Point Feature Histograms Rusu (2009) and the more state-of-the-art SHOT 28 Figure 4.2: The sensor position is restricted to a set of points on a sphere centered at the location of the object. Its orientation is fixed so that it points at the centroid of the object. A point cloud is obtained at each viewpoint, key points are selected, and local features are extracted (top right). The features are used to construct a VP-Tree (bottom right). features Tombari et al. (2010). The features are filtered using a pass-through filter and are assembled in the setF kp associated with kp∈K P . 4.3.2 Training the VP-Tree Thefeatures S P∈T S kp∈K P F kp obtainedfromthetrainingsetarequantizedhier- archically into visual words, which are defined by k-means clustering (see Nistér and Stewénius (2006) for more details). Instead of performing unsupervised clus- tering, the initial cluster centers are associated with one feature from each of the models inD. The training setT is partitioned into|D| groups, each consisting of the features closest to a particular cluster center. The same process is applied 29 to each group of features, recursively defining quantization cells by splitting each cell into|D| new parts. The tree is determined level by level, up to a pre-specified maximum number of levels. Given a query point cloudQ at test time, we determine its similarity to a templateP by comparing the paths of their features down the VP-Tree. The relevance of a feature at node i is determined by a weight ω i := ln |T|/η i , where η i is the number of templates fromT with at least one feature path through node i. The weights are used to define a query descriptor q and a template descriptor d P , with i-th components q i := n i ω i and d i := m i ω i respectively, where n i and m i are the number of features of the query and the template with a path through node i. The templates fromT are ranked according to a relevance score: s(q, d P ) := d P kd P k 1 − q kqk 1 1 . The template with the lowest relevance score is the best matching one toQ. 4.3.3 Online Classification In the online classification stage, we uniformly sample the scene point cloud for key points. Then feature descriptors are computed for the support surface at each one of the scene key points. The scene descriptors are then matched to the Viewpoint Treedescriptorstorecoverthecorrecttemplate. Therecoveredtemplate gives us the corresponding object model. The relative pose between the scene and the object model is recovered from the matched template as we know the camera pose corresponding to the template and the current sensor pose. The entire online matching and classification phase is illustrated in Fig 4.3. 30 Figure 4.3: The figure shows the online classification phase of the VP Tree. First key points are sampled on the scene, then descriptors are computed at each one of these key points. These scene descriptors are then matched to the VP Tree descriptors to recover the corresponding template. The recovered template gives us the corresponding model and pose. 4.3.4 Performance of the VP-Tree The performance of the static detector was evaluated by using the templates fromT as queries to construct a confusion matrix (Fig. 4.4). If the retrieved template matched the class of the query it was considered correct regardless of the viewpoint. To analyze the noise sensitivity of the VP-Tree, we gradually increased the noise added to the test set. Gaussian noise with standard deviation varying 31 Predicted class ap ax bk bb bl bm bh cu fs gt gl hb hr pn pp sl sf sb va wc wb apples (ap) axe (ax) bathroomkit (bk) bigbox (bb) bottles (bl) broom (bm) brush (bh) cups (cu) flowerspray (fs) gastank (gt) glasses (gl) handlebottle (hb) heavyranch (hr) pan (pn) pipe (pp) shovel (sl) spadefork (sf) spraybottle (sb) vases (va) watercan (wc) wreckbar (wb) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Actual class Figure 4.4: Confusion matrix for all classes in the VP-Tree. A class is formed from all views associated with an object. from 0.05 to 5 cm on a log scale was added along the direction of the ray cast from the observer’s viewpoint. The resulting class retrieval accuracy is shown in Fig. 4.5. As expected the performance starts to degrade as the amount of noise is increased but the detector behaves well at the typical depth camera noise levels. 0.05 0.1 0.5 1 5 0.06 0.46 0.82 0.99 Simulated noise standard deviation [cm] Mean classification accuracy Figure 4.5: Effect of signal noise on the classification accuracy of the VP-Tree 32 4.4 Correspondence Grouping Figure 4.6: This figure shows gives the intuition behind geometric correspondence grouping. The scene correspondences of key points from a particular object model are checked for geometric consistency based on their geometric relationship in the object model. If this constraint is violated by a matched key point, then the match is pruned. With this approach we refine matches between the scene and object model OncewerecovertheinitialmatchesandposeestimatesfromtheViewpointPose Tree we reject false positives by verifying geometric consistency. This is illustrated in Figure 4.6. Consider scene points p i,s and p j,s matched to model points p i,m and p j,m . Let the distance between the model points be d m i ,m j , and the distance between the scene points be d s i ,s j . We ensure that the scene and model points satisfy the same geometric relationship by computing the similarity between them. This is accomplished by ensuring that the distance between d m i ,m j and d s i ,s j is below a particular threshold as shown below d C i ,C j =|d m i ,m j −d s i ,s j |< where is a correspondence threshold and C i = m i , s i is a correspondence set between scene and model points. In practice the geometric consistency is verified 33 between the centroids of local surface patches computed at the key points similar to Chen and Bhanu (2007). 4.5 Global Hypothesis Verification Finallywiththesetofalltheobjectandposehypothesesinthescenewefindthe final set of optimal pose estimates. This is done by solving a greedy optimization problem that considers all the object and pose hypotheses at once and only retains the set of hypotheses that best explain the scene. This is accomplished by finding the set of poses that are accepted by an acceptance function. Let a particular hypothesis be represented by a tuple (M,t), whereM is the object model and the t is the transform from model to scene. The acceptance function measures the quality of a hypothesis (M,t). The acceptance function consists of a support term and a penalty term. The support term is proportional to the number of transformed model points that fall with a certain threshold of scene points, i.e t(M) fall within an -band of the scene. The penalty term is proportional to the size of the transformed model parts t(M) which occlude the scene. This ensures that a transformed object cannot occlude scene points reconstructed from the same viewpoint. Finally from the set of accepted hypothesis, conflicting hypotheses are filtered by performing non-maximal suppression on a conflict graph. The conflict graph is constructed with the set of accepted hypothesis, where the nodes are the accepted hypotheses and the edges connect hypotheses that are conflicting. Two hypotheses are considered conflicting if the intersection of the point sets explained by them is non-empty. We adopt the RANSAC algorithm introduced by Papazov and Burschka (2011) to solve this optimization problem. An output of the final result of the 3D Recognition pipeline is shown in Figure 4.7. 34 Figure 4.7: This figure shows the output of the 3D recognition pipeline shown in Fig 4.2. The green points belong to a recognized target object. The red points belong to other objects that the Viewpoint Pose Tree was trained on. The white points represent data that was either not matched or explained during hypothesis verification 4.6 Conclusions In this chapter we introduced a novel 3D object detector for static object recog- nition and pose estimation that was developed as part of this thesis. Our 3D object detector, the viewpoint-pose tree (VP-Tree), uses point cloud data from a depth sensor. The VP-Tree provides a pose estimate in addition to detecting an object’s class. This is achieved via partial view matching and helps in cases when the object is partially occluded or in contact with another object. Apart from classification with the VP-Tree, our 3D recognition pipeline also refines the pose estimates with geometric correspondence grouping and global hypothesis verification. Though the 3D recognition pipeline introduced in this chapter produces state-of-the-art results for single view recognition, it is still unable to cope with a large degree of occlusion and noise due to the inherent shortcomings of static recognition methods. 35 Chapter 5 Active 3D Object Recognition 5.1 Introduction Approaches to one of the central problems in computer vision, object classi- fication and pose estimation, is primarily tackled as a problem of static decision makingwherethesensingdeviceisassumedtobestationary[Lowe(2004),Belongie et al. (2002), Fan et al. (1989)]. This has also historically been the case as high- lighted by Andreopoulos and Tsotsos (2013). Such an approach to visual inference require the models to; Either be learned from datasets containing hundreds of thousands of annotated static images as in the case of deep learning techniques Szegedy et al. (2015); Or perform complex single view inference as in the case of 3D recognition Aldoma et al. (2016). In both these settings, exploration and inter- action are not considered to be part of the visual process. Though recent progress in this domain has shown approaches being benchmarked on data sets contain- ing hundreds of thousands of images such as Pascal VOC (Everingham et al., 2010) or ImageNet (Russakovsky et al., 2014). The data in these datasets contain images where both the environment and the camera are naturally static. However occlusions, clutter, variations in lighting, and imperfect object models in complex environments decrease the accuracy of these single-view detectors. Active percep- tion approaches circumvent these issues by utilizing appropriate sensing settings to gain more information about the scene. The research in sensor management Hero III and Cochran (2011) presents a structured approach to controlling the 36 degrees of freedom in sensor systems in order to improve the information acquisi- tion process. However, most of the work either assumes a simplified model for the detection process Spaan et al. (2010); Jenkins (2010) or avoids the problem alto- gether and concentrates on estimating a target’s state after its detection Kreucher et al. (2005); Sommerlade and Reid (2008). In this chapter we present a novel approach to active perception for object detection and pose estimation by casting the problem as an active information acquisition problem as introduced in Chap 2.1. This research is aimed at bridging the gap between the research in sensor management and the recent advances in 3D object recognition that has been enabled by the advent of low-cost RGB-D cameras and open-source point cloud libraries Rusu and Cousins (2011). We have published the techniques discussed in this chapter in Sankaran et al. (2013); Atanasov et al. (2014). In our approach to active perception, we exploit the sensor’s ability to move to increase the confidence in its decision rather than placing the burden of providing perfect results on a static detector. We consider the following problem. A mobile sensor has access to the models of several objects of interest. Its task is to determine which, if any, of the objects are present in a cluttered scene and estimate their poses. The sensor has to balance the detection accuracy with the amount of time spent observing the objects. The problem can be split into a static detection stage followed by a planning stage, which selects a sequence of views minimizing the error made by the observer. Therestofthechapterisorganizedasfollows. Thenextsectionpresentsrelated approaches to active perception and summarizes our contribution. In Sec. 5.3.1, we setup the general problem for object detection and pose estimation as a discrete space active information acquisition problem. Then in Sec 5.3 we set up hypothe- ses about the class and orientation of an unknown object and formulate the active 37 recognition problem precisely. An observation model which assigns a confidence measure to the static detections and in turn to the class-pose hypotheses is dis- cussed in Sec. 5.4. Sec. 5.5 describes how to choose a sequence of views for the sensor, which tests the hypotheses and balances sensing time with decision accu- racy. Implementation details are in Sec. 5.6. Finally, Sec. 5.7 presents results from simulation and real-world experiments and compares the performance of our active approach to those of static detection and the widely-used greedy view selection. 5.2 Related Work Some of the earliest work in active perception is by Bajcsy (1988); Krotkov and Bajcsy (1993). It is focused on improving the estimates of object’s 3D posi- tions by controlling the intrinsic parameters of a stereo camera. Similarly, Pito (1999) addresses the problem of selecting the resolution of a camera to improve surface reconstruction by maximizing information gain. Since then many active vision works have utilized information theoretic criteria for view point and sen- sor parameter selection. Approaches in sensor management Hero III and Cochran (2011); Huber (2009) can be classified according to different categories. According to sensor type they can be classified into mobile (sensors have dynamic states) and stationary (sensors have fixed states). The objectives may be to identify a target and estimate its state or simply to improve the state estimate of a detected target. The targets might be mobile or stationary as well. The process of choosing sensor configurations may be myopic (quantifies the utility of the next configuration only) or nonmyopic (optimizes over a sequence of future configurations). A large body of work assumes fixed sensor positions and focuses on choosing effective sensor parameters. For example, Sommerlade and Reid (2008) control 38 a pan-zoom-tilt camera with a fixed position to track mobile targets based on myopic minimization of conditional entropy. This simplifies the problem consid- erably because the trade-off between minimizing the sensor movement energy and maximizing viewpoint informativeness is avoided. Methods, which deal with active selection of views for realistic sensor models typically resort to myopic planning Browatzki et al. (2012); Borotschnig et al. (2000); Potthast and Sukhatme (2014). Similarly, Denzler and Brown (2002) select the focal length, pan and tilt angles, and the viewpoint of a camera on a hemisphere around an object of interest via greedy maximization of entropy. Borotschnig et al. (2000) represent object appear- ance via parametric eigenspaces and use probability distributions in the eigenspace to greedily select discriminative views. Greedy approaches for active object recog- nition have also been used in semantic mapping and localization Ekvall et al. (2006). Golovin and Krause (2011) showed that myopic planning for an adaptively submodular objective function is worse than the optimal strategy by a constant factor. Mutual information is submodular but not adaptively submodular and as a result the performance guarantees for greedy policies hold in the non adaptive setting only, i.e. if the sequence of sensor views were selected offline and applied regardless of the observations received online. Such open-loop planning performs well only with linear Gaussian sensor models Krause et al. (2008). Instead of mutual information, we use a criterion which quantifies the trade-off between the sensor movement energy expenditure and the expected cost of an incorrect object classification (probability of error, Sec 2.1.1). The cost function is called value of information (Krause, 2008, Ch.15) or expected Bayesian risk Karasev et al. (2012). Unfortunately, Krause (Krause, 2008, Prop.15.1) shows that it is not adaptively submodular. Even with a fixed sensor state, a myopic strategy can 39 perform arbitrarily worse than the optimal one in an adaptive setting Naghshavar and Javidi (2012). In light of these observations, the goal of this work is to devise a nonmyopic view planning strategy and compare its performance to the myopic information theoretic approaches. We represent the active object classification and pose estimation problem as a partially observable Markov decision process (POMDP) and use a point-based approximate algorithm Kurniawati et al. (2008); Ong et al. (2009) to solve it. The work that is closest to ours is Eidenberger and Scharinger (2010), which uses a mobile sensor to classify stationary objects and estimate their poses. Static detection is performed using SIFT matching and the object pose distributions are represented with Gaussian mixtures. Similar to our approach, the problem is encoded by a POMDP but instead of an approximate nonmyopic policy, the authors resort to a myopic approach to reduce the differential entropy in the pose and class distributions. Karasev et al. (2012) plan the path of a mobile sensor for visual search of an object in an otherwise known and static scene. The authors hypothesize about the object’s pose and apply greedy maximization of the condi- tional entropy of the next measurement. Paletta and Pinz (2000) describe a rein- forcement learning approach for obtaining a sequence of views which maximally discriminate objects of various classes at different orientations. An approximate policy which maps a sequence of received measurements to a discriminative view point is obtained offline. Velez et al. (2012) consider detecting doorways along the path of a mobile sen- sor traveling towards a fixed goal. The unknown state of a candidate detection is binary: “present” or “not present”. Deformable part models Felzenszwalb et al. (2008) are used for static detection; stereo disparity and plane fitting for pose estimation. The pose estimates are not optimized during the planning stage. An 40 entropy field is computed empirically off-line for all viewpoints in the workspace and is used to non myopically select locations with high information gain. The planning is open-loop because the object state distributions change online but only the precomputed entropy field is used. We use a depth sensor, which validates the assumption that position estimates need not be optimized. However, the orien- tation estimates can be improved through active planning. Inspired by the work on hypothesis testing Naghshavar and Javidi (2012); Sankaran (2012), we intro- duce a rough discretization of the space of orientations so that the hidden object state takes on several values, one for “object not present” and the rest for “object present” with a specific orientation. Using several hypotheses necessitates closed- loop planning because the ranking of the viewpoints in terms of informativeness changes depending on which hypothesis is more likely. In the POMDP formulation weassumethatthesensorobservationsareindependentacrossdifferentviewpoints, which is frequently violated in practice. Potthast et al. (2016) present a low cost on-line approach to active multi- view object detection that performs online feature selection to improve decision making. Similar to our work, they trade off between the recognition accuracy at current viewpoint and future viewpoints based on possible observations, but their decision making framework is restricted to a single step look ahead. Their myopic approach also makes assumptions about the expected change in information. They only reason about the most probable observation in the next time step. In contrast our approach reasons about all possible observations over a non-myopic horizon. Also their experiments were only performed with single isolated objects. An exten- sion to this work presented in Potthast and Sukhatme (2016) where non-myopic planning is achieved through on-line trajectory optimization. In this approach the recognition uncertainty is approximated by the distance between the sensor and 41 the target object and does not incorporate the true recognition accuracy provided by the object detection framework. Once again, the proposed approach only deals with single isolated objects. In more recent work deep CNNs have been used for active object recognition Malmir et al. (2017); Malmir and Cottrell (2017). In Malmir et al. (2017) a deep network is trained to jointly predict object labels and action values using a variant Q-learning. The labels and actions are predicted by the network when it is presented with an input RGB image. The network embeds a generative Dirichlet model of object similarities to encode the state of the sys- tem. Using deep networks to jointly learn actions and labels for object recognition requires a lot of training data per object class. Moreover the learned policies tend to over-fit to the training data and hence do not generalize well to the test data. To deal with this issue in Malmir and Cottrell (2017) the authors adapt Belief Tree Search (BTS) to train a LSTM for active object recognition. In this approach a deep CNN is trained for object pose estimation. The trained deep CNN produces likelihoods for input images. These likelihoods are used in a BTS algorithm to find near optimal policies for active object recognition on the training set. The policies produced by the BTS are rolled out and these roll outs are used to train a LSTM which directly predicts actions given object pose likelihoods (beliefs). The trained LSTM is used on-line to predict the best action given an object belief. Though deep learning based approaches find non-myopic policies for active object recog- nition and generalize well to novel objects, they are extremely data intensive and require a lot of training data to learn near optimal policies that generalize well. Another approach based on hypothesis testing is Laporte and Arbel (2006). To disambiguate competing hypotheses the authors myopically select views to max- imize the dissimilarity between the distributions of the expected measurements. In recent years, the active vision problem has received significant attention in the 42 robotics community as well. Hanheide et al. (2011) present an approach for object search and place categorization in large indoor environments. A novel probabilis- tic model is used to encode structural relations among objects and places (e.g. cereal boxes are often located in kitchens). An object search task is then repre- sented by pairing the probabilistic conceptual map with the visual appearance of the object of interest. A sequence of views is planned using a POMDP abstraction with conditional entropy as the reward function Göbelbecker et al. (2011). Other recent work which does active visual search in a similar spirit is Aydemir et al. (2011). Probabilistic spatial relations and static properties of rooms are used to pose the problem as a fully-observable Markov decision process (MDP). A greedy next-best-view approach is used to determine if an object is present or not at a specific location, while the MDP is used to synthesize a sequence of good locations to search. Sridharan et al. (2010) plan visual sensing actions for scene understand- ing and disambiguation. Similar to our approach, a POMDP captures the tradeoff between plan reliability and execution time and enables a robot to simultaneously decide which region in the scene to focus on and what processing to perform. Contributions: We propose a nonmyopic view planning approach to improve the static detection and pose estimation by moving the sensor to more informative viewpoints. Ouroptimizationcriterioncapturesthetrade-offbetweengainingmore certainty about the correct object class and pose and the cost of moving the sensor. This encodes the requirements of an object recognition task more precisely than the mutual information criterion. 43 5.3 Problem Formulation In the static object recognition system introduced in Chap 4, the recognized classes and poses of objects are associated with discrete camera viewpoints of the sensor. Due to this nature of the static recognition system, we formulate the active information acquisition problem for discrete spaces. An algorithm that computes the optimal control policy for a discrete space problem can be obtained by considering the discrete-space active information acquisition problem as an instanceofapartiallyobservableMarkovdecisionprocess(POMDP).Weutilizean offline point based approximate POMDP solver (Kurniawati et al. (2008)) instead of an exact algorithm to obtain a non-greedy camera control policy in Sec 5.3. We do not use an exact algorithm since they do not scale well with the cardinality of the state space. 5.3.1 Discrete Space Active Information Acquisition Preliminaries: Let the sensor state spaceX, the target state spaceY and the measurement spaceZ be finite sets With the sensor motion model introduced in Eq 2.1 of Chap 2, the evolution of the sensor state can be captured as follows x t+1 =f(x t , u t ), u∈U 44 whereU is a finite space of admissible controls. The target transition probabilities and sensor emission probability can be described by probability mass functions over the setsY,Z. y t+1 ∼p g (·|y t ) z t ∼p h (·|x t , y t ) If a prior pmf p t ∈P(Y) :=b∈ [0, 1] |Y| where P y∈Y b(y) = 1 describes the target estimate at time t. Given that the state and measurement spaces are discrete we canformulatetherecursiveestimationprocedureforthetargetstateestimateusing Recursive Bayesian Estimation as discussed in Chap 3.1. Then the prediction and update steps would be as follows Predict: p t+1|t (y) = X s∈Y p g (y t+1 |s)p t (s) (5.1) Update: p t+1 (y) = p h (z t+1 |x t+1 , y)p t+1|t (y) P s∈Y p h (z t+1 |x t+1 , s)p t+1|t (s) (5.2) 45 Since the target state is discrete we compute the probability of error Eq 2.7 (Chap 2.1.1) which is the probability that the maximum likelihood estimate ˆ y := arg max y∈Y p t (y) is not the true state y t . p e (t) :=E z 1:t 1− max s∈Y p t (s) (5.3) Nowwedefineacostassociatedwithsensormotionasamovementcost,m(x, x t+1 ). Withthesedefinitionswecanformulatethediscretespaceactiveinformationacqui- sition problem as follows Problem: Given initial sensor state x 0 ∈X and a prior p 0 ∈P(Y) on the true target state y 0 . Choose a stopping time isτ and a sequence of functionsμ t (H t )∈U for t = 0,...,τ that minimize the total cost min τ,μ 0:τ τ−1 X t=0 m(x, x t+1 ) +λp e (τ) (5.4) s.t x t+1 =f(x t ,μ t (H t )) t = 0,....,τ− 1 x t+1 / ∈{x 0 ,...., x t } t = 0,....,τ− 1 z t ∼ X y∈Y p h (·|x t , y)p t (y) t = 0,....,τ p t+1 =T (p t , z t+1 , x t+1 ) t = 0,....,τ− 1 whereλ is the trade of between the cost of movement and the probability of an error (as defined in Chap 2.1.1). T (p t , z t+1 , x t+1 ) is the posterior pmf obtained from Eq 5.1 and 5.2.H 0 := (z 0 , x 0 ) andH t := (z 0:t , x 0:t , u 0:t−1 ) for t> 0 are the state, measurement and control histories. The second constraint in the optimization problem ensures independence of measurements as required for Recursive Bayesian Estimation. Now in the subsequent sections we can see how this formulation for 46 active information acquisition in discrete spaces is adapted for the problem of object detection and pose estimation. 5.3.2 Sensing Consider a mobile depth sensor, which observes a static scene, containing unknown objects. The sensor has access to a finite databaseD of object mod- els (Fig. 5.1) and a subsetI of them are designated as objects of interest. We assume that an object class has a single model associated with it and use the words model and class interchangeably. This is necessary because our static detector, the VP-Tree, works with instances. However, the view planning approach is indepen- dent of the static detector and can be used with other class-based detectors. Axe Bigbox Broom Brush Flowerspray Gastank Handlebottle Heavyranch Pan Pipe Shovel Spadefork Spraybottle Watercan Wreckbar Apples Bathroomkit Bottles Cups Glasses Vases Training Object Database Figure 5.1: Database of object models constructed using kinect fusion (left) Rusu and Cousins (2011) and an example of a scene we used to evaluate our framework in simulation (right) The task of the sensor is to detect all objects fromI which are present in the scene and to estimate their poses. Note that the detection is against not only known objects from the database but also clutter and background. At each time step the sensor obtains a point cloud from the scene, splits it into separate surfaces (segmentation) and associates them with either new or previously seen 47 objects (data association). These procedures are not the focus of this research but we detail them in Sec. 5.6.1. We assume that they estimate the object positions accurately. We formulate hypotheses about the class and orientation of an unknown object by choosing a small finite set of discrete orientationsR α ⊂SO(3) 1 for each object classα∈I. To denote the possibility that an object is not of interest we introduce a dummy classα ∅ and a dummy orientationR ∅ ={θ ∅ }. The sensor needs to decide among the following hypotheses: H(α ∅ ,θ ∅ ) : the object does not belong toI, H(α,θ) : the object class is α∈I with orientation θ∈R α Decision Cost In accordance with the formalism we introduced in Chapter 2, to measure the accuracy of the sensor’s decisions we introduce a cost for choosing H(ˆ α, ˆ θ), when H(α,θ) is correct: j D (ˆ α, ˆ θ,α,θ) := k( ˆ θ,θ), ˆ α =α κ + , ˆ α∈I,α / ∈I κ − , ˆ α6=α,α∈I, where κ + and κ − are costs for making false positive and false negative mistakes respectively, andk(·,·) is function that computes a cost for an incorrect orientation estimate, when the class is correct. 1 SO(3) denotes the 3D rotation Lie group 48 Example. Suppose that the task is to look for chairs (α 1 ) and tables (α 2 ) regardless of orientation (k( ˆ θ,θ) := 0). The decision cost can be represented with the matrix: j D (ˆ α, ˆ θ,α,θ) : ˆ αα α ∅ α 1 α 2 α ∅ 0 κ − κ − α 1 κ + 0 κ − α 2 κ + κ − 0 In static detection, it is customary to run a chair classifier first to distinguish between α ∅ and α 1 and then a table classifier to distinguish between α ∅ and α 2 . Our framework requires moving the sensor around the object to distinguish among the hypotheses and it is necessary to process them concurrently. 5.3.3 Mobility We are interested in choosing a sequence of views for the sensor, which has an optimal trade-off between the energy used to move and the expected cost of incor- rect decisions. Doing this with respect to all objects in the scene simultaneously results in a complex joint optimization problem. Instead, we treat the objects independently and process them sequentially, which simplifies the task to choos- ing a sequence of sensor poses to observe a single object. Further, we restrict the motion of the sensor to a sphere of radiusρ, centered at the location of the object. The sensor’s orientation is fixed so that it points at the centroid of the object. We denote this space of sensor poses by V ρ and refer to it as a viewsphere. A sensor pose x∈V ρ is called a viewpoint. See Fig.4.2 for an example. At a high-level plan- ning stage we assume that we can work with a fully actuated model of the sensor dynamics. The viewsphere is discretized into a set of viewpointsX ρ , described by the nodes of a graph. The edges connect nodes which are reachable within a 49 single time step from the current location based on the kinematic restrictions of the sensor. Control Cost Since the motion graph is known a priori the Floyd-Warshall algorithm can be used to precompute the all-pairs movement cost between viewpoints: d(x, x 0 ) =d M (x, x 0 ) + Δ 0 = cost of moving from x to x 0 on the viewsphereX ρ and taking another observation, where Δ 0 > 0 is a fixed measurement cost, which prevents the sensor from obtain- ing an infinite number of measurements without moving; d M (x, x 0 ) is a function that computes the cost of motion from x and x 0 . As a result, a motion plan of lengthτ for the sensor consists of a sequence of viewpoints x 1 ,..., x τ ∈X ρ on the graph and its cost is: j M (τ) := τ X t=2 d(x t−1 , x t ) 5.3.4 Active Object Classification and Pose Estimation Problem. Let the initial pose of the mobile sensor be x 1 ∈X ρ . Given an object with unknown class α and orientation θ, we choose a stopping time τ, a sequence of viewpoints x 2 ,..., x τ ∈X ρ , and a hypothesisH(ˆ α, ˆ θ), which minimizes the total cost: E ( j M (τ) +λj D (ˆ α, ˆ θ,α,θ) ) , (5.5) 50 whereλ≥ 0 determines the relative importance of a correct decision versus the cost of movement. The expectation is over the the correct hypothesis and the observa- tions collected by the sensor. Our approach to solving the active object classification and pose estimation problem consists of two stages. First, we use the VP-Tree to perform static detec- tion in 3D as described in the next section. Since the detection scores are affected by noise and occlusions, they are not used directly. Instead, we use the hypotheses about the detection outcome and maintain them in a probabilistic framework. In the second stage, we use non myopic planning to select better viewpoints for the static detector and update the probabilities of the hypotheses. 5.4 Observation Model Statistics about the operation of the VP-Tree detector for different viewpoints and object classes are needed to maintain a probability distribution over the object hypotheses. Using the VP-Tree output as the sensor observation reduces the obser- vation space from all possible point clouds to the space of VP-Tree outputs and includes the operation of the vision algorithm in the statistics. Given a query point cloud suppose that the VP-Tree returns templateP g,l as the top match. Assume that the templates inT are indexed so that those obtained from models inI have a lower l index than the rest. We take the linear index ofP g,l as the observation if the match is an object of interest. Otherwise, we record only the model index l, ignoring the viewpoint g: Z = (l− 1)Γ +g, if l≤|I| Γ|I| + (l−|I|), if l>|I|. 51 This makes the observation space one dimensional. In order to compute the likelihood of an observation offline, we introduce an occlusion state ψ for a point cloud. Suppose that the z-axis in the sensor frame measures depth and the xy-plane is the image plane. Given parameters andE, we say that a point cloud is occluded from left if it has less thanE points in the image plane to the left of the linex =−. If it has less thanE points in the image plane above the line y =, it is occluded from top. Similarly, we define occluded from bottom, occluded from right, and combinations of them (left-right, left-top, etc.). Let Ψ denote the set of occlusion states, including the non-occluded (ψ ∅ ) and the fully-occluded cases. Then, the likelihood of a VP-Tree observation z for a given sensor pose x∈X ρ , hypothesis H(α,θ), and occlusion ψ∈ Ψ is: h z (x,α,θ,ψ) :=P(Z = z| x,H(α,θ),ψ) The function h is called the observation model of the static detector. It can be obtained offline because for a given occlusion state it only depends on the char- acteristics of the sensor and the vision algorithm. Since all variables are discrete, h can be represented with a histogram, which we compute from the training set T . Note, however, that the observation model depends on the choice of planning viewpoints and hypotheses, which means that it needs to be recomputed if they change. Ideally, it should be computed once for a given training set and then be able to handle scenarios with different sets of hypotheses and planning viewpoints. To make the computation of the observation model independent of the choice of hypotheses and planning views we discretize the view sphere V ρ very finely 52 into a new set of viewpoints V o ρ with coordinates in the object frame. A nominal observation model: h o z (v,α,ψ) :=P(Z = z|v,α,ψ), v∈V o ρ ,α∈D,ψ∈ Ψ is computed and used to obtain h z (x,α,θ,ψ) as follows: 1. Determine the sensor’s pose w(x,θ) in the object frame. 2. Find the closest viewpoint v∈ V o ρ to w(x,θ) (the fine discretization avoids a large error). 3. Rotate the lines associated with ψ to the object frame of α to get the new occlusion region. Obtain a point cloud from v, remove the points within the occlusion region, and determine the occlusion state ψ o in the object frame. 4. Copy the values from the nominal observation model: h z (x,α,θ,ψ) =h o z (v,α,ψ o ). As a result, it is necessary to compute only the nominal observation model h o z (v,α,ψ o ). The histogram representing h o was obtained in simulation. A view- sphere with radiusρ = 1 m was discretized uniformly into 128 viewpoints (the set V o ρ ). A simulated depth sensor was used to obtain 20 independent scores from the VP-Tree for every viewpointv∈V o ρ , every modelα∈D, and every occlusion state ψ∈ Ψ. Fig. 5.2 shows an example of the final observation model obtained from the nominal one with the planning viewpoints and hypotheses used in some of our experiments. 53 Figure 5.2: Observation model obtained with seven hypotheses for the Handlebot- tle model and the planning viewpoints used in the simulation experiments (Sec. 5.7.1). Given a new VP-Tree observation z t+1 from the viewpoint x t+1 , the obser- vation model is used to determine the data likelihood of the observation and to update the hypotheses’ prior by applying Bayes rule. 5.5 Active Hypothesis Testing In this section we provide a dynamic programming formulation for the single object optimization problem in Equation (5.5). Let ¯ I :=I∪{α ∅ } denote the set of all hypothesized classes and M := P α∈ ¯ I |R α | - the total number of hypotheses. The state of the problem at time t consists of the sensor pose x t ∈X ρ and the information state p t ∈ [0, 1] M , containing the probabilities for each hypothesis H(α,θ): p t (α,θ) :=P(H(α,θ)| x 1:t , z 1:t ,ψ 1:t )∈ [0, 1], 54 conditioned on the past sensor trajectory, x 1:t , the past VP-Tree observations, z 1:t , and the occlusion states of the observed point clouds,ψ 1:t . Suppose that the sensor decides to continue observing by moving to a new viewpoint x t+1 ∈X ρ . The newly observed point cloud is used to determine the VP-Tree score z t+1 and the occlusion state ψ t+1 . Then, the probabilities in p t are updated according to Bayes’ rule: p t+1 (α,θ) =P(H(α,θ)| x 1:(t+1) , z 1:(t+1) ,ψ 1:(t+1) ) = P(Z t+1 = z t+1 | x t+1 ,H(α,θ),ψ t+1 )p t (α,θ) P(Z t+1 = z t+1 | x t+1 ,ψ t+1 ) = h z t+1 (x t+1 ,α,θ,ψ t+1 )p t (α,θ) P α 0 ∈ ¯ I P θ 0 ∈Rα h z t+1 (x t+1 ,α 0 ,θ 0 ,ψ t+1 )p t (α 0 ,θ 0 ) usingtheobservationmodelobtainedinSec.5.4andtheassumptionofindependent successive observations. Fig. 5.2 illustrates the update. Let T (p t , x t+1 , z t+1 ,ψ t+1 ) denote the Bayesian operator above, which maps p t to p t+1 given a view x t+1 , a VP-Tree score z t+1 , and an occlusion state ψ t+1 . The future sequence of views is planned with the assumption that there are no occlusions, i.e. ψ s = ψ ∅ for s > t + 1. This choice makes the planned sequence independent of the observed scene and allows computing it offline. Supposing for a moment that the stopping time, τ, is known, the sensor is not allowed to move and has to make a decision at time τ. After the last observation z τ has been incorporated in the posterior, p τ , the terminal cost of the problem is: V τ (x τ ,p τ ) = min ˆ α∈ ¯ I, ˆ θ∈R ˆ α E α,θ {λj D (ˆ α, ˆ θ,α,θ)} = min ˆ α∈ ¯ I, ˆ θ∈R ˆ α X α∈ ¯ I X θ∈Rα λj D (ˆ α, ˆ θ,α,θ)p τ (α,θ). 55 The intermediate stage costs for t = (τ− 1),..., 0 are: V t (x t ,p t ) = min v∈Xρ ( d(x t ,v)+ E Z t+1 V t+1 (v,T (p t ,v,Z t+1 ,ψ ∅ )) ) . Lettingτ be random again andt go to infinity, we get the following infinite-horizon dynamic programming equation: V (x,p) = min ( min ˆ α∈ ¯ I, ˆ θ∈R ˆ α X α∈ ¯ I X θ∈Rα λj D (ˆ α, ˆ θ,α,θ)p τ (α,θ), min v∈Xρ d(x,v) +E Z {V (v,T (p,v,Z,ψ ∅ ))} ) , (5.6) which is well-posed by Propositions 9.8 and 9.10 in Bertsekas and Shreve (2007). Equation (5.6) gives an intuition about the relationship between the cost functions d(·,·) (j M ), j D , and the stopping time τ. If at time t the expected cost of making a mistake, given by min ˆ α∈ ¯ I, ˆ θ∈R ˆ α P α∈ ¯ I P θ∈Rα λj D (ˆ α, ˆ θ,α,θ)p t (α,θ) is smaller than the cost of taking one more measurement, then the sensor stops and chooses the minimizing hypothesis; otherwise it continues observing the scene. To determine the value functionV (x,p), we resort to numerical approximation techniques, which work well when the state space of the problem is sufficiently small. Define the setA :={(α,θ)|α∈I, θ∈R α }∪{(α ∅ ,θ ∅ )} of all hypothesized 56 class-orientation pairs. Then, for s 1 ,s 2 ∈X ρ ∪A redefine the cost of movement and the state transition function: d 0 (s 1 ,p,s 2 ) = d(s 1 ,s 2 ), s 1 ,s 2 ∈X ρ , P α∈ ¯ I P θ∈Rα λj D (α 0 ,θ 0 ,α,θ)p(α,θ), s 1 ∈X ρ ,s 2 = (α 0 ,θ 0 )∈A, 0, s 1 =s 2 ∈A, ∞, otherwise, T 0 (p,s,z,ψ ∅ ) = T (p,s, z,ψ ∅ ), s∈X ρ , p, s∈A. Using the new definitions, we can rewrite Equation (5.6) into the usual Bellman optimality equation for a POMDP: V (s,p) = min s 0 ∈Xρ∪A ( d 0 (s,p,s 0 ) +E Z {V (s 0 ,T 0 (p,s 0 ,Z,ψ ∅ )} ) The state space of the POMDP is the discrete space of sensor posesX ρ and the continuous spaceB := [0, 1] M of distributions over the M hypotheses. Since the viewpoints are chosen locally around the object, the spaceX ρ is very small in practice (only 42 views were used in our experiments). The main computational challenge comes from the exponential growth of the size ofB with the number of hypotheses M. To alleviate this difficulty, we apply a point-based POMDP algo- rithm Kurniawati et al. (2008); Ong et al. (2009), which uses samples to compute successive approximations to the optimally reachable part ofB. The algorithm computes an approximate stationary policy ˆ μ : X ρ ∪A ! ×B→X ρ ∪A, which maps the current sensor viewpoint, x t , and the hypotheses’ probabilities, p t , to a 57 future viewpoint or a guess of the correct hypothesis. In practice, there is some controloverthesizeofM. Inmostapplications, thenumberofobjectsofinterestis small and we show in Sec. 5.7.2 that a very sparse discretization of the orientation space is sufficient to obtain accurate orientation estimates. 5.6 Implementation Details 5.6.1 Segmentation and data association Our experiments were carried out in a tabletop setting, which simplifies the problems of segmentation and data association. Point clouds obtained from the scene were clustered according to Euclidean distance by a Kd-tree. An occupancy grid representing the 2D table surface was maintained in order to associate the clustered surfaces with new or previously seen objects. Each cell of the grid could be unoccupied or associated with the ID of an existing object. The centroid of a newly obtained object surface was projected to the table and compared with the occupied cells (if any). If the cell corresponding to the new centroid was close enough to a cell associated with an existing object, the new surface was associated with that object and its cell was indexed by the existing object’s ID. Otherwise, a new object with a unique ID was instantiated. Since segmentation was not the focus of this research we did not explicitly address the case where objects were touching. In such situations the detection outcome would be inconsistent and dependent on the chosen viewpoints. 5.6.2 Coupling among objects The optimization in problem Equation (5.5) is with respect to a single object but while executing it, the sensor obtains surfaces from other objects within its 58 field of view. To utilize these observations, we have the sensor turn towards the centroid of every visible object and update the probabilities of the hypotheses associated with the object. The turning is required because the observation model was trained only for a sensor facing the centroid of the object. Removing this assumption requires more training data and complicates the observation model computation. The energy used for these terms is not included in the optimization in Equation (5.5). The scores obtained from the VP-Tree are not affected significantly by scaling. This allows us to vary the radius ρ of the viewsphere in order to ease the sensor movement and to update hypotheses for other objects within the field of view. The radius is set to 1 meter by default but if the next viewpoint is not reachable, it can be adapted to accommodate for obstacles and the sensor dynamics. Algorithm 1 summarizes the complete view planning framework. 5.7 Performance Evaluation TheVP-Treewastrainedontemplatesextractedusingasimulateddepthsensor from 48 viewpoints, uniformly distributed on a view sphere of radiusρ = 1 m (Fig. 4.2). The observation model was trained as described in the last paragraph of Sec. 5.4. Since the VP-Tree scores are not significantly affected by scaling, the score likelihoods remain similar as the view sphere radius varies. We simplify the training process by using a fixed view sphere radius, which limits the number of sensor poses at which we train the observation model. The reason for using a simulated sensor is also simply pragmatic. A lot of point clouds, each accompanied with a ground truth sensor pose, are needed for every model in the database in 59 Algorithm 1 View Planning for Active Object Recognition 1: Input: Initial sensor pose x 1 = (x p 1 ,x θ 1 )∈R 3 nSO(3), object models of interestI, vector of priors p 0 ∈ [0, 1] M 2: Output: Decision ˆ α i ∈ ¯ I, ˆ θ i ∈R(ˆ α i ) for every object i in the scene 3: Priority queue pq←∅; Current object ID i← unassigned 4: for t = 1 to∞ do 5: Obtain a point cloudQ t from x t 6: ClusterQ t and update the table occupancy grid 7: for every undecided object j seen inQ t do 8: Rotate the sensor so that x r t faces the centroid of j 9: Get viewsphere radius: ρ←kx p t −centroid(j)k 10: Get closest viewpoint: v j ← arg min v∈Xρ kx p t −vk 11: Obtain a point cloudQ j 12: Get VP-Tree score z j and occlusion state ψ j fromQ j 13: Update probabilities for object j: p j t ←T (p j t−1 ,v j ,z j ,ψ j ) 14: if j / ∈pq then 15: Insert j in pq according to probability j∈I: 1−p j t (α ∅ ,θ ∅ ) 16: end if 17: end for 18: if i is unassigned then 19: if pq is not empty then 20: i←pq.pop() 21: else . All objects seen so far have been processed. 22: if whole scene explored then 23: break 24: else 25: Move sensor to an unexplored area and start over 26: end if 27: end if 28: end if 29: x t+1 ← ˆ μ(v i ,p i t ) 30: if x t+1 == (α,θ)∈A then 31: ˆ α i ←α, ˆ θ i ←θ, i← unassigned, Go to line 19 32: end if 33: Move sensor to x t+1 34: end for order to train the observation model. This extensive training is simpler to do in simulation but affects the recognition results in real scenes adversely. 60 We used|X ρ | = 42 planning viewpoints in the upper hemisphere of the view- sphere to avoid placing the sensor under the table. The following costs were used in all experiments: λ = 75, J D (ˆ α, ˆ θ,α,θ) = 0, ˆ α =α and ˆ θ =θ 1, otherwise d(x, x 0 ) =gcd(x, x 0 ) +d 0 , (5.7) where gcd(·,·) is the great-circle distance between two viewpoints x, x 0 ∈X ρ and d 0 = 1 is the measurement cost. The parameter λ was set high (heuristically) in order to favor correct decisions over speed and to emphasize the advantage of active view planning over static detection. If necessary a more principled approach to choosing λ, such as cross validation, can be used. 5.7.1 Performance evaluation in simulation A single object of interest (Handlebottle) was used: I ={α H }. Keeping the pitch and roll zero, the space of object yaws was discretized into 6 bins to formulate the hypotheses: H(∅) :=H(α ∅ ,θ ∅ ) = The object is not a Handlebottle H(r) :=H(α H ,θ) = The object is a Handlebottle with yaw θ∈{0 ◦ , 60 ◦ , 120 ◦ , 180 ◦ , 240 ◦ , 300 ◦ }. Seventy synthetic scenes were generated with 10 true positives for each of the seven hypotheses. The true positive object was placed in the middle of the table, while 61 the rest of the objects served as occluders. Fig. 5.1 shows an example of the scenes used in the simulation. Four approaches for selecting sequences of views fromX ρ were compared. The static approach takes a single measurement from the starting viewpoint and makes a decision based on the output from the VP-Tree. This is the traditional approach in machine perception. The second approach is our nonmyopic view planning (NVP). Note that the NVP policy is computed offline and used as a look-up table, which makes the planning decisions online instantaneous. Thethirdapproach(random)isarandomwalkontheviewsphere,whichavoids revisiting viewpoints. It ranks the viewpoints, which have not been visited yet, according to the great-circle distance from the current viewpoint. Then, it selects a viewpoint at random among the closest ones. The observation model is used to update the hypotheses’ probabilities over time. The experiment is terminated when the probability of one hypothesis is above 60%, i.e. τ = inf{t≥ 0|∃(α,θ)∈ A such that p t (α,θ)≥ 0.6}, and that hypothesis is chosen as the sensor’s decision. This stopping rule was chosen empirically so that the random approach makes about the same number of measurements as NVP with the costs given in (5.7). This allows us to compare the informativeness of the chosen sensor views. 62 Lastisthewidely-usedgreedymutualinformation(GMI)approach. Specialized to our setting, the GMI policy takes the following form: μ GMI (x,p) = arg max x 0 ∈NV I(H(α,θ);Z) d(x, x 0 ) = arg min x 0 ∈NV H(H(α,θ)|Z) d(x, x 0 ) = arg min x 0 ∈NV 1 d(x, x 0 ) X z∈Z X α∈ ¯ I X θ∈Rα p(α,θ)h z (x 0 ,α,θ,ψ ∅ ) × log 2 P α 0 ∈ ¯ I P θ 0 ∈R α 0 p(α 0 ,θ 0 )h z (x 0 ,α 0 ,θ 0 ,ψ ∅ ) p(α,θ)h z (x 0 ,α,θ,ψ ∅ ) ! , whereNV :={x∈X ρ | x has not been visited}, H(α,θ) is the true hypothesis, I(·;·) is mutual information, H(·|·) is conditional entropy, andZ is the space of observations as defined in Sec. 5.4. The same stopping rule as for the random approach was used so that the number of measurements made by GMI is roughly the same as those for random and NVP. Fifty repetitions with different starting sensor poses were carried out on every scene. For each hypothesis, the measurement cost P τ t=1 d 0 , the movement cost P τ t=2 gcd(x t , x t−1 ), and the decision cost j D were averaged over all repetitions. The accuracy of each approach and the average costs are presented in Table 5.1. The following conclusions can be made: • The active approaches for object classification and pose estimation signifi- cantly outperform the traditional single-view approach in terms of accuracy. In most cases, by making 1− 2 extra measurements, they are able to choose the correct hypothesis more than 20% more frequently. • Thereisasteadyimprovementinperformancewhengoingfromrandomview- point selection, to greedy view planning, and finally to nonmyopic view plan- ning. ComparedwiththerandomandtheGMIapproaches, ourNVPmethod 63 needsless movementand lessmeasurementson averageand, as demonstrated by its lower average decision cost, is able to select more informative views. • The performance gain of NVP over GMI is not significant. In some scenarios it might not justify the complicated offline training. For example, it is much easier to include additional constraints, such as occlusion avoidance, with greedy planning. • The most notable advantage of NVP comes from the adaptive stopping cri- terion. This is especially evident when the observed object is clutter (H(∅) is correct). In this case, the scores provided by the VP-Tree are not consistent andcause theprobabilities ofvarious hypothesesto increaseanddecrease fre- quently. As a result, the GMI and random approaches need many measure- ments to reach their prespecified stopping time. In contrast, NVP employs a longer planning horizon and recognizes that if the clutter class is likely, it is better to stop sooner than to attempt to increase the confidence as many (costly) measurements would be needed. The main advantage of NVP over GMI is not that it selects much more informative views but that it optimizes its stopping criterion. 5.7.2 Accuracy of the orientation estimates Since the object orientations in a real scene are not discrete a refinement step is needed if the algorithm detects an object of interest, i.e. decides on ˆ α6= α ∅ . The surfaces observed from an object are accumulated over time. After a decision, these surfaces are aligned using an iterative closest point (ICP) algorithm with the surface of the database model, corresponding to H(ˆ α, ˆ θ). Thus, the final decision includes both a class and a continuous pose estimate. 64 Table 5.1: Simulation results for a bottle detection experiment True Hypothesis Avg Number of Measurements Avg Movement Cost Avg Decision Cost Avg Total Cost H(0 ◦ ) H(60 ◦ ) H(120 ◦ ) H(180 ◦ ) H(240 ◦ ) H(300 ◦ ) H(∅) Predicted Hypothesis (%) Static H(0 ◦ ) 60.35 3.86 1.00 2.19 1.48 2.19 28.92 1.00 0.00 29.74 30.74 H(60 ◦ ) 5.53 53.90 2.19 1.00 1.48 1.95 33.94 1.00 0.00 34.57 35.57 H(120 ◦ ) 4.86 4.62 51.49 3.90 2.21 1.24 31.68 1.00 0.00 36.38 37.38 H(180 ◦ ) 4.34 4.34 6.01 49.13 1.95 1.24 32.98 1.00 0.00 38.15 39.15 H(240 ◦ ) 3.88 1.96 1.24 2.20 56.11 1.24 33.37 1.00 0.00 32.92 33.92 H(300 ◦ ) 5.07 1.24 2.44 2.44 1.72 54.29 32.82 1.00 0.00 34.28 35.28 H(∅) 0.56 1.09 3.11 1.93 0.32 3.13 89.87 1.00 0.00 7.60 8.60 Overall Average Total Cost: 31.52 Random H(0 ◦ ) 73.78 3.17 1.24 2.21 1.48 1.24 16.87 2.00 1.26 19.66 22.93 H(60 ◦ ) 1.96 70.34 2.20 1.72 1.00 1.48 21.31 2.36 1.71 22.25 26.31 H(120 ◦ ) 1.00 1.49 70.75 3.43 1.00 1.24 21.09 2.30 1.64 21.94 25.87 H(180 ◦ ) 1.48 1.73 3.66 66.97 1.97 1.48 22.71 2.71 2.16 24.78 29.64 H(240 ◦ ) 1.48 1.24 1.48 2.45 68.76 1.72 22.87 2.41 1.77 23.43 27.62 H(300 ◦ ) 1.72 1.97 1.00 1.24 1.97 71.85 20.25 2.60 2.02 21.11 25.74 H(∅) 0.07 2.11 2.00 1.53 1.59 0.37 92.33 4.95 4.93 5.76 15.64 Overall Average Total Cost: 24.82 Greedy MI H(0 ◦ ) 82.63 2.93 0.76 1.61 0.83 0.40 10.85 1.96 1.20 13.03 16.19 H(60 ◦ ) 0.80 80.14 1.05 1.07 0.14 1.16 15.64 2.26 1.58 14.89 18.73 H(120 ◦ ) 1.09 1.05 76.93 2.64 0.83 0.82 16.66 2.30 1.64 17.31 21.25 H(180 ◦ ) 1.47 1.25 3.62 75.60 0.71 0.50 16.84 2.79 2.25 18.30 23.34 H(240 ◦ ) 0.49 1.15 0.82 2.58 75.29 1.71 17.96 2.37 1.72 18.53 22.62 H(300 ◦ ) 1.79 0.50 0.12 0.86 1.21 81.78 13.74 2.59 2.00 13.66 18.25 H(∅) 0.72 1.35 2.23 0.39 0.25 0.41 94.65 5.29 5.37 4.01 14.67 Overall Average Total Cost: 19.29 NVP H(0 ◦ ) 87.98 0.48 0.24 0.24 0.24 0.48 10.34 2.06 1.45 9.01 12.51 H(60 ◦ ) 0.00 83.78 0.97 0.24 0.24 0.24 14.53 2.28 1.73 12.17 16.17 H(120 ◦ ) 0.48 0.00 82.81 1.21 0.00 0.00 15.50 2.37 1.86 12.89 17.12 H(180 ◦ ) 0.00 0.00 0.97 82.61 1.21 0.24 14.98 2.50 2.05 13.04 17.60 H(240 ◦ ) 0.49 0.24 0.00 0.49 78.73 0.00 20.05 2.57 2.18 15.95 20.71 H(300 ◦ ) 0.00 0.24 0.24 0.73 0.48 81.60 16.71 2.60 2.15 13.80 18.55 H(∅) 1.49 1.58 1.37 0.37 0.74 1.25 93.20 2.08 1.50 5.10 8.68 Overall Average Total Cost: 15.91 Simulations were carried out to evaluate the accuracy of the continuous orien- tation estimates with respect to the ground truth. The following distance metric on SO(3) was used to measure the error between two orientations represented by quaternions q 1 and q 2 : f(q 1 , q 2 ) = cos −1 2hq 1 , q 2 i 2 − 1 , whereha 1 +b 1 i +c 1 j +d 1 k,a 2 +b 2 i +c 2 j +d 2 ki =a 1 a 2 +b 1 b 2 +c 1 c 2 +d 1 d 2 denotes the quaternion inner product. A single object of interest (Watercan) was used:I ={α W }. The ground truth yaw(ψ)androll(φ)oftheWatercanwerevariedfrom 0 ◦ to 360 ◦ at 7.5 ◦ increments. The pitch (γ) was kept zero. Synthetic scenes were generated for each orientation. 65 0 60 120 180 240 300 0 90 180 270 Orientation error [deg] Yaw [deg] Roll [deg] 10 20 30 40 50 60 70 0 60 120 180 240 300 0 20 25 30 35 40 45 50 Yaw [deg] Orientation error averaged over roll [deg] Average: 38.91 Figure 5.3: Twenty five hypotheses (red dotted lines) were used to decide on the orientationofaWatercan. Thetopplotshowstheerrorintheorientationestimates as the ground truth orientation varies. The error averaged over the ground truth roll, the hypotheses over the object’s yaw (blue dots), and the overall average error (red line) are shown in the bottom plot. Hypotheses were formulated by discretizing the yaw space into 6 bins and the roll space into 4 bins: H(α ∅ ,θ ∅ ) = The object is not a Watercan H(α W ,θ) = The object is a Watercan with orientation θ = (ψ,φ,γ)∈{(i y 60 ◦ , 0,i r 90 ◦ )| i y = 0,..., 5, i r = 0,..., 3} 66 Fifty repetitions with different starting sensor poses were carried out on every testscene. Theerrorsintheorientationestimateswereaveragedandtheresultsare presented in Fig. 5.3. As expected, the orientation estimates get worse for ground truth orientations which are further away from the hypothesized orientations. On the bottom plot, it can be seen that the hypothesized yaws correspond to local minima in the orientation error. This suggests that the number of hypotheses needs to be increased if a better orientation estimate is desired. Large errors result when the guessed hypothesis is not correct. Even when it is correct, if the real (continuous) pose of the object is far from the hypothesized one, the ICP algorithm might not perform well because it is very sensitive to the initialization. Still, a rather sparse set of hypothesized orientations was sufficient to obtain an average error of 39 ◦ . For these experiments, the average number of measurements was 2.85 and the average movement cost was 2.61. 5.7.3 Performance evaluation in real-world experiments In this section, we demonstrate that the real-world performance of NVP is similar to the simulation. We expect the same to be true for the rest of the view planning methods and did not carry out additional real experiments. It is unlikely that their behavior differs from the trends observed in Sec. 5.7.1. Table 5.2: Results for a real-world bottle detection experiment True Hypothesis Avg Number of Measurements Avg Movement Cost Avg Decision Cost Avg Total Cost H(0 ◦ ) H(60 ◦ ) H(120 ◦ ) H(180 ◦ ) H(240 ◦ ) H(300 ◦ ) H(∅) Predicted (%) H(0 ◦ ) 87.5 2.5 5.0 0.0 0.0 0.0 5.0 2.53 2.81 9.38 14.72 H(60 ◦ ) 2.5 80.0 0.0 0.0 0.0 0.0 17.5 2.66 2.52 15.00 20.18 H(120 ◦ ) 7.5 0.0 72.5 0.0 0.0 0.0 20.0 3.16 3.43 20.63 27.22 H(180 ◦ ) 0.0 0.0 0.0 70.0 10.0 2.5 17.5 2.20 1.72 22.5 26.42 H(240 ◦ ) 0.0 0.0 0.0 2.5 75.0 2.5 20.0 2.39 2.51 18.75 23.65 H(300 ◦ ) 0.0 0.0 0.0 0.0 5.0 72.5 22.5 2.57 2.18 20.63 25.38 H(∅) 0.0 0.0 0.97 0.0 0.0 0.97 98.05 2.17 1.93 1.46 5.56 Overall Average Total Cost: 20.45 An Asus Xtion RGB-D camera attached to the right wrist of a PR2 robot was used as the mobile sensor. As before, the sensor’s task was to detect if any 67 Handlebottles (I ={α H }) are present on a cluttered table and estimate their poses. Fig. 5.4 shows the experimental setup. Twelve table setups were used, Figure 5.4: An example of the experimental setup (left), which contains two instances of the object of interest (Handlebottle). A PR2 robot with an Asus Xtion RGB-D camera attached to the right wrist (middle) employs the nonmyopic view planning approach for active object classification and pose estimation. In the robot’s understanding of the scene (right), the object which is currently under evaluation is colored yellow. Once the system makes a decision about an object, it is colored green if it is of interest, i.e. inI, and red otherwise. Hypothesis H(0 ◦ ) (Handlebottle with yaw 0 ◦ ) was chosen correctly for the green object. See the video in https://tinyurl.com/y99rpsyg for more details. each containing 2 instances of the object of interest and 8− 10 other objects. Ten repetitions were carried out for each setup, which in total corresponded to 40 true positive cases for every hypothesis. The results are summarized in Table 5.2. The performance obtained in the real experiments is comparable to the simula- tion results. On average, more movement and more measurements were required to make a decision in practice than in simulation, which can be attributed to the fact that the VP-Tree and the observation model were trained in simulation but were used to process real observations. The VP-Tree scores were inconsistent sometimes which caused the hypotheses’ probabilities to fluctuate and the sensor took longer to make decisions. Still, the results from the experiments are very satisfactory with an average accuracy of 76% for true positives and 98% for true negatives. 68 To demonstrate that our approach can handle more complicated scenarios, several experiments were performed with two objects of interest (Handlebottle and Water- can): I ={α H ,α W }, and 53 hypotheses associated with likely poses for the two objects. See the video from Fig. 5.4 for more details. The detection process demonstrated in the video takes a long time. One reason is that the cost of an incorrect decision was set very high compared to the cost of moving in order to minimize the mistakes made by the observer. As a result, the sensor takes many measurements but changing this behavior simply requires adjusting λ. There are several other aspects of our framework that slow down the processing and need improvement, however. First, the occlusion model should be used in the planning stage to avoid visiting viewpoints with limited visibility. Second, as an artifact of the way we trained the observation model, the sensor has to turn towards the centroid of every object in its field of view. This is slow and undesirable. The observation model can be modified, at the expense of a more demanding training stage, to include sensor poses which do not face the object’s centroid. Finally, an unavoidable computational cost is due to the feature extraction from the observed surfaces and the point cloud registration needed to localize the sensor in the global frame as our method assumes that the sensor has accurate self-localization. 5.8 Conclusions Thisworkaddressedtheproblemofclassificationandposeestimationofseman- tically important objects by actively controlling the viewpoint of a mobile depth sensor. To alleviate the difficulties associated with single-view recognition, we for- mulated hypotheses about the class and orientation of an unknown object and 69 proposed a soft detection strategy, in which the sensor moves to increase its con- fidence in the correct hypothesis. Nonmyopic planning was used to select views which balance the amount of energy spent for sensor motion with the benefit of decreasing the probability of an incorrect decision. The validity of our approach was verified both in simulation and in real-world experiments with an RGB-D camera attached to the wrist of a PR2 robot. The performance of the nonmyopic view planning approach was compared to greedy view selection and to the traditional static detection. The results show that the active approaches provide a significant improvement over static detection, while the nonmyopic approach outperforms the greedy method but not significantly. The main advantage of nonmyopic planning over the greedy approach is the adaptive stopping criterion, which depends on the observations received online. Our frame- work has several other advantages over existing work. The idea of quantifying the likelihood of the sensor observations using a probabilistic observation model (Sec. 5.4) is general and applicable to real sensors. The proposed planning framework is independent of the static object detector and can be used with various exist- ing algorithms in machine perception. Finally, instead of using an information theoretic cost, the probability of error is minimized directly. The drawback of our approach is that it requires an accurate estimate of the sensor pose and contains no explicit mechanism to handle occlusions during the planning stage. Moreover, the sequence of views is selected with respect to a single object instead of all objects within the field of view. 70 5.9 Challenges in Active Perception Despite the demonstrated advantages of our nonmyopic approach to conven- tionalstaticandgreedydetectionmethodsthereareshortcomingsthatareinherent to active approaches. Active approaches are fundamentally not capable of deal- ing with heavy occlusion or the problem of segmentation (Section 5.6.1). In the event of light clutter an occlusion model can be included in the planning stage but this will necessitate re-planning as the motion policy will no longer be computable offline. In the presence of heavy clutter, even a high fidelity occlusion model falls short since there are objects directly in the line of view of the sensor that need to be physically moved out of the way for the object of interest to be revealed. Since Active Perception methods cannot physically alter the state of the envi- ronment, they fail when objects are not well separated or are occluded by clutter . To address some of these problems we look to leverage the benefits of Interactive Perception for pose estimation in the next Chapter. 71 Chapter 6 Interactive 3D Object Recognition 6.1 Introduction For a robot to autonomously manipulate its environment it needs to have the ability to detect objects and estimate their pose. Typically, this problem is treated as a static estimation problem, where given a single view representation (image or point cloud) as input, the pose of the target object is estimated. Though state-of- the-art approaches to this problem can handle reasonable amounts of clutter and occlusions, when objects are heavily occluded, single view methods fail by produc- ing false positives or by entirely missing object instances. Methods dealing with single view object recognition in 3D and their shortcomings have been discussed in great detail in Chapter 4. Similarly, as highlighted in Section 5.9 and in Atanasov et al. (2014); Sankaran et al. (2013) active methods are also inadequate when the occluding objects need to be moved out of view to uncover a target object. Active methods also fail when segmentation needs to be addressed before addressing the question of classification or pose estimation. When objects are touching each other in a cluttered scene, the segmentation problem can only be addressed by separating the objects through physical contact. These actions are naturally outside the realm of active methods. To alleviate these shortcomings we turn to Interactive Perception, to study the 72 utility of forceful physical interaction with the environment in order to aid per- ception. In the Interactive Perception setting, the robot has the ability to induce physical changes in the state of the environment through actions that establish contact with the environment. Recent surveys Sankaran et al. (2017); Bajcsy et al. (2018) have shown that complex perception problems can be simplified with the aid of movement and interaction. While active and interactive approaches may make the associated perception problem easier by optimizing the sensory input or the environment structure, the challenge is to find an optimal sequence of actions. In this chapter, we exploit this insight to enable a robot with the ability to interactively gather information about a target object in an unknown environment in order to estimate its pose. We formulate the problem using the interactive information acquisition formulation introduced in Eq 2.5 in Chapter 2.1. Unlike active view selection approaches where non-myopic solutions have been proposed for the problem of object recognition Atanasov et al. (2014); Karasev et al. (2012); Doumanoglou et al. (2016); Solving for non-myopic solutions for the recognition problem in the interactive domain is not tractable. This is due to the lack of accurate models capable of predicting the evolution of the environment over large time horizons. This is demonstrated in Fig 6.1. Such forward models are critical for non-myopic planning. This is especially intractable for a large collection of objects in the presence of contact. Hence, such non-myopic solutions do not translate to well to interactive information gathering. Moreover the ability of a robot to interactively gather information becomes especially crucial when the target object is occluded from all angles. 73 Figure 6.1: The figures above show the fundamental difference in the process models between active and interactive methods. As shown in the figure on the left, active methods only change view point. This can be easily approximated for tractability to model the behavior of the classification algorithm as shown in Chap- ter 5.3.3. This helps us derive non-myopic solution for the active problem setting. In contrast, interactive methods need to model the evolution of an environment as a consequence of contact. When there are multiple objects in contact in the scene this model is non-trivial to approximate over large time horizons. Hence we resort to myopic planning For these reasons, we resort to greedy solutions for the interactive recognition problem. In this thesis, we present two distinct methods for interactive recogni- tion. We first learn greedy control policies through max margin learning. This is discussed in Sec 6.3. Though this works well for the simplified problem setting it does not scale well to the real world problem. To alleviate this issue, we later 74 develop an information theoretic approach for greedy closed loop planning. This is discussed in detail in Sec 6.4. We demonstrate our approach on a interactive object recognition problem in dense clutter using the Gazebo simulator. 6.2 Related Work Interactive Recognition and Pose Estimation, though an area of robotics research still in its infancy, holds its roots in ideas explored in the late 80s & early 90s. The earliest research to propose the use of physical interaction to aid perception was demonstrated by Tsikos and Bajcsy (1991). They used a robot arm to make a scene simpler for a vision system. They specifically looked into the problem of separating a random heap of objects into sets of similar shapes. During the same period Bajcsy (1989); Bajcsy and Sinha (1989) proposed a looker and feeler system that performed material recognition for recognizing potential footholds for legged locomotion. Approaches addressing Interactive Recognition in this era, either looked at very simplified versions of the problem or used hand crafted exploration strategies to aid perception. More recently Interactive Pose Estimation has seen considerable progress in the realm of haptic perception. For instance, Hebert et al. (2013); Petrovskaya and Khatib (2011) demonstrate approaches for pose estimation and model based local- ization via haptic recognition. Similarly Koval et al. (2015) proposed a manifold particle filter for solving the problem of haptic pose estimation. These approaches do not actively select actions to aid perception, but merely use contact as a con- sequence for decision making. In contrast, Javdani et al. (2013) exploit notions of submodularity to address the problem of action selection to aid perception. All of the aforementioned approaches including Javdani et al. (2013) assume the object 75 is isolated for haptic recognition and do not consider clutter, occlusions or other sources of error, which are the traditional sources of error for visual perception. The problem of pose estimation for isolated objects is a fairly well solved problem in visual perception (Chapter 4). In Interactive Recognition Hausman et al. (2014) propose an action selection strategy that minimizes expected entropy, but their approach works only with dis- crete poses and a limited set of objects. Hence it does not scale to continuous poses or a large database of objects. There are approaches that can handle a large database of objects such as Sinapov and Stoytchev (2013), but this approach does not perform any action selection or reason about object-action relations. They partition sensorimotor experience to disambiguate objects that have already been singulated. Both Hausman et al. (2014); Sinapov and Stoytchev (2013) cannot handle clutter, occlusions or unmodeled objects which are main sources of uncer- tainty in visual recognition. There are approaches such as the one proposed by Dogaretal.(2013)whichdonotrequireobjectstobesingulated, butthisapproach does incorporate feedback from manipulation to improve perception. Other meth- ods in the realm of “search for target objects”, such as Wong et al. (2013), incor- porate feedback from manipulation, but their complex co-co-occurrence modeling only reasons about object types and not poses. Both the aforementioned methods, Dogar et al. (2013); Wong et al. (2013) cannot handle a large degree of clutter in the scene or imperfect segmentation that occurs as a consequence of objects being in contact. Then there are approaches that analyze the relations between objects and actions such as Kjellström et al. (2010), where they learn affordances by observing human demonstrations. But in this approach the actor and the sensor are independent agents hence the modeling does not readily translate to active or interactive systems. Finally there are approaches that do not explicitly perform 76 object recognition or pose estimation, but use notions of active and interactive actions to enable scene exploration (Bohg et al., 2010). Though there has been work in interactive perception for object segmentation or reconstruction, interactive perception as an approach has not yet been applied to the object recognition Sankaran et al. (2017). To demonstrate the utility of our approach we endow the robot with the ability to interact on top of a state-of-the- art object detection and pose estimation pipeline which was discussed in Chapter 4. We show that our approach succeeds in pose estimation where the state-of-the- art recognition pipeline fails due to its single view approach to the problem. This is demonstrated in Figure 6.2. The image on the left shows the target objects in the scene which are highlighted by a green bounding box. The figure on the left shows the camera view received by the robot where one of the target objects is entirely occluded from view. This is a typical case where the ability to interact with the environment can help aid the recognition problem. Figure 6.2: The image on the left shows the simulation environment where the robot is looking for a target object. Instances of the target object in the scene are marked with a green square. As shown only one of the objects is visible from the viewpoint of the robot as shown in the right image. This instance is partially occluded (marked green). The second instance is fully occluded (marked red). 77 The remainder of this chapter is organized as follows. First we discuss a max margin formulation for learning greedy control policies from demonstration in Sec 6.3. Then in Sec 6.4 we introduce a formulation for interactive information acquisition for object recognition. We then summarize the chapter and draw con- clusions in Sec 6.5. Finally, in Sec 6.6 we discuss the challenges associated with interactive perception. 6.3 Learning Greedy Control Policies from Demonstration As shown in Figure 6.1 and discussed in Section 6.1, modeling the evolution of the environment, is a significant and intractable challenge to be able to produce non-myopic solutions to the optimization problem introduced in Eq: 2.5. Methods have been proposed to predict the evolution of the environment in SE3 to affine perturbations (Byravan and Fox, 2017). These approaches are limited by the number of objects they can reason about in the environment. They are also not capable of predicting the movement of objects in or out of occlusion. Due to these limitations, we ask ourselves if it is possible to circumvent the problem of reasoning about the environment and construct functions that can directly map observations of the environment to control policies in order to solve the problem of pose estimation in clutter. Approaches like (Ratliff et al., 2009; Ross and Bagnell, 2010) have shown how to apply the principles of max margin learning (Chap 3.3) to structured prediction problems. These are typically called max-margin planning solutions where the functions that map observations to con- trol policies are learned from expert demonstrations. We adopt a similar approach to learn a linear mapping function m to map observations z i to greedy control 78 policies u i . We assume that our space of observations and controls are discrete, i.e z∈Z and u∈U. More formally the learning problem can be written as follows Learn m :Z→U (6.1) m w (z) = arg max u k X i=1 w i n i (z, u) = arg max u w T n(z, u) where n are the basis or feature functions, k is the number of feature functions, w are the coefficients for the feature functionsn to define our linear mapping function m. We learn this function from expert demonstrations by solving the following Inverse Optimal Control problem. Let our observations z be point cloud templates from the environment. For each of these templates the expert picks a greedy controlpolicy utooptimizetheperformanceofthestaticobjectrecognitionsystem (Chap 4). Here we make two assumptions about the model, • The expert’s policy through the state space is optimal where the expert max- imizesthesinglestepdecisionutilityofthegeneraldecisionutilityintroduced in Eq 2.9 • The state is fully observable With the a collection of expert policies u of each point cloud template z we con- struct a library{{z 0 , u 0 },...,{z N , u N }} with N expert demonstrations. This library consists of expert labeled observations where we have a greedy control policy picked by the expert for each observation. With this library we can learn 79 the weights for the mapping function m :Z→U using a standard max-margin network learning algorithm Tsochantaridis et al. (2005) in the following manner: min w,β,δ 1 2 w T w +α N X i=1 δ i (6.2) s.t. n(z i , u i ) +β≥ 1−δ i n(z i , u j ) +β≤−1 +δ i ,∀j6=i δ i ≥ 0 where δ i is the slack variable and α is the soft margin parameter in standard max-margin network learning algorithms. w is the set of weights we learn and β is the bias term. We demonstrated the validity of this approach in a proof-of-concept low dimen- sional simulation setup for object pose estimation in dense clutter where the target object is entirely occluded Sankaran et al. (2015). The details of these experiments are discussed in Section 6.3.1. 6.3.1 Low Dimensional Simulation Setup We emulate the problem of interactive action selection for object pose estima- tion in clutter with a 2 dimensional simulation setup. In this simplified scenario a target object of known shape (H shaped-structure) is entirely occluded in a 2D grid world. The objective of the system is to accurately determine the pose of this hidden target object by uncovering the 2D grid world with greedy control policies. This is shown in Figure 6.3. 80 In this simulation setup, the interaction actions available to the cyan colored agent are to uncover one of the 8 grid cells around its current location. Depending on the cells uncovered the belief regarding the distribution of poses of the target object are updated. This agent is learned from expert demonstrations through the max margin learning framework demonstrated in Eq 6.2. The expert has the same set of actions available to the agent. Figure 6.3: The first figure shows the simulation environment. The learned (cyan) agent has knowledge about the entire state of the environment it can only manip- ulate the cells in its immediate vicinity (8-connected neighborhood) shown by the red arrows. In the first figure, the dark covered cells are occluded cells and the light colored cells are empty environment cells. The second figure shows the 2D pose of the hidden target object (red cells). The third figure shows a snapshot at timet of the evolution of the agent’s belief about the distribution of the poses of the target object. This distribution of possible poses is shown by the green cells and the other occluded cells are shown in red. The blue cells are unoccluded environment cells. The final figure shows the result on convergence, where the agent’s belief about the target object pose has converged to a single pose estimate. A video of this simulation can be viewed in https://tinyurl.com/ycsnzncv The first figure in Figure 6.3, shows the simulation environment visible to the agent. Thoughtheagenthasknowledgeabouttheentirestateoftheenvironmentit canonlymanipulatethecellsinitsimmediatevicinity(8-connectedneighborhood). In the first figure the dark covered cells are occluded cells and the light colored cells are empty environment cells. The second figure shows the 2D pose of the hidden target object. The third figure shows a snapshot at time t of the evolution 81 of the agent’s belief about the distribution of the poses of the target object. This distribution of poses is dictated by the uncovered cells in the environment and the possible poses of the target object in the occluded environment. The final figure in Figure 6.3, shows the result on convergence, when the agent’s belief about the target object pose converges to a single pose estimate. We represent the distribution of the target object’s pose by a discrete set and prune this set based on the state of the environment. We learn the agent on inverse distance transform features computed on the environment from the current distribution of poses. We then template patches on this feature representation. These templates are used as the observations z for the expert selected greedy control policy u. We then learn the function m :Z→U from these expert demonstrations. We compare the learned greedy policym ∗ learned against a simple greedy heuristic m greedy . In the greedy heuristic maximizes the feature value from the agent’s action set u = argmax l∈U n(z, l) We compared the two greedy policies over 100 different trials with 10 random poses of the hidden target object and each of the 10 poses had 10 different initial- izations for the agent. The results are tabulated in Table 6.1. The results show the mean number of actions taken over the successful trials, out of the 10 trials. The mean for m ∗ learned was 9.73 as compared to m greedy ’s 13.58 and the standard deviation form ∗ learned 2.29 as opposed tom greedy ’s 8.34 which is four times as much. We can see that the learned greedy policy behaves better than a well engineered heuristic. 82 1 2 3 4 5 6 7 8 9 10 m ∗ learned 12.3 8.4 8.1 10.1 7.1 8.3 7.1 10.1 12.6 13.2 m greedy 25.6 9.3 8 6.4 7.8 13.8 8.3 7.1 28.5 21 Table 6.1: Results of Learned Greedy Policy vs Greedy Heuristic Policy 6.3.2 High Dimensional Simulation Setup We then extended the setup discussed in Sec 6.3.1 to a real world manipulation problem with the same max margin learning framework introduced in Sec 6.3. The goal here is to find a sequence of actions to manipulate the environment, in order to accurately detect and estimate the pose of a target object. In this real world setup, the pose of the target object is estimated by the state of the art static object recognition framework introduced in Chapter 4. The objective of the learned greedy control policy was to learn manipulation actions that reduce the uncertainty of the static recognition algorithm. In our high dimensional simulation setup we used the PR2 robot fitted with an RGBD sensor in a gazebo simulation environment (Koenig and Howard, 2004). This setup is shown in Figure 6.2. The observations z in this setup were fixed size point cloud templates around the location of manipulation. The actions u were parametrized by a 3D location on the surface of the point cloud and a push direction represented as a quaternion. In the training phase, the expert demon- strated actions that optimized the performance of the static recognition system. Our initial experiments showed that the learning problem was ill posed and did not scale well to the real world setup. The greedy control policies learned from expert demonstration did not help the static recognition algorithm converge on the pose of the target object. 83 6.4 Interactive Information Acquisition for 3D Object Recognition As we saw in the Section 6.3, the learning problem for learning greedy control policies for interactive object recognition is ill posed for high dimensional systems. To circumvent this issue, we pose the problem of interactive object recognition as a problem of interactive information acquisition as shown in Eq 2.5, 2.9 and use conditional entropy (Eq 2.1.1) as our information measure to optimize. With this setup in place we define a procedure for greedy closed loop planning by sequen- tially selecting an actions that maximize the expected decision utility introduced in Eq 2.9. In the rest of this section we first define the recursive estimation problem in Sec 6.4.1. Then we discuss the observation model used in the estimation proce- dureinSec6.4.1. Thenwediscusstheinformationmeasureandtheapproximations we make tractability compute this information measure in Sec 6.4.1. Finally we discuss the experiments we performed to validate our approach in Sec 6.4.2. 6.4.1 Problem Formulation Let bbethethreedimensionalposeofaknowntargetobject, i.ewehaveathree dimensional CAD model of the object of interest. The environment is represented as a occupancy depth map m where the occupancy of each pixel of the depth map is modeled by a binary occupancy variableso t ={o i t }; whereo i t = 0 means the pixel m i is empty ando i t = 1 indicates that it is occupied. The observations z t are range measurements from a depth sensor. The controls u t alter the physical state of the environment. Our objective is to estimate the joint probability of the pose of the target and the occupancy of the depth map given the history of observations and controls, i.e p(b 1:t+1 ,o 1:t+1 |z 1:t+1 , u 1:t+1 ). The sensor state is fixed i.e x t+1 = x t . 84 Figure 6.4: This figure shows the graphical model for the interactive object detec- tion problem. It captures the independences in the problem The independences of our problem are captured by the graphical model shown in Fig 6.4. Due to the form of our observation model discussed in Sec 6.4.1 and the process model, the depth map occupancy variableso 1:t+1 can be marginalized out, but the pose variables b 1:t+1 cannot. Hence the joint posterior we are trying to estimate becomes p(b 1:t+1 ,o t+1 |z 1:t+1 , u 1:t+1 ). Given the graphical model in Figure 6.4 our estimation problem can be defined as estimating the posterior distribution p(b 1:t+1 ,o t+1 |z 1:t+1 , u 1:t+1 ) = p(o t+1 |b 1:t+1 , z 1:t+1 , u 1:t+1 )p(b 1:t+1 |z 1:t+1 , u 1:t+1 ) (6.3) Recursive Estimation of Depth Map Occupancy The estimation problem in Eq 6.3 has two quantities of interest that we need to compute the occupancy depth map given the distribution of object poses p(o t+1 |b 1:t+1 , z 1:t+1 , u 1:t+1 ) and the distribution of object poses p(b 1:t+1 |z 1:t+1 , u 1:t+1 ). The occupancy depth map can be factorized as 85 p(o t+1 |b 1:t+1 , z 1:t+1 , u 1:t+1 ) = Q i p(o i t+1 |z i 1:t+1 , z 1:t+1 , u 1:t+1 ) from the independence assumptions in Fig 6.4. Since the primary quantity of interest is the distribution of object poses given the history of measurements and controls. We perform inference using a Rao- Blackwellized Particle Filter where the poses b l are distributed according to p(b 1:t+1 |z 1:t+1 , u 1:t+1 ). The depth map occupancy term can be computed recur- sivelyusingtheMarkovassumptionmadeearlierandBayesruleasshowninEq6.4. For ease of readability we denote B t+1 = b 1:t+1 , B t = b 1:t , U t = u 1:t . p(o i t+1 |Z i t+1 , B t+1 , U t+1 ) = (6.4) P o i t h p(z i t+1 |b t+1 ,o i t+1 )p(o i t+1 |o i t )p(o i t |Z i t , B t , U t )) i P o i t+1 ,o i t h p(z i t+1 |b t+1 ,o i t+1 )p(o i t+1 |o i t )p(o i t |Z i t , B t , U t ) i The occupancy transitionp(o i t+1 |o i t ) is is modeled similar to Wüthrich et al. (2013). The observation modelp(z i t+1 |b t+1 ,o i t+1 ) is discussed in Sec 6.4.1. We compute the probability of an observation assuming a particular pose b k . In the recursive filter described above the occupancy beliefs are updated depending on the best control action u ∗ t . The various control actions available to the sensor are evaluated, and the one that maximizes the information gain of the system is executed. This is discussed in Sec 6.4.1. A 2D illustration of the occupancy depth map recursive update is shown in Figure 6.5. Occupancy Aware Stochastic Sensor Model We model the probability of a sensor measurement using a beam model as defined in Thrun et al. (2005). The actual measurement at a particular depth map pixelp(z i |b, o i ) is modeled using a variant of the beam based model introduced in 86 Figure 6.5: The figure above shows the update of the occupancy depth map for a given object pose. The target object is shown in yellow. The pose of the target object evolves from b t to b t+1 on applying control u t . This updates the occupancy of the pixel o i t to o i t+1 . The sensor is shown as a green circle with sensing rays Wüthrich et al. (2013) without the heavy tail assumption. The original measure- mentmodelwasusedtomodelocclusions, weadaptthismodeltomodeloccupancy given a particular object pose b l . Using this model, the recursive update of the occupancy of a depth map pixel cellp(o i t+1 |Z i t+1 , B t+1 , U t+1 ) can be computed from the observation model p(z i t+1 |b t+1 ,o i t+1 ) and the previous value p(z i t |b t ,o i t ). Recursive Estimation of Pose Distribution With the definitions and the independence assumptions in Fig 6.4, the distri- bution of poses p(b 1:t+1 |z 1:t+1 , u 1:t+1 ) can be decomposed as follows p(b 1:t+1 |Z t+1 , U t+1 ) ∝p(z t+1 |Z t , B t+1 )p(b t+1 |u t+1 , b t )p(B t |Z t , U t ) 87 The process model is p(b t+1 |u t+1 , b t ) is computed by adding independent zero mean Gaussian noise around the translation and rotation of the object pose after applying a particular control input u t+1 . We compute p(B t+1 |Z t+1 , U t+1 ) by propagating samples from the previous time step according to the process model p(b t+1 |u t+1 , b t ) and re-sampling them according to the likelihoodp(z t+1 |Z t , B t+1 ). Since the observations are dependent on the map occupancy, we can compute it as p(z i t+1 |Z i t , B t+1 ) = X o i t+1 ,o i t h p(z i t+1 |b t+1 ,o i t+1 )p(o i t+1 |o i t )p(o i t |Z i t , B t , U t ) i This gives us the observation likelihood for a single depth map occupancy cell for a given pose sample. To extend this to the entire occupancy depth map we can rewrite it as p(z t+1 |Z t , B t+1 ) = Y i X o i t+1 ,o i t h p(z i t+1 |b t+1 ,o i t+1 )p(o i t+1 |o i t )p(o i t |Z i t , B t , U t ) i Hence the estimate of the pose depends on the estimate of the occupancy depth map through the likelihood. Hence the particle filter algorithm for recursively updating the distribution of poses is as follows • Samples poses p(b 1:t |z 1:t , u 1:t )→ (1) • Maintain and recursively update the occupancy depth map per object pose – Update the pose samples using the process modelp(b t+1 |u t+1 , b t )→ (2) 88 Figure 6.6: The figure above shows the recursive update of the distribution of poses. This procedure is explained below – Update the occupancy depth maps p(o t+1 |b 1:t+1 , z 1:t+1 , u 1:t+1 )→ (3) basedonobservations z t+1 andcomputethelikelihoodp(z t+1 |z 1:t , b 1:t+1 ) → (4). – Resampletheobjectposesaccordingtothelikelihoodtogettheupdated object poses p(b 1:t+1 |z 1:t+1 , u 1:t+1 )→ (5) The recursive particle filter update procedure explained above can be visualized in Figure 6.6. Information Gain for Interactive Object Detection in a Rao Black- wellized Particle Filter With our recursive estimation in place, we need to pick a greedy control policy that maximizes an information measure of choice to solve the interactive infor- mation acquisition problem defined in Eq 2.5, 2.9. We use conditional entropy (Eq 2.1.1) as our information measure to optimize and pick control policies that maximize information gain for our interactive object pose estimation problem. 89 To compute the information gain for interactive object pose estimation we need to compute the expected change in entropy of the joint distributionp(b t ,o t |U t , Z t ) after executing control ˆ u t+1 . Executing control ˆ u t+1 updates the current object pose b t to ˆ b. In order to evaluate all controls u∈U such that we can select the next environment state that maximizes our decision utility (Eq 2.9), we need to be able to compute the expectation of this quantity. The expectation is dependent on the expected information gain and an action utility associated with each control u∈U. Now to compute the expected information gain of the joint distribution p(b t ,o t |U t , Z t ), let us posit that ˆ z is the expected observation we receive after executing control u t . Then the change in entropy can be calculated as I(ˆ z, ˆ u t+1 ) = H(p(b t ,o t |Z t , U t ))−H(p(b t ,o t , ˆ b|Z t , U t ,ˆ z, ˆ u t+1 )) Since we don’t know which measurement we will receive, we need to integrate over all possible measurements to compute the expected information gain. This can be computed as follows E[I(ˆ u t+1 )] = Z ˆ z p(ˆ z|ˆ u t+1 , Z t , U t )I(ˆ z, ˆ u t+1 )dˆ z (6.5) 90 The probability of an expected observation p(ˆ z|ˆ u t+1 , Z t , U t ) can be approximated by integrating over all possible poses and depth map occupancies. p(ˆ z|ˆ u t+1 , Z t , U t ) (6.6) = Z b 1:t ,ot p(ˆ z|ˆ u t+1 , b 1:t ,o t , Z t , U t ) p(b 1:t ,o t |Z t , U t )db 1:t do t = Z b 1:t ,ot p(ˆ z|ˆ u t+1 , b 1:t ,o t , Z t , U t ) p(b 1:t |Z t , U t )p(o t |b 1:t , Z t , U t )db 1:t do t If we assume that our posterior is represented by a set of particles and we can rewrite Eq 6.6 as p(ˆ z|ˆ u t+1 , Z t , U t ) (6.7) ≈ N X l=1 p(ˆ z|ˆ u t+1 , b [l] 1:t ,o [l] t , Z t , U t ) w [l] p(o [l] t |b [l] 1:t , Z t , U t ) whereN isthenumberofparticles. WiththeassumptionsoftheRao-Blackwellized particle filter the term p(o [l] t |b [l] 1:t , Z t , U t ) can be computed analytically. The term p(ˆ z|ˆ u t+1 , b [l] 1:t ,o [l] t , Z t , U t ) can be computed by ray casting for a particular object pose b [l] 1:t . Hence the discrete posterior about the possible observations can be computed with the depth occupancy map containing that particle weighted by the likelihood of that particle. As the number of particles increase this operation quickly becomes intractable, instead we approximate this operation similar to Stachniss et al. (2005) by drawing a pose b k from the particle set with a probability proportional to its weight. We 91 then make a crucial assumption to make our interactive information acquisition problem tractable. Assumption 6.1: A simulated control ˆ u only induces a minor rigid transforma- tion on the pose b t of the target object. With this assumption we can easily propagate the sampled pose b k for a par- ticular greedy control policy. Then we generate an observation ˆ z with this particle. We use this observation to approximateE[I(ˆ u t+1 )] using Eq 6.5 where instead of computing the integral we compute the weighted sum, where the information gain computation of the sampled particle is weighted by the likelihood of that parti- cle. In practice we only sample a single particle according to its likelihood. We can repeat this procedure of different controls u∈U and pick the control policy ˆ u t+1 that maximizes the decision utility in Eq 2.9. This gives us a greedy closed loop policy for interactive information acquisition for object recognition. Once we execute the greedy policy, we can update the posterior about the joint distri- bution p(b t+1 ,o t+1 |U t+1 , Z t+1 ) using the actual observation received on executing the greedy control policy ˆ u t+1 . In summary the interactive information acquisition problem is executed as follows • Sample a pose b l from the particle set according to likelihood w [l] → (1) • Evaluate control greedy policies ˆ u∈U – Propagate particle b l t → b l t+1 according to the control policy ˆ u→ (2) – Compute the observation ˆ z on the propagated particle b l t+1 → (3) – Compute the expected information gain on the observation ˆ z by weight- ing it by the likelihood of the particleE[I(ˆ u t+1 )] =w [l] I(ˆ z, ˆ u t+1 )→ (4) • Pick the ˆ u ∗ t+1 that maximizes the decision utility and execute it→ (5) This entire procedure discussed above is illustrated in Figure 6.7. 92 Figure 6.7: The figure above shows approximation for interactive information acquisition discussed in this section. We first sample poses according to their likelihood (1). Then the sampled poses are propagated according to the different greedy control policies we want to evaluate (2). We then compute the expected observation on the propagated poses (3). Finally we compute the expected infor- mation for each of the greedy policies (4) 6.4.2 Experiments and Observations We evaluated our approach in a gazebo simulation environment (Koenig and Howard, 2004). The system is tasked with accurately estimating the pose of an occluded target impact wrench with a sequence of pushing actions as shown in Fig 6.8. The utility of the actions are are evaluated according to the Interactive Information Acquisition algorithm introduced in Eq 6.5 in Sec 6.4.1. The target object pose estimates are initialized based on the output of the static recognition pipeline, i.e we sample poses around the unknown objects and detected instances of the target object. The pose of the target object is recursively estimated using the particle filter algorithm introduced in Sec 6.4.1. An assumption we make is that there is only one instance of the target object in the scene. Our proposed approach can be extended with variations of the particle filter to estimate the pose of multiple target objects in the scene (Wüthrich et al., 2015), but currently we limit our focus to only one target object. 93 Figure 6.8: The image on the left shows the target impact wrench whose pose we are trying to estimate in the current environment. The image on the right shows the view of the environment from the RGBD sensor in gazebo. The environment can be manipulated with pushing actions which are simulated by spherical (white) force objects. The force object is meant to emulate a gripper and weighs 5kg and can exert a force of 5N. The force object is 5 cm in diameter. The direction of push is shown by the green arrows The experimental setup can work with pose estimation for any target object where we have access to the CAD models of the objects (Chap 4). Implementation Details In the exact experimental setup we initialize a tabletop environment in gazebo with a random collection of objects which includes the target object. The environ- ment is sensed using an RGBD sensor. The point cloud of the sensed environment is pre-segmented to extract the tabletop objects similar to Sec 5.6.1. Once the scene is initialized, we receive initial pose estimates for the target object from the static object recognition system s() (Chap 4). Based on the out- put of the static recognition system, we sample poses b [l] around the output pose estimates of s() and the unrecognized parts (Fig: 4.7) of the scene. The recur- sive estimation algorithm estimates the target object pose using a (pre-segmented) 94 depth image I s of the current scene. We render a depth image I b for each sampled pose b [l] and compute the likelihoodp(z t+1 |z 1:t , b 1:t+1 ) of this pose using an obser- vation (depth image) I s from the scene. This is done by comparing the observed depth image I s with the predicted depth I b image using the observation model (Sec 6.4.1) We then sample the most likely pose b ∗ as shown in Fig 6.7. Following which we sample pushing actions ˆ u on the surface of the point cloud of the scene such that the direction of push intersects with the location of the sampled pose b ∗ . The force applied is kept constant for all pushes (5N). Before simulating the pushing actions, we store the current state of the environment q t . We then use the gazebo physics engine to simulate each of these sampled pushes{ˆ u 1 ,..., ˆ u N } and record a new depth image for each of these actions as illustrated in Fig 6.7. The pushing actions are executed using a spherical force object as explained in Figure 6.8. With the depth image recorded for each simulated push we can compute the likelihood and hence the expected information gain for that pushing action. We then select the pushing action ˆ u ∗ that maximizes our decision utility Eq 2.9. Note: In our current experiments, all the pushing actions are given the same cost hence the utility only depends on the information gain. Finally we reset the environment to its stored state q t and execute the infor- mation maximizing action ˆ u ∗ . We repeat this procedure till the pose estimate of the particle filter converges to the true pose of the target object. Note: In a real world experiment, the state of the environment q t would not be stored. Instead we can simulate the effect of an action using a physics engine or assume a minor rigid transformation on all objects (Assumption 6.1) The overall algorithm for interactive information acquisition is given in Algo- rithm 2. The threshold τ in line 20 of Algorithm 2, is used to measure similarity 95 Algorithm 2 Interactive Information Acquisition for Object Pose Estimation 1: Input: Target object model CAD model T 2: Output: Target object pose ˆ b 3: Initialize poses p(b 1:t |z 1:t ,u 1:t ) from static object recognition s(). 4: for t = 1 to convergence do 5: Sample a pose b l from the particle set according to likelihood w [l] 6: Save the state of the environment q t 7: for each ˆ u∈U do 8: Propagate particle b l t → b l t+1 according to the control policy ˆ u in gazebo 9: Compute the observation ˆ z on the propagated particle b l t+1 10: Compute the expected information gainE[I(ˆ u t+1 )] =w [l] I(ˆ z,ˆ u t+1 ) on ˆ z 11: end for 12: ˆ u ∗ t+1 = arg max ˆ ut+1∈U E[I(ˆ u t+1 )] 13: Reset environment to q t 14: Execute ˆ u ∗ t+1 15: Update the pose samples using the process model p(b t+1 |u ∗ t+1 ,b t ) 16: Update p(o t+1 |b 1:t+1 ,z 1:t+1 ,u 1:t+1 ) based on observations z t+1 17: Compute the likelihood p(z t+1 |z 1:t ,b 1:t+1 ) 18: Resample the poses according to the likelihood to get p(b 1:t+1 |z 1:t+1 ,u 1:t+1 ) 19: Sample poses b [l∗] from static object recognition s() 20: if P N i=1 b [i∗] N − P N j=1 w [j] b [j] P N j=1 w [j] <τ then 21: pose estimate converged, Go to line 24 22: end if 23: end for 24: Report b∗ = P N j=1 w [j] b [j] P N j=1 w [j] as final pose between the particle filter estimated pose and static object recognition system rec- ognized pose. An example of the algorithm’s convergence is shown in Figure 6.9. The images on the left are from the gazebo physics simulator. The images on the right are from the sensor’s view. The first row of images show the algorithm’s initial estimate of the target object pose, which is rendered as a green CAD model in the image on the right in the first row. The second row of images show the result of convergence after executing multiple information seeking interactive actions. As it can be noted in the figure on the right in the second row of images, the pose estimate converges close to the true pose of the object. To analyze the performance of our approach we compare it against a heuristic interactive algorithm. In the heuristic version, the system picks a greedy open loop 96 Figure 6.9: The images above are views of the environment from the gazebo physics simulator (left) and the sensor (right).The first row of images show the algorithm’s initial estimate of the target object pose, which is rendered as a green CAD model in the image on the right. The second row of images show the result of convergence after executing multiple information seeking interactive actions. As it can be noted in the figure on the right the pose estimate converges close to the true pose of the object. A video of this experiment can be viewed in https://tinyurl.com/ y9x2psyz control policy. Here the control policy is a push action that pushes a point on the point cloud that intersects the most number of target object pose samples. If there are multiple actions that intersect with the same number of target object samples, we pick one at random. Hence in Figure 6.10 we refer to the heuristic version of the algorithm as random. To compare our interactive information acquisition to the greedy open loop policy, we evaluate the tracking error over a fixed number of actions for both approaches. The tracking error is computed as the the error between the ground truth pose and the estimated pose from the particle filter. We 97 show the position and orientation estimate errors averaged over 10 experiments in Fig 6.10. Observations: • The interactive information acquisition algorithm converges to the true pose compared to the heuristic algorithm • The information acquisition algorithm rarely gets stuck in local minima despite its myopic nature, as compared to the heuristic algorithm • Local minima effects (convergence to a bad pose estimate) is eventually rec- tified by further manipulation in the information acquisition case. This is because the low variance resampling in the particle filter helps correct the estimate in the presence of new evidence, but the heuristic algorithm does not exploit this new evidence. A non interactive setting would not be able to exploit this new information either. Figure 6.10: The cyan curve is the information acquisition algorithm and the blue curve depicts greedy open loop control policy’s action selection. The figure on the left shows the convergence of the position error measured in cm. The figure on the right shows the mean orientation error (difference of roll, pitch, yaw) measured in degrees. For both policies there 10 interaction actions that are evaluated. 98 Other Implementation Details • We deal with particle depletion by using a low variance resampling strategy suggested by Thrun et al. (2005), namely stratified resampling. • To ensure that we explore all of the unexplored pose space we replenish the particle filter with random samples from the occlusion volume of the point cloud. • In the current implementation, to evaluate the utility of the various control policies, the simulations are performed on the same core, which makes the implementation slow. This can be optimized by parallelizing the simulation on multiple cores. 6.5 Conclusions This work addressed the problem of interactive object recognition using greedy control policies. We saw in Section 6.3, that the policies learned from expert demonstrations do not generalize well to high dimensional settings. In contrast the information theoretic approach to action selection generalized well to high dimen- sional settings which emulated the real world problem. This proposed framework for interactive information acquisition and the approximation for simulating the effect of greedy control policies can easily be applied to different problem domains in interactive perception Sankaran et al. (2017). Our formulation also provides a formal mechanism to combine active and interactive strategies in an information theoretic setting. There are computational bottlenecks to interactive information 99 acquisition. These mainly involve, propagating the state of the environment for- ward in time. With advances in physics engines and GPUs these problems can be addressed with greater ease. 6.6 Challenges in Interactive Perception OurcurrentapproachestoInteractiveperceptionaremyopicastheydonotrea- son about complex scene dynamics over large time horizons. In order to combine active and interactive perception methods we need to develop efficient approxi- mations that do not require large look-head horizons but can still produce near optimal policies. 100 Chapter 7 Summary and Future Directions Revisiting our initial claim that perception is a process that needs to be both active and exploratory, in this thesis we develop approaches to robot perception specifically for object recognition, that utilize both active and interactive strate- gies. We developed formalisms and algorithms that can address these problems in an information theoretic framework. Hence, the uncertainty regarding the tar- get we are trying to estimate, can be reduced with information seeking actions (Chapter 2). We then introduced the tools required to address the problems in the myopic and non-myopic setting in Chapter 3. We then introduced our work in state-of-the-art 3D recognition and the improvements we have made since the work was first published in Chapter 4. In Chapter 5 we adapted the general problem formulation for active informa- tion acquisition Eq: 2.4 for active 3D object recognition where we developed a non-myopic strategy for information acquisition. We compared our approach to otherstate-of-the-artalgorithmsatthetimeofpublicationandreportedtheresults on simulation and real world experiments. Finally in Chapter 5 we addressed inter- active 3D object recognition, the challenges associated with modeling this problem and why we are currently limited to myopic solutions. We then introduced a method to learn greedy control policies from expert demonstration in Sec 6.3 and showed that this approach does not scale well to high dimensional settings. We then developed an interactive information acquisition algorithm using the formal- ism developed in Chapter 2. We made efficient approximations to this problem 101 to make it tractable and find greedy closed loop policies for interactive 3D object recognition (Chapter 6.4). We evaluated our approach on a high dimensional close to real world setting using a physics simulator and real world sensing. To extend the work presented in this thesis we need to develop a combined strategy for active and interactive information acquisition. In order to solve the joint problem, we need to consider that the combined problem would have the same limitations as the Interactive Perception problem. Hence we would have to leveragethetechniquesintroducedinChapter6toexplorethejointactionselection problem. Learning greedy close loop controllers for joint active and interactive recognition For robots to truly be able to emulate human level perception their visual process should be both active and exploratory (Gibson, 1979). As a first step in this direction a system capable of both active and interactive decision making will be limited by the constraints and challenges of the individual problems Sec 5.9, 6.6. With the approximations we have introduced for interactive recognition, we can now define a consistent state (images), actions and rewards (pose uncertainty) for a joint active and interactive system. We hypothesis that we should be able to learn controllers for the joint problem through direct policy search techniques. Recent advances in model free reinforcement learning techniques (Levine et al., 2016;Schulmanetal.,2015;vanHoofetal.,2015)haveshownthatthesecontrollers can be learned efficiently for very high dimensional input data such as images. 102 Reference List Aldoma, A., Tombari, F., Prankl, J., Richtsfeld, A., Stefano, L. D., and Vincze, M. (2013). Multimodal cue integration through hypotheses verification for rgb- d object recognition and 6dof pose estimation. In 2013 IEEE International Conference on Robotics and Automation, pages 2104–2111. Aldoma, A., Tombari, F., Stefano, L. D., and Vincze, M. (2016). A global hypothe- sisverificationframeworkfor3dobjectrecognitioninclutter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7):1383–1396. Andreopoulos, A. and Tsotsos, J. K. (2013). 50 years of object recognition: Direc- tions forward. Computer Vision and Image Understanding, 117(8):827–891. Atanasov, N. (2015). Active Information Acquisition with Mobile Robots. PhD thesis, University of Pennsylvania. Atanasov, N., Sankaran, B., Le Ny, J., Pappas, G. J., and Daniilidis, K. (2014). Nonmyopic view planning for active object classification and pose estimation. IEEE Transactions on Robotics, 30(5):1078–1090. Aydemir, A., Sjöö, K., Folkesson, J., Pronobis, A., and Jensfelt, P. (2011). Search in the Real World: Active Visual Object Search Based on Spatial Relations. In IEEE Int. Conf. Robotics and Automation (ICRA). Bajcsy, R. (1988). Active perception. Proceedings of IEEE Special issue on Com- puter Vision, 76(7). Bajcsy, R. (1989). Active perception and exploratory robotics. Technical report, University of Pennsylvania. Bajcsy, R., Aloimonos, Y., and Tsotsos, J. K. (2018). Revisiting active perception. Auton. Robots, 42(2):177–196. Bajcsy, R. and Sinha, P. R. (1989). Exploration of surfaces for robot mobility. In Proceedings of the Fourth International Conference on CAD, CAM, Robotics and Factories of the Future. Tata McGraw-Hill. 103 Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346–359. Bellman,R.(1957). Dynamic Programming. PrincetonUniversityPress,Princeton, NJ, USA, 1 edition. Belongie, S., Malik, J., and Puzicha, J. (2002). Shape Matching and Object Recog- nition Using Shape Contexts. IEEE Trans. Pattern Analysis and Machine Intel- ligence, 24(4). Bertsekas, D. and Shreve, S. (2007). Stochastic Optimal Control: the Discrete-time Case. Athena Scientific. Bertsekas, D. P. (2000). Dynamic Programming and Optimal Control. Athena Scientific, 2nd edition. Bohg, J., Johnson-Roberson, M., Björkman, M., and Kragic, D. (2010). Strategies for multi-modal scene exploration. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 4509–4515. Borotschnig, H., Paletta, L., Prantl, M., and Pinz, A. (2000). Appearance-based Active Object Recognition. Image and Vision Computing, 18(9). Browatzki, B., Tikhanoff, V., Metta, G., Bulthoff, H., and Wallraven, C. (2012). Active Object Recognition on a Humanoid Robot. In IEEE Int. Conf. on Robotics and Automation (ICRA). Buch, A. G., Yang, Y., Krüger, N., and Petersen, H. G. (2014). In search of inliers: 3d correspondence by local and global voting. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2075–2082. Byravan, A. and Fox, D. (2017). Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pages 173–180. Chen, H. and Bhanu, B. (2007). 3d free-form object recognition in range images using local surface patches. Pattern Recognition Letters, 28(10):1252 – 1262. Crammer, K. and Singer, Y. (2002). On the algorithmic implementation of multi- class kernel-based vector machines. J. Mach. Learn. Res., 2:265–292. Denzler, J. and Brown, C. (2002). Information Theoretic Sensor Data Selection for Active Object Recognition and State Estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(2). 104 Dogar, M., Koval, M., Tallavajhula, A., and Srinivasa, S. (2013). Object search by manipulation. In IEEE International Conference on Robotics and Automation. Doumanoglou, A., Kouskouridas, R., Malassiotis, S., and Kim, T.-K. (2016). Recovering 6d object pose and predicting next-best-view in the crowd. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Eidenberger, R. and Scharinger, J. (2010). Active Perception and Scene Modeling by Planning with Probabilistic 6D Object Poses. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). Ekvall, S., Jensfelt, P., and Kragic, D. (2006). Integrating Active Mobile Robot Object Recognition and SLAM in Natural Environments. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. Everingham, M., Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc) challenge. Int. J. Comput. Vision, 88(2):303–338. Fan, T.-J., Medioni, G., and Nevatia, R. (1989). Recognizing 3-D Objects Using Surface Descriptions. IEEE Trans. Pattern Analysis and Machine Intelligence, 11(11). Felzenszwalb, P., Girshick, R., and McAllester, D. (2008). A Discriminatively Trained, Multiscale, Deformable Part Model. In IEEE Computer Vision and Pattern Recognition (CVPR). Gibson, J. J. (1966). The senses considered as perceptual systems. Houghton Mifflin, Boston. Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Houghton Mifflin. Göbelbecker, M., Gretton, C., and Dearden, R. (2011). A Switching Planner for Combined Task and Observation Planning. In AAAI Conference on Artificial Intelligence. Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and appli- cations in active learning and stochastic optimization. J. Artif. Int. Res., 42(1):427–486. Gu, C. and Ren, X. (2010). Discriminative mixture-of-templates for viewpoint classification. In Proceedings of the 11th European Conference on Computer Vision: Part V, ECCV’10, pages 408–421, Berlin, Heidelberg. Springer-Verlag. 105 Hager, G. D. and Mintz, M. (1988). Estimation procedures for robust sensor control. Int. J. Approx. Reasoning, 2(3):335–336. Hanheide, M., Gretton, C., Dearden, R., Hawes, N., Wyatt, J., Pronobis, A., Aydemir, A., Göbelbecker, M., and Zender, H. (2011). Exploiting Probabilistic Knowledge under Uncertain Sensing for Efficient Robot Behaviour. In Int. Joint Conf. on Artificial Intelligence (IJCAI). Hao, Q., Cai, R., Li, Z., Zhang, L., Pang, Y., Wu, F., and Rui, Y. (2013). Efficient 2d-to-3d correspondence filtering for scalable 3d object recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 899–906. Hausman, K., Corcos, C., Müller, J., Sha, F., and Sukhatme, G. (2014). Towards interactive object recognition. In IROS 2014 Workshop on Robots in Clutter: Perception and Interaction in Clutter, Chicago, IL, USA. Hebert, M. and Hsiao, E. (2012). Occlusion reasoning for object detection under arbitrary viewpoint. In 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), volume 00, pages 3146–3153. Hebert, P., Howard, T., Hudson, N., Ma, J., and Burdick, J. (2013). The next best touch for model-based localization. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 99–106. Held, R. and Hein, A. (1963). Movement-produced stimulation in the develop- ment of visually guided behaviour. Journal of comparative and physiological Psychology, (56):872–876. Hero III, A. and Cochran, D. (2011). Sensor Management: Past, Present, and Future. IEEE Sensors Journal, 11(12). Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., and Navab, N. (2013). Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the 11th Asian Conference on Computer Vision - Volume Part I, ACCV’12, pages 548– 562, Berlin, Heidelberg. Springer-Verlag. Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA. Huber, M. (2009). Probabilistic Framework for Sensor Management. PhD thesis, Universität Karlsruhe (TH). Huber, M. (2015). Nonlinear Gaussian Filtering: Theory, Algorithms, and Appli- cations. 106 Javdani, S., Klingensmith, M., Bagnell, J. A., Pollard, N. S., and Srinivasa, S. S. (2013). Efficienttouchbasedlocalizationthroughsubmodularity. In IEEE Inter- national Conference on Robotics and Automation (ICRA). Jenkins, K. (2010). Fast Adaptive Sensor Management for Feature-based Classifi- cation. PhD thesis, Boston University. Kappler, D., Pastor, P., Kalakrishnan, M., Wüthrich, M., and Schaal, S. (2015). Data-driven online decision making for autonomous manipulation. In Robotics: Science and Systems. Karasev, V., Chiuso, A., and Soatto, S. (2012). Controlled recognition bounds for visual learning and exploration. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 2915–2923. Curran Associates, Inc. Kehl, W., Tombari, F., Navab, N., Ilic, S., and Lepetit, V. (2015). Hashmod: A hashing method for scalable 3d object detection. In BMVC. Kjellström, H., Romero, J., and Kragic, D. (2010). Visual object-action recogni- tion: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, pages 81–90. Koenig, N. and Howard, A. (2004). Design and use paradigms for gazebo, an open-source multi-robot simulator. In In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149–2154. Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press. Koval, M. C., Pollard, N. S., and Srinivasa, S. S. (2015). Pose estimation for planar contact manipulation with manifold particle filters. The International Journal of Robotics Research, 34(7):922–945. Krause, A. (2008). Optimizing Sensing: Theory and Applications. PhD thesis, Carnegie Mellon University. Krause,A.,Singh,A.,andGuestrin,C.(2008). Near-OptimalSensorPlacementsin Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Jour- nal of Machine Learning Research, 9. Kreucher, C., Kastella, K., and Hero III, A. (2005). Sensor management using an active sensing approach. Signal Processing, 85(3). 107 Krotkov, E. and Bajcsy, R. (1993). Active Vision for Reliable Ranging: Cooper- ating Focus, Stereo, and Vergence. International Journal of Computer Vision, 11(2). Kurniawati, H., Hsu, D., and Lee, W. (2008). SARSOP: Efficient Point- Based POMDP Planning by Approximating Optimally Reachable Belief Spaces. Robotics: Science and Systems. Laporte, C. and Arbel, T. (2006). Efficient Discriminant Viewpoint Selection for Active Bayesian Recognition. International Journal of Computer Vision, 68(3). Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. J. Mach. Learn. Res., 17(1):1334–1373. Li, C., Bohren, J., Carlson, E., and Hager, G. D. (2016). Hierarchical semantic parsing for object pose estimation in densely cluttered scenes. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5068– 5075. Lowe, D. (2001). Local feature view clustering for 3d object recognition. In Proc 2001 IEEE Comput Soc Conf Comput Vis Pattern Recogn, volume 1, pages I– 682. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110. Malisiewicz, T., Gupta, A., and Efros, A. A. (2011). Ensemble of exemplar-svms for object detection and beyond. In Proceedings of the 2011 International Con- ference on Computer Vision, ICCV ’11, pages 89–96, Washington, DC, USA. IEEE Computer Society. Malmir, M. and Cottrell, G. W. (2017). Belief tree search for active object recog- nition. 2017 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 4276–4283. Malmir, M., Sikka, K., Forster, D., Fasel, I., Movellan, J. R., and Cottrell, G. W. (2017). Deep active object recognition by joint label and action prediction. Computer Vision and Image Understanding, 156:128 – 137. Image and Video Understanding in Big Data. Mian, A., Bennamoun, M., and Owens, R. (2010). On the repeatability and quality of keypoints for local feature-based 3d object retrieval from cluttered scenes. Int. J. Comput. Vision, 89(2-3):348–361. 108 Muja, M. and Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 36(11):2227–2240. Naghshavar, M. and Javidi, T. (2012). Active Sequential Hypothesis Testing. ArXiv: 1203.4626. Najemnik, J. and Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434(7031):387–391. Nistér, D. and Stewénius, H. (2006). Scalable Recognition with a Vocabulary Tree. In Computer Vision and Pattern Recognition (CVPR). Ong, S., Png, S., Hsu, D., and Lee, W. (2009). POMDPs for Robotic Tasks with Mixed Observability. Robotics: Science and Systems. O’Regan, J. K. and Noë, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24:939–973. Paletta, L. and Pinz, A. (2000). Active Object Recognition by View Integration and Reinforcement Learning. Robotics and Autonomous Systems, 31(1âĂŞ2). Papazov, C. and Burschka, D. (2011). An efficient ransac for 3d object recognition in noisy and occluded scenes. In Proceedings of the 10th Asian Conference on Computer Vision - Volume Part I, ACCV’10, pages 135–148, Berlin, Heidelberg. Springer-Verlag. Pauwels, K., Rubio, L., DÃŋaz, J., and Ros, E. (2013). Real-time model-based rigid object pose estimation and tracking combining dense and sparse visual cues. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2347–2354. Petrovskaya, A. and Khatib, O. (2011). Global localization of objects via touch. Trans. Rob., 27(3):569–585. Pito, R. (1999). A Solution to the Next Best View Problem for Automated Surface Acquisition. IEEE Trans. Pattern Analysis and Machine Intelligence, 21(10). Potthast, C., Breitenmoser, A., Sha, F., and Sukhatme, G. S. (2016). Active multi-view object recognition. Robot. Auton. Syst., 84(C):31–47. Potthast, C. and Sukhatme, G. (2014). A Probabilistic Framework for Next Best View Estimation in a Cluttered Environment. Journal of Visual Communication and Image Representation, 25(1). 109 Potthast, C. and Sukhatme, G. S. (2016). Online trajectory optimization to improve object recognition. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4765–4772. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, first edition. Ratliff, N. D., Silver, D., and Bagnell, J. A. (2009). Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25–53. Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 661–668. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014). ImageNet Large Scale Visual Recognition Challenge. Rusu, R. (2009). Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Technische Universität München. Rusu, R. B. and Cousins, S. (2011). 3D is here: Point Cloud Library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China. Rusu, R. B., Holzbach, A., Blodow, N., and Beetz, M. (2009). Fast geometric point labeling using conditional random fields. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’09, pages 7– 12, Piscataway, NJ, USA. IEEE Press. Sankaran, B. (2012). Sequential Hypothesis Testing for Next Best View Estima- tion. Master’s thesis, University of Pennsylvania. Sankaran, B., Atanasov, N., Le Ny, J., Koletschka, T., Pappas, G., and Daniilidis, K. (2013). Hypothesis testing framework for active object detection. In IEEE International Conference on Robotics and Automation (ICRA). Sankaran, B., Bohg, J., Ratliff, N. D., and Schaal, S. (2015). Policy learning with hypothesis based local action selection. In Proceedings of the Reinforcement Learning and Decision Making (RLDM). Sankaran, B., Hausman, K., Bohg, J., Brock, O., Kragic, D., Schaal, S., and Sukhatme, G. (2017). Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics, 33:1273–1291. 110 Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In Blei, D. and Bach, F., editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889– 1897. JMLR Workshop and Conference Proceedings. Sinapov, J. and Stoytchev, A. (2013). Grounded object individuation by a humanoid robot. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 4981–4988. Sommerlade, E. and Reid, I. (2008). Information Theoretic Active Scene Explo- ration. In IEEE Computer Vision and Pattern Recognition. Spaan, M., Veiga, T., and Lima, P. (2010). Active Cooperative Perception in Network Robot Systems using POMDPs. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). Sridharan, M., Wyatt, J., and Dearden, R. (2010). Planning to See: A Hierarchical Approach to Planning Visual Actions on a Robot using POMDPs. Artificial Intelligence, 174(11). Srkk, S. (2013). Bayesian Filtering and Smoothing. Cambridge University Press, New York, NY, USA. Stachniss, C., Grisetti, G., and Burgard, W. (2005). Information gain-based explo- ration using rao-blackwellized particle filters. In In RSS, pages 65–72. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567. Taskar, B., Guestrin, C., and Koller, D. (2004). Max-margin markov networks. In Advances in Neural Information Processing Systems (NIPS 2003), Vancouver, Canada. Winner of the Best Student Paper Award. Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press. Tombari, F., Salti, S., and Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In Proceedings of the 11th European Conference on Computer Vision Conference on Computer Vision: Part III, ECCV’10, pages 356–369, Berlin, Heidelberg. Springer-Verlag. Tsikos, C. J. and Bajcsy, R. (1991). Segmentation via manipulation. IEEE T. Robotics and Automation, 7(3):306–319. 111 Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin methodsforstructuredandinterdependentoutputvariables. Journal of Machine Learning Research (JMLR). van Hoof, H., Peters, J., and Neumann, G. (2015). Learning of non-parametric control policies with high-dimensional state features. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, volume 38, page 995âĂŞ1003. JMLR. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA. Velez, J., Hemann, G., Huang, A., Posner, I., and Roy, N. (2012). Modelling Observation Correlations for Active Exploration and Robust Object Detection. Journal of Artificial Intelligence Research, 44. Wong, L., Kaelbling, L., and Lozano-Perez, T. (2013). Manipulation-based Active Search for Occluded Objects. In IEEE International Conference on Robotics and Automation (ICRA), pages 2814–2819. Wüthrich, M., Bohg, J., Kappler, D., Pfreundt, C., and Schaal, S. (2015). The coordinate particle filter - a novel particle filter for high dimensional systems. In Proceedings of the IEEE International Conference on Robotics and Automation. Wüthrich, M., Pastor, P., Kalakrishnan, M., Bohg, J., and Schaal, S. (2013). Prob- abilistic object tracking using a range camera. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3195–3202. Zhang, J., Kan, C., Schwing, A. G., and Urtasun, R. (2013). Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In The IEEE International Conference on Computer Vision (ICCV). 112
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Information theoretical action selection
PDF
Learning affordances through interactive perception and manipulation
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Learning objective functions for autonomous motion generation
PDF
Object detection and recognition from 3D point clouds
PDF
Discrete geometric motion control of autonomous vehicles
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Data-driven autonomous manipulation
PDF
Informative path planning for environmental monitoring
PDF
Motion coordination for large multi-robot teams in obstacle-rich environments
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Hierarchical tactile manipulation on a haptic manipulation platform
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Coordinating social communication in human-robot task collaborations
PDF
Nonverbal communication for non-humanoid robots
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Characterizing and improving robot learning: a control-theoretic perspective
Asset Metadata
Creator
Sankaran, Bharath
(author)
Core Title
From active to interactive 3D object recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/15/2018
Defense Date
07/25/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D perception,active perception,humanoids,information acquisition,interactive perception,mobile manipulation,OAI-PMH Harvest,object recognition,optimal control,robotics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ayanian, Nora (
committee chair
), Schaal, Stefan Kai (
committee chair
), Spedding, Geoffrey (
committee member
), Sukhatme, Gaurav (
committee member
)
Creator Email
bharath@scaledrobotics.com,bsankara@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-78585
Unique identifier
UC11668798
Identifier
etd-SankaranBh-6827.pdf (filename),usctheses-c89-78585 (legacy record id)
Legacy Identifier
etd-SankaranBh-6827.pdf
Dmrecord
78585
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sankaran, Bharath
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D perception
active perception
humanoids
information acquisition
interactive perception
mobile manipulation
object recognition
optimal control
robotics