Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Intelligent robotic manipulation of cluttered environments
(USC Thesis Other)
Intelligent robotic manipulation of cluttered environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Intelligent Robotic Manipulation of Cluttered Environments
by
Megha Gupta
A Dissertation Presented to the
Faculty of the USC Graduate School
UNIVERSITY OF SOUTHERN CALIFORNIA
in partial fulllment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
(Computer Science)
Committee:
Prof. Gaurav S. Sukhatme (Chair) Computer Science
Prof. Stefan Schaal Computer Science
Prof. Bhaskar Krishnamachari Electrical Engineering
December 2014
Copyright 2014 Megha Gupta
Dedicated to my family
ii
Acknowledgements
First and foremost, I would like to thank my parents who have always believed in me
and given me the freedom to pursue whatever I have wanted to. I realize that this freedom
is a huge privilege and am grateful for it every day. I am blessed to have their constant
support and encouragement.
This dissertation would not have been possible without the guidance and support of
my advisor, Prof. Gaurav Sukhatme. I never cease to be amazed by his clear view of
the big picture and his ability to always ask the right questions. I am thankful for his
insights whenever I was stuck and for the many things that I have learned from him over
the course of my graduate studies.
I cannot imagine my graduate school life without my labmates who have shared many
ups and downs of research with me. Their warm friendship and kindness made lab a fun
place for me. I am lucky to have been mentored by Sameera Poduri (once again after
my undergraduate years) during my rst year at USC. Karthik Dantu introduced me to
the world of robotic sensor networks and helped me set up my very rst experiments on
robots. Jonathan Binney and Jonathan Kelly were always there to provide the much-
needed humor. I will miss having `enlightening' discussions on anything and everything
under the sun with Hordur Heidarsson and Jnaneshwar Das. Arvind Pereira, Christian
Potthast, Harshvardhan Vathsangam, Max P
ueger, David Kim, and Geoery Hollinger
have always been very generous with their time and advice. I can safely say that I have
never met a person as energetic as Stephanie Kemna, and I am thankful for her company
in the lab. Thanks to J org M uller and Andreas Breitenmoser for taking an active interest
in my research and advising me on all kinds of issues ranging from writing to swimming. I
want to thank Mrinal Kalakrishnan and Peter Pastor for patiently answering my questions
about the PR2 and robotic manipulation. Karol Hausman and Aleksandra Waltos were
the rst ones to take care of me during my internship in Munich and I am ever grateful to
them for rst being such gracious hosts, and then great neighbors. Thomas R uhr helped
me settle down in the Munich robotics lab and I am thankful for all his help, whether it
be with getting a stove, a phone card, or buying grocery, in addition to his contribution
iii
to Chapter 3 in this dissertation. I thank Yi-Hsuan Kao for being a patient and sincere
collaborator, and for helping me come up with a tting research problem to wrap up my
thesis with.
No amount of praise and gratitude is enough for what Lizsl De Leon Spedding has done
for me (and innumerable other students!) as a PhD advisor. She goes to great lengths to
x things up for us when we forget our deadlines or are too lazy to read the guidelines.
She inspires me everyday to be patient and kind.
My long stint at USC was made completely worthwhile by my `bros' Vivek Kumar
Singh, Manish Jain, Ripple Goyal, Pramod Sharma, Prithviraj Banerjee, Maheswaran
Sathiamoorthy, Kartik Audhkhasi, Nupur Kothari, Uday Khankhoje, Shweta Agrawal,
Nilesh Mishra, Priyanka Pandey, and Kuhu. Special thanks to Vivek for being who he is,
Uday and Shweta for being my partners in crime, and Manish and Ripple for being like
my family away from home. I also want to thank Vibhu Jindal, Rashi Garg, and Neeraj
Tripathi for always being there for me.
The level of craziness in my life went up several notches when I joined Vidushak. Not
only did I learn to act, it also became my ultimate stress buster. Through Vidushak, I met
some of the warmest, funniest, and the most creative people at USC, and forged friendships
that would last a lifetime. I will cherish memories of every practice, every show, every
play, and every movie that we have done together. Heartfelt thanks to Adarsh Shekhar,
Vikram Ramanarayanan, Yamini Jangir, Mallika Sanyal, Krishnakali Dasgupta, Vishnu
Vardhan, Pankaj Rajak, Satyanarayan Rao, and Sananda Mukherjee { you guys rock!
Last, but in no imaginable way the least, I want to thank my sisters, Shweta Gupta
and Pragya Gupta, for being my mentors, condantes, and friends. I thank my in-laws,
who are the best parents that anyone could wish for, for their encouragement, love, and
patience these past few years. And to my husband, Nikhil Karamchandani, I can only say
that you are my pillar of strength. Thanks for standing with me through thick and thin.
iv
Table of Contents
Acknowledgements iii
List of Figures vii
List of Tables x
Abstract xi
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Sorting in Clutter 7
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Decluttering using Manipulation Primitives . . . . . . . . . . . . . . 18
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Exploration in Clutter 30
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Interactive Environment Exploration . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Interactive Object Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Contextual Object Search in Clutter 49
4.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Modeling the Object Placements on the Grid . . . . . . . . . . . . . 55
4.4 Contextual Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Gain of a Categorization Action . . . . . . . . . . . . . . . . . . . . 59
v
4.4.2 Gain of a Move Action . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7.1 Markov Decision Process Formulation . . . . . . . . . . . . . . . . . 69
4.7.2 Category Recognition using the Cloud . . . . . . . . . . . . . . . . . 71
5 Conclusion 74
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
BIBLIOGRAPHY 76
vi
List of Figures
1.1 Personal robots of the future will be expected to serve as personal
assistants in homes and oces, and caregivers to the elderly. . . . . . . . 2
1.2 Our homes and oces are cluttered because they have a lot of objects of
all kinds, shapes, and sizes, in close proximity. . . . . . . . . . . . . . . . . 3
1.3 General pipeline for the systems presented in this dissertation. . . . . . . 5
2.1 The object sorting problem . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Sorting pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The three dierent states a region could be in . . . . . . . . . . . . . . . . 12
2.4 An overview of the object sorting algorithm combing perception and
manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Experimental setup: the PR2 robot with a head-mounted Kinect looks at
Duplo bricks on a
at table-top. The task is to sort the bricks by color
and size into the three bins placed next to the table. . . . . . . . . . . . . 14
2.6 An example of the point cloud processing in our pipeline: the input point
cloud is processed to extract the point clouds of individual objects. . . . . 15
2.7 Examples showing how the object point cloud is divided into dierent
regions. Objects within a region are very close to each other while any
two objects in dierent regions are a minimum distance apart. The red
rectangle shows the bounding box for the robot's estimate of the area
occupied by the objects. The blue rectangles show the bounding boxes
for each region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 The perception component of the object sorting pipeline . . . . . . . . . . 18
2.9 The pick and drop primitive . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.10 Calculating spread and tumble directions. The blue rectangles show the
bounding boxes for the regions and the white rectangles mark the object
that is manipulated. Spread and tumble directions are chosen as the ones
shown by black arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.11 Eect of the spread motion primitive on a cluttered region . . . . . . . . . 20
2.12 Eect of the tumble and spread motion primitives on a piled up region . . 21
2.13 Snapshots from the sorting experiment of a cluttered scene . . . . . . . . 22
2.14 Results of sorting Duplo bricks by color and size for an uncluttered scene
using our pipeline on the PR2 robot . . . . . . . . . . . . . . . . . . . . . 23
2.15 Comparison of the performances of the na ve and manipulation-aided
sorting approaches on various metrics was done for the ten
congurations shown in this gure. The results are reported for a single
trial for each conguration in Fig. 2.16. . . . . . . . . . . . . . . . . . . . . 23
vii
2.16 Comparison of the performances of the na ve and the manipulation-aided
sorting on various metrics for the ten congurations shown in Fig. 2.15.
The results are reported for a single trial for each conguration. . . . . . . 24
2.17 The na ve and the manipulation-aided sorting algorithms were compared
for 5 trials on each of these 3 congurations of increasing occupancy density. 25
2.18 Fraction of spread and tumble operations applied by manipulation-aided
sorting for the 10 congurations in Fig. 2.16 . . . . . . . . . . . . . . . . . 26
2.19 Pathological case where na ve sorting results in a large number of failures 28
3.1 Cluttered environments require manipulation for exploration. . . . . . . . 30
3.2 Conceptual overview of the exploration planner. . . . . . . . . . . . . . . . 36
3.3 Simulation of the adaptive horizon exploration on a 3 4 grid with 6
objects and a horizon of 2. The camera is placed along the lower boundary
of the grid. Circles represent objects that are visible, squares represent
objects that have never been seen so far and are thus, unknown, diamonds
represent objects that have been seen at least once so far and so, their
locations are exactly known, and crosses represent grid cells whose state
is unknown. The unmarked cells with no objects are known to be free. In
the nal conguration, all cell states are known, as shown by the absence
of cross marks. This exploration needed a total of 6 actions and 1.4 seconds. 37
3.4 Comparison of the performance of the adaptive horizon exploration with
varying horizons on a 3 4 grid with 5 objects. . . . . . . . . . . . . . . . 38
3.5 Comparison of the performance of the adaptive horizon exploration with
random planning. X-axis shows the number of objects in a 3 4 grid-
world and the Y-axis shows the ratio of the number of actions required for
complete exploration for the random planning algorithm to the adaptive
look-ahead exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Comparison of the performance of the adaptive horizon exploration with
xed horizon exploration (horizon = 1) on a 44 grid with varying number
of objects. All numbers are averaged over 20 dierent object congurations. 40
3.7 Figure showing the average number of times a particular value of horizon
results in information gain (and hence, a plan) with respect to degree of
clutter for Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Experimental setup. The PR2 with a head-mounted Kinect views the
objects on a shelf from the front. The task is to search for a target object
by rearranging other objects on the shelf. . . . . . . . . . . . . . . . . . . 41
3.9 An overview of the implementation pipeline on the PR2 robot searching
for an object in a real world scenario. . . . . . . . . . . . . . . . . . . . . . 42
3.10 RViz snapshots of the algorithm running on a scene of 5 objects - 4 visible
and 1 hidden. The planning horizon is 1 to begin with in this example and
the dimensions of the target object (a salt shaker) were 0:04m0:04m0:1m. 46
4.1 Examples of everyday environments following organizational principles in
the storage of dierent kinds of objects. . . . . . . . . . . . . . . . . . . . 51
4.2 Conceptual overview of the contextual planner . . . . . . . . . . . . . . . 52
4.3 An example grid world: top view . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Example of a factor graph for the MRF on a 2 x 2 grid world. . . . . . . . 57
viii
4.5 An example run of the dierent planners for object search on a 4 5 grid
with 5 categories of objects (`Mugs & Glasses', `Bowls', `Plates', `Breakfast
Food', and `Breakfast Drinks'). a is the ground truth and b is what the
robot sees. The cells whose state is unknown to the robot because they
have never been seen are shown in black. The cells currently visible are
shown in white. Grey cells represent those that are known to be free or
occupied but are currently hidden from view. The target object (pack of
tea bags) is marked in green in a. The sequence of actions taken by the
move-only planner is shown in (c) - (f), (g) - (i) show the plan generated
by the contextual-SC planner, and (j) - (k) show the contextual-MC plan. 64
4.6 Comparison of the average performance of the dierent planners with
varying target depth and the total number of objects in the scene. . . . . 67
4.7 A more detailed comparison of the performance of the dierent planners
with target depth and total number of objects. . . . . . . . . . . . . . . . 68
4.8 A sample Human Intelligence task (HIT) for object categorization on
Amazon's Mechanical Turk . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9 Scatter plot for the response time vs. reward (5 trials for each reward)
for an object categorization HIT (Human Intelligence task) on Amazons
Mechanical Turk. The dotted line shows the median response time for
each reward amount. The median response time roughly goes down as
the reward is increased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
ix
List of Tables
2.1 Performance comparison of the na ve and the manipulation-aided sorting
approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Average number of dierent kinds of failures using the two approaches for
the pathological case in Fig. 2.19 . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 p-values for paired t-tests between each pair of planners . . . . . . . . . . . 66
x
Abstract
Robotic household assistants of the future will need to understand their environment
in real-time with high accuracy. There are two problems that make this challenging for
robots. First, human environments are typically cluttered, containing a lot of objects
of all kinds, shapes and sizes, in close proximity. This introduces errors in the robot's
perception and manipulation. Second, human environments are highly varied. Improving
a robot's perceptual abilities can tackle these challenge only partially. A robot's ability to
manipulate its environment can help in enabling and overcoming the limits of perception.
We test this idea of manipulation-aided perception in the context of sorting and
searching in cluttered, bounded, and partially observable environments. The inherent
uncertainty in the world state forces the robot to adopt an observe-plan-act strategy
where perception, planning, and execution are interleaved. Since execution of an action
may result in revealing information about the world that was unknown hitherto, a new
plan needs to be generated as a consequence of the robots actions. Since manipulation is
typically expensive on a robot, our goal is to reduce the number of object manipulations
required to complete the desired task.
This thesis presents planning algorithms for a robot's intelligent physical interaction
with its cluttered environment. The focus is on using simple manipulation primitives to
declutter the world and making task completion easier and faster using the environment's
local structure or context. For object sorting, we present a robust pipeline that combines
manipulation-aided perception and grasping to achieve more reliable and accurate sorting
results. In the context of environment exploration, we present an adaptive look-ahead
algorithm for exploration by prehensile and non-prehensile manipulation of the objects in
the environment. This algorithm is then applied to manipulation-based object search in
real world. Finally, we add contextual structure to the world in the form of object-object
co-occurrence relations and present an algorithm that uses context to guide the object
search.
We evaluate the performance and applicability of our planners through extensive
simulations and real-world experiments on the PR2 robot. Our results show that
purposeful manipulation of clutter to aid perception becomes increasingly useful (and
essential) as the clutter in the environment increases, and that intelligent manipulation
of a cluttered environment improves the eciency of robotic tasks.
xi
Chapter 1
Introduction
Personal robots of the future will not only be expected to assist humans with everyday
household chores like cleaning, cooking, and grocery shopping, but also to serve as
caregivers to the elderly (Fig. 1.1). This necessitates a capability to understand their
surroundings in real-time with high reliability. There are two problems that make this
very challenging for robots: human environments are typically very cluttered, i.e. they
contain a lot of objects of all kinds, shapes and sizes, in close proximity that introduces
errors in the robot's perception and manipulation. Second, human environments are
highly varied - each home is organized dierently and even within the same home, an
object may not be found at the same location from one day to the next. Thus, an
important aspect of self-suciency of robots is their ability to deal with this clutter since
even very organized homes and oces have a lot of objects (Fig. 1.2). Traditionally,
perception guides manipulation in robotic manipulation tasks. Having sophisticated
perceptual abilities is, thus, very important to deal with the large variety and complexity
of objects that a robot may encounter. However, clutter introduces uncertainty and
errors in robotic perception leading to failures in a manipulation task.
Manipulation of the world to reduce clutter, and thus enabling ecient perception,
could serve as an elegant tool in such scenarios. Using physical interaction with our
environment to complete a task or make sense of our surroundings comes very naturally
to us. We do it all the time. Think about looking for your car keys, that book that
you need to return to the library, a matching pair of socks in your drawer, making room
for a milk carton in your refrigerator, sorting your laundry, grocery, snail mail, solving
a jigsaw puzzle - the examples are endless. Although clutter makes these tasks tougher,
we usually succeed at them in reasonable amounts of time because of our ability to pick
1
Figure 1.1: Personal robots of the future will be expected to serve as personal assistants
in homes and oces, and caregivers to the elderly.
2
Figure 1.2: Our homes and oces are cluttered because they have a lot of objects of all
kinds, shapes, and sizes, in close proximity.
3
up objects, push them aside, and shue through them. Not only are these manipulation
maneuvers eective, they are few and simple too. This motivates our research in using
intelligent manipulation of the environment as an aid to perception, and thus ultimately
to purposeful grasping and manipulation.
This approach is in direct contrast to the traditional robotics approach of considering
contact with any object in the environment that is not directly involved in the task as a
collision. However, in many tasks, such contact is acceptable and not catastrophic either
for the robot or the user. In fact, it is often necessary for task completion.
Our vision is to introduce humanoid robots into our homes and oces as personal
assistants and their ability to deal intelligently with clutter would be critical to making
this a reality. The goal of this dissertation is to design planning algorithms for a
robot's intelligent physical interaction with its cluttered environment. The
focus will be on using simple manipulation primitives to declutter the world and
making task completion easier and faster using the environment's local structure or
context.
1.1 Contributions
This dissertation explores the idea that deliberate, intelligent manipulation of a cluttered
environment by a robot can enable ecient perception, planning, and manipulation, thus
improving performance on the overall robotic task. We address this idea specically in the
contexts of sorting, exploration, and search in cluttered environments. The environments
being dealt with are bounded and small (eg. a kitchen shelf), however clutter renders
them partially observable. This inherent uncertainty in the world's state forces the robot
to adopt an observe-plan-act strategy where perception, planning, and execution have to
be interleaved since execution of an action may result in revealing information about the
world that was unknown hitherto, and hence a new plan needs to be generated. Since
manipulation is typically expensive on a robot, our goal is to reduce the number of object
manipulations required to complete the desired task.
4
The contributions of this thesis are in the following three areas:
1. Object Sorting
Given a pile of small objects of a similar type on a tabletop, the objects have to be
sorted by color or size. We present a robust pipeline that combines manipulation-
aided perception and grasping to achieve reliable and accurate sorting.
2. Environment Exploration
A sequence of rearrangements of objects in a small and bounded cluttered
environment has to be planned to explore it and potentially search for a target
object. We present an adaptive look-ahead algorithm for exploration by prehensile
and non-prehensile manipulation of the objects in the world. We then use it for
object search in the real world.
3. Contextual Object Search
Given certain object-object co-occurrence relations and a target object, a sequence of
actions has to be planned to search for the target eciently. We present an algorithm
that uses context to guide the object search and results in fewer manipulations than
a purely manipulation-based search without the use of any context.
Take observation
Perception
Manipulation
S A
Planning
Update state
Select a suitable
action
Execute selected
action
Other
inputs
Figure 1.3: General pipeline for the systems presented in this dissertation.
In the following chapters, we present complete systems that solve the problems listed
above. All of these systems use an observe-plan-act strategy for the corresponding
partially observable world till the desired goal is achieved. In general, we could represent
these systems as shown in Fig. 1.3. Our main algorithmic contributions lie in designing
5
the planning component of this pipeline. The planner gets as inputs an observation of
the world, the state spaceS, and the action spaceA, and outputs a suitable action to
manipulate the world. As we progress through the chapters, we add more inputs to the
planner to enable it to do more sophisticated decision-making.
We evaluate our planners through simulations and real-world experiments on the PR2
robot using various metrics like planning time and number of actions required, and show
that purposeful manipulation of clutter to aid perception becomes increasingly useful (and
essential) as the clutter in the environment increases.
6
Chapter 2
Sorting in Clutter
Manipula(on
of
clu-er
Figure 2.1: The object sorting problem
We rst explore the idea of manipulation-aided perception and grasping in the context
of sorting small objects on a cluttered tabletop (Fig. 2.1). We present a robust pipeline
that combines perception and manipulation to accurately sort objects by some property
(eg., color, size, shape etc.). The pipeline uses two motion primitives to manipulate the
scene in ways that help the robot to improve its perception and grasps. This results in
the ability to sort cluttered object piles accurately. We also present an implementation
on the PR2 robot which applies our algorithm to sort Duplo bricks by color and size, and
compare our method to brick sorting without the aid of manipulation. The experimental
results demonstrate the benets of our approach, particularly in environments with a high
degree of clutter.
7
2.1 Motivation
Imagine you are searching for a pen on your untidy desk. You look at the desk but
can't see a pen. What is the next thing you do? You pick up and move other objects
on the desk until you nd the pen. In this process, you displaced the objects that you
were not interested in but this is acceptable. Physically interacting with the environment
to achieve a goal is a natural behavior for humans. Shuing through the pieces of a
jigsaw puzzle, sorting the laundry, moving aside clutter to see what lies beneath { the
examples are endless. This process is appealing since it is highly eective and the applied
manipulation maneuvers are few and simple, e.g., picking up an object, pushing things
aside, and shuing through them.
This motivates our research in robots manipulating the environment as an aid to
perceive and understand it, which eventually enables them to purposefully grasp and
manipulate objects. All over the world, research in the eld of personal robots has been
growing, with the aim of introducing humanoid robots into our households to assist us in
daily chores such as cleaning, cooking, and keeping things in order. This requires that
robots can understand their environment in real-time with high accuracy. Improving their
perceptual abilities in terms of accurate sensors and inference, however, is only one side
of the coin. Their ability to manipulate the environment can substantially contribute to
overcome perceptual shortcomings, since objects are usually easier to detect and grasp
when they are isolated compared to when they are part of clutter.
In this work, we consider the robotic task of sorting objects that are piled up or
scattered on a table. We propose an algorithm that enables a robot to rearrange the
objects using a set of manipulation primitives in order to reliably perceive and accurately
grasp the individual objects. In contrast to purely perceptual approaches, our method
can quickly reduce clutter and resolve occlusions of objects that may, otherwise, lead
to incorrect perception and failed grasps in the sorting process. We demonstrate the
eectiveness of our approach with a PR2 robot sorting Duplo bricks into dierent bins
according to size or color. Our experiments show that sorting is more consistent and
reliable when cluttered parts of the scene are rst manipulated to spread them out, as
opposed to when objects are attempted to be picked up directly from clutter.
8
2.2 Related Work
Several eorts have been directed at successful manipulation in cluttered spaces. Cohen,
Chitta, and Likhachev [1] use heuristics and simple motion primitives to tackle the high-
dimensional planning problem. Jang et al. [2] present visibility-based techniques for real-
time planning for object manipulation in cluttered environments. Hirano and Kitahama,
and Yoshizawa [3] describe an approach using RRTs to come up with grasps even in
cluttered scenes. Dogar and Srinivasa introduced the concept of push-grasping, using
which a robot manipulates and pushes obstacles aside to plan a more direct grasp for an
object [4].
At the other end of the spectrum, the problem of improving perception in cluttered
environments has also been studied. Thorpe et al. [5] present a sensor-fusion system
to monitor an urban environment and safely drive through it. Various techniques using
multiple views, sensors, and computer vision algorithms have been developed to improve
perceptual abilities of a robot in clutter [6, 7, 8].
However, these two classes of work focus on either accurate grasping or object
recognition, and consider collisions with obstacles unacceptable. In a similar vein, there
has also been research on achieving high perception accuracy so that grasp planning can
benet from it. These approaches study perception techniques to aid manipulation while
we are aiming at the converse.
There has been some work on manipulation-aided perception, also referred to as
interactive perception in literature. Using manipulation for object segmentation and
feature detection has been studied by several groups [9, 10, 11, 12]. Katz and Brock [13]
use manipulation to build a kinematic model of an unknown articulated object.
Recently, some approaches have been designed to isolate objects using small deliberate
interactions with a pile of unknown objects. Chang et al. [14] introduce small disturbances
in an object pile and track the optical
ow of textured objects to separate unknown rigid
objects. Katz et al. [15] use a similar idea where they detect how certain features of an
object pile change after small perturbations and use it to segment objects. The goal of
these two methods is not sorting but to clear away the pile by segmenting and separating
the objects. These could be useful when dealing with an unknown set of objects but
the rst method needs a large number (4-9) of perturbations per object while the second
9
method uses a more sophisticated Barrett hand that is capable of more complex grasps.
Our approach to object sorting is more suited to standard robotic manipulators with
simple grippers and is ecient in terms of the average number of actions required to sort
an object.
The classic work on Freddy II Robot [16], which succeeded at assembling a variety of
simple structures from a heap of parts on a table, is quite similar in
avor to our proposed
pipeline though we dier in our assumptions and implementation details. They used
detailed object models to segment and singulate objects from a pile. Any unsegmented
heap was attacked as a whole to decompose it. We compare our approach to this kind of
na ve decomposition of a pile in Section 2.6.
The DARRT algorithm [17] also presents a planning approach based on diverse actions.
Given a set of goal locations for a set of movable objects, DARRT plans a sequence
of actions based on their utilities given the current state of the world. However, the
outcomes of actions are assumed to be deterministic making later action choices in the
plan dependent on earlier ones. The challenge here is to come up with valid long plans
in constrained high-dimensional spaces. In contrast, our manipulation primitives are not
model-based and inherently non-deterministic, and are assumed to be independent of
each other. Thus, our plans are short and re-observation of the state of the world and the
generation of a new plan after execution of a set of actions becomes necessary.
Parts sorting is an old eld of research and highly accurate vibration-based parts
feeders have been designed in industry [18, 19]. A parts feeder can not only sort but also
position and orient bulk parts before they are fed to an assembly station and thus help
in micro-assembly. Work has also been done on sorting parts by shapes using Bayesian
techniques and parallel-jaw grippers [20, 21]. However, these methods assume availability
of very specic equipment or require parts to be fed to the feeder one by one. Robotic
bin-picking is also a classic problem of robotics. Work in this eld has, however, focused
on sophisticated hardware combined with specically tailored computer vision algorithms
[22, 23, 24, 25]. In contrast, our approach does not require any specialized hardware and
has been designed with personal robotic assistants, which have a standard suite of sensors
and will have to function in an unstructured environment, in mind.
10
Take observation
Perception
Manipulation
S A
Planning
Update state
Select action
based on state
Execute selected
action
Figure 2.2: Sorting pipeline
2.3 Algorithm
We consider the problem of sorting objects scattered or piled up on a table by their
properties (e.g., color, size, shape, etc.) in separate bins and eventually clearing the table.
We present a complete pipeline for real-world scenarios that combines perception and
manipulation, interleaving the following four steps repeatedly:
1. Perception and object segmentation
2. High-level planning of an action sequence
3. Low-level planning of arm motion for the next action
4. Plan execution and control
In this paper, we focus on the high-level planning of an action sequence and contribute
a novel algorithm to reliably sort objects through manipulation-aided perception. In each
iteration, our algorithm analyzes the scene to choose actions that sort suciently isolated
objects into corresponding bins, and improve the visibility and segmentation of objects in
cluttered areas. Our algorithm chooses actions that sort isolated objects, and improve the
visibility and graspability of objects in cluttered areas as described below:
1. Segmentation into regions: Let us assume that we have a way to segment the
scene into spatial clusters, or regions. Given this segmentation, we classify each
region to be in one of the following states (Fig. 2.3):
11
Uncluttered: No two objects in the region are in contact or too close to each
other to be reliably segmented and grasped.
Cluttered: Each object is close to or in contact with at least one other object.
However, all the objects in the region lie directly on the table.
Piled up: The objects are cluttered and piled up, i.e., some of them lie on top
of other objects.
(a) Uncluttered (b) Cluttered (c) Piled up
Figure 2.3: The three dierent states a region could be in
2. Manipulation of regions: To perturb a region, we dene a library of simple
manipulation primitives that are generic enough to be applicable to a variety of
objects and standard manipulators. Based on the state of each region, an appropriate
manipulation primitive is used:
Uncluttered: Pick every object in the region and drop it in the corresponding
bin to clear the region.
Cluttered: Spread out the objects (increase their average separation) to make
the region uncluttered.
Piled up: Decompose the pile such that all objects lie directly on the tabletop.
3. Repeat until the table is cleared.
Our algorithm repeatedly plans an action sequence to manipulate the clutter until objects
are scattered enough to be easily grasped. Thus, it substantially improves the graspability
of objects and enables the robot to sort the objects accurately.
Fig. 2.4 gives an overview of the algorithm.
12
Input point
cloud
Divide into
regions
Assign states
to regions
Use a manipulation
primitive for each
region
Figure 2.4: An overview of the object sorting algorithm combing perception and
manipulation
2.4 Experimental Setup
To demonstrate the capabilities of our approach, we implemented and experimentally
evaluated our algorithm using the PR2 robot, a semi-humanoid robotic platform developed
by Willow Garage. The PR2 is equipped with a variety of sensors including several
cameras and laser range nders. For the experiments presented in this paper, we used the
colored depth images of the Microsoft Kinect sensor mounted on the head of the robot.
For manipulation, the PR2 has a simple two-nger gripper that enables to execute only
simple parallel-ngered grasps. This requires sucient space on both sides of the target
object and makes grasping of objects from clutter a challenging problem.
In our experiments, we consider the task of sorting Duplo bricks as an example to show
that manipulation can improve robotic perception and grasping in clutter. Fig. 3.8 shows
our experimental setup. Duplos are scattered on a table and the PR2 is positioned at one
of the edges of the table with its head looking down. Bins for collecting sorted Duplos are
placed on one side of the table at known locations. For the rest of the paper, we assume
the tabletop to lie in the XY -plane with the X and Y axes as shown in Fig. 3.8.
2.5 Implementation
We implemented our approach on the PR2 robot based on ROS [26], an open-source robot
operating system developed by Willow Garage. ROS provides a set of mature algorithms
for perception and manipulation on the PR2, and we combined those with our planning
13
X
Y
Figure 2.5: Experimental setup: the PR2 robot with a head-mounted Kinect looks at
Duplo bricks on a
at table-top. The task is to sort the bricks by color and size into the
three bins placed next to the table.
algorithm to create a robust pipeline that can declutter a pile of objects in an intelligent
way, and sort it accurately and reliably. The dierent components of our object sorting
pipeline are presented in detail below.
2.5.1 Perception
The perception component of our pipeline receives a point cloud from the head-mounted
Kinect sensor and processes it using the following steps:
Object Cloud Extraction
Standard algorithms available in the Point Cloud Library (PCL) [27] are used for
preprocessing. The point cloud is rst ltered to only contain the table for which an
approximate bounding box is assumed to be known. In the second step, we extract the
table as the largest plane from the point cloud by planar segmentation using RANSAC.
Its dimensions are used for another ltering, which removes all outliers and produces the
point cloud associated with only the tabletop objects (Fig. 3.4a - 2.6c).
14
(a) Filtered point cloud (b) Extracted tabletop
(c) Objects point cloud (d) Point clouds of individual objects
Figure 2.6: An example of the point cloud processing in our pipeline: the input point
cloud is processed to extract the point clouds of individual objects.
15
Dividing into Regions
We divide the ltered point cloud into regions based on the spatial relationships of the
corresponding objects. In particular, we apply the spatial clustering algorithm available
in PCL [28] to the objects point cloud such that the minimum Euclidean distance between
any pair of points in dierent clusters exceeds a predened threshold. Fig. 2.7 shows the
regions generated by the algorithm for dierent scenes. Note that our method adaptively
selects the number of regions depending on the spatial conguration of the objects.
Assigning States to Regions
To assign a state to each region, we rst segment the region into individual objects. For
Duplos, we exploit the fact that Duplos are single-colored objects, which substantially
simplies object segmentation. We adapt the spatial clustering algorithm to additionally
take into account the color of the points in the distance metric, which allows two Duplos
of dierent colors to be distinguished even if they are in contact. Fig. 2.6d shows the
result of spatial color-based cluster extraction. We see that spatial color-based clustering
gives good results even in the presence of clutter.
Note that we use single-colored objects to simplify object segmentation because the
focus of this paper is not segmentation. For more complex objects (including multi-colored
objects), one could apply a more sophisticated object segmentation algorithm available in
the literature and still use our planning algorithm to reduce clutter and sort objects using
manipulation primitives.
For each object in a region, we compute the number of objects it is close to or in
contact with. According to our denition of uncluttered, cluttered, and piled up regions,
we determine the state of each region based on the distance between neighboring objects
and the height of the point cloud, as shown in Fig. 2.7d.
Fig. 3.10 gives an overview of the perception component of the object sorting pipeline.
16
(a) 6 regions (b) 4 regions
(c) 4 regions (d) 5 regions
Figure 2.7: Examples showing how the object point cloud is divided into dierent regions.
Objects within a region are very close to each other while any two objects in dierent
regions are a minimum distance apart. The red rectangle shows the bounding box for
the robot's estimate of the area occupied by the objects. The blue rectangles show the
bounding boxes for each region.
17
PERCEPTUAL PROCESSING
Obtain new point cloud
from Kinect
Pass-through filter
Planar segmentation
Divide object point cloud into
regions and assign states
(Euclidean color-based
clustering)
Figure 2.8: The perception component of the object sorting pipeline
2.5.2 Decluttering using Manipulation Primitives
For each region, the robot applies one of the three motion primitives described in
Section 2.3 depending on the region's state.
Pick and Drop
For clearing away an uncluttered region, the PR2 tabletop manipulation pipeline of ROS
already provides an implementation of all essential tasks, namely tabletop detection,
collision map building, arm navigation, and grasping (Fig. 2.9). Since collisions of the
compliant manipulator with objects and the table are acceptable in the context of
sorting, the standard pipeline is slightly modied to incorporate this. The collision map
includes only the table but none of the objects on it and collisions with the table while
grasping and lifting up an object are allowed. Based on the adapted collision detection,
the grasp server of the manipulation pipeline attempts to successively pick up all objects
in the region. Once an object is picked up, the arm moves to a predened position above
the bin with the corresponding color or size, and drops the object in it. Unless a grasp
fails, this action sequence is likely to sort the objects and to clear the uncluttered region.
18
Tabletop detection
Build collision map
Arm navigation
Grasp planning
Pick & drop
Figure 2.9: The pick and drop primitive
Spread
For a cluttered region, our algorithm rst determines the object that is in contact with the
maximum number of other objects. Let us denote this object byB. PerturbingB from its
original position moves the neighboring objects as well and may result in isolating several
objects for easy pick up. The robot places its ngers on the center of B and moves it
once along theX-axis and then along the Y -axis. MovingB in two orthogonal directions
helps to not only move it out of its region's bounding box but also to isolate it from other
objects that may have moved with it during the rst perturbation along the X-axis. The
directions of these orthogonal movements are calculated from B's position in the region
and chosen to be the ones that are more likely to take B away from others with a small
amount of movement. Specically, the directions are chosen such that the extents of the
point cloud aroundB are smaller in those directions. These small perturbations are a way
to minimize clash with other regions since we do not analyze the outcome of the spread
action.
(a) Spread directions
p
(b) Tumble direction
Figure 2.10: Calculating spread and tumble directions. The blue rectangles show
the bounding boxes for the regions and the white rectangles mark the object that is
manipulated. Spread and tumble directions are chosen as the ones shown by black arrows.
19
(a) Before spread action (b) After spread action
Figure 2.11: Eect of the spread motion primitive on a cluttered region
For example, in Fig. 2.10a, the blue rectangle around the objects shows the bounding
box for that region and the object outlined by a white box in the center is the one in
contact with the maximum number of other objects. The spread directions are chosen as
the ones shown by black arrows because the boundaries of the bounding box are closer
in those directions. This action is likely to spread out the objects making the region
less cluttered than before. Fig. 2.11 shows the result of applying two consecutive spread
actions to a cluttered region.
For cluttered regions with a large number of objects, nding B is computationally
expensive. Also, one spread action on this large cluttered region may not be enough.
To speed up the decluttering of such a region, we divide the point cloud into four equal
quadrants and spread out each one of these quadrants.
Tumble
For a piled up region, the centroidp of the topmost object in the region is located. Given
theX axis points forward (away from the robot), three waypoints are dened for the robot
arm: a point in front ofp,p, and a point behindp. Depending on the extend of the point
cloud, the robot moves its hand across these waypoints either forward or backward in an
attempt to tumble the pile. In particular, the direction of this movement is the one of
smaller extent of the point cloud aroundp along theX axis, because it would be easier for
the pile to tumble in that direction (see Fig. 2.10b for an example). This action is likely
to decompose the pile such that all objects lie directly on the tabletop. If it fails because
an inverse kinematic solution is not found, the robot tries to tumble the pile by moving
20
(a) Before tumble & spread actions (b) After tumble & spread actions
Figure 2.12: Eect of the tumble and spread motion primitives on a piled up region
its hand along the Y axis. If that fails too, our algorithm applies the spread action as a
fallback for this region.
Fig. 2.12 shows the result of using a tumble action followed by two spread actions on
a piled up region. Note that in our current implementation, all objects isolated as a result
of spreading or tumbling are rst picked up and sorted before the next set of spread and
tumble primitives is applied. This prevents any neighboring isolated objects from coming
close (due to a spread or tumble action) and forming a new cluttered region again.
After all regions have been manipulated, the perception pipeline is invoked again and
the whole procedure is repeated as shown in Fig. 2.4.
2.6 Results
Fig. 2.13 shows a sequence of snapshots of the table while the pipeline was running.
We compared our manipulation-aided approach to a na ve sorting approach in which
the perception pipeline essentially remains the same as in Fig. 3.10 but the pick and drop
action is called for all regions irrespective of their states. Thus, in the na ve approach,
the robot tries to pick and drop every object it sees, irrespective of the degree of clutter.
Our evaluation metric is the number of successful grasps. We categorize failures as the
following:
Empty grasp: A grasp is attempted and the gripper moves to the object but fails
to grasp it.
Double grasp: The gripper grasps two objects instead of just the one it intended.
This usually occurs when an object is surrounded by other objects and a non-optimal
21
(a) (b) (c)
(d) (e) (f)
Figure 2.13: Snapshots from the sorting experiment of a cluttered scene
grasp is planned. Double grasp of objects of same color may also occur because the
spatial color-based clustering algorithm cannot distinguish between them. Although
a double grasp is acceptable when sorting by color, it is unacceptable when sorting
by size and for applications that require single object extraction.
Lost object: A grasp is not attempted because the target object is out of reach
(too far on the table or dropped on the
oor). Such an object is never picked up
and is lost.
We use occupancy density as a measure of the degree of clutter in the scene. LetA
be the area of the bounding box of all objects in the scene and let n be the number of
objects. Then, the occupancy density is given by n=A. Although clutter is a subjective
term and this is only one of the many possible measures that could be dened for the
degree of clutter, it gives us an estimate of how densely packed the objects are.
For uncluttered scenes, both approaches reduce to the same algorithm and are
successful in sorting all objects correctly. Fig. 2.14 shows the results of sorting by color
and object length for uncluttered scenes.
We also tested our algorithm on 10 other congurations of varying occupancy densities
and number of objects. Fig. 2.15 shows these congurations and compares the na ve sorting
with our manipulation-aided algorithm for a single trial for each of them.
22
(a) Sorted by color (b) Sorted by length
Figure 2.14: Results of sorting Duplo bricks by color and size for an uncluttered scene
using our pipeline on the PR2 robot
(1) (2) (3) (4)
(5) (6) (7) (8)
(9) (10)
Figure 2.15: Comparison of the performances of the na ve and manipulation-aided sorting
approaches on various metrics was done for the ten congurations shown in this gure.
The results are reported for a single trial for each conguration in Fig. 2.16.
Fig. 2.16a shows the percentage of failures (w.r.t. the total number of objects) for
dierent scenes using the two algorithms. The na ve approach results in an average of
37% failures (with at least 20% failures) while our approach results in an average of 12%
failures (minimum 0 failures). The number of failures of our approach is signicantly lower
than that of na ve sorting (paired t-test: p< 0:001). Fig. 2.16b compares the percentage
of objects that were successfully sorted. The na ve approach results in an average of
90.3% (and minimum 73.3%) success rate while our approach results in an average of
97.8% (and minimum 93.3%) success rate. Since na ve sorting also ends up disturbing the
scene while attempting grasps, some of these accidental perturbations may result in useful
decluttering, thus aiding successful grasping and sorting. However, our algorithm provides
a more structured and reliable method of perturbing the scene resulting in a signicantly
23
1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
Scene Index
Naive
Manipulation−aided
(a) Percentage of failures
1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120
Scene Index
Naive
Manipulation−aided
(b) Percentage of successfully sorted objects
1 2 3 4 5 6 7 8 9 10
0
0.5
1
1.5
Scene Index
Naive
Manipulation−aided
(c) Number of pickup attempts as a fraction of the
total number of objects (ideal = 1)
1 2 3 4 5 6 7 8 9 10
0
200
400
600
800
1000
1200
1400
1600
1800
Scene Index
Naive
Manipulation−aided
(d) Time taken (seconds)
Figure 2.16: Comparison of the performances of the na ve and the manipulation-aided
sorting on various metrics for the ten congurations shown in Fig. 2.15. The results are
reported for a single trial for each conguration.
24
(a) A relatively uncluttered
scene (occupancy density =
276:5 Duplos=m
2
)
(b) A more cluttered scene
(occupancy density = 526:3
Duplos=m
2
)
(c) A piled up scene (occupancy
density = 857:1 Duplos=m
2
)
Figure 2.17: The na ve and the manipulation-aided sorting algorithms were compared for
5 trials on each of these 3 congurations of increasing occupancy density.
higher number of successfully sorted objects (paired t-test, p< 0:05). Thus, our approach
cuts down failures to a third while increasing success rate by an average of 7%.
Fig. 2.16c compares the number of pickup attempts made as a fraction of the total
number of objects. A fraction of 1 is the ideal value because it means there were exactly as
many pickup attempts as there were objects|no more and no less. We observed that this
fraction stays close to 1 for our algorithm while it varies a lot for the na ve algorithm. In
general, we see that the na ve algorithm requires an average of more than 1 pickup attempt
per brick; when it requires fewer, it is because there have been some double grasps.
Finally, Fig. 2.16d compares the time taken to sort all objects by the two approaches.
The na ve approach requires an average of 1.34 manipulations and 33.5 seconds per
successfully sorted Duplo. On the other hand, our approach requires an average of 44.7
seconds and 1.78 manipulations per successfully sorted Duplo. Our algorithm needs
nearly 50% more time and 33% more manipulations on average to sort an object
correctly than na ve sorting because it introduces more manipulation maneuvers into the
pipeline to spread out the clutter. The na ve sorting pipeline takes approximately 30
seconds to sort each brick ( 6 seconds for perceptual processing and 24 seconds for
grasp planning, moving the arm to the pickup location, picking the brick up, moving the
arm to the bin location, and placing the brick). Grasp planning and arm navigation are
the components that slow down the pipeline. The manipulation-aided sorting approach
additionally manipulates the clutter and hence adds more instances of arm navigation to
the pipeline, thus imposing an additional cost in terms of time. One spread action takes
7 seconds while one tumble action takes 10 seconds. However, for highly cluttered
scenes, the na ve sorting pipeline results in many failed grasps and retries. In contrast,
25
our algorithm takes additional time to spread out the clutter in a structured, predictable
way and thus results in more robust and reliable sorting.
Fig. 2.18 shows the number of spread and tumble operations applied as a fraction of
the total number of objects of the 10 congurations in Fig. 2.16.
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
Scene Index
Spread
Tumble
Figure 2.18: Fraction of spread and tumble operations applied by manipulation-aided
sorting for the 10 congurations in Fig. 2.16
In a second set of experiments, we repeatedly examined object sorting of three scenes
with dierent degrees of clutter as shown in Fig. 2.17a-2.17c. We carried out 5 trials for
each approach and each scene (by carefully reconstructing the scene manually before each
run) and evaluated the performance of each trial.
1. Fig. 2.17a shows a relatively uncluttered scene of dimensions 0.31 m 0.21 m
containing 18 objects, giving an occupancy density of 276.5 Duplos=m
2
. On average,
the na ve approach failed in 12.4 grasps while our approach only failed in 1.8.
2. Fig. 2.17b shows a cluttered scene of dimensions 0.18 m 0.19 m containing 18
objects, giving an occupancy density of 526.3 Duplos=m
2
. On average, the na ve
approach failed in 7.8 grasps while our approach only failed in 1.4.
3. Fig. 2.17c shows a piled up scene of dimensions 0.15 m 0.14 m containing 18
objects, giving an occupancy density of 857.1 Duplos=m
2
. On average, the na ve
approach failed in 8.4 grasps while our approach only failed in 1.6.
26
Figure Occupancy Density
Approach
Empty Double Lost Total Number Successfully Sorted Total Number
Index (Duplos=m
2
) Grasps Grasps Objects of Failures (out of 18) of Manipulations
Fig. 2.17a 276.5
Na ve Method 10 1.2 1.2 12.4 15.2 25.8
Our Method 1.8 0 0 1.8 18 27
Fig. 2.17b 526.3
Na ve Method 6.2 1.4 0.2 7.8 14.8 22.8
Our Method 1 0.2 0.2 1.4 17.6 27.8
Fig. 2.17c 857.1
Na ve Method 5.8 1.4 1.2 8.4 14.8 21.6
Our Method 1.4 0 0.2 1.6 17.6 29.8
Table 2.1: Performance comparison of the na ve and the manipulation-aided sorting
approaches
Figure Occupancy Density
Approach
Empty Double Lost Total Number Successfully Sorted Total Number
Index (Duplos=m
2
) Grasps Grasps Objects of Failures (out of 18) of Manipulations
Fig. 2.19 526.3
Na ve Method 17 0 1 34 17 34
Our Method 1 0 0 1 18 29
Table 2.2: Average number of dierent kinds of failures using the two approaches for the
pathological case in Fig. 2.19
Table 4.1 shows the detailed results for each scene averaged over the 5 trials. We
observe that our approach performs better than the na ve approach on several metrics
for all the scenes. In particular, the number of failures is much lower for manipulation-
aided sorting, because it rst manipulates the scene to make it less cluttered (as shown
in Fig. 2.13) and then attempts any grasps. Our approach also results in more accurate
sorting with a success rate of close to 100%. It, however, uses more manipulations than
the na ve approach on an average, particularly as the degree of clutter increases.
Compared to other existing techniques dealing with object singulation from a pile, our
approach is more ecient in the average number of manipulations per object. In contrast
to Chang et al. [29] (4{9 perturbations per object) and Katz et al. [15] (average of 1.9
perturbations per object), our approach needed an average of 1.76 manipulations to sort
an object.
There may be some pathological cases, where all objects are packed tightly together and
the upper faces of all of them are level with each other. For example, in the conguration
shown in Fig. 2.19, na ve sorting performed exceptionally worse in terms of number of
failed pickup attempts. It resulted in a total of 18 failures (17 empty grasps, 1 lost object)
and 34 manipulations while our algorithm resulted in only 1 failure (1 empty grasp) and
29 manipulations. 17 out of 18 objects were correctly sorted by the na ve algorithm in
14:18 min while our approach sorted all of them correctly in 12:58 min.
27
Figure 2.19: Pathological case where na ve sorting results in a large number of failures
Please see the supplemental video for a demonstration of how the two algorithms
work. The clips for the spread and tumble actions in the video are only snapshots of
the complete sorting. Note that every time a set of spread and tumble primitives are
applied in an iteration, a few objects get isolated. These objects are rst picked up and
sorted, before a new set of spread and tumble actions are executed in the next iteration.
However, these pick up actions are not shown in the video. Only when a brick was too far
for the robot to reach or dropped on the
oor, it was manually removed from the scene
and counted as a lost object.
2.7 Conclusion
This work explored manipulation-aided perception and grasping in the context of sorting
small objects on a tabletop. We presented a novel planning algorithm that combines
perception and manipulation to accurately and robustly sort arbitrarily cluttered objects
by some property (e.g., color, size, shape etc.). To achieve reliable perception and grasping
in the sorting process, our algorithm uses two motion primitives to manipulate the scene
in ways that help the robot to reduce clutter and resolve occlusions of objects. This
substantially enhances the sorting capabilities of a robot with a simple parallel-ngered
hand in case of cluttered and piled up objects. In addition, we implemented our algorithm
on the PR2 robot and presented its successful application to sorting Duplo bricks by color
and size. The experimental results demonstrate that manipulation-aided sorting provides
a more consistent and reliable approach to accurate sorting as compared to object sorting
without the aid of manipulation.
28
2.8 Discussion
There is one important limitation of manipulation-aided sorting { it is often slower than
na ve sorting as it introduces more manipulations into the pipeline. While these deliberate
and controlled manipulations of the pile increase reliability and accuracy of sorting, they
also increase latency in the system. We argue that in domestic settings, where personal
robots would operate, this is not a deal-breaker. The scale of domestic tasks would not
be as large as in an industrial setting and, thus, trading o throughput for safety and
accuracy would be acceptable.
However, speed is a desirable trait after all. We suggest a few ways in which higher
throughput could be achieved without sacricing reliability. First, the spread action
could be made smarter by choosing spreading directions based on the region's minimum
bounding box. Locations of other regions could also be taken into account to ensure that
a spread action does not interfere with the neighboring regions. Second, for a robot with
two arms, both arms may be used, one for spreading and tumbling, and the other for
picking up isolated objects. This would, however, need careful arm motion planning so
that the arms do not collide. Finally, the sequence of spread and tumble operations
could be decided such that a region that is more isolated from others could be disturbed
and cleared away rst, creating space for spreading out other regions.
Note that our technique of reducing clutter using manipulation would be applicable
even for complex objects as long as they can be reasonably segmented. We used single-
colored objects in our experiments only to simplify segmentation. In fact, any oracle that
could tell us the states of dierent regions is sucient for our algorithm to work.
It is also worth mentioning that the broad idea of planning with manipulation
primitives presented here is extensible to domains other than sorting. Any problem that
has as input a set of movable objects, a set of goals, a set of manipulation primitives,
and a denition for the utility of a primitive could be tackled by an approach similar to
ours. Some examples of such problems include setting and clearing a dining table,
searching for an object in cluttered spaces, serving food and drinks, and cleaning a
messy room. The library of manipulation primitives for these tasks might consist of
actions like pick and place, push and pull, pour, and carry to name a few.
29
Chapter 3
Exploration in Clutter
Figure 3.1: Cluttered environments require manipulation for exploration.
Exploration of our environment is a commonplace occurrence in our everyday lives.
Think about looking for a matching pair of socks in your drawer, nding a spot to place
a milk carton in your already overstocked refrigerator, searching for and counting coins
to do your laundry - the examples are endless. The exploration is made tougher by the
fact that our homes are cluttered - unstructured, and containing a lot of objects. The
location of the same object may also vary from home to home. The good news is that we
usually succeed at the task and nd what we were looking for in reasonable amounts of
time. An important factor that facilitates or even enables this exploration is our ability
to manipulate the world around us.
It is this idea of interactive exploration in clutter that we address in this paper.
Robotic environment exploration in cluttered environments is a challenging problem.
The number and variety of objects present not only make perception very dicult but
also introduce many constraints for robot navigation and manipulation. In this work, we
30
investigate the idea of exploring a small, bounded environment (e.g., the shelf of a home
refrigerator) by prehensile and non-prehensile manipulation of the objects it contains.
The presence of multiple objects results in partial and occluded views of the scene. The
robot can manipulate the world to observe more of it but the kind of actions allowed
include only rearrangement of the objects - they cannot be permanently removed. This
inherent uncertainty in the scene's state forces the robot to adopt an observe-plan-act
strategy where perception, planning, and execution have to be interleaved since
execution of an action may result in revealing information about the world that was
unknown hitherto, and hence a new plan needs to be generated.
We present a multi-step lookahead algorithm which plans for a sequence of actions that
will give the maximum reward under the current information about the world state. We
evaluate our planner for planning time and number of moves required using simulations
over a large set of scenarios. Simulation results show that as compared to rearranging
objects greedily or randomly, our proposed algorithm guarantees complete exploration of
the environment (unless no valid moves are left) and requires a fewer number of moves
to do so. These savings in number of actions are even more signicant as the number of
objects in the scene increases. This is important since manipulation by a real robot is
usually very slow and thus, we want to minimize the number of moves.
We then present an implementation of the algorithm on the PR2 robot and apply it
to object search. The robot has a xed and partial view of the scene but has access to
two manipulation primitives - pick and place, and push. We present preliminary results
with a pipeline that uses these simple motion primitives to rearrange a cluttered scene in
ways that locate the target object nally.
Our algorithm interleaves adaptive lookahead planning with object manipulation and
has direct applications in real world problems like object search, object counting, and
scene mapping. Our work simultaneously falls in the categories of search and exploration,
sequential decision making, and interactive perception. We discuss the relevant related
work in each category in the next section.
31
3.1 Related Work
The problem of exploration and object search in dierent kinds of environments has been
studied extensively. Active visual search ([30], [31], [32]) searches for a target object lying
out in the open but the challenge is to decide where to move the camera to locate the
target. In pursuit-evasion ([33], [34], [35]), the problem is to come up with an exploration
strategy for the pursuer such that all evaders are detected quickly. This has been studied
under various constraints on the number of pursuers and evaders, visibility range, discrete
and continuous world, etc. Work on search and rescue operations ([36], [37]) focuses on
ecient algorithms for robot navigation and coordination of a team of robots to search a
disaster site thoroughly.
Partially Observable Markov Decision Processes, or POMDPs, are a popular planning
framework for sequential decision making under uncertainty in states and state transitions.
However, a POMDP becomes intractable very quickly as the size of the state space grows.
Heuristic-based algorithms like A
, D
, and D
-lite ([38], [39]) work well when the goal
state is known and a good heuristic is used. Our problem of environment exploration
requires planning under uncertainty but the goal state is unknown, and there is no prior
on object locations.
Traditionally, research on mobile manipulation considers collisions with obstacles
unacceptable. More recently, there has been some work on interactive perception where
the robot actively manipulates the world to complete the overall task at hand.
Fitzpatrick [9], Chang [29], and Schiebener et al. [12] use interaction with objects for
segmentation and recognition of unfamiliar objects; Katz and Brock [13] employ it to
obtain a kinematic model of an unknown object and use it for purposeful manipulation;
and Dogar and Srinivasa [40] use it for more eective grasping in clutter.
There has been some very recent work on object search using manipulation. Wong,
Kaelbling, and Lozano-P erez [41] use spatial and object co-occurrence constraints to guide
the search. Objects are permanently removed from the scene in the course of the search.
Dogar et al. [42] prove theoretical guarantees about their algorithm being optimal within
certain conditions, such as a perfect segmentation of the scene, recognition of the objects
(that have known shape) and the possibility to remove all objects from the scene. They
assume that the target is the only hidden object in the scene - if not, then all other hidden
32
objects are smaller than the target in size. Our planning algorithm uses rearrangement of
objects as opposed to permanent removal and employs two dierent types of maneuvers -
pick and place, and push - depending on the object to be manipulated. We do not impose
any constraints on the number or size of hidden objects. We do not require object shapes
to be known either.
3.2 Interactive Environment Exploration
The goal of interactive environment exploration in clutter is to explore a small, bounded,
cluttered environment (e.g., a shelf of a refrigerator or a book shelf), by rearranging
the objects till the whole environment has been explored. We assume objects cannot be
removed permanently from the scene, such as by placing them on another table. Here,
we are not interested in identifying the individual objects in the search area, but only in
knowing the state of all areas of the volume as occupied or free. It is important to note
that the total number of objects in the environment is not known a priori. The exploration
algorithm could also be applied to object search or object counting. Since the scene is
only partially visible due to object occlusions, an observe-plan-act strategy is called for,
as new objects could be discovered that the current plan does not yet account for.
To begin understanding the nature of this problem, we study a simplied setup with
the following assumptions:
1. The size of the world is (approximately) known.
2. The environment is a grid-world with at most one object in each cell. Additionally,
each object occupies only one cell.
3. An object is visible to the camera only if there is no object in front of it in the grid.
Therefore, no object is partly occluded by another object.
4. Objects are not in contact with each other.
5. The robot has a xed point of view.
We will remove some of these assumptions in Section 3.4 where an adaptation to
problems in a real, continuous world is discussed.
33
We propose an adaptive lookahead exploration algorithm that guarantees complete
exploration of the environment unless there are no more actions possible. To start with,
our algorithm assigns a state of free, occupied, or unknown to every grid cell. The goal is
to know the state of every cell and only the objects that are visible at a particular instant
can be moved by the robot. Allowed actions are to move a visible object from its current
cell to any cell known to be free as long as the free cell is not occluded by an object in
front. It takes a starting horizon (or lookahead) value,H
0
, as input and in every iteration,
chooses the sequence of actions of length H
0
that would reveal most of the unknown cells
amongst all action sequences. We refer to the number of unknown cells whose state is
revealed by an action sequence as the information gain of that action sequence. If all H
0
length plans result in no information gain, the horizon is incremented by 1 until a non-zero
information gain is obtained. The algorithm then plans with the longer horizon, executes
the plan, and falls back to H
0
for the next iteration. This procedure is repeated until all
cell states are known (free or occupied).
The details of the algorithm are given in Algorithm 1. The inputs to the algorithm
are the number of rows (N
r
) and the number of columns (N
c
) in the discretized world
grid, and the starting horizon value (H
0
). The algorithm has the property that it chooses
a longer planning horizon, which requires more planning time, only when it is absolutely
necessary and hence, it is adaptive. It guarantees complete exploration of the environment
unless no valid moves are left.
The Adaptive Horizon Exploration algorithm tries to maximize the information gain of
the search with the given planning horizon. The motivation behind this is to minimize the
number of actions required to completely explore the environment by exploring as much
environment as possible in each iteration. However, it should be pointed out that the
environment is partially observable and the total number of objects unknown. Planning is
done with partial information and thus, may end up being non-optimal if hidden objects
are revealed during the exploration. Choosing a longer horizon does not necessarily reduce
the number of actions required for complete exploration for the same reason. Another
important point is that if we choose a longer planning horizon, those actions will be
chosen that result in object placements that do not block other objects that are visible
(and hence, movable) at the time of planning.
34
Algorithm 1 Adaptive Horizon Exploration(N
r
;N
c
;H
0
)
1: while true do
2: V = Find-Visible-Objects
3: Occ = Update-Occupied-cells
4: Free = Update-free-cells
5: N(unknown) = N
r
N
c
(Occ +Free)
6: if N(unknown) = 0 then
7: Environment explored. Exit.
8: end if
9: H
curr
=H
0
. H
curr
- current horizon
10: [IG;A] =plan(V;Occ;Free;H
curr
)
11: . IG - information gain, A - action sequence
12: if IG = 0 then
13: H
curr
=H
curr
+ 1
14: Re-plan. Go to step 10.
15: end if
16: Execute all actions in A.
17: end while
A conceptual overview of the complete exploration pipeline is shown in Fig. 3.2. Note
that the planner has an additional input, the denition of gain of an action, and an action
is selected based on the current state as well as its gain.
3.3 Simulation Results
We carried out simulations of Algorithm 1 in MATLAB. Fig. 3.3 depicts the steps of the
algorithm for a 3 4 grid with 6 objects and an initial horizon of 2. This environment
was fully explored in 1.4 seconds using 6 actions. This time, of course, does not include
any time required for the actual manipulation of objects.
Fig. 3.4 shows the eect of starting with dierent horizon values (H
0
) on the number of
actions required and time taken (averaged over 5 dierent placements of objects) to explore
the whole grid. The test environment was a 3 4 grid with 5 objects. We observe that
though the number of planning iterations goes down as horizon increases, each iteration
takes longer and in fact, the number of actions required increases monotonically. Note that
the horizon is adaptive and so, it may not stay at H
0
all the time during exploration. It
35
Take observation
Perception
S A Gain
Find feasible
actions
Find gain of
each action
Planning
Execute action
with maximum
gain
Manipulation
Update state
Increase horizon
if max gain = 0
Figure 3.2: Conceptual overview of the exploration planner.
is worthwhile to reiterate here that since planning is done with partial information about
the environment, the plans may end up being non-optimal. Therefore, a longer horizon
may not necessarily result in fewer actions.
Many other simulations on environments of varying sizes and degree of clutter suggest
that a starting horizon of 1 or 2 is often the best choice. Keeping the planning speed
in mind, it may be a good idea to start with H
0
= 1 and the algorithm will switch to a
horizon of 2 as and when needed.
We also compare our approach to a random planning approach where a sequence of
actions is chosen randomly from the available set of actions at any time. Results comparing
the performance of the two approaches on a 3 4 grid-world and a horizon of 2 with
increasing clutter are shown in Fig. 3.5. We plot the ratio of the average (over 50 dierent
placements of objects) number of actions taken for random planning to those for adaptive
horizon planning. We see that random planning needs nearly 2 to 6 times more actions
than our approach before the whole grid is explored. Since manipulation dominates the
execution time on an actual robot, exploration using random planning would clearly be
much slower, particularly as the degree of clutter increases.
Fig. 3.6 compares our algorithm to an algorithm with a xed horizon of 1 on a 4 4
grid with varying number of objects. Fig. 3.6a plots the number of actions required
36
(a) Starting conguration (b) (c)
(d) (e) (f)
(g) Final conguration
Figure 3.3: Simulation of the adaptive horizon exploration on a 3 4 grid with 6 objects
and a horizon of 2. The camera is placed along the lower boundary of the grid. Circles
represent objects that are visible, squares represent objects that have never been seen so
far and are thus, unknown, diamonds represent objects that have been seen at least once
so far and so, their locations are exactly known, and crosses represent grid cells whose
state is unknown. The unmarked cells with no objects are known to be free. In the nal
conguration, all cell states are known, as shown by the absence of cross marks. This
exploration needed a total of 6 actions and 1.4 seconds.
37
1 2 3 4
2
4
6
8
Horizon length
Total iterations
Total actions
(a) Number of iterations and actions
1 2 3 4
0
100
200
300
400
Horizon length
Exploration time (s)
(b) Time taken
Figure 3.4: Comparison of the performance of the adaptive horizon exploration with
varying horizons on a 3 4 grid with 5 objects.
(averaged over 20 dierent placements of objects) by the two algorithms to explore as
much of the environment as possible (i.e, before no more information gain is possible or
no moves are left) as the number of objects in the environment increases while Fig. 3.6b
plots the average number of unexplored cells with increasing degree of clutter. Fig. 3.6a
seems to indicate that the xed horizon algorithm uses fewer actions than our approach to
explore the environment but we see from Fig. 3.6b that it consistently fails to completely
explore the environment. This happens because xed horizon plans soon result in no
more information gain, thus ending the exploration prematurely. As clutter increases,
more and more of the environment remains unexplored by the xed horizon algorithm.
For very cluttered scenes with most of the cells in the grid occupied, the adaptive horizon
algorithm also fails to complete exploration because no valid moves are possible.
Fig. 3.7 shows the number of times (averaged over 20 congurations) a horizon value
is used by the adaptive horizon algorithm as the degree of clutter in a 4 4 environment
38
2 3 4 5 6 7 8 9
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Number of objects
Ratio of number of actions:
Random/Adpative horizon
Figure 3.5: Comparison of the performance of the adaptive horizon exploration with
random planning. X-axis shows the number of objects in a 34 grid-world and the Y-axis
shows the ratio of the number of actions required for complete exploration for the random
planning algorithm to the adaptive look-ahead exploration.
increases. We see that higher planning horizons are needed more and more as clutter
increases while the usage of lower horizons goes up as well. This shows that adpating
the horizon is essential to achieve complete exploration, particularly in highly cluttered
scenes.
The simulation results on a simple grid world presented in this section helped us
evaluate the nature of the environment exploration problem and analyze various aspects
of our adaptive horizon exploration algorithm. These results indicate that our algorithm
performs better than random and greedy approaches on average in terms of number of
actions. They also suggest that starting with an initial horizon of 1 and adapting it
as and when needed may be better than choosing a longer horizon to begin with. We
present the application of our proposed algorithm to object search in a real world cluttered
environment in the next section.
3.4 Interactive Object Search
In robotics, object search has typically been conned to active visual search which refers
to moving the camera so as to locate the target which is, otherwise, lying out in the open,
39
4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
10
12
14
Number of objects
Average number of actions taken
Adaptive horizon
Fixed horizon of 1
(a) Average number of actions
4 5 6 7 8 9 10 11 12 13 14
0
2
4
6
8
10
12
Number of objects
Average number of unexplored cells
Adaptive horizon
Fixed horizon of 1
(b) Average number of unexplored cells
Figure 3.6: Comparison of the performance of the adaptive horizon exploration with xed
horizon exploration (horizon = 1) on a 4 4 grid with varying number of objects. All
numbers are averaged over 20 dierent object congurations.
not occluded by other objects. Here we show how robotic manipulation can be used to
interact with the world and locate the target.
In practice, the assumptions of a grid world and absence of partial occlusions made
in Section 3.2 are unrealistic. Moreover, it may be possible to manipulate an object in
dierent ways depending on its size and surrounding environment. While an object that
is not surrounded closely by other objects may be picked up and placed at a new location,
it may only be possible to push and slide objects around in heavy clutter. Some objects
may simply be too large for the robot to grasp. We, therefore, relax the assumptions of a
40
4 5 6 7 8 9 10 11 12
0
1
2
3
4
Number of objects
Horizon = 1
Horizon = 2
Horizon = 3
Figure 3.7: Figure showing the average number of times a particular value of horizon
results in information gain (and hence, a plan) with respect to degree of clutter for
Algorithm 1.
grid world and no partial occlusions, and introduce two kinds of manipulation (pick and
place, and push) for implementation on a real robot.
We implemented our planner on the PR2 robot [43] to test its feasibility in the real
world. PR2 is a semi-humanoid robotic platform developed by Willow Garage. Fig. 3.8
shows our experimental setup. We mounted a Microsoft Kinect sensor on the robot head.
All perception data in this paper are from this sensor. Several objects of dierent shapes
and sizes are placed on a high shelf in front of the PR2 such that the robot can see only
some of the objects. The shelf is high enough to deny the robot a top view of the objects
but not too high to hinder manipulation.
The next section gives the details of the algorithm as applied to a real world
environment.
3.4.1 Implementation
The planner was tested on the PR2 by building a ROS package that implements the
pipeline depicted in Fig. 3.9. All steps of the implementation are detailed below.
41
Figure 3.8: Experimental setup. The PR2 with a head-mounted Kinect views the objects
on a shelf from the front. The task is to search for a target object by rearranging other
objects on the shelf.
Perception
a) Read in a point cloud from the Kinect.
b) Use thresholding to retain only that region from the point cloud that contains the
shelf (Fig. 3.10a, 3.10b) and then separate the planar surface from the objects
(Fig. 3.10d) using planar segmentation. The approximate size and location of the
shelf need to be known for this step. Reduce the size of object cloud using
downsampling. These algorithms are available in the Point Cloud Library
(PCL) [27].
c) Calculate the dimensions of the planar surface. This will constitute our planning
bounding box as it denes the area in which the objects may be located (Fig. 3.10d).
d) Extract individual object clouds using spatial clustering of the objects point cloud.
Note that not all objects may be fully visible and for most objects, only the front
surface is visible to the camera. Therefore, these point clouds are incomplete and
may not represent the objects correctly. To take this into account, bounding boxes
are dened for each cluster. Since object depth is often not perceived correctly in
this setup, the bounding box is given some additional depth if the depth of an object
cluster is below a threshold (Fig. 3.10e). Also, if the point cloud does not touch the
table, it corresponds to an occluded object and its bounding box is extended to the
42
Input Point Cloud
Planar segmentation and
object cloud extraction
Calculate dimensions of
shelf and define bounding
box for planning
Spatial clustering to find
visible objects
PERCEPTION
Build/update voxel grid to
store free, occupied, and
hidden voxels
Sample hidden space for
valid poses of target object
Identify movable objects
Sample free space for valid
poses of movable objects
PLANNING
Calculate gain for each
feasible move
Find a sequence of moves
that maximizes gain
Execute the sequence
of moves generated by
the planner
MANIPULATION
Figure 3.9: An overview of the implementation pipeline on the PR2 robot searching for
an object in a real world scenario.
table. A number of points are then generated to ll this object bounding box and
this new point cloud is then used for planning instead of the actual object clusters.
Planning
a) Build a voxel grid using an octree [44] from the object cloud to represent free,
occupied, and unknown voxels in the planning scene bounding box. In Fig. 3.10f,
the voxels marked as red encode the occupied and hidden space. The rest of the
voxels in the planning bounding box are encoded as free (not shown in the gure).
b) Among the detected clusters, clusters corresponding to movable objects are
identied. These are objects that are not partially occluded by other objects and
oer a clear path for the gripper for a front grasp (Fig. 3.10g). For this paper, we
use only front grasps for all objects.
c) Sample the hidden space for valid poses of the target object (Fig. 3.10h). The size
of the target object is assumed to be known and it is modeled as a point cloud. A
43
valid pose is one for which none of the points in the target point cloud are in the
free space (and hence, not visible to the camera), the object touches the support
surface, and is upright. Rotation about an axis perpendicular to the support surface
is allowed.
d) For each movable object, sample the free space for locations where the object may
be relocated. If the object can be grasped, its new location could be anywhere in
the free space. If the object is too big to be grasped, its new location could only be
in the neighboring free space where it can be pushed to. These locations constitute
the set of possible moves (Fig. 3.10i). Consecutive moves of the same object in the
same planning iteration are not allowed.
e) Simulate each possible move and calculate the percentage of sampled target poses
that are revealed by it. We will refer to this percentage as the information gain of
the move.
f) Use Algorithm 1 to nd a sequence of actions that maximizes information gain
(Fig. 3.10j) and send it to the manipulation node for plan execution.
Manipulation
a) Each move contains information regarding the point cloud to be manipulated, its
origin and destination pose, and the kind of action to be taken (pick and place, or
push). Depending on the kind of action and the direction of displacement, waypoints
are generated for the robot hand to move to, thus displacing the object in the process.
Whether the left arm is used or the right is decided based on the destination of the
object. If the destination lies in the left half of the planning bounding box, the left
arm is used; otherwise the right arm is used.
A new point cloud is obtained from the Kinect after plan execution (Fig. 3.10k) and the
process is repeated. The target object is currently identied by its size or color (assuming
it is the only object in the scene of that size or color) to simplify object recognition.
The search is considered complete when the target object becomes visible to the camera.
If the target object cannot be found, the search stops when there are no hidden areas
left where the target could be located. Since execution of an action may occlude space
44
earlier observed to have been free or occupied, the octree is updated after successful
execution of moves to retain information of occupied and free voxels across planning
iterations (Fig. 3.10l).
The accompanying video shows the PR2 searching for a small salt shaker in a kitchen
shelf containing 3 other objects of varying size. The starting horizon was 1 in this example.
Note how the rst move is a push because the object is too big to grasp. The second
object is however, picked up and placed at a new location, resulting in the salt shaker
being revealed.
3.5 Complexity Analysis
We now analyze the running time of our algorithm. Let us introduce some notation rst:
letN be the total number of objects;p the maximum number of points in the downsampled
point cloud of an object; k
v
the sample size of movable objects; k
t
the sample size of the
target object; and H the horizon length. The worst case complexity of various steps in
one iteration of the planner is as follows:
a) Sample target pose: a =O(pk
t
)
b) Find movable objects: b =O(pN) +O(N
2
)
c) Find possible moves: c =O(pk
v
N)
d) For each possible move, simulating the move and checking if it results in a collision
is d =O(p
2
N). Then, the planner is recursively called with a smaller horizon.
Therefore, ifm is the maximum number of possible moves in a recursion, the complexity
W with horizon H is:
W = (a +b +c +md)
H1
X
i=0
m
i
+m
H
d
= (a +b +c +md)
m
H
1
m 1
+m
H
d
= O(p
2
N
H+2
k
H+1
v
+pN
H1
k
H1
v
k
t
)
Thus, we see that the complexity scales rapidly with increasing horizon, as expected.
However, downsampling the object point clouds will speed up the planner. In our
45
(a) Point cloud of the shelf
obtained from the head-mounted
Kinect on the PR2
(b) Top view of the point cloud seen
by the robot shows the occluded
areas of the shelf.
(c) The actual scene has the
target object, a small salt
shaker, hidden behind the yellow
can.
(d) The largest plane detected
(the shelf) is shown in white
while the brown points belong
to objects on the shelf. Pink
lines dene the bounding box for
planning.
(e) Bounding boxes of visible objects.
If the depth of an object point
cloud is below a threshold, a xed
depth is added to its bounding box
for realistic planning of moves and
collision checks.
(f) Side view of the octree -
representation of the space that
is either occupied or hidden and
thus, could be hiding the target
object.
(g) Movable Objects - these
objects are not occluded by
others.
(h) The green arrows depict the
valid target object poses sampled in
the hidden space.
(i) An example valid move.
Purple: old positions, green:
new positions
(j) Move output by the planner -
the axes show the source and the
destination poses of the object to
be moved.
(k) Point cloud obtained from
Kinect after executing the plan
(l) Side view of the occupied and
hidden voxels in the octree after
executing the plan. Note that
the information about the voxels
behind the moved object being
free is retained.
Figure 3.10: RViz snapshots of the algorithm running on a scene of 5 objects - 4 visible
and 1 hidden. The planning horizon is 1 to begin with in this example and the dimensions
of the target object (a salt shaker) were 0:04m 0:04m 0:1m.
46
experiments, we found that as few as 100 points were enough to correctly represent the
boundary and volume of a typical kitchen object. Also, dense sampling of the free space
for new object locations does not help much because many samples are then close
together, resulting in similar information gains. So, we can aord to choose a small k
v
.
We carried out experiments with a sample size of 20,000 for the target object pose
(k
t
) and 50 for poses of movable objects (k
v
). Withp,k
v
,k
t
as constants, the complexity
of each iteration becomesO(N
H+2
). Thus, if complete exploration requires i iterations,
the overall complexity isO(iN
H+2
). Since each planning iteration results in non-zero
information gain (i.e, at least one target sample is revealed), there can be a maximum of
k
t
iterations. In practice, each action will reveal several target samples and thus, i would
be much smaller than k
t
. For a scene containing 4-6 objects, the average planning time
for each iteration on the PR2 robot is 16 seconds with a planning horizon of 1 and 60
seconds with a horizon of 2.
3.6 Discussion
Our solution is eective because meaningful simplications were made that oer a good
compromise between computational eort and solving power for the presented domain.
However, we consider only a small subset of possible environment interactions (picking
from the front, pushing to the side), and other kinds of maneuvers might be necessary to
solve a given problem. Incorrect initial segmentation is partially accounted for in the
current algorithm, as the segmentation gets updated when changes occur due to
manipulation. There are still paradox situations though. For example, if there are two
objects close to each other and the only way to move one of them is to push it into the
other, the current algorithm will not allow this to happen as we explicitly forbid
object-object collisions. If they were incorrectly segmented as one cluster, the robot
instead happily pushes both together. We are looking into modeling the physical
interactions between objects and allowing pushing one object into another, which is a
powerful manipulation strategy in many real world situations. Uncertainty in a pushing
action's outcome will then come into play and will have to be accounted for.
47
3.7 Conclusion
We presented an algorithm for environment exploration in small cluttered environments,
e.g., a kitchen shelf, using manipulation of objects. Simulation results show that as
clutter increases, signicantly fewer number of moves are required to completely explore
the environment using our algorithm as opposed to rearranging objects randomly. This
is important since manipulation by a real robot is usually very slow and thus, we want to
minimize the number of moves. Our algorithm also guarantees complete exploration as
opposed to a greedy exploration as long as there are valid moves left. We then presented
an implementation of the algorithm on the PR2 robot and applied it to object search.
The robot has a xed and partial view of the scene but has access to two dierent
manipulation primitives. We presented preliminary experimental results with a pipeline
that manipulates a cluttered scene in ways that help the robot to locate the target
object nally. Our algorithm interleaves adaptive look-ahead planning with object
manipulation and has direct applications in real world problems like object search,
object counting, and scene mapping.
In discrete domains, planning can often be optimized by pruning as identical states are
expanded multiple times. Additionally, the state space is constrained due to the discrete
amount of possible manipulation primitives and object positions, while the problem of
high dimensionality prevails. The fact that in continuous space, there are innitely many
manipulation primitives as well as innitely many states constitutes a distinct challenge
that we can only meet by dropping the claim of completeness.
In order to provide an algorithm that would be probabilistically complete with less
simplications to the domain, we are looking into sampling from the large unconstrained
set of possible push and pick & place primitives directly. This could be done in an
anytime manner such as RRT*. The tree could initially be lled using a small set of
proven manipulation primitives before allowing the algorithm to explore the vast space
of possible solutions when more time is available and a solution is not yet found. This
approach will require to consider grasp and push planning as well as the modeling of
arbitrary physical interactions. In turn, it would allow the most complex problems to be
solved.
48
Chapter 4
Contextual Object Search in Clutter
Object search is a frequently encountered problem that the personal robots of our future
homes will have to master. There are two problems that make this task very challenging
for robots. First, our homes and oces typically contain a lot of objects of all kinds,
shapes and sizes in close proximity that introduces errors in the robot's perception and
manipulation. Second, each home is organized dierently and even within the same home,
an object may not have a xed location. However, we tend to follow certain organizational
principles while placing objects and these are surprisingly consistent across homes. In this
chapter, we present an object search algorithm that uses this contextual structure or co-
occurrences to guide the search with the goal of reducing the number of moves required to
nd the target object since manipulation is typically expensive on a robot. Using large-
scale simulations, we show that using context helps especially when the search environment
is densely packed and multiple layers of objects occlude the target object.
4.1 The Problem
Imagine your friend is visiting you but the day she arrives, you cannot be home with her
because you have to be at work. You ask your friend to make herself comfortable and she
does. In fact, she even prepares lunch in your kitchen though she has never been in it
before. In the process, she has to look for every object and ingredient needed to cook the
meal, but she manages ne because of two reasons. First, she uses a mental model of where
various objects are kept in a typical kitchen and how dierent objects are placed relative
to each other. For example, vegetables are likely to be in the refrigerator while pasta and
49
sauce are likely to be close to each other. Secondly, she manipulates the environment to
observe the occluded areas and reveals objects that were hidden.
This kind of object search in small but cluttered environments by a robot, where
the manipulation of occlusions becomes necessary, is the problem we are trying to solve
in this chapter. Robotic household assistants of the future would be expected to not
only assist humans in their daily chores but also serve as caregivers to the elderly. They
would, therefore, need to be capable of reliably understanding and manipulating their
surroundings. Since a large variety of objects are found in our homes and many of them
do not have xed locations, object search would be a common task for the robot. Given
a target object, the robot should be able to search for it in its cluttered and partially
observable environment where the target object may be occluded from view by other
objects. However, an advantage of these human environments is that each everyday object
has a relatively consistent location with respect to other objects. For example, in a kitchen,
a mug is most likely to found in a kitchen, and that too next to other mugs or cups and
glasses. In a garage, a pair of pliers is likely to be close to a hammer and a wrench. In
an oce, a stapler is most likely to be found next to other stationery like tape or printing
paper.
Fig. 4.1a and Fig. 4.1b show such examples of cluttered shelves from real kitchens.
Fig. 4.1c shows the under-the-sink shelf in a bathroom. We can see from these pictures
that although these spaces contain a lot of objects, some of which are occluded, objects
that serve a similar purpose have been kept together. For example, bottles and jars are
together, cleaning supplies are next to each other, various dairy products are close by.
Thus, while the clutter in these environments poses a challenge to object search, the
organization principles governing the spatial relations between various objects provide a
tool to guide the search.
In this work, we assume that there are a nite number of categories that all the
objects in the environment could be categorized into. Any pair of objects either belongs
to the same object category (e.g., ketchup and mustard both belong to the category
sauces), dierent but correlated categories (e.g., a mug (category: mugs and glasses) and
a bowl (category: plates and bowls)), or dierent and unrelated categories (e.g., a cabbage
(category: vegetables) and pepper (category: condiments)). We assume that we know
these relationships between all category pairs and will refer to these as the co-occurrence
50
(a)
(b)
(c)
Figure 4.1: Examples of everyday environments following organizational principles in the
storage of dierent kinds of objects.
constraints. The robot is armed with these co-occurrence constraints and has access to
a perfect object categorization algorithm. Since object categorization implicitly involves
recognition, we will use the terms recognition/categorization interchangeably in the rest of
the paper. Although several object recognition algorithms already exist, their performance
relies on the amount of training data and the quality of features that the model is trained
on. They are also susceptible to external factors like the lighting conditions and the object
orientation. We, instead, propose to use the Amazon Mechanical Turk (MTurk) service [45]
for object categorization. MTurk is an online marketplace where tasks requiring human
intelligence can be posted for a small fee. Using human workforce for categorization
ensures that the results are accurate without the need for training any classiers.
We present a planner for object search that uses co-occurrences among object
categories to predict the most likely location of the target object. It then decides
whether to relocate, temporarily remove, replace, or categorize an object to either
improve its prediction or to access the target. Since the environment is partially
observable, the robot iteratively observes, plans, and executes an action. The goal is to
51
reduce the number of object manipulations required to retrieve the target object since
manipulation is typically expensive on a robot. We show using simulations that using
context especially helps when the search environment is densely packed and the target
object is occluded by multiple layers of objects in front.
A conceptual overview of our contextual search pipeline is shown in Fig. 4.2. Note
that the planner has an additional input, the co-occurrences , and an action is selected
based on the current state as well as its gain, which in turn depends on .
Take observation
Perception
S A Gain
Find feasible
actions
Find gain of
each action
Planning
Execute action
with maximum
gain
Manipulation
Update state
Increase horizon
if max gain = 0
ψ
Figure 4.2: Conceptual overview of the contextual planner
4.2 Related Work
Active Visual Search: Most of the research on object search in cluttered environments
has been done in the context of active visual search (AVS), where the target object is
assumed to be lying out in the open and the task is to nd a camera view that has the target
in it. It involves active adjustment of the sensor parameters such that when the target is
in the eld of view, its image is of sucient quality to enable accurate object recognition.
Ye and Tsotsos [31] rst approached this as a sensor placement problem using a 3D
visibility map and showed that it is NP-complete. Their work was built upon and improved
by several other researchers. For example, L opez et al. [46] extend it to the integrated
52
problem of view planning and multiple object search in an environment using monocular
vision. Ma et al. [32] use a Bayesian framework and a grid-based probability map along
with a combination of global/local search techniques to make the search computationally
inexpensive. However, none of them consider target objects that are hidden from view by
other objects in an environment where changing the camera position alone cannot help,
and object manipulation is necessary to reveal the target. In addition, object recognition
required by AVS has been so far done by training classiers on the objects to be searched
for.
Using Context for Search: The idea of using a semantic map or high-level object
labels to speed up search has been around for quite some time now. There has been a
lot of work on AVS using context. Kollar and Roy [47] use object-object and object-
scene co-occurrences to predict locations of novel objects. They derive the co-occurrences
using image tags on Flickr and then use them in combination with a prior map of the
environment and several object detectors to search for a new object. Samadi et al. [48]
showed how the Web could be searched to infer object-scene co-occurrences which, in turn,
could be used to drive the robot to a location that is likely to contain the target object.
They use human help to detect and retrieve the object. Aydemir et al. [49] use spatial
relations between objects and landmarks to speed up their active visual search. They use
the concept of indirect search, rst proposed in [50], that nds a landmark rst and then
uses the landmark to guide the search for the target object. Joho and Burgard [51] use
various object attributes and spatial relations learned from real world data to compute a
belief over the target object location and then plan for the next location to be visited by
the robot in an unknown environment. Kunze et al. [52] show how the semantic model
of a large-scale environment can be used for an ecient object search. They present a
decision-theoretic approach where the robot considers the utility of each search location
to decide where to go next.
Manipulation-based Search: This work is most closely related to the more recent
work on manipulation-based search where the target object is occluded from view by
other objects in front of it so that manipulation of these occlusions becomes necessary to
locate the target. The focus then becomes on planning a sequence of object manipulations
that would reveal the target. Wong et al. [41] consider the search environment to consist
of a number of independent containers. They dene co-occurrences as the likelihood of
53
dierent object types to be in the same container and use this information to generate the
contents of a container based on the robot's observations. The robot then looks for the
target object using the co-occurrence and spatial constraints based on the target object's
size by removing occluding objects from the scene permanently. However, they assume
that the object types and their models are known. Dogar et al. [53] consider a tabletop
environment instead and focus on the problem of gaining visibility and accessibility to the
target object. The consider objects with known geometries and poses in their setup and
calculate the order of permanent object removal from the scene by analyzing the occluded
volume and the expected volume revealed. They also give optimality guarantees of their
greedy search in some special scenarios.
Moldovan and De Raedt [54] use co-occurrences to search for an object that aords a
certain action where the environment may contain a number of objects that aord that
action. They learn an object aordance model mapping object properties to aordances
by using a database of objects and a xed (but extensible) library of aordances. The
co-occurrence model is learned using images of shelves from Google Images. At each step
of the search, their planner outputs which shelf is most likely to contain the desired object
but not where that object is likely to be in that shelf. Therefore, they remove all the
visible objects (in no particular sequence) from the shelf most likely to contain the target
object to reveal more of the shelf. Since their focus is not on deciding the sequence of
actions in which the occluding objects should be manipulated to reveal the target object
quickly, their work is closer to active visual search than to manipulation-based search.
In the previous chapter, we have shown how a cluttered environment could be searched
for a target without prior knowledge of the number of objects in the environment, the
object models, and the co-occurrences under the constraint that the objects can only be
rearranged (and not permanently removed) in the world. Here, we extend our previous
work by adding context to the search environment. We assume that all objects can be
categorized into one of a nite number of categories, and that we know the spatial co-
occurrence relationships among various correlated object categories. There has been some
prior work on learning the object-object and object-scene co-occurrences using data from
supermarkets [51], tagged images on the Web [55], or simulated and real kitchens [54], [56].
However, we do not focus on learning these co-occurrences in this work.
54
Fr ont
B ac k
Figure 4.3: An example grid world: top view
We present an algorithm for object search using categorization and manipulation of
occluding objects with these co-occurrences as a guide. We also show how the Amazon
Mechanical Turk service could be used for perfect object recognition and categorization,
thus abstracting out the problem of training an object detector and making our algorithm
extensible to unknown environments and novel objects. Our planner uses a library of
manipulation primitives such as rearrangement, temporary removal, and replacement of
an object that lend themselves naturally to cluttered environments.
4.3 Problem Formulation
We consider a world discretized into a uniform grid with a known size. Each cell can
contain at most one object and any object can occupy only a single cell (See Fig. 4.3 for
an example). The robot looks at the world from the front and at any time, sees only the
frontmost layer of objects. Objects in the deeper layers are hidden from the robot's view
if they have another object in front of them. We assume objects to not be in contact with
each other and the robot has a xed point of view.
4.3.1 Modeling the Object Placements on the Grid
In this paper, we will consider a co-occurrence, , between two object categories to
represent how likely they are to be placed at a certain distance from each other. Thus,
each will be a function of the object categories and the distance between them and can
be more clearly expressed as (d
ab
;c
a
;c
b
) where d
ab
is the distance between two objects
55
a and b, and c
a
and c
b
are their respective categories. Intuitively, for two closely placed
objects, would be high if they belong to the same category, medium for correlated
categories, and low for uncorrelated categories. For objects that are neither too far nor
too close to each other, will be high for correlated categories, and medium for
uncorrelated categories and the same category. And nally, for two objects placed far
from each other, will be high for uncorrelated categories, medium for correlated, and
low for the same category.
Thus, for each pair of objects, there are are dependencies between their categories
and the distance between them. We model these spatial co-occurrence relations as a
Markov Random Field (MRF), a popular framework to model dependencies between
random variables. Let us begin by considering a grid world with m cells. Let
I =f1; 2;:::;mg be the set of cell indices. Dene a random variable X
i
for every cell
i2I. X
i
represents the category of the object present in cell i before any actions are
taken and can take on values inC =f0; 1;:::;N
c
g where N
c
is the number of categories.
X
i
= 0 implies that cell i is unoccupied. Now consider a fully-connected undirected
graphG = (V;E) whereV is the set of thesem random variables andE is the set of edges
between each pair of these random variables. G represents an MRF over the variables in
V. Fig. 4.4a and Fig. 4.4b show a simple 2 2 grid and the correspondingG respectively.
Dene a multivariate random variable X =fX
v
g
m
v=1
. A joint assignment of these
variables, ~ x = (x
v
)
m
v=1
, represents a state of the world. We will use P (x
1
;x
2
;:::;x
m
)
to represent the probability of a world state. Using a pairwise Markov Random Field
assumption, the joint probability P (x
1
;x
2
;:::;x
m
) can be represented as a product of
potentials on all nodes inV and all edges inE:
P (x
1
;x
2
;:::;x
m
) =
1
Z
Y
(i;j)2E
ij
=
1
Z
Y
(i;j)2E
(d
ij
;x
i
;x
j
) (4.1)
whereZ is the normalization factor,d
ij
is the distance between cellsi andj, and we have
used the co-occurrence constraint
ij
= (d
ij
;x
i
;x
j
) as the factor potential over the cells
i andj. This potential represents the likelihood of categoriesx
i
andx
j
occurring together
in cellsi andj respectively. We assume the values of these co-occurrences, , to be given.
56
1
2
3
4
(a) Grid world
X
1
X
2
X
3
X
4
(b) Markov Random
Field
ψ
12
ψ
13
ψ
14
ψ
23
ψ
24
ψ
34
X
1
X
2
X
3
X
4
(c) Factor graph
Figure 4.4: Example of a factor graph for the MRF on a 2 x 2 grid world.
This MRF can be represented as a Factor Graph F = (V [ F;E) where
V =fX
1
;X
2
;:::;X
m
g is the set of variable nodes, andF =f
1
;
2
;:::;
n
g (n =
m
2
)
are the factor nodes, one factor for each pair of cells in the grid. Fig. 4.4c shows the
factor graph for the MRF in Fig. 4.4b. Given a grid with m cells, each joint assignment
of the variables,fx
1
;x
2
;:::;x
m
g, represents an object placement on the grid that satises
the co-occurrence constraints dened by the potentials .
4.4 Contextual Planner
This problem could be framed as a Markov Decision Process (MDP) which is a commonly
used framework for sequential decision-making problems. However, our state space for
the MDP grows exponentially with the size of the grid (see Appendix 4.7.1). Thus,
even for very small grids and a few categories, nding an optimal solution to the MDP
becomes infeasible. To deal with larger worlds, we present a contextual planner that
decides which action to take at every step by estimating the gain of every feasible action
(or action sequence) using the information that is already available about the world. The
distribution over X represents our current belief over the state of the world and we will
denote it by b(X). We update this belief as the search proceeds and we observe more of
the world. An overview of the contextual object search pipeline is shown in Fig. 4.2. The
57
planner requires four inputs: the state space S, the action space A, the co-occurrences ,
and the gain criterion to evaluate the utility of each action. We will now clearly dene
each of these inputs.
Each grid cell could either be free, occupied by an object of known category, occupied
by an object of an unknown category, or its state could be unknown. These cell states
form our state space S.
The following actions are available to the planner:
1. categorize: We assume that we have access to an object recognition and
categorization method that will accurately categorize a visible object in one of the
specied object categories. One such system is discussed in Appendix 4.7.1.
2. move: There are three types of ways in which an object may be moved.
relocate: The robot can move a visible object to any of the visible free cells
thus changing its location and potentially revealing the state of one or more
unknown state cells behind the object's original location.
remove: The robot can temporarily remove a visible object from the world. We
assume that there is no place available where the robot could keep this object
after removal and thus, has to hold it in one of its grippers. The robot keeps
its other gripper free to manipulate the rest of the world. This means that at
most one object can be removed from the world at any time.
replace: The robot can replace a temporarily removed object by placing it on
a visible free cell.
We assume that the search ends as soon as the target object becomes visible.
To dene the gain of an action, we will introduce a random variable, Y , dened over
the location of the target object. Thus, Y can take on values inf1;:::;mg. We will use
P
Y
(i) to represent the probability of cell i to contain the target object. LetU be the set
of all cells whose object categories are not yet known. We assume that one of the cells
must contain the target object. Thus,
X
i2V
P
Y
(i) = 1
58
)
X
i2U
P
Y
(i) +
X
i= 2U
P
Y
(i) = 1
The cells that have been seen at least once are known not to contain the target, hence
P
Y
(i) = 08 i = 2U. Hence,
X
i2U
P
Y
(i) = 1 (4.2)
For an unknown state cell, since the only clue we have to the location of the target
object is our belief over the cell to contain the target category c
, P
Y
(i)8 i2U will be
proportional to the probability of i containing c
given our current belief over the world
state, i.e., P (X
i
=c
). To satisfy eq. 4.2, we dene:
P
Y
(i) =
P (X
i
=c
)
P
i2U
P (X
i
=c
)
8 i2U (4.3)
Since an action may result in one of many possible next states of the world, let us
dene a setS
0
that contains all the possible resultant states of the world. The entropy of
Y , H(Y ), is dened as in Eq. 4.4 and we dene the information gain of an action as the
expected entropy reduction in Y as a result of taking that action (Eq. 4.5).
H(Y ) =
X
i2V
P
Y
(i) lnP
Y
(i) (4.4)
E[H(Y )] =
X
~ x2S
0
p(~ x)(H(Y )H(Yj~ x)) (4.5)
Details for the calculation of the next states and the resulting entropy are given in
Sections 4.4.1 and 4.4.2.
4.4.1 Gain of a Categorization Action
Let obj be a visible object whose category is unknown and w be the cell where obj was
originally located. We want to calculate the gain of categorizing it given our current belief
about the state of the world,b(X). LetK be the set of all cells whose object categories are
known before taking this action and c
i
be the true known object category for cell i2K.
LetO be the set of those cells j that are known to have X
j
6= 0 before taking this action.
Sinceobj could belong to any of the N
c
categories, there areN
c
possible next states after
59
taking this action. The probability of any such next state given our current belief about
the state of the world is given by:
P (X
w
=qj b(X))
=P (X
w
=qj X
i
=c
i
8i2K;X
j
6= 08j2O)
=
P (X
w
=q;X
i
=c
i
8i2K;X
j
6= 08j2O)
P (X
i
=c
i
8i2K;X
j
6= 08j2O)
(4.6)
To calculate the probability in Eq. 4.6, we need to marginalize over the variables
X
u
8 u = 2K and X
j
8 j2O. However, exact inference is computationally expensive.
We, therefore, do discrete approximate inference using Loopy Belief Propagation
(LBP) [57], [58], [59]. LBP is a well-known technique for approximate inference in
graphical models with loops and has been seen to perform well in practice [60]. It uses
the message-passing algorithm where the i
th
cell's neighbors send it their respective
estimates of its marginal, P (x
i
).
Thus, at the rst level of inference, we will reason about P (X
w
=qj b(X)). Then, at
the second level of inference, we will update our belief on X andY given thatobj belongs
to category q (i.e., cell w would also be added toK). In this way, we can calculate the
probability of each possible next state ~ x and the conditional entropy, H(Yj~ x), associated
with it. These can then be plugged in Eq. 4.5 to calculate the gain for this categorize
action.
4.4.2 Gain of a Move Action
While a replace action will result in zero information gain, a relocate or a remove action
may reveal the state of an unknown state cell(s) in the same column as the object being
moved. At most, one of the revealed cells can be occupied. Consider an unknown state
cell i behind the object being moved which may be free or occupied. If it is free, the
unknown cell behind it would be seen which could also be free or occupied. Thus, there
are many possible next states that this move could result in. Consider one such possible
next state where we denote the unknown state cell revealed to be occupied as
f and the
set of N
f
cells revealed to be free as F
r
=ff
1
;f
2
;:::;f
N
f
g. Dene
E
f
to be the event
that
f is occupied, i.e., X
f
6= 0. Also, for every f2 F
r
, dene E
i
to be the event that
60
Algorithm 2 Contextual Object Search Algorithm
1: procedureContextual Search(Grid, targetT , target categoryc
, co-occurrences
)
2: cat . Set of categorized objects
3: occ . Set of occupied cells
4: vis . Set of currently visible cells
5: free . Set of currently free cells
6: planning horizon = 1 . Number of future steps to plan for
7: Take observation and update occ, vis, free
8: while T not visible do . Target hasn't been found
9: Update current belief b(X) and P
Y
10: if no action has been taken then
11: A nd categorize actions(vis;free;occ;cat;P
Y
) . Force rst action to
be categorize
12: else
13: A nd actions(vis;free;occ;cat;P
Y
) . Find all valid move and
categorize actions
14: end if
15: g
max
a2A
nd action gain(a;c
; ;b(X);P
Y
;Grid)
16: if g
> 0 then . Informative action found
17: a
arg max
a2A
nd action gain(a;c
; ;b(X);P
Y
;Grid)
18: Execute a
19: Take observation and update occ, vis, free, cat
20: planning horizon = 1
21: else . No informative action found
22: planning horizon = planning horizon +1 . Plan further into the future
23: Go to line 10.
24: end if
25: end while
26: end procedure
61
X
f
i
= 0. Then the probability of this particular next state given our current belief of the
state of the world is given by:
P (
E
f
;
N
f
\
i=1
E
i
j b(X))
=P (
E
f
j b(X))
N
f
Y
i=1
P (E
i
j
E
f
;
i1
\
j=1
E
j
;b(X)) (4.7)
We nd the probability of each next state given b(X) using Eq. 4.7, and then for each
next state ~ x, we calculate the conditional entropy, H(Yj~ x). Finally, we nd the gain for
the move using Eq. 4.5.
With the gain for each action dened as above, we can now decide which action to take
from the available pool of actions at any planning iteration, execute that action, update
our belief based on the new observations, and repeat till the target object has been found.
However, to use contextual information correctly, we need a good prior onX to start with.
Therefore, we enforce the rst action to be always object categorization. Also, whenever
P
Y
(i) exceeds a certain threshold for a particular cell i, indicating that cell i has a high
probability of containing the target object, we focus on making cell i visible instead of
trying to reduce the entropy of Y further. Thus, we calculate the sequence of actions
that will reveal cell i in the fewest number of moves. Additionally, when no action has a
positive gain, we increment the planning horizon and re-plan.
The contextual object search algorithm has been summarized in Algorithm 2.
4.5 Simulation Results
We carried out over 1000 simulations over grid worlds of varying sizes, number of
categories, and number of objects in the world. In this section, we compare the
performance of four planners:
Random: This planner randomly selects which object to move and where, and serves
as a baseline for comparison. The co-occurrences and the categorize action are not
used here.
62
Move-only: This planner does not have access to the co-occurrences either and only
uses move actions to search for the target. The gain of a move here is given by the
maximum number of unknown state cells that will be revealed by the move.
Contextual-SC: This is the planner presented in Algorithm 2. The planner uses co-
occurrences and both categorize and move actions to search. Categorization of at
most one of the visible objects (the one with the maximum categorization gain) is
allowed by this planner.
Contextual-MC: This is a variant of Algorithm 2 where we allow categorization of
all visible objects to establish a stronger prior.
As mentioned earlier, we assume in this paper that the factor potentials for the MRF
are given to us. For our simulations, we designed factor potentials that resulted in
environments with clustered groups of correlated categories. This is what we expect real
world environments to look like. We then used Gibbs sampling to sample from the
high-dimensional distribution given in Eq. 4.1. Each such sample represented a
simulated grid world. For each sample, we ran a simulation with a dierent hidden
object as the target each time.
Fig. 4.5 shows all the steps of an example object search on a 45 grid for the move-only
and the contextual planners. Fig. 4.5a shows a top view of the ground truth. There are
5 object categories present in this world: `Mugs & Glasses', `Bowls', `Plates', `Breakfast
Food', and `Breakfast Drinks'. The cell marked in green contains the target object (a
pack of tea bags) belonging to the category `Breakfast Drinks'. The rst 3 categories are
correlated to each other and the last two categories are also correlated; however, these two
groups of correlated categories are uncorrelated. Fig. 4.5b shows what the robot sees in
the beginning. The cells whose states are unknown to the robot because they have never
been seen are shown in black. The cells currently visible to the robot are shown in white.
Grey cells represent those that are known to be free or occupied but are currently hidden
from view.
The subsequent gures show the sequence of actions taken by the dierent planners.
The move-only planner needs 4 moves as shown in Fig. 4.5(c)-(f), the contextual-SC
planner needs 2 (Fig. 4.5(g)-(i)), and the contextual-MC planner needs only one move
(Fig. 4.5(j)-(k)) to nd the target. We can see that since the move-only planner takes
63
(a) Ground truth (b) Initial view of the robot
(c) Move-only: rst
action
(d) Move-only: second
action
(e) Move-only: third
action
(f) Move-only: fourth
action
(g) Contextual-SC: rst
action; categorize the
object marked in blue
(h) Contextual-SC: second
action
(i) Contextual-SC: third
action
(j) Contextual-MC: rst
action; categorize all the
visible objects (marked in
blue)
(k) Contextual-MC:
second action
Figure 4.5: An example run of the dierent planners for object search on a 4 5 grid
with 5 categories of objects (`Mugs & Glasses', `Bowls', `Plates', `Breakfast Food', and
`Breakfast Drinks'). a is the ground truth and b is what the robot sees. The cells whose
state is unknown to the robot because they have never been seen are shown in black. The
cells currently visible are shown in white. Grey cells represent those that are known to be
free or occupied but are currently hidden from view. The target object (pack of tea bags)
is marked in green in a. The sequence of actions taken by the move-only planner is shown
in (c) - (f), (g) - (i) show the plan generated by the contextual-SC planner, and (j) - (k)
show the contextual-MC plan.
64
decisions based only on the number of unknown state cells (shown in black) behind an
object, it chooses to move an object that is quite far from the target object. On the
other hand, the contextual-SC planner rst chooses to categorize an object (the red mug
in this case) and realizes that since the category of the red mug (`Mugs & Glasses') is
uncorrelated to the target object category (`Breakfast Drinks'), it should be looking in a
dierent area of the world. It, therefore, decides to move an object located far away from
the red mug. Now that it is searching in the correct area, it manages to nd the target
object in just 2 moves. Finally, the contextual-MC planner categorizes all the ve visible
objects and acquires a stronger prior on the location of the target object. It then reveals
the target object in a single move.
Fig. 4.6(a) and Fig. 4.6(b) compare the average performance of dierent planners as
the depth of the target object and the number of objects in the environment changes
respectively. We dene the depth of an object as the number of objects in front of it at
the beginning of the search. Thus, if the target object is at a depth d, a minimum of
d moves will be needed by any planner to reveal the target object. These results are
from running each planner on 210 dierent searches. From both these gures, it is easy
to see that the random planner is always the worst which is not surprising. We also see
that the contextual planners consistently outperform the move-only planner by locating
the target in a fewer number of moves. Moreover, we found that an average of just one
categorization was needed by the contextual planners for all the tests. This is encouraging
because this implies that just one categorization is good enough to set an informative prior
for rest of the search. Thus, even if categorization is done on the cloud (as suggested in
Appendix 4.7.2) that incurs a cost in the form of latency and a fee for using the online
service, the overall cost will be small.
Let us take a closer look at the performance of the planners using the plots in Fig. 4.7.
Fig. 4.7(a) and Fig. 4.7(b) compare the variation in the number of moves required to nd
the target over all samples by the four planners with respect to target depth and the total
number of objects in the world respectively. It is easy to see that the performance of the
random planner is inconsistent and unpredictable. It also requires the maximum number
of moves in most of the cases. We will only compare the remaining three planners from
here on.
65
Paired t-test
Target Depth Number of objects
1 2 3 14 15 16
Move-only &
0.1 1e-05 0.17 0.002 1.7e-03 0.03
Contextual-SC
Move-only &
7e-05 8e-09 3e-05 3e-05 5e-08 0.04
Contextual-MC
Contextual-SC &
4e-05 9e-06 7e-05 0.002 1.7e-03 0.03
Contextual-MC
Table 4.1: p-values for paired t-tests between each pair of planners
Notice that not only the average but the median number of moves for the contextual
planners is also lower than that for the move-only planner. In addition, the spread is more
tightly packed around the median for the contextual planners, particularly the contextual-
MC planner. Clearly, the categorization of all visible objects at once in the rst planning
iteration of the contextual-MC planner results in setting up a very informative prior on X
and thus, leads to the target faster. We found that the contextual-MC planner performs
signicantly better than the contextual-SC planner, which in turn performs signicantly
better than the move-only planner. To test the statistical signicance of our results, we
carried out paired t-tests between each pair of planners. A paired t-test is a method of
statistical hypothesis testing used to determine if two sets of data are signicantly dierent
from each other when the data consists of a sample of matched pairs. The p-values for the
paired t-tests between each pair of planners are reported in Table I. The p-value represents
the probability of the result to have occurred by chance alone.
4.6 Conclusion
We presented an algorithm for object search in typical home and oce environments that
are cluttered with a variety of objects. In such environments, it is often necessary to move
occluding objects before the target object can be seen and retrieved. Since manipulation is
typically expensive on a robot, our goal is to reveal the target in as few moves as possible.
66
1 2 3
0
5
10
15
20
25
30
35
40
Depth of target object
Average number of moves
Random
Move−only
Contextual−SC
Contextual−MC
(a) Average number of moves as the target depth increases
14 15 16
−5
0
5
10
15
20
25
30
35
40
Total number of objects
Average number of moves
Random
Move−only
Contextual−SC
Contextual−MC
(b) Average number of moves as the total number of objects in environment
increases
Figure 4.6: Comparison of the average performance of the dierent planners with varying
target depth and the total number of objects in the scene.
67
Depth of Target Object
1
0
5
10
15
20
25
11 11 22 22 3 33 3
−1 0 1 2 −1 0 1 2 −1 0 1 2
Number of moves vs Target depth
Number of Moves
2
3
Move-only
Contextual-SC
Contextual-MC
Random
(a)
Number of Objects
0
5
10
15
20
25
14 14 14 14 15 15 15 15 16 16 16 16
−1 0 1 2 −1 0 1 2 −1 0 1 2
Number of moves vs Number of objects
Number of Moves
14
15
16
Move-only
Contextual-SC
Contextual-MC
Random
(b)
Figure 4.7: A more detailed comparison of the performance of the dierent planners with
target depth and total number of objects.
68
We exploited the fact that we tend to follow certain organizational principles while storing
objects - objects that serve a similar purpose are often kept together. We showed that
this inherent contextual structure in the spatial placement of objects, or co-occurrences,
can serve as a tool to guide our search. Our contextual search algorithm uses these co-
occurrences to decide at each step whether to move an object or categorize it. Using
simulations, we showed that search using context requires signicantly fewer number of
moves to locate the target as compared to a purely manipulation-based search without
the knowledge of co-occurrences. Our simulations also suggest that context especially
helps when the search environment is densely packed and the target object is occluded by
multiple layers of objects in front.
4.7 Appendix
4.7.1 Markov Decision Process Formulation
Any MDP has four components: the state space S, the action space A, the transition
modelT , and the rewardsR. Given these four components, the solution to the MDP
gives an optimal policy that denes which action should be taken at each state to obtain
the maximum expected reward until one of the goal states is reached. Let there bem cells
in the grid andN
c
categories of objects. LetT be the target object belonging to category
c
2f1; 2;:::;N
c
g. For each cell i2f1;:::;mg, the cell state, x
i
2f1; 0; 1;:::;m;m +
1; (N
c
+ 1)mg, where
-1 represents the cell state is unknown
0 represents the cell is free
x
i
= j where j 2f1;:::;mg species that cell i is occupied by an object whose
category is unknown and that initially occupied cell j
x
i
=mc +j where j2f1;:::;mg and c2f1;:::;N
c
g species that cell i is occupied
by an object of category c that initially occupied cell j.
As the grid has m cells, the state of the world is an m-dimensional vector with the
states of all cells stacked together: ~ x = (x
1
;:::;x
m
). However, some combinations are
invalid and hence appear with zero probability. Let us estimate the size of state space,
69
jSj, excluding these invalid states. As the states of any two cells on dierent columns of
the grid are independent of each other, we only consider the invalid combinations on the
same column. Let r be the number of cells on a column and c
r
denote the number of
valid r-dimensional vectors, (x
1
;:::;x
r
). We now do a recursive analysis of c
r
and derive
its upper bound. For this, let us list all possible states of the last cell in the column, x
r
,
and count the possible valid r-dimensional vectors for each case:
When x
r
=1 (state of last cell in the column is unknown), then x
r1
can not be
0. There are a total of c
r1
r-dimensional vectors with x
r
=1 out of which c
r2
of them have x
r1
= 0. Thus, there will be c
r1
c
r2
valid r-dimensional vectors
with x
r
=1.
Whenx
r
= 0 (the last cell is known to be free),x
r1
=1 is not possible. Following
the same logic as before, there will be c
r1
c
r2
valid r-dimensional vectors with
x
r
= 0.
Whenx
r
> 0 (the last cell is known to be occupied),x
r1
=1 is again not possible.
In this case, there will be m(N
c
+ 1)(c
r1
c
r2
) valid vectors.
To sum up all possible combinations, we get the recursive formula as:
c
r
= ((N
c
+ 1)m + 2)(c
r1
c
r2
) =O(mN
c
)c
r1
:
Similarly,c
r1
follows the same derivation except for the occupied case, where a cell i can
now take only (N
c
+ 1)(m 1) possible values when x
i
> 0. That is,
c
r1
= ((N
c
+ 1)(m 1) + 2)(c
r2
c
r3
) =O((m 1)N
c
)c
r2
:
Given c
1
= (mn + 1)N
c
+ 1, we have
c
r
=O(m(m 1) (mr + 1)N
r
c
):
The remaining
m
r
columns follow the same analysis. Finally, the number of all possible
grid states is the multiplication over all possible combinations of each column. That is,
jSj =O(m!N
m
c
):
70
One of the two actions can be applied to any object that is visible to the robot: 1)
categorize it if its category is unknown, or 2) move it from one location to another. LetA
cat
and A
move
denote the set of all such feasible categorize and move actions respectively. A
categorize action chooses one of the unknown object to recognize. That is,jA
cat
j =O(m).
On the other hand, a move action relocates, removes, or replaces an object, i.e., A
move
j =
O(m
2
). Thus,jAj =jA
cat
[A
move
j =O(m
2
). Our state space,S, grows exponentially
with the size of the grid, m. Thus, even for very small grids and a few categories, nding
a solution to the MDP becomes infeasible.
4.7.2 Category Recognition using the Cloud
In order to identify the category of an object in the world accurately, we suggest the use
of the human workforce through Amazon Mechanical Turk (MTurk) [45]. MTurk is a
crowdsourcing Internet marketplace that enables any person or business (Requester) to
create Human Intelligence Tasks (HITs) that a computer currently cannot do reliably.
These HITs are then available to registered MTurk Workers (typically called Turkers) to
browse and complete anonymously for a monetary payment as low as $0.01 for each HIT.
MTurk also allows the Requesters to make their HITs available only to those Turkers who
have done well in the past. Such Turkers are known as Masters.
In our object categorization HITs, the Turkers are given the object image and a list
of 10 categories to choose from. One such HIT is shown in Fig. 4.8. The use of MTurk
for categorization allows us to do away with the need to train any object detectors or
classiers, making our algorithm robust, accurate, and extensible to novel objects and
environments. We used Categorization Masters for our HITs and obtained 100% accurate
results. Another way to ensure accuracy is to obtain responses from multiple Masters and
then take a majority vote [61].
Since using the cloud for object categorization introduces latency in the search, we
performed some preliminary analysis to see if the MTurk response time is aected by
the reward paid to the Worker. The response time for various reward values per HIT
on Amazons Mechanical Turk are reported in Fig. 4.9. We see that the median response
time, shown by the dotted line, roughly decreases as the reward is increased. We obtained
the minimum median response time for a reward of $0.06 per HIT.
71
Figure 4.8: A sample Human Intelligence task (HIT) for object categorization on Amazon's
Mechanical Turk
72
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
0
2
4
6
8
10
12
14
16
18
Reward per HIT ($)
Response time per HIT (minutes)
Figure 4.9: Scatter plot for the response time vs. reward (5 trials for each reward) for
an object categorization HIT (Human Intelligence task) on Amazons Mechanical Turk.
The dotted line shows the median response time for each reward amount. The median
response time roughly goes down as the reward is increased.
73
Chapter 5
Conclusion
This thesis explored the idea of manipulation as an aid to perception, and consequently,
to the overall robotic manipulation task in cluttered environments. The environments
being dealt with are bounded and small (eg. a kitchen shelf), however clutter renders
them partially observable. This inherent uncertainty in the world's state forces the robot
to adopt an observe-plan-act strategy where perception, planning, and execution have
to be interleaved since execution of an action may result in revealing information about
the world that was unknown hitherto, and hence a new plan needs to be generated.
Since manipulation is typically expensive on a robot, our goal is to reduce the number
of object manipulations required to complete the desired task. We studied its utility in
three contexts and presented planning algorithms that generate a sequence of actions to
manipulate the world and complete the desired task eciently.
5.1 Summary of Contributions
The contributions of this thesis are in three areas:
1. Object Sorting
Given a pile of small objects of a similar type on a tabletop, the objects have to be
sorted by color or size. We presented a robust pipeline that combines manipulation-
aided perception and grasping to achieve reliable and accurate sorting. We validated
its feasibility through extensive experiments on the PR2 robot, and demonstrated
that manipulation-aided sorting becomes increasingly useful as the clutter increases.
2. Environment Exploration
A sequence of rearrangements of objects in a small and bounded cluttered
74
environment has to be planned to explore it and potentially search for a target
object. We presented an adaptive look-ahead algorithm for exploration by
prehensile and non-prehensile manipulation of the objects in the world. We then
used it for object search in the real world using the PR2 robot.
3. Contextual Object Search
Given certain object-object co-occurrence relations and a target object, a sequence
of actions has to be planned to search for the target eciently. We presented an
algorithm that uses context to guide the object search and results in fewer
manipulations than a purely manipulation-based search without the use of any
context.
This thesis shows that intelligent manipulation of clutter to aid perception improves
the eciency of robotic tasks in everyday environments like households and oces, and
could serve as an important tool in the eld of personal robots. The focus was on using
simple manipulation primitives to declutter the world and making task completion easier
and faster using the environment's local structure or context. Evaluation of our planners
through simulations and real-world experiments on the PR2 robot (using various metrics
like planning time and number of actions required) indicates that purposeful manipulation
of clutter to aid perception becomes increasingly useful (and essential) as the clutter in
the environment increases.
75
BIBLIOGRAPHY
[1] B. Cohen, S. Chitta, and M. Likhachev, \Search-based planning for manipulation
with motion primitives," in Proc. of the IEEE Intl. Conference on Robotics and
Automation, 2010, pp. 2902{2908.
[2] H. Jang, H. Moradi, P. Le Minh, S. Lee, and J. Han, \Visibility-based spatial
reasoning for object manipulation in cluttered environments," Computer-Aided
Design, vol. 40, no. 4, pp. 422{438, 2008.
[3] Y. Hirano, K. Kitahama, and S. Yoshizawa, \Image-based object recognition and
dexterous hand/arm motion planning using RRTs for grasping in cluttered scene,"
in Proc. of the IEEE/RSJ Intl. Conference on Intelligent Robots and Systems, 2005,
pp. 2041{2046.
[4] M. Dogar and S. Srinivasa, \Push-grasping with dexterous hands: Mechanics and
a method," in Proc. of the IEEE/RSJ Intl. Conference on Intelligent Robots and
Systems, 2010, pp. 2123{2130.
[5] C. Thorpe, J. Carlson, D. Duggins, J. Gowdy, R. MacLachlan, C. Mertz, A. Suppe,
and B. Wang, \Safe robot driving in cluttered environments," Robotics Research, pp.
271{280, 2005.
[6] S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller, \Integrating visual and
range data for robotic object detection," in ECCV workshop on Multi-camera and
Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), 2008.
[7] P. Forss en, D. Meger, K. Lai, S. Helmer, J. Little, and D. Lowe, \Informed visual
search: Combining attention and object recognition," in Proc. of the IEEE Intl.
Conference on Robotics and Automation, 2008, pp. 935{942.
[8] D. Kragic, M. Bj orkman, H. Christensen, and J. Eklundh, \Vision for robotic object
manipulation in domestic settings," Robotics and Autonomous Systems, vol. 52, no. 1,
pp. 85{100, 2005.
[9] P. Fitzpatrick, \First contact: an active vision approach to segmentation," in Proc.
of the IEEE/RSJ Intl. Conference on Intelligent Robots and Systems, vol. 3, 2003,
pp. 2161{2166.
[10] W. Li and L. Kleeman, \Segmentation and modeling of visually symmetric objects by
robot actions," The Intl. Journal of Robotics Research, vol. 30, no. 9, pp. 1124{1142,
2011.
76
[11] E. Kuzmic and A. Ude, \Object segmentation and learning through feature grouping
and manipulation," in 10th IEEE-RAS Intl. Conference on Humanoid Robots
(Humanoids), 2010, pp. 371{378.
[12] D. Schiebener, A. Ude, J. Morimoto, T. Asfour, and R. Dillmann, \Segmentation and
learning of unknown objects through physical interaction," in 11th IEEE-RAS Intl.
Conference on Humanoid Robots (Humanoids), 2011, pp. 500{506.
[13] D. Katz and O. Brock, \Manipulating articulated objects with interactive
perception," in IEEE International Conference on Robotics and Automation (ICRA),
2008, pp. 272{277.
[14] L. Chang, J. Smith, and D. Fox, \Interactive singulation of objects from a pile," in
IEEE International Conference on Robotics and Automation (ICRA), May 2012, pp.
3875{3882.
[15] D. Katz, M. Kazemi, J. A. Bagnell, and A. Stentz, \Clearing a pile of unknown
objects using interactive perception," in IEEE International Conference on Robotics
and Automation (ICRA), May 2013, pp. 154{161.
[16] A. Ambler, H. G. Barrow, C. M. Brown, R. M. Burstall, and R. J. Popplestone,
\A versatile system for computer-controlled assembly," Articial Intelligence, vol. 6,
no. 2, pp. 129{156, 1975.
[17] J. L. Barry, \Manipulation with diverse actions," Ph.D. dissertation, Massachusetts
Institute of Technology, 2013.
[18] K. Bohringer, K. Goldberg, M. Cohn, R. Howe, and A. Pisano, \Parallel
microassembly with electrostatic force elds," in Proc. of the IEEE Intl. Conference
on Robotics and Automation, vol. 2, 1998, pp. 1204{1211.
[19] R. Taylor, M. Mason, and K. Goldberg, \Sensor-based manipulation planning as a
game with nature," in Proc. of the 4th Intl. symposium on Robotics Research, 1988,
pp. 421{429.
[20] D. Kang and K. Goldberg, \Sorting parts by random grasping," IEEE Transactions
on Robotics and Automation, vol. 11, no. 1, pp. 146{152, 1995.
[21] A. Rao and K. Goldberg, \Shape from diameter," The Intl. Journal of Robotics
Research, vol. 13, no. 1, p. 16, 1994.
[22] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and R. Chellappa,
\Fast object localization and pose estimation in heavy clutter for robotic bin picking,"
The International Journal of Robotics Research, vol. 31, no. 8, pp. 951{973, 2012.
[23] A. Pochyly, T. Kubela, M. Kozak, and P. Cihak, \Robotic vision for bin-picking
applications of various objects," in 41st International Symposium on Robotics (ISR),
and 6th German Conference on Robotics (ROBOTIK), 2010, pp. 1{5.
[24] S. Fuchs, S. Haddadin, M. Keller, S. Parusel, A. Kolb, and M. Suppa, \Cooperative
bin-picking with time-of-
ight camera and impedance controlled dlr lightweight
robot iii," in IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), 2010, pp. 4862{4867.
77
[25] M. Nieuwenhuisen, D. Droeschel, D. Holz, J. St uckler, A. Berner, J. Li, R. Klein, and
S. Behnke, \Mobile bin picking with an anthropomorphic service robot," in IEEE
International Conference on Robotics and Automation (ICRA), 2013.
[26] \Robot Operating System (ROS)." [Online]. Available: http://ros.org/wiki
[27] R. Rusu, S. Cousins, and W. Garage, \3D is here: Point Cloud Library (PCL)," in
Proceedings of the 2011 IEEE International Conference on Robotics and Automation,
Shanghai, China, 2011.
[28] R. B. Rusu, \Semantic 3D Object Maps for Everyday Manipulation in Human
Living Environments," Ph.D. dissertation, Computer Science department, Technische
Universit at M unchen, Germany, October 2009.
[29] L. Chang, J. Smith, and D. Fox, \Interactive singulation of objects from a pile," in
Proc. of the IEEE/RSJ Intl. Conference on Intelligent Robots and Systems, 2011.
[30] A. Aydemir, M. G obelbecker, A. Pronobis, K. Sj o o, and P. Jensfelt, \Plan-based
Object Search and Exploration Using Semantic Spatial Knowledge in the Real
World," in Proc. of the European Conference on Mobile Robotics, 2011.
[31] Y. Ye and J. Tsotsos, \Sensor planning for 3D object search," Computer Vision and
Image Understanding, vol. 73, no. 2, pp. 145{168, 1999.
[32] J. Ma, T. Chung, and J. Burdick, \A probabilistic framework for object search with
6-DOF pose estimation," The Intl. Journal of Robotics Research, vol. 30, no. 10, pp.
1209{1228, 2011.
[33] T. Parsons, \Pursuit-evasion in a graph," Theory and applications of graphs, pp.
426{441, 1978.
[34] I. Suzuki and M. Yamashita, \Searching for a mobile intruder in a polygonal region,"
SIAM Journal on computing, vol. 21, no. 5, pp. 863{888, 1992.
[35] B. Gerkey, S. Thrun, and G. Gordon, \Visibility-based pursuit-evasion with limited
eld of view," The Intl. Journal of Robotics Research, vol. 25, no. 4, pp. 299{315,
2006.
[36] V. Kumar, D. Rus, and S. Singh, \Robot and sensor networks for rst responders,"
Pervasive Computing, vol. 3, no. 4, pp. 24{33, 2004.
[37] D. Calisi, A. Farinelli, L. Iocchi, and D. Nardi, \Multi-objective exploration and
search for autonomous rescue robots," Journal of Field Robotics, vol. 24, no. 8-9, pp.
763{777, 2007.
[38] S. Koenig and M. Likhachev, \D^* lite," in Proceedings of the National Conference
on Articial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press;
MIT Press; 1999, 2002, pp. 476{483.
[39] C. Tovey, S. Greenberg, and S. Koenig, \Improved analysis of d*," in IEEE
International Conference on Robotics and Automation, 2003, vol. 3. IEEE, 2003,
pp. 3371{3378.
78
[40] M. Dogar and S. Srinivasa, \A framework for push-grasping in clutter," in Robotics:
Science and Systems, 2011.
[41] L.L.S. Wong, L.P. Kaelbling, and T. Lozano-P erez, \Manipulation-based Active
Search for Occluded Objects," in Proc. of the IEEE Intl. Conference on Robotics
and Automation, 2013.
[42] M.R. Dogar, M.C. Koval, A. Tallavajhula, and S.S. Srinivasa, \Object Search by
Manipulation," in Proc. of the IEEE Intl. Conference on Robotics and Automation,
2013.
[43] W. Garage, \Overview of the PR2 robot," 2009.
[44] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, \OctoMap:
An ecient probabilistic 3D mapping framework based on octrees," Autonomous
Robots, 2013, software available at http://octomap.github.com. [Online]. Available:
http://octomap.github.com
[45] \Amazon Mechanical Turk." [Online]. Available: http://www.mturk.com/
[46] D. G. L opez, K. Sjo, C. Paul, and P. Jensfelt, \Hybrid laser and vision based
object search and localization," in IEEE International Conference on Robotics and
Automation. IEEE, 2008, pp. 2636{2643.
[47] T. Kollar and N. Roy, \Utilizing object-object and object-scene context when
planning to nd things," in IEEE International Conference on Robotics and
Automation (ICRA), 2009, pp. 2168{2173.
[48] M. Samadi, T. Kollar, and M. M. Veloso, \Using the Web to Interactively Learn to
Find Objects," in 26th AAAI Conference on Articial Intelligence, 2012.
[49] A. Aydemir, K. Sjoo, J. Folkesson, A. Pronobis, and P. Jensfelt, \Search in the real
world: Active visual object search based on spatial relations," in IEEE International
Conference on Robotics and Automation (ICRA), 2011, pp. 2818{2824.
[50] T. D. Garvey, \Perceptual strategies for purposive vision," Ph.D. dissertation,
Stanford University, 1976.
[51] D. Joho and W. Burgard, \Searching for objects: Combining multiple cues to object
locations using a maximum entropy model," in IEEE International Conference on
Robotics and Automation (ICRA), 2010, pp. 723{728.
[52] L. Kunze, M. Beetz, M. Saito, H. Azuma, K. Okada, and M. Inaba, \Searching
objects in large-scale indoor environments: A decision-theoretic approach," in IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2012, pp.
4385{4390.
[53] M. R. Dogar, M. C. Koval, A. Tallavajhula, and S. S. Srinivasa, \Object search by
manipulation," Autonomous Robots, vol. 36, no. 1-2, pp. 153{167, 2014.
[54] B. Moldovan and L. De Raedt, \Occluded object search by relational aordances,"
in IEEE International Conference on Robotics and Automation (ICRA), June 2014.
79
[55] P. Chumtong, Y. Mae, K. Ohara, T. Takubo, and T. Arai, \Object search using
object co-occurrence relations derived from web content mining," Intelligent Service
Robotics, pp. 1{13, 2013.
[56] M. J. Schuster, D. Jain, M. Tenorth, and M. Beetz, \Learning organizational
principles in human environments," in Robotics and Automation (ICRA), 2012 IEEE
International Conference on. IEEE, 2012, pp. 3867{3874.
[57] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference.
Morgan Kaufmann, 1988.
[58] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, \Factor graphs and the sum-
product algorithm," IEEE Transactions on Information Theory, vol. 47, no. 2, pp.
498{519, 2001.
[59] J. M. Mooij, \libDAI: A Free and Open Source C++ Library for Discrete
Approximate Inference in Graphical Models," The Journal of Machine Learning
Research, vol. 11, pp. 2169{2173, 2010.
[60] C. Berrou and A. Glavieux, \Turbo codes," Encyclopedia of Telecommunications,
2003.
[61] M. Marge, S. Banerjee, and A. I. Rudnicky, \Using the Amazon Mechanical Turk for
transcription of spoken language," in IEEE International Conference on Acoustics
Speech and Signal Processing (ICASSP). IEEE, 2010, pp. 5270{5273.
80
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning affordances through interactive perception and manipulation
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Learning objective functions for autonomous motion generation
PDF
Data-driven autonomous manipulation
PDF
Trajectory planning for manipulators performing complex tasks
PDF
Information theoretical action selection
PDF
Speeding up trajectory planning for autonomous robots operating in complex environments
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Planning for mobile manipulation
PDF
Informative path planning for environmental monitoring
PDF
Learning from planners to enable new robot capabilities
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Machine learning of motor skills for robotics
PDF
Augmented simulation techniques for robotic manipulation
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Algorithms and systems for continual robot learning
PDF
Hierarchical tactile manipulation on a haptic manipulation platform
PDF
Data-driven robotic sampling for marine ecosystem monitoring
PDF
Accelerating robot manipulation using demonstrations
PDF
Scaling robot learning with skills
Asset Metadata
Creator
Gupta, Megha
(author)
Core Title
Intelligent robotic manipulation of cluttered environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/08/2014
Defense Date
05/07/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,personal robots,planning algorithms,robotic manipulation,sensor-based manipulation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav S. (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
contactmegha@gmail.com,meghagup@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-521349
Unique identifier
UC11297867
Identifier
etd-GuptaMegha-3105.pdf (filename),usctheses-c3-521349 (legacy record id)
Legacy Identifier
etd-GuptaMegha-3105.pdf
Dmrecord
521349
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gupta, Megha
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
personal robots
planning algorithms
robotic manipulation
sensor-based manipulation