Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Planning and learning for long-horizon collaborative manipulation tasks
(USC Thesis Other)
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PLANNING AND LEARNING FOR LONG-HORIZON COLLABORATIVE MANIPULATION
TASKS
by
Hejia Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2024
Copyright 2024 Hejia Zhang
Dedication
I dedicate this thesis to my parents,
for their unconditional love and support.
ii
Acknowledgements
These past four years have been incredible. I am deeply grateful to the friends, mentors, and colleagues
I have encountered on this journey. I have learned and grown a lot both as a person and as a researcher
during my Ph.D.
First and foremost, I am extremely grateful to my advisor, Stefanos. He taught me some of the most
important skills and mindsets for becoming a good researcher - writing, speaking, reading, formulating
research problems, and summarizing research insights, to name a few - and has been extremely patient
during this process. He has been a role-model for me as a roboticist, an academic, a mentor and a person.
Thank you for inspiring me to approach challenging research problems fearlessly, teaching me to remain
detached from the outcomes, and to focus on the most crucial actions that can be taken. Thank you for
introducing and motivating me to work on an important real-world robot application - robot-assisted
roof bolting. Through this project, I realized the importance of good mechanical design in addition to
robot algorithms. More importantly, throughout this project, I had the opportunity to spend time with
you, sharing drinks and conversations. I will always remember your lesson on the meaning of life:“The
meaning of life is not to be as happy as possible, but to do good things and take responsibility". As I begin
my next journey, I am committed to doing good and taking responsibility, both as a researcher and as a
person.
I am grateful to my former advisors for my master’s program at USC, Gaurav Sukhatme and Stefan
Schaal, and their Ph.D. students, Eric Heiden and Ryan Julian. Thank you for introducing me to the field
of robotics research and the fascinating world of robot learning.
I am grateful to Daniel Seita, who is probably the most disciplined person I have ever seen. Thank you
for bringing me into the world of foundation models for robotics. Thank you for teaching me some useful
tools to organize my research thoughts and results.
I would like to express my gratitude to the other USC professors I have collaborated with. I want to
thank Sven Koenig for providing some of the most detailed paper feedback that I have seen, despite his
responsibilities as a conference chair at the same time. I want to thank C.-C. Jay Kuo for serving as part
of my defense committee - thank you for providing your guidance and insights.
I am grateful to my comrades, the ICAROS lab members, for their friendship. I am thankful to Heramb
and Sophie, who were always there to have a discussion; to Matt, for inspiring me to combine mixed-integer
linear programming with neural networks; to Nathan, for providing insights about good user study and
human-robot interaction research.
iii
Beyond my research, I am grateful to my brother, Xianan. He has been a role-model for me since my
childhood. As an academic, he has offered invaluable mental and academic guidance. I am also grateful to
my furry friend, Nancy, who has been a supportive companion throughout my Ph.D. journey.
Finally, I am profoundly thankful to my parents for their unwavering love and support. They have
always placed my education above all else. Without their encouragement, I would not have achieved this
much.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Imitating Human Collaborative Manipulation Plans in YouTube Videos . . . . . . . . . 3
2.1 Detecting collaborative manipulation plans from YouTube videos . . . . . . . . . . . . . . 4
2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Action recognition with commonsense reasoning . . . . . . . . . . . . . . . . . . . 7
2.2.2 Collaborative YouTube videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3: Multi-Robot Geometric Task-and-Motion Planning . . . . . . . . . . . . . . . . . . . . . 14
3.1 Multi-Robot GTAMP problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 A planning framework for multi-robot geometric task-and-motion planning . . . . . . . . 17
3.2.1 Computing collaborative manipulation information . . . . . . . . . . . . . . . . . . 17
3.2.2 Searching for task-and-motion plans . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2.1 Key Component 1: Generating promising task skeletons . . . . . . . . . 21
3.2.2.2 Key Component 2: Task-skeleton grounding . . . . . . . . . . . . . . . . 25
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Benchmark domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 4: Guided Object Placement Sampling for Efficient Geometric Task-and-Motion Planning 32
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Monotone GTAMP formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Reverse-time search strategy for monotone GTAMP . . . . . . . . . . . . . . . . . . 34
4.1.3 Object placement sampling for monotone GTAMP . . . . . . . . . . . . . . . . . . . 35
4.2 GOPS: Guided object placement sampling for GTAMP . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Pre-computing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
4.2.2 Neural network-based collision checking . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Computing object placement cost maps . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 Integration with GTAMP planners . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Benchmark domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Evaluation on problems with seen object shapes . . . . . . . . . . . . . . . . . . . . 42
4.3.4 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.5 Evaluation on problems with unseen object shapes . . . . . . . . . . . . . . . . . . 43
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Object placement sampling in GTAMP . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.2 Learning-guided collision avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 5: Multi-Robot Planning for Automated Roof Bolting in Underground Mining . . . . . . . 46
5.1 Design of the automated roof bolting machine . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Integration of ABB IRB 1600 robotic arm . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Components of the automated roof bolting machine . . . . . . . . . . . . . . . . . 48
5.2 Multi-robot planning for roof bolting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Two example scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Planning and execution details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
List of Tables
2.1 Comparison of the proposed system with a baseline system regarding the precision, recall
and number of correctly detected actions in the collaborative cooking videos. . . . . . . . 8
2.2 Number of Correct Trees for Different Types of Error. A: Action Recognition Error, HO:
Hand-Object Association Error, OO: Object-Object Association Error. . . . . . . . . . . . . 9
3.1 Comparison of the proposed method with two baseline methods in the two benchmark
domains regarding the success rate, planning time, makespan and motion cost. The
numbers in the names of the problem instances indicate the numbers of the goal objects
and the movable objects besides the goal objects. In PA5, PA7 and PA10, each problem
instance has 3 goal objects and 2 robots. We omit the planning time and solution quality
results for Ap2 on PA10 and BO8 because its success rate is significantly lower than those
of the other two methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 The results of the proposed method in domain PA regarding the success rate, planning
time, makespan and motion cost. The numbers in the names of the problem instances
indicate the numbers of the robots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Comparison of the proposed method with two baseline methods in the two benchmark
domains regarding the success rate, planning time, and plan length. The numbers in the
names of the problems indicate the numbers of the movable objects besides the goal object. 42
4.2 Comparison of the samplers that uses different object placement cost maps in the two
benchmark domains regarding the success rate, planning time, and plan length. The
numbers in the names of the problems indicate the numbers of the movable objects besides
the goal object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Comparison of GOPS with UNIFORM on benchmarking problems with object shapes that
are unseen in the training dataset for neural network-based collision checking model. . . . 44
vii
List of Figures
2.1 The robots execute the action sequence shown in the video. . . . . . . . . . . . . . . . . . 3
2.2 Segmentation of the two test videos. For each video we show the ground truth for the
adult (P1-G), our result for the adult (P1-O), the ground truth for the child (P2-G) and our
result for the child (P2-O). The bottom color bar shows different colors for different actions. 5
2.3 A Collaborative Manipulation Action Context-Free Grammar . . . . . . . . . . . . . . . . . 6
2.4 Generated task graph for the (Rita) video. Different actions are shown in different colors
using the colorbar of Fig. 2.2. P1 indicates an action executed by the woman in the video
and P2 by the girl. The dotted line indicates a collaboration between the two actors. . . . . 7
2.5 Normalized confusion matrix of the action recognition with commonsense reasoning. The
rows indicate true values, while the columns are the predictions of the action recognition
module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Example frames of the video (Rita) showing two humans cooking together and corresponding snapshots in simulation environment of two robots executing the same actions.
We visualize the learned and executed action graphs on the upper right corner of each
frame and snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Example frames and generated trees of 3 failure cases. The captions depict the groundtruth
descriptions of each case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Example frames, snapshots in simulation environment of two robots executing the same
actions with humans and generated actions trees of 5 successful cases. The captions depict
the ground-truth descriptions of each successful case. . . . . . . . . . . . . . . . . . . . . . 11
3.1 Left: Sorting colored objects into boxes of corresponding colors. Right: Moving the
colored boxes to the green region. In both scenarios, white objects are only allowed
to be relocated within their current regions (red). We use PyBullet (Coumans and Bai,
2016–2019) as our simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Overview of the proposed framework. Fig. 3.3 provides a more detailed visualization and
description of Phase 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Visualization of the search process in the second phase of our framework. We show the
initialization stage of the search process (left) and two example search iterations (middle,
right) that lead to different evaluation outcomes. Left: Blue arrows represent the workflow
for initializing the search tree. Middle: Yellow arrows represent a search iteration that
results in an updated set of objects to be moved and thus a new set of task skeletons to
be grounded. Right: Red arrows represent a search iteration that results in an executable
task-and-motion plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
viii
3.4 (Left) An example scenario where we want to generate task skeletons to move object
M1 given an empty sequence of grounded joint actions. (Right) The corresponding
collaborative manipulation task graph for moving object M1. The rounded rectangular
nodes are action nodes. The circular nodes are object nodes. The red circular nodes
represent objects that are specified to be moved. The yellow arrows represent action edges.
The purple arrows represent block-place edges, and the blue arrow represents a block-pick
edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Left: Bring the knife (dashed box) from the storage region (purple) to the human workspace
region (red). Right: Retrieve the green bottle from the bookshelf to the storage region
(purple). In both scenarios, other movable obstacles are only allowed to be relocated
within their current regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Visualization of an object placement sampling problem instance (a), the set of failure
sampled object placements (b - e) according to different feasibility requirements (Sec. 4.1.3)
for object placement sampling in monotone GTAMP, and an object placement that satisfies
all the feasibility requirements.(c). For efficient monotone GTAMP, we also need to
consider an effectiveness requirement, which will be described in Sec. 4.1.3 and Sec. 4.2.3. . 33
4.3 Visualization of the neural network architecture of our collision checker, and a forward
pass through the network with a batch of three object height maps and two robot
configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Visualization of the computed cost maps. The rotation space of object placement is evenly
discretized to a discrete set of 16 rotations. Left: An object placement sampling problem
instance (part of the shelf is omitted for better visualization), where the target is to retrieve
the green object from the shelf. The yellow objects are fixed obstacles and the white
objects are movable objects. During the reverse-time search, the action for moving the
goal object (green) has been planned. We need to relocate the magenta object (dashed
box). The robot configurations for picking and placing the goal object (green), and the
robot configuration for picking the magenta object, are visualized. Middle: Visualization
of the feasibility object placement cost maps and the effectiveness object placement cost
map. They are composed to construct an overall object placement score map. Right: A
sampled feasible and effective object placement. Instead of selecting an object placement
near the upper-right corner, the sampler prioritize a placement that has low compactness
cost based on the computed compactness cost map. . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Success rates of GOPS and two baselines for solving object placement sampling problems
in 50, 150, and 250 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 A human operator is installing a bolt into the roof bolter.∗
. . . . . . . . . . . . . . . . . . . 46
5.2 ABB IRB 1600 Robotic Arm next to a roof bolter unit. . . . . . . . . . . . . . . . . . . . . . 48
5.3 The plate feeder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 The bolt feeder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 The wrench system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6 The roof bolting system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.7 Example scenario 1 in the simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.8 Generated CMTGs for example scenario 1. R1 and R2 represent the ABB robot arm and
the roof bolter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.9 Example scenario 2 in the simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
5.10 Generated CMTGs for example scenario 2. R1 and R2 represent the ABB robot arm and
the roof bolter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.11 Frames showing the execution of the generated plan for example scenario 1 in both
simulation (Left) and real-world (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.12 Frames showing the execution of the generated plan for example scenario 2 in both
simulation (Left) and real-world (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.13 Integration of autonomous and remote operations improve efficiency. . . . . . . . . . . . . 57
x
Abstract
How can we enable robots that operate in unstructured human environments such as homes, hospitals, and
underground mines to effectively collaborate with other agents such as robots, and humans? This is a challenging problem that necessitates improvements across various aspects of robot capabilities. For instance,
it requires multiple robots to devise effective collaboration strategies for intelligent task decomposition
and allocation. Additionally, robots must comprehend human actions to offer appropriate assistance.
In this thesis, we focus on robotic collaborative manipulation tasks where robots interact not only with
their environments but also with other agents. We explored three important research questions arising
in robotic collaborative manipulation tasks: (1) enabling robots to comprehend and imitate human collaborative manipulation actions; (2) optimizing robot collaboration considering their distinct manipulation
capabilities; (3) devising efficient robot responses to environmental changes induced by humans.
One direction to address these research questions is to leverage the recent advances in deep learning.
The drawback, however, is that it needs tons of data and it is brittle. If a model encounters a data point
that is very different from the ones seen in the training set, then it is likely to make mistakes.
Our approach is to combine deep learning based approach with traditional symbolic reasoning approach. We propose multi-robot planning algorithms, neuro-symbolic task and motion planning algorithms, and human collaborative manipulation action representations for the three important arising research questions. We show that the proposed frameworks can address these questions effectively and lead
to many interesting future directions.
xi
Chapter 1
Introduction
Enabling robots that operate in unstructured human environments such as homes, hospitals, and underground mines to effectively collaborate with other agents such as robots and humans, can not only improve
the overall efficiency of the task execution but also enhance robots’ abilities to satisfy human preferences
and needs.
There has been tremendous progress in enabling robots to perform complex tasks in isolation. Many
complex robotics tasks can be formalized as task-and-motion planning (TAMP) problems. TAMP is the
problem of combining task and motion planning to divide an objective such as assembling a table, into a
series of robot-executable motion trajectories. It is an extremely challenging problem as it involves hybrid
search space, long planning horizon, a large number of objects, and expensive action feasibility checking.
Multi-robot TAMP (MR-TAMP), where multiple robots are involved and required to collaborate intelligently to perform tasks efficiently and effectively, is more challenging. To enable multiple robots to
collaborate intelligently, effective collaboration strategies have to be planned. An effective collaboration
strategy should intelligently decompose complex tasks into subtasks and assign them to different agents
wisely.
Human-aware TAMP (HA-TAMP), where humans are not just obstacles but also intelligent agents
that have their own goals and expectations about the world and the robot, is even more challenging. On
the one hand, robots have to recognize different human actions that have many variations in different
environments and manipulation tasks. On the other hand, robots have to frequently replan on-the-fly
since human actions are uncertain and often invalidate current robot plans.
To address these challenges, we first focus on the problem of robot imitating human collaborative
manipulation tasks from YouTube videos (Chapter 2). Previous work has shown that the space of human
manipulation actions has a linguistic, hierarchical structure that relates actions to manipulated objects
and tools. Building upon this theory of language for action, we propose a system for understanding and
executing demonstrated action sequences from full-length, real-world cooking videos on the web. We use
temporal-spatial information of human hand trajectories to detect collaborative actions such as handover
and holding actions. We also leverage the rich semantic information encoded in language for human
manipulation action reasoning.
The proposed video imitation framework can extract high-level action sequences, such as “grasp the
knife then cut", from YouTube videos. To enable the robots to execute these high-level action sequences
1
in any environments, we then move on to study task-and-motion planning problems. We focus on an
important subclass of TAMP problems, namely geometric task-and-motion planning (GTAMP) problems.
The goal of the GTAMP problem is to move objects to goal regions in the presence of other movable objects.
This thesis first proposes multi-robot GTAMP (MR-GTAMP) which extends GTAMP problems to multirobot domains and presents a novel framework to address MR-GTAMP problems efficiently and effectively
(Chapter 3). We then focus on the problem of efficient continuous parameters sampling for GTAMP and
presents a novel framework that constructs discrete sampling distributions for continuous parameters by
leveraging a learning-based collision checker and a set of convolution kernels constructed following the
hand-crafted heuristics (Chapter 4).
Besides our efforts on developing generalist collaborative robots, we also explored the application
of modern robotics in an important underground mining operation, named roof-bolting operation (Chapter 5). We designed novel mechanical components for automating the traditional roof-bolting machine. We
then integrate the proposed efficient and effective MR-GTAMP framework into the decision-making loop
of the roof-bolting machine. We conducted an application study on the proposed automated roof-bolting
system. We show that the roof-bolting task can be formulated as MR-GTAMP problems and addressed
efficiently with the proposed planning framework.
2
Chapter 2
Imitating Human Collaborative Manipulation Plans in YouTube Videos
In this chapter, we focus on the problem of enabling the robot to “watch” unconstrained videos on the
web, extract the action sequences shown in the videos and convert them to an executable plan that it can
perform either independently, or as part of a human-robot or robot-robot team.
Learning from online videos is hard, particularly in collaborative settings: it requires recognizing the
actions executed, together with manipulated tools and objects. In many collaborative tasks these actions
include handing objects over or holding an object for the other person to manipulate. There is a very large
variation in how the actions are performed and collaborative actions may overlap spatially and temporally (Hoffman, 2019).
On the other hand, actions exhibit a syntactic-like structure that is common in language. Yang and
Aloimonos (Yang et al., 2014; Yang et al., 2015) build upon this theoretical insight to propose a manipulation
action grammar, where symbolic information of manipulated objects and tools are parsed to interpret
action sequences from 12 YouTube video clips.
We build upon this theory of language for action to propose a system for zero-shot understanding both
single and collaborative actions from full-length YouTube videos. Our key insight is that hands contain
both spatial and temporal information of the demonstrated actions. This allows using hand trajectories to
temporally segment full-length videos to short clips, derive hand-object and object-object associations and
infer the demonstrated actions.
Figure 2.1: The robots execute the action sequence shown in the video.
3
Our system takes as input a YouTube video showing a collaborative task from start to end. We assume
that the objects in the video are annotated with labels and bounding boxes, e.g., by running a YOLOR
algorithm (Wang, Yeh, and Liao, 2021) (Fig. 2.1). We also assume a skill library that associates a detected
action with skill-specific motion primitives. The YouTube video has not been seen before and no information
from the video has been included as training set, thus we refer to our system as zero-shot imitating system.
We focus on cooking tasks because of the variety in manipulation actions and their importance in home
service robotics.
2.1 Detecting collaborative manipulation plans from YouTube videos
The input to the system is a full-length, previously unseen video from the web. We assume that objects
in the video are labeled and a bounding box is provided for each object e.g., using a state-of-the-art object
detection algorithm (Wang, Yeh, and Liao, 2021). We base this assumption on the tremendous progress
of recent object-detection algorithms and the availability of large data-sets. This is the only labeled input
data provided to the system. ∗
Hand Detection. Our work is based on the insight that hands are the main driving force of manipulation
actions (Yang et al., 2014). We use OpenPose (Cao et al., 2018), which detects jointly the human body and
hands. We use the detected hands to (1) segment videos by tracking the hand trajectory, and (2) detect
which objects are manipulated at a given point in time.
Video Segmentation. We temporally segment the video to short clips using the trajectories of the detected hands as time-series data, performing a separate segmentation for each hand of the actors in the
video. The trajectory of each hand is a sequence of hand positions on the image frames. We use a
greedy approach (Hallac, Nystrup, and Boyd, 2018), which formulates the segmentation of a trajectory
as a covariance-regularized maximum likelihood problem of finding the segment boundaries. The result
segmentation for a hand is a list of sub-sequences of hand positions. We then generate a new sequence
of segments for the whole video as the union of individual segments for different hands, which we will
use for action recognition. This method results in over-segmentation with some actions spanning multiple
segments, which is common in segmentation algorithms (Grundmann et al., 2010). Therefore, we merge
segments with identical action trees in the action graph generation phase.
Object Association. After video segmentation, we extract objects that are relevant to actions in each
segment. We do this by associating objects with hands and with other objects based on their relative
positions in the frame. We introduce a semantic hierarchy of objects, by assigning them to three classes:
tools that manipulate other objects, e.g., knife and fork, containers that can contain other objects, e.g., pot
and bowl, and ingredients, e.g., banana and lemon, that can not contain other objects. For robustness,
we only keep the hand and object associations retained for at least a pre-defined number of consecutive
frames.
Ingredient-Container Association. We associate container objects with ingredients, if there is an overlap in
the bounding boxes of the two. We use containers to detect transfer of objects from one container to the
∗While the current implementation of the system works for videos of one person working independently or two persons
collaborating, an extension to three or more persons interacting in pairs is straightforward and left for future work.
4
P1-G
P1-O
P2-O
P2-G
Video 1
P1-G
P1-O
P2-O
P2-G
Video 2
Actions
cut stir sprinkle squeeze heat wrap roll pour holding handover transfer dip none unknown
Figure 2.2: Segmentation of the two test videos. For each video we show the ground truth for the adult
(P1-G), our result for the adult (P1-O), the ground truth for the child (P2-G) and our result for the child
(P2-O). The bottom color bar shows different colors for different actions.
other, e.g. transfer a tomato from a bowl to a chopping board. We use the Jaccard index (Jaccard, n.d.)
between the bounding boxes of the two objects to pair an ingredient with a container that most likely
contains that ingredient.
Hand-Object Association. We want to detect the objects grasped by the hands and then propagate this
association to nearby objects that can inform the action recognition. This allows us to infer which objects
are directly manipulated or used as tools to manipulate other objects. We associate detected hands with
objects whose bounding boxes overlap with the box of the hand. In the case of multiple overlaps, we
associate the hand with the object whose bounding box is the closest to the bounding box of the hand.
If the closest object is an ingredient that is contained by another object, we associate the hand with its
container instead.
Object-Object Association. For each object that has been associated with a hand, we look to associate that
object with other objects that are possibly manipulated. For instance, if a hand grasps a spoon, we wish to
see if the spoon is used to stir a pot nearby. We do this by associating objects associated with each hand
with other containers or ingredients based on the Euclidean distance between the bounding boxes of the
two objects. We don’t associate the grasped object with any tool object based on the assumption that a
tool object will not be the receiver of any action.
Action Recognition. After segmenting the video into clips and pairing objects with hands, we recognize
actions performed by humans in the videos. We have two types of actions, actions performed by a single
person, which we name individual or single actions, and collaborative actions performed by a pair of humans
in the video. As a special case of individual actions, we introduce transfer actions, which occur when
an object moves from one container to another. This allows to detect transfer of an ingredient between
containers.
Individual Actions. We recognize commonsense actions (Yang et al., 2015), using a trained language model
from a recipe corpus (Marin et al., 2019). Given a set of candidate actions and a set of candidate objects, we
extract P(Object|Action) for each possible bigram consisting of one object word and one action word in
corpus. We then formulate action recognition task in terms of approximate sampling and prediction from
the posterior distribution:
5
HP → H O | HP AP | HP CP | H OP | H AP (2.1)
AP → A O | A OP | A HP (2.2)
CP → C HP (2.3)
OP → O O | O OP (2.4)
H → Hand (2.5)
C → Collaboration (2.6)
O → Object (2.7)
A → Action (2.8)
Figure 2.3: A Collaborative Manipulation Action Context-Free Grammar
P(A|O1, ..., On) ∼
Yn
k=1
P(Ok|A)P(A) (2.1)
where A is the performed action and O1, ..., On are the objects involved in the action respectively. We
then select the most likely action.
Transfer Actions. We treat transfer actions separately from the other individual actions, since they occur
when an object is moved from one container to the next and thus require tracking an object’s association
temporally. These actions are critical in keeping track of the location of the food in the cooking task.
Collaborative Actions. We detect a collaboration: (1) when two persons grasp the same object, or (2) the
object grasped by one person is used as a tool to manipulate an object grasped by another person. In
case (1), we check over time which hands grasp the object and detect a handover if the person grasping
the object changes. Otherwise, we detect a holding action, for instance when one person assists the other
person stirring a pot by holding the pot as well.
Action Grammar Parsing. We need to represent the structure of the recognized actions for a robot to
execute them. We use a manipulation action grammar (Yang et al., 2015) which assumes that hands (H) are
the driving force of both single manipulation actions (A) and collaborative actions (C). A hand phrase HP
contains an action phrase AP, or a collaborative action phrase CP. We also introduce an object phrase
OP, which we use to indicate container - ingredient relationships between objects, e.g. a tomato in the
bowl, as well as transfer actions from one container to another. The grammar is given in Fig. 2.3. The rules
(5)-(8) are terminal, with Hand taking the values: “LH_P1”, “RH_P1”, “LH_P2” and “RH_P2,” “LH_P1” being
the left hand of the first person and so on. We use a context-free grammar parser (Church, 1989) to parse
the constructed visual sentences (Zhang et al., 2019) and output a parse tree of the specific manipulation
action. The robot can then execute the action by reversely parsing the tree (Yang et al., 2015). Fig. 2.8
shows the constructed trees from five different action clips.
Action Graph Generation and Execution. Because of over-segmentation, we end up with multiple
consecutive segments that are parts of the same action. Therefore, we first merge consecutive segments
from the video with identical actions and manipulated objects. We do not require identical ingredient
objects, since they may not be visible in some of the segments.
We then generate an action graph that combines the generated action trees to action sequences, each
corresponding to each person in the video. We then decompose each action into motion primitives. We
6
P2
P2
P2
P1
P1
P1 P2
Figure 2.4: Generated task graph for the (Rita) video. Different actions are shown in different colors using
the colorbar of Fig. 2.2. P1 indicates an action executed by the woman in the video and P2 by the girl. The
dotted line indicates a collaboration between the two actors.
define four primitives (Holladay, Lozano-Pérez, and Rodriguez, n.d.): grasp, engage, actuate and place. For
instance, a transfer action of a food from a plate to a bowl with a spoon includes grasping the spoon (grasp),
moving it close to the food (engage), performing the scooping motion (actuate), moving the spoon close to
the bowl (engage), turning it to remove the food (actuate), and placing it back in its initial position (place).
We use Task Space Regions (TSRs) (Berenson, Srinivasa, and Kuffner, n.d.) to specify feasible regions of
target poses of the robot’s end effector in the grasp, engage and place primitives and use bidirectional
rapidly-exploring random trees (LaValle and Kuffner, 2001) to plan collision-free paths.
We implement the action graph as an open-source platform, that enables collaborative task execution
in the cooking domain.† Fig. 2.4 shows the action graph generated for the (Rita) video.‡ The graph includes
one action (stir) that was not detected by the visual processing system.
2.2 Evaluation
Our first experiment evaluates the commonsense reasoning-based module for recognizing single actions in
the cooking domain based on a large annotated dataset of 100 YouTube videos that include single-person
actions. Equipped with these results, we test our system and compare it with a baseline system that consists
of a state-of-the-art action detection framework (Tang et al., 2020) on six YouTube videos that show two
persons collaborating in the various cooking tasks.
2.2.1 Action recognition with commonsense reasoning
Based on the fact that in cooking people use specific tools for specific actions, such as a knife to cut and a
rolling pin to roll, we hypothesize that the system will be able to recognize most actions from the objects
and tools in the scene.
Dataset and Experiment Settings. We use the publicly available manipulation knowledge representation dataset called the Functional Object-Oriented Network (FOON) (Paulius, Jelodar, and Sun, 2018). The
dataset was prepared from 100 public domain YouTube cooking videos with a third-person view, and it
†
https://github.com/icaros-usc/wecook
‡Rita: https://www.youtube.com/watch?v=d3SZH7NFDjc&t=69s
7
Predictions
True values
Figure 2.5: Normalized confusion matrix of the action recognition with commonsense reasoning. The rows
indicate true values, while the columns are the predictions of the action recognition module.
Table 2.1: Comparison of the proposed system with a baseline system regarding the precision, recall and
number of correctly detected actions in the collaborative cooking videos.
Precision Recall Correct
Ours Ap1 Ours Ap1 Ours Ap1
Sal 0.52 0.31 0.48 0.35 11 8
Robalo 0.38 0.22 0.50 0.22 9 4
Rita 1. 0.30 0.83 0.50 5 3
Peixe 0.65 0.18 0.76 0.18 13 3
Massa 0.62 0.60 0.41 0.19 13 6
Rissois 0.68 0.33 0.43 0.23 13 7
Total 0.58 0.30 0.51 0.25 64 31
includes annotated actions and objects involved in these actions. We restricted the action set to a set of
common candidate actions similarly to previous work (Yang et al., 2015). We trained our language model
on the Recipe1M+ dataset (Marin et al., 2019) and the One-Billion-Words dataset (Chelba et al., 2013).
Results. The average precision over all actions was 41% and the recall was 69%. Fig. 2.5 shows the
normalized confusion matrix for our action set, with the true positive rate (recall) for each action. The
action sprinkle and action pour were often misclassified as stir, because of the prevalence of container
objects, such as bowl and pot, which skewed the inference towards the stir action. On the other hand, the
performance forcut was high because of the presence of the knife tool. These results show the effectiveness
of the commonsense reasoning-based module in the cooking domain.
2.2.2 Collaborative YouTube videos
The FOON dataset does not include object bounding boxes and does not have videos with collaborative
actions. We show the applicability of the entire system, shown in Fig. 3.2, in six public YouTube videos,
where we manually did the object annotation.§ Although there is no existing complete system that learns
§Masa: https://www.youtube.com/watch?v=1p2wBBmhPmk,
Rissois: https://www.youtube.com/watch?v=jAhQfH1PspU,
8
Table 2.2: Number of Correct Trees for Different Types of Error. A: Action Recognition Error, HO: HandObject Association Error, OO: Object-Object Association Error.
Total Correct A HO OO
Sal 21 11 8 1 1
Robalo 24 9 5 7 3
Rita 5 5 0 0 0
Peixe 20 13 0 3 4
Massa 21 13 4 2 2
Rissois 19 13 3 1 2
Total 110 64 20 14 12
P2
P1:
P2: Cut
P1 P2
P2
P1: Holding
P2: Transfer
P2
P2
P1
P1 P2
P2
P1: Holding
P2: Transfer
P2
P2
P1
P1 P2
P2
P1:
P2: Cut
P1
P2
P2
P1
P1 P2
P2
P1: Holding
P2: Transfer
P1
P2
P2
P1
P1 P2
P2
P1: Holding
P2: Transfer
P1
P2
P2
P1
P2
P1:
P2: Stir
P2
P2
P1
P1 P2
P2
P1: Stir
P2:
P1
P2
P2
P1
P1 P2
P2
P1:
P2: Stir
P1
P2
P2
P1
P1 P2
P2
P1: Stir
P2:
P1
Figure 2.6: Example frames of the video (Rita) showing two humans cooking together and corresponding
snapshots in simulation environment of two robots executing the same actions. We visualize the learned
and executed action graphs on the upper right corner of each frame and snapshot.
the collaborative manipulation plans like ours, we construct a baseline system that consists of a state-ofthe-art action detection framework (Tang et al., 2020) and compare our system with it. The results show
that our system achieves significantly higher action detection accuracy.
Dataset and Experiment Settings. We set up a start and end time for each video, annotated the objects
and set bounding boxes. We only annotated objects that were clearly visible, skipping objects that were
heavily occluded. For training the baseline system, we annotated the action labels for each one second clip
of the videos and human body bounding boxes for each frame, similar to the AVA dataset (Gu et al., 2018)
used in the baseline method paper (Tang et al., 2020). The videos include a total of 37030 frames and 126
executed actions of 12 different action types.
Baseline System (Ap1). We construct a baseline system based on AIA (Tang et al., 2020), a state-of-theart end-to-end deep learning-based action detection framework. We select AIA as our baseline method
because it focuses on reasoning over person-person, person-object interactions. Given the ground truth
object and human body annotations, the baseline system detects actions of each person in each one second
clip of the test videos using AIA. The system then aggregates the results for each video segment. The video
segments are generated using the video segmentation module of the proposed system for a fair comparison.
We use the pre-trained model, which is pre-trained on Kinetics-700 dataset (Carreira and Zisserman, 2017)
Sal: https://www.youtube.com/watch?v=GYr9Mw4ml5U,
Robalo: https://www.youtube.com/watch?v=wRBgiq_kzsY,
Peixe: https://www.youtube.com/watch?v=Jm77G7Xycyw,
Rita: https://www.youtube.com/watch?v=d3SZH7NFDjc
9
No tree generated
a The girl is sprinkling onion to the pot.
HP
AP
O
baking-dish
A
pour
HP
O
olive-oil
H
RH_P1
b The man is placing the olive oil back.
HP
AP
OP
O
board
O
mixture
A
squeeze
HP
O
spoon
H
RH_P2
c The girl is not performing any action.
Figure 2.7: Example frames and generated trees of 3 failure cases. The captions depict the groundtruth
descriptions of each case.
for action classification task, proposed in the AIA paper (Tang et al., 2020). For each test video, we then
fine-tune the pre-trained model on the other 5 videos. ¶
Results. We evaluate the performance of the system with respect to the percentage of correctly learned
action trees. We define a correct action tree when the structure and the included persons, actions and
objects are identical to the ground-truth (Yang et al., 2015), and the segment corresponding to that tree
has a non-zero temporal overlap with the ground-truth segment. We specify the precision as the number
of action trees the system returns correctly out of the total number of detected instances, and the recall as
the number of action trees the system returns correctly out of the total number of ground-truth trees.
Fig. 2.8 shows the action-trees and snapshots of the action tree executions by two robotic arms in the
open-source platform. The proposed method reproduces successfully a variety of individual and collaborative actions.
Table 2.1 shows that the proposed system achieved 0.58 precision and 0.51 recall while the baseline
system (Ap1) only achieved 0.30 precision and 0.25 recall. It should be noted that for Ap1 a detection
counts as correct if the output action name is correct since Ap1 only generates action names instead of
executable action trees. Our system works better mainly because 1) our action detection module leverages
commonsense knowledge to recognize actions while Ap1 detects actions only based on the visual features
of the objects and human bodies thus it requires huge amount of annotated data, which is expensive to
¶Dataset and code for training and testing the baseline system is available at https://bit.ly/3v524w6.
10
HP
CP
HP
O
lemon
H
RH_P2
C
handover
HP
O
lemon
H
RH_P1
a The woman is handing over a lemon to the girl.
HP
CP
HP
AP
OP
O
board
O
meat
A
cut
HP
O
knife
H
RH_P2
C
holding
HP
O
board
H
RH_P1
b The woman is holding the chopping board for the girl to cut the meat.
HP
AP
OP
OP
O
cooking-pot
O
board
O
meat
A
transfer
HP
O
spatula
H
RH_P1
c The woman is transferring meat from the chopping board to the pot.
HP
AP
O
cooking-pot
A
stir
HP
O
spoon
H
RH_P1
HP
AP
OP
OP
O
cooking-pot
O
bowl
O
flour
A
transfer
HP
O
bowl
H
RH_P2
d The woman is stirring the pot while the girl is transferring the flour from the bowl to the pot.
HP
AP
O
dough
A
roll
HP
O
rolling-pin
H
RH_P2
e The girl is rolling the dough.
Figure 2.8: Example frames, snapshots in simulation environment of two robots executing the same actions
with humans and generated actions trees of 5 successful cases. The captions depict the ground-truth
descriptions of each successful case.
obtain, to fine-tune the pre-trained model; 2) By tracking hand-object and object-object associations, our
approach is better at detecting collaborative actions and transfer action.
Post-hoc Analysis. We perform a post-hoc analysis of the data to identify different causes of failures of
our system in the generated trees. Table 2.2 shows the number of correct trees that we generated and the
number of trees that were erroneous for different reasons. We identified three sources of error: 1) action
recognition (A) 2) hand-object associations (HO): some objects were not visible and we did not include
them in the object annotations. There were also cases where the hand would be close to an object without
11
grasping it, resulting in error in hand-object association. 3) object-object associations (OO): similarly to
the hand-object association, the bounding boxes of two objects would overlap for consecutive frames, even
though the objects were not interacting.
We observe that a main source of error was hand-object associations. This occurred often when an
object was not visible and we did not include it in the object annotations. For instance, in Fig. 2.7a, the
system failed to generate an action tree for the sprinkle action since the chopped onion was not annotated.
A second source of error in hand-object associations was because of the distance between bounding boxes.
For example, in Fig. 2.7c, the system incorrectly infers that the girl is grasping the spoon, although the girl’s
hand does not make actual contact with the spoon. Errors occurred also in the object-object associations;
for example, in Fig. 2.7b, the system incorrectly associates the oil with the dish, although the man is
bringing the object back.
2.3 Discussion
In this chapter, we presented a framework named GOPS, to guide object placement sampling for efficiently
addressing monotone MAMO problems. Instead of learning a monolithic object placement sampling model,
we propose to leverage a neural network-based collision checker and a set of human-crafted convolution
kernels to efficiently compute object placement cost maps for individual feasibility requirements and effectiveness requirements, and compose them for overall sampling. We showed that our framework outperforms the standard uniform sampling approach and an end-to-end learning-based approach in two
challenging MAMO domains.
Limitations. Our work is limited in many ways. The pre-computing process (Sec. 4.2.1) of our framework
requires expensive IK solving and collision checking. To reduce these computational demands, we plan to
integrate learning modules (He et al., 2023; Bensadoun et al., 2022) in future work. We also plan to account
for more complex object placement constraints, such as the constraints on the relative transformations
between different objects (Yang et al., 2023). We have also assumed full observability of the scene and
therefore cannot handle uncertainties, noise in robot perception (Muguira-Iturralde et al., 2022). We plan
to account for sensing limitations in the future (Nikolaidis, Dragan, and Srinivasa, 2016).
2.4 Related work
In this chapter we propose a system for converting full-length unconstrained videos from the Internet to
collaborative action plans executed by one or more robots. Most relevant to ours is prior work on temporal
activity segmentation and action understanding.
Temporal Activity Segmentation. Work in activity segmentation includes learning Gaussian Mixture
Models (Sener and Yao, 2018), specifying cost-functions incorporating spatial and temporal features of the
trajectory (Lasota and Shah, 2019) and combining classifier outputs with a symbolic grammar to parse
sequence data (Qi, Jia, and Zhu, 2018). Zhou et al. propose an end-to-end method for procedure segmentation (Zhou, Xu, and Corso, 2018). More recently, temporal convolutional networks have been proposed
to capture long-range dependencies for this task (Wang et al., 2020; Li, Abu Farha, and Gall, 2021; Huang,
12
Sugano, and Sato, 2020). However these approaches rely on expensive data annotation. Thus more and
more unsupervised and self-supervised learning approaches (Wang et al., 2022; Li and Todorovic, 2021)
have been proposed to reduce the dependence on labeled data.
While our system can accommodate any state-of-the-art temporal segmentation technique (Li and
Todorovic, 2021; Li, Abu Farha, and Gall, 2021; Wang et al., 2022), we applied the Greedy Gaussian segmentation algorithm by Hallac et al. (Hallac, Nystrup, and Boyd, 2018). The algorithm is applicable to
general multivariate time-series data and we adopt it for segmenting videos to clips using the extracted
hand trajectories, based on the insight that hands contain temporal and spatial information about manipulation actions.
Human Activity Understanding. There has been a lot of work on human activity recognition (Tang et
al., 2020; Wu et al., 2019; Tirupattur et al., 2021). Recent work on deep learning approaches has enabled the
generation of natural language (Venugopalan et al., 2014), individual robot commands (Nguyen et al., 2018),
low-level robot policies (Finn et al., 2017; Yu et al., 2018; Chen, Nair, and Finn, 2021; Zhang et al., 2018)
and neural programs (Sun et al., 2018) using manually annotated datasets or visual human demonstrations.
Generation of collaborative actions has been achieved by representing them as social affordances (Shu et
al., 2017) from data recorded in a lab setting. Pastra et al. (Pastra and Aloimonos, 2012) discuss a minimalist
grammar for action understanding, inspired by the suggestion by Chomsky (Chomsky, n.d.). Based on this
theoretical insight, Yang et al. (Yang et al., 2015; Yang et al., 2014) proposed a system that uses deep neural
networks for hand and object detection and association, while leverages a language corpus for action
recognition. Performance was shown on 12 selected clips. In our preliminary work (Zhang et al., 2019), we
presented a collaborative grammar whose functionality was tested only qualitatively in a small number of
manually selected simple clips, rather than in full-length unconstrained videos.
Contrary to the aforementioned work, our system includes a full pipeline that takes as input fulllength, previously unseen videos with annotated objects and bounding boxes, infers manipulated objects
and tools based on hand detections, recognizes single and collaborative actions representing their structure
using a collaborative action grammar and concatenates the actions into a temporal sequence of executable
commands. Importantly, the proposed system is modular so that individual components can be replaced
or combined with state-of-the-art techniques, such as CNN-based methods for action recognition (Wang
et al., 2021; Chen et al., 2021). Moreover, different from previous work that learns to generate deep neural
network robot policies (Finn et al., 2017; Yu et al., 2018) and reward functions (Chen, Nair, and Finn, 2021),
our framework generates human-interpretable action trees since human-interpretability is important for
robots that work in human environments.
13
Chapter 3
Multi-Robot Geometric Task-and-Motion Planning
With the visual imitation system proposed in Chapter 2, we can extract high-level manipulation action
sequences, such as “grasp a knife and cut the meat" from online videos. However, it is challenging to execute these high-level action sequences in different environments because different environments usually
have different geometric details. Moreover, robots that work in the real world, usually need to generate
secondary actions, such as moving an obstacle out of the way, to complete the instructed task goal.
Task-and-motion planning (TAMP) problems approach these challenges by combining task planning
and motion planning. GTAMP is an important subclass of TAMP where the robot has to move several
objects to regions in the presence of other movable objects (Kim, Kaelbling, and Lozano-Pérez, 2019).
In this chapter, we focus on multi-robot geometric task-and-motion planning problems, which is an
extended version of geometric task-and-motion planning (GTAMP) problems. GTAMP has been addressed
efficiently in single-robot domains (Kim, Kaelbling, and Lozano-Pérez, 2019; Kim and Shimanuki, 2020; Kim
et al., 2022b). We focus on MR-GTAMP, where several robots have to collaboratively move several objects
to regions in the presence of other movable obstacles.
MR-GTAMP naturally arises in many multi-robot manipulation domains, such as multi-robot construction, multi-robot assembly and autonomous warehousing (Chen et al., 2022; Hartmann et al., 2021).
MR-GTAMP is interesting as multi-robot systems can perform manipulation tasks more effectively than
single-robot systems and can also perform manipulation tasks that are beyond the capabilities of singlerobot systems (Shome et al., 2021). For example, in a product-packaging task, a single robot may have to
move a lot of objects to clear a path to grasp an object, while a two-robot system can easily perform a
handover action to increase the effectiveness of task execution.
Examples of MR-GTAMP problem instances are shown in Fig. 3.1. The example task shown in Fig. 3.1
(left) requires multiple robotic arms to sort colored objects into boxes of corresponding colors in a confined
workspace. The example task shown in Fig. 3.1 (right) requires multiple mobile manipulators to move
green objects to the green region. In both tasks, white objects are movable obstacles and are only allowed
to be relocated within their current regions. These example tasks embody the key challenges that MRGTAMP aims to address. First, they are in a hybrid discrete-continuous planning space which is extremely
large when multiple robots are involved (Pan et al., 2021; Kim et al., 2022b). This involves high-level
task planning, which decides which robot should move which objects and in what sequence, and lowlevel motion planning, which decides the positions to which objects should be relocated and the motion
14
Figure 3.1: Left: Sorting colored objects into boxes of corresponding colors. Right: Moving the colored
boxes to the green region. In both scenarios, white objects are only allowed to be relocated within their
current regions (red). We use PyBullet (Coumans and Bai, 2016–2019) as our simulator.
trajectories robots should follow. Second, in both scenarios, robots work in a confined workspace and
have to consider geometric constraints imposed by the environments and the tasks carefully. Finally,
robots must collaborate intelligently to perform tasks effectively. For example, robots can achieve their
targets more quickly by concurrently manipulating multiple objects, and they can avoid relocating too
many objects by performing handover actions.
3.1 Multi-Robot GTAMP problem formulation
In an MR-GTAMP problem, we have a set of nR robots R = {Ri}
nR
i=1, a set of fixed rigid objects F, a set
of nM movable rigid objects M = {Mi}
nM
i=1 and a set of nRe regions Re = {Rei}
nRe
i=1 . We assume that all
objects and regions have known and fixed shapes. The focus of our work is not on grasp planning (Quispe,
Amor, and Christensen, 2016). So, for simplicity, we assume a fixed set of grasps GrM,R for each object
M ∈ M and robot R ∈ R pair. Gr is the union of the sets of grasps for all object and robot pairs.
Each object has a configuration, which includes its position and orientation. Each robot has a configuration defined in its base pose space and joint space. We are given the initial configurations of all robots,
objects and a goal specification G in form of a conjunction of statements of the form InRegion(M, Re),
which is true iff object M ∈ M is contained entirely in region Re ∈ Re. An example goal specification
is (InRegion(M1, Re1) ∧ InRegion(M2, Re1)) which indicates the target that we want to move objects
M1 and M2 to region Re1.
We define a grounded joint action as a set of nR actions and motions performed by all the robots at one
time step. We denote the grounded joint action at time step j as nR-tuple sj = ⟨(a
j
R1
, ξj
R1
),(a
j
R2
, ξj
R2
), . . . ,
(a
j
RnR
, ξj
RnR
)⟩, where each action a is a pick-and-place action or a wait∗
action that the corresponding
robot executes and motion ξ is a trajectory that the corresponding robot executes, specified as a sequence of robot configurations. In this work, we focus on pick-and-place actions because of their importance in robotic manipulation in cluttered space. Each pick-and-place action is a tuple of the form
⟨M, Re, Rpick, Rplace, gpick, gplace, Pplace
M ⟩, where M represents the object to move; Re represents the
target region for M; Rpick and Rplace represent the robots that pick and place M, respectively; g
pick and
∗As in (Pan et al., 2021), a robot with a wait action does not have to do anything but can move to avoid other robots.
15
Phase 1. Compute Collaborative Manipulation Information (Sec. 4.1):
“Object blocks robot from picking up object , object
blocks robot from placing object to region , Robot
cannot reach region to place objects, ...”
Phase 2. Tree Search for Collaborative Robot Plans (Sec. 4.2):
The search process iteratively selects a task skeleton for
grounding (Sec. 4.2.2) and generates new task skeletons
(Sec. 4.2.1) to resolve conicts emerged during grounding.
Robot Executable
Task-And-Motion Plans
Target:
”Move object to region .”
M1
M3
M2
M4
M5
Re1
Re2
R1 R2
M1 Re1
M3 R2
M1
M2
R2 Re1 R1
M1
Re1
Example Problem Instance
MIP-based Task-skeleton Generating (Sec. 4.2.1)
Task Skeletons Task-skeleton Grounding (Sec. 4.2.2)
Failure
Selected Task-skeleton
Possible generating outcomes
A sequence of grounded actions and a set of objects
representing the conicts emerged during grounding
Failure
Update the search tree Update the search tree
Update the search tree
Update the search tree
Select a promising
task-skeleton
Possible grounding outcomes
Figure 3.2: Overview of the proposed framework. Fig. 3.3 provides a more detailed visualization and description of Phase 2.
g
place represent the grasps used by Rpick and Rplace, respectively, and P
place
M represents the configuration at which to place M. Moreover, we call a pick-and-place action whose Rpick is different from Rplace
a handover action. Each grounded joint action maps the configurations of the movable objects to new
configurations and the unaffected objects remain at their old configurations.
We define a partially grounded joint action as an nR-tuple of the form ⟨a¯R1
, . . . , a¯RnR
⟩, where a¯ is a
wait action or a pick-and-place action without the placement information P
place
M . We refer to a pick-andplace action without the placement information as a partially grounded pick-and-place action since it has
only the information about the grasps that will be used.
We define a task skeleton S¯ as a sequence of partially grounded joint actions. We want to find a taskand-motion plan, i.e., a sequence of grounded joint actions S that changes the configurations of the objects
to satisfy G.
We denote the process of finding feasible object placements and motion trajectories for a task skeleton
as grounding.
A task-and-motion plan is valid iff, at each time step j: (1) the corresponding multi-robot trajectory
Ξ
j = ⟨ξ
j
R1
, ξj
R2
, . . . , ξj
RnR
⟩ is collision-free; (2) the robots can use the corresponding motion trajectories
and grasp poses to grasp the target objects and place them at their target configurations without collisions;
and (3) all handover actions can be performed without inducing collisions. The considered collisions include collisions between robots, collisions between an object and a robot and collisions between objects.
16
3.2 A planning framework for multi-robot geometric task-and-motion
planning
We present our two-phase MR-GTAMP framework (Fig. 3.2) in this section. In the first phase, we compute
the collaborative manipulation information, i.e., the occlusion and reachability information for individual
robots and whether two robots can perform a handover action for an object (Sec. 3.2.1). In the second phase,
we use a Monte-Carlo Tree Search exploration strategy to search for task-and-motion plans (Sec. 3.2.2).
The search process depends on a key component that generates promising task skeletons (Sec. 3.2.2.1)
and a key component that finds collision-free object placements and trajectories for the task skeletons to
construct valid task-and-motion plans (Sec. 3.2.2.2).
3.2.1 Computing collaborative manipulation information
Given an MR-GTAMP problem instance and the initial configurations of all objects and robots, our framework first computes the occlusion and reachability information for individual robots, e.g., whether an
object blocks a robot from manipulating another object and whether a robot can reach a region to place
an object there. We also compute whether two robots can perform a handover action for an object by
computing whether they can both reach a predefined handover point to transfer the object. In this work,
we consider only handover actions for objects that are named in goal specification G for computational
simplicity. We assume that all robots return to their initial configurations after each time step. Inspired
by (Kim et al., 2022b), we use the conjunction of all true instances of a set of predicates to represent the
computed information. To define these predicates, we define two volumes of the workspace similar to (Stilman et al., 2007a; Kim et al., 2022b). The first volume Vpick(M, g, R, ξ) is the volume swept by robot R to
grasp object M with grasp g following trajectory ξ. The second volume Vplace(M, g, R, Pplace
M , ξ) is the
volume swept by robot R and object M to transfer the object to configuration P
place
M following trajectory
ξ. Our predicates are as follows:
• OccludesPick(M1, M2, g, R) is true iff object M1 overlaps with the swept volume denoted as
Vpick(M2, g, R, ξ), where ξ is chosen to be collision-free with all the objects except M2, if possible;
• OccludesGoalPlace(M1, M2, Re, g, R) is true iff M1 is an object that overlaps with the swept
volume Vplace(M2, g, R, Pplace
M2
, ξ), where P
place
M2
and ξ are chosen to be collision-free with all the
objects except M2, if possible, and the pair ⟨M2, Re⟩ is named in goal specification G;
• ReachablePick(M, g, R) is true iff there exists a trajectory for robot R to pick object M with grasp
g;
• ReachablePlace(M, Re, g, R) is true iff there exists a trajectory for robot R to place object M into
region Re with grasp g; and
17
KEY COMPONENT 1 ( Sec. 4.2.1 ):
- Generating promising task
skeletons.
- Input: objects to move , a
sequence of grounded joint
actions .
- Output: a set of task
skeletons .
M∗
S
{S¯}
KEY COMPONENT 1
A set of objects
named in the goal
specication,
{ { ...
Initialization Stage
D0 : D0.S = ∅
M∗
0
S0 = ∅
E1 : E1.S¯ E2 : E2.S¯
{ {
...
selection
expansion
...
evaluation
KEY COMPONENT 2
KEY COMPONENT 1
D0 : D0.S = ∅
E1 : E1.S¯ E2 : E2.S¯
D0.S, E2.S¯ D2 : D2.S = S2
S2,M∗
2
E2.1 : E2.1.S¯
reward
backpropagation
An executable taskand-motion plan:
backpropagation
...
selection
expansion
...
evaluation
backpropagation
KEY COMPONENT 2
D0 : D0.S = ∅
E1 : E1.S¯ E2 : E2.S¯
E2.1 : E2.1.S¯
D2.S, E2.1.S¯ D2.1 : D2.1.S = S2.1
S2.1
S2.1
D2 : D2.S = S2
reward
KEY COMPONENT 2 ( Sec. 4.2.2 ):
- Task skeleton grounding.
- Input: a sequence of
grounded joint actions ,
a task skeleton .
- Output: three possible
outcomes: (1) a sequence of
grounded joint actions ;
(2) a sequence of grounded
joint actions and a set of
objects ; (3) a failure ag.
S
S¯
M∗
S
S∗
One Search
Iteration
Another Search
Iteration
Figure 3.3: Visualization of the search process in the second phase of our framework. We show the initialization stage of the search process (left) and two example search iterations (middle, right) that lead to
different evaluation outcomes. Left: Blue arrows represent the workflow for initializing the search tree.
Middle: Yellow arrows represent a search iteration that results in an updated set of objects to be moved
and thus a new set of task skeletons to be grounded. Right: Red arrows represent a search iteration that
results in an executable task-and-motion plan.
• EnableGoalHandover(M, g1, g2, R1, R2) is true iff robots R1 and R2 can both reach a predefined
handover point for object M with grasps g1 and g2, respectively, and the object M is named in goal
specification G.
For a predicate instance to be true, the corresponding trajectories are required to be collision-free with
respect to all fixed objects. For a predicate instance of EnableGoalHandover to be true, the two robots
are required to not collide with each other.
The values of all predicate instances can be computed with existing inverse-kinematics solvers (Diankov, 2010) and motion planners (LaValle, 2006). Ideally, we wish to find trajectories for the robots that
have the minimum number of collisions with all objects, i.e., the minimum constraint removal (Hauser,
2013) trajectories. However, this is known to be very time consuming. Thus, we follow previous work (Kim
et al., 2022b) and first attempt to find a collision-free trajectory with respect to all movable and fixed objects. If we fail, we attempt to find a collision-free trajectory with respect to only the fixed objects.
In our implementation, we efficiently compute the predicates – with the exception of EnableGoalHandover – in parallel for all robots by creating an identical simulation environment for each robot.
3.2.2 Searching for task-and-motion plans
We now describe our search process (Fig. 3.3) for efficiently finding effective collaborative task-and-motion
plans. Our search process is initialized with a set of task skeletons, that is generated for moving the set
18
of objects named in the goal specification, utilizing the computed collaborative manipulation information
(Sec. 3.2.1). We will describe our key component for generating task skeletons in detail in Sec. 3.2.2.1. We
then generate a search tree with a root node, denoted as D0 as shown in Fig. 3.3 (left). We associate an
empty sequence of grounded joint actions with node D0, denoted as D0.S = ∅. We use the “." operator
to denote the association relationship. This implies that at node D0, we do not have any grounded joint
actions. We then create edges originating from node D0, with each edge storing a distinct initial task
skeleton.
Throughout our search process, at each search iteration, we select an edge that has not been evaluated
yet, and we evaluate it by trying to ground the task skeleton associated with it. As previously defined,
the term grounding refers to the process of finding feasible object placements and motion trajectories for
a task skeleton to be executable. After each evaluation, we compute a reward based on the evaluation
result. The reward will then be propagated back up the search tree, with each edge in the path from the
root node to the selected edge having its value updated based on the reward. We use a Monte-Carlo Tree
Search (MCTS) exploration strategy to balance exploration (exploring different unevaluated edges) and
exploitation (biasing the search towards the branches that have received high rewards).
We use a reverse search algorithm inspired by (Stilman et al., 2007a) to ground task skeletons. We will
describe our key component for task-skeleton grounding in detail in Sec. 3.2.2.2. The insight behind the
reverse search algorithm is to use the grounded future joint actions as the artificial constraints to guide
the grounding for the current actions. Therefore, throughout our search process, we save the grounding
results and use them as artificial constraints for subsequent grounding tasks. We use two examples, as
shown in Fig. 3.3 (middle, right), to illustrate the idea.
In the first example (Fig. 3.3 (middle)), we select edge E2 for evaluation. We create a new node, denoted as D2, to serve as the head node of edge E2. The tail node of edge E2 is the root node D0 whose
associated sequence of grounded joint actions is empty. This means that we can attempt to ground the
task skeleton associated with E2, denoted as E2.S¯, without any artificial constraints. Ideally, if we manage
to ground task skeleton E2.S¯ successfully, we would get an executable task-and-motion plan to perform
the task. However, in many situations, we can only ground the task skeleton partially. This implies that
there are conflicts that emerge during task-skeleton grounding. For example, there would not be enough
space to place objects unless we relocate some objects that were not planned to be moved initially. Such
situations can arise as we cannot account for all geometric specifics during task-skeleton generation. In
such situations, we generate new task-skeletons to address the emerged conflicts, and we expand the tree
by creating new edges, with each edge storing a distinct new task skeleton. In our first example, we create
new edges originating from node D2. Moreover, we store the sequence of joint actions that have been
grounded to this point in node D2, denoted as D2.S. It should be noted that D2.S contains D0.S and the
grounded part of E2.S¯.
In the second example (Fig. 3.3 (right)), we select edge E2.1 for evaluation. The grounding of the task
skeleton associated with edge E2.1, denoted as E2.1.S¯, should consider D2.S as artificial constraints which
is the sequence of joint actions that have been grounded to this point. If we successfully ground E2.1.S¯,
we can get an executable task-and-motion plan by concatenating the grounded task-skeleton with D2.S.
At each search iteration, we have four phases: selection, expansion, evaluation and backpropagation.
19
Notation. We use |S| and |S¯| to denote the number of objects intended to be moved in sequences of
grounded joint actions S and task skeletons S¯, respectively.
Selection phase. In the selection phase, we start at the root node and recursively select the edge with the
highest Upper Confidence Bound (UCB) value until we reach an edge Ei with a task skeleton that has not
been grounded yet. We denote the tail node of edge Ei as Dj . We follow the UCB value formula used
in (Silver et al., 2017). The UCB value of the pair of node Dj and edge Ei
is: Q(Dj , Ei) = Ei.value
Ei.visits+1 +
c×Ei
.prior×
√
Dj .visits
Ei.visits+1 , where Ei
.value is the cumulative reward edge Ei has received so far, Dj .visits
and Ei
.visits are the number of times Dj and Ei have been selected, c is a constant to balance exploration
and exploitation, and Ei
.prior is used to bias the search with domain knowledge (Silver et al., 2017). In
our implementation, we set Ei
.prior to 1
|Ei.S¯|
to prioritize grounding task skeletons with fewer objects to
move. The value Ei
.value of an edge is initialized to 0.
Assume that we select edge Ei from node Dj in the selection phase.
Expansion phase. In the expansion phase, we create a new node Di as the head node of edge Ei
.
Evaluation phase. In the evaluation phase, we use the task-skeleton grounding component to ground
task skeleton Ei
.S¯ associated with Ei to compute reward r for selecting edge Ei
(Sec. 3.2.2.2). Note that
node Dj is the tail node of edge Ei and the grounded sequence of joint actions stored in node Dj is denoted
as Dj .S. There are three possible outcomes: (1) If we fail at grounding, we set r to 0. (2) If we obtain a
sequence of grounded joint actions S
∗
, then we found a valid task-and-motion plan. In this case, we set r to
1+α
1
|S∗|
, where α is a constant hyperparameter used to balance the two terms of the reward that is set to 1
in our experiments (Sec. 3.3). The first term of the reward incentivizes the search algorithm to select edges
where more actions have been grounded, and the second term incentivizes the search algorithm to select
edges that move fewer objects. (3) In the third case, task skeleton Ei
.S¯ cannot be fully grounded without
relocating some objects that are not planned to be moved in Ei
.S¯. In this case, we obtain a sequence of
grounded joint actions S
′
and a set of objects M∗
from the grounding process. Here, S
′
consists of Dj .S
and the grounded part of task skeleton Ei
.S¯. We use M∗
to represent the set of objects for which we
need to find a sequence of grounded joint actions, denoted as SM∗ , to relocate so that we can construct a
final task-and-motion plan for the problem by concatenating SM∗ with S
′
. We then call the task-skeleton
generating component (Sec. 3.2.2.1) to move M∗
. If we cannot find any task skeleton to move M∗
, then we
set r to 0. However, if we find a set of task skeletons {S¯}, then we set r to S
′
.length
S′
.length+S¯∗.length + α
1
|S′
|+|S¯∗|
,
where S¯∗
is the task skeleton with the minimum number of time steps among all task skeletons {S¯} and
S
′
.length and S¯∗
.length represent the number of time steps of S
′
and S¯∗
, respectively.
We would like to point out that the reward in the second possible outcome represents a special case
of the reward in the third possible outcome. Both rewards use their first terms to incentivize the search
algorithm to select edges where more actions have been grounded, and their second terms to incentivize
the search algorithm to select edges that move fewer objects.
We use node Di to store the returned grounded joint actions S
′
as Di
.S. In the third scenario, if we
find new task skeletons, then we create new edges to store them for node Di
. If no new edge is created,
then we mark node Di as a terminal node.
20
Backpropagation phase. In the backpropagation phase, we update the cumulative reward of the selected
edges {Esel} with the computed reward r according to Esel.value = Esel.value + r. We also increment
the number of visits of the selected edges and nodes by 1.
In our implementation, we track the grounding failures for different task skeletons similarly to (Ren,
Chalvatzaki, and Peters, 2021), so that we can skip over those branches where grounding their task skeletons is known to be infeasible.
3.2.2.1 Key Component 1: Generating promising task skeletons
One key component in the second phase of our framework is to generate promising task skeletons {S¯} for
moving a set of objects M∗ given a sequence of already grounded joint actions S
′
. As previously defined,
the term task-skeleton refers to a sequence of actions without the placement and trajectory information.
This key component will be used in two situations. It is firstly called at the initialization stage of the
search process (Fig. 3.3 (left)). In this situation, we set S
′
as empty and set M∗
as the set of objects named
in the goal specification of the problem instance. We will use the generated task skeletons to initialize the
search tree as shown in Fig. 3.3 (left). The second scenario where this component is called is when we
can only ground part of a task skeleton in the evaluation phase during the search process. Fig.3.3 (middle)
depicts one example search iteration where this situation happens. In this example search iteration, we
set S
′
as S2 and set M∗
as M∗
2
. We use this key component to generate task skeletons to relocate M∗
2
.
We take S
′
as input because we should exclude objects from our task-skeleton generation that are already
planned to be moved in S
′
. The task-skeleton generation algorithm is designed to utilize the computed
collaborative manipulation information from the first phase (Sec. 3.2.1) to eliminate task skeletons that
include infeasible actions and to prioritize motion planning for effective task plans that have fewer time
steps and fewer objects to be moved.
Notation. Assume that we want to generate task skeletons to move objects M∗ given a sequence of
grounded joint actions S
′
. The set of objects included in S
′
cannot be moved again because of the monotone
assumption. For simplicity of presentation, we slightly abuseMto denote the movable objects not included
in S
′
.
Building the collaborative manipulation task graph. To reason about the collaborative manipulation
capabilities of the individual robots, we encode the computed information as a graph. We build a collaborative manipulation task graph (CMTG) to capture the precedence of the manipulations of different objects,
i.e., we can only move an object after we move the obstacles that block the pick-and-place action we are
going to execute, based on the computed information from the first phase (Sec. 3.2.1). Since we only compute occlusion information for placing objects named in the goal specification, the precedence encoded in
the CMTG lack occlusion information for relocating objects that are not named in the goal specification.
Instead, we assume that we will always find the feasible places to relocate these objects. We determine the
exact object placements during task-skeleton grounding (Sec. 3.2.2.2).
A CMTG (Fig. 3.4) has two types of nodes: An object node represents an object M ∈ M; and an
action node represents a partially grounded pick-and-place action a¯, i.e. a pick-and-place action without
placement information. A CMTG has three types of edges: An action edge is an edge from an object node to
21
M1
M2 M3
M1
M3
M2
M4
M5
Re1
Re2
R1 R2
M1 : Re1 : R1 : R2 : g1
M1,R1
: g1
M1,R2
M2 : Re2 : R2 : R2 : g1
M2,R2
: g1
M2,R2
M3 : Re2 : R1 : R1 : g1
M3,R1
: g1
M3,R1
M1 : Re1 : R2 : R2 : g2
M1,R2
: g2
M1,R2
Figure 3.4: (Left) An example scenario where we want to generate task skeletons to move object M1 given
an empty sequence of grounded joint actions. (Right) The corresponding collaborative manipulation task
graph for moving object M1. The rounded rectangular nodes are action nodes. The circular nodes are
object nodes. The red circular nodes represent objects that are specified to be moved. The yellow arrows
represent action edges. The purple arrows represent block-place edges, and the blue arrow represents a
block-pick edge.
an action node. It represents moving the object represented by the object node with the action represented
by the action node. A block-pick edge is an edge from an action node to an object node. It represents that
the object represented by the object node obstructs the pick action of the action represented by the action
node. A block-place edge is an edge from an action node to an object node. It represents that the object
represented by the object node obstructs the place action of the action represented by the action node. All
block-place edges are connected to the action nodes that move the objects named in the goal specification.
A CMTG has a set of object nodes that represents the input objects M∗
that must be moved.
Given the computed collaborative manipulation information and a set of objects M∗
to move, we
incrementally construct a CMTG by iteratively adding object M ∈ M∗
to the CMTG with Alg. 1. Given
the CMTG C built so far and an object M to add, we first add an object node representing M to C (Alg. 1,
line 5). Then, for each pair of a robot R ∈ R and its grasp gM,R ∈ GrM,R, we find all partially grounded
pick-and-place actions ¯a that move object M to its target region ReM with R as the pick robot (Alg. 1,
line 6-27). For each partially grounded pick-and-place action a¯, we find all movable objects that block
the pick action of a¯ and add the corresponding block-pick edges (Alg. 1, line 35-40). If M is named in
goal specification G, then we also find all movable objects that block the place action of a¯ and add the
corresponding block-place edges (Alg. 1, line 41-48). We recursively add the blocking objects in a similar
way (Alg. 1, lines 37 and 44).
Mixed-integer linear program formulation and solving. Given a CMTG C, we find a set of task
skeletons that specify which robot will move which object at each time step. We assume that each object
will be moved at most once, i.e., we assume that the problem instances are monotone. Given a time step
limit T, we cast the problem of finding a task skeleton that has a minimum number of objects to be moved
as a mixed-integer linear program (MIP). We encode the precedence of manipulating different objects as
formal constraints in the MIP such that we can generate task skeletons that are promising to be successfully
grounded. We incrementally increase the time step limit T. In our implementation, the maximum time
step limit is a hyperparameter.
22
Algorithm 1 AddObject(M, C)
1: input: an object M; the collaborative manipulation task graph built so far, denoted as C.
2: if M ∈ C.object_nodes then
3: return
4: end if
5: C.object_nodes.add(M)
6: if M is named in goal specification G then
7: ReM = GetGoalRegion(M)
8: else
9: ReM = GetCurrentRegion(M)
10: end if
11: for Rpick ∈ R do
12: for gM,Rpick ∈ GrM,Rpick do
13: ¯a = {}
14: if ReachablePick(M, gM,Rpick , Rpick) then
15: if ReachablePlace(M, ReM, gM,Rpick , Rpick) then
16: ¯a.add((M, ReM, Rpick, Rpick, gM,Rpick , gM,Rpick ))
17: end if
18: if M is named in goal specification G then
19: for Rplace ∈ R \ {Rpick} do
20: for gM,Rplace ∈ GrM,Rplace do
21: if EnableGoalHandover(M, gM,Rpick , gM,Rplace , Rpick, Rplace) and
22: ReachablePlace(M, ReM, gM,Rplace , Rplace) then
23: ¯a.add((M, ReM, Rpick, Rplace, gM,Rpick , gM,Rplace ))
24: end if
25: end for
26: end for
27: end if
28: for a¯ ∈ ¯a do
29: R
pick
a¯
is the robot to pick M in a¯
30: g
pick
a¯
is the grasp used by R
pick
a¯
in a¯
31: R
place
a¯
is the robot to place M in a¯
32: g
place
a¯
is the grasp used by R
place
a¯
in a¯
33: C.action_nodes.add(¯a)
34: C.action_edges.add(M, a¯)
35: for Mj ∈ M do
36: if OccludesPick(Mj , M, gpick
a¯
, Rpick
a¯
) then
37: AddObject(Mj , C)
38: C.block_pick_edges.add(¯a, Mj )
39: end if
40: end for
41: if M is named in goal specification G then
42: for Mj ∈ M do
43: if OccludesGoalPlace(Mj , M, ReM, g
place
a¯
, Rplace
a¯
) then
44: AddObject(Mj , C)
45: C.block_place_edges.add(¯a, Mj )
46: end if
47: end for
48: end if
49: end for
50: end if
51: end for
52: end for
For simplicity of presentation, we slightly abuse M again to denote the objects in C. We use M∗ ⊆ M
to denote the objects that are intended to be moved. We slightly abuse ¯a to denote the set of partially
grounded pick-and-place actions in C. We use E¯a = {(M, a¯)} to denote the set of action edges in C. We
use E
pick
B = {(¯a, M)} to denote the set of block-pick edges and E
place
B = {(¯a, M)} to denote the set of
block-place edges in C, EB = E
pick
B ∪ E
place
B
, where M ∈ M and a¯ ∈ ¯a. We define the binary variables
Xt
M,a¯
and Xt
a,M¯
, where t ∈ [1, . . . , T],(M, a¯) ∈ E¯a and (¯a, M) ∈ EB. Xt
M,a¯ = 1 implies that action a¯ is
23
executed at time step t
′
s.t. t
′ ≥ t. Xt
a,M¯ = 1 implies that object M can be considered for being moved at
time step t since it blocks action a¯ which is executed at or after time step t.
Our MIP model is shown in the following. The implications in constraint (11) and constraint (12) are
compiled to linear constraints using the big-M method Griva, Nash, and Sofer, 2009:
minimizeX
(M,a¯)∈E¯a
X1
M,a¯
Xt
M,a¯ ≥ X
t+1
M,a¯
, ∀(M, a¯) ∈ E¯a, t ∈ [1, T − 1] (3.1)
Xt
M,a¯ = Xt
a,M¯ ′ , ∀(M, a¯) ∈ E¯a, (¯a, M′
) ∈ EB,
t ∈ [1, T] (3.2)
Xt
M,a¯′ ≤
X
(¯a,M)∈EB
Xt
a,M¯
, ∀M ∈ M \ M∗
,
(M, a¯
′
) ∈ E¯a, t ∈ [1, T] (3.3)
X
(M,a¯)∈E¯a s.t. R in a¯
XT
M,a¯ ≤ 1, ∀R ∈ R (3.4)
X
(M,a¯)∈E¯a
XT
M,a¯ ≥ 1 (3.5)
X
(M,a¯)∈E¯a s.t. R in a¯
Xt
M,a¯ ≤ 1+
X
(M,a¯)∈E¯a s.t. R in a¯
X
t+1
M,a¯
,
∀R ∈ R, t ∈ [1, T − 1] (3.6)
X
(M,a¯)∈E¯a
Xt
M,a¯ ≥ 1 +X
(M,a¯)∈E¯a
X
t+1
M,a¯
,
t ∈ [1, T − 1] (3.7)
X
(M,a¯)∈E¯a
X1
M,a¯ = 1, ∀M ∈ M∗
(3.8)
X
(M,a¯′)∈E¯a
X1
M,a¯′ ≥ X1
a,M¯
, ∀(¯a, M) ∈ EB (3.9)
X
(M,a¯)∈E¯a
X1
M,a¯ ≤ 1, ∀M ∈ M (3.10)
X1
a,M¯ = 1 =⇒
X
t∈[1,...,T ]
Xt
a,M¯ ≥
(
X
(M,a¯′)∈E¯a
X
t∈[1,...,T ]
Xt
M,a¯′ ) + 1,
∀(¯a, M) ∈ E
pick
B
(3.11)
X1
a,M¯ = 1 =⇒
X
t∈[1,...,T ]
Xt
a,M¯ ≥
(
X
(M,a¯′)∈E¯a
X
t∈[1,...,T ]
Xt
M,a¯′ ),
∀(¯a, M) ∈ E
place
B
(3.12)
Constraint (1) enforces that Xt
M,a¯
indicates whether we have selected a¯ at or after time step t. Constraint (2) enforces that, if an action is selected, then the objects that obstruct it are also moved. Constraint
(3) enforces that, besides the objects in M∗
, we only move objects that obstruct the actions we have selected. Constraints (4 − 7) enforce that, at each time step, we select at least one action, while each robot
executes at most one action. Constraint (8) enforces that the objects in M∗
are moved. Constraint (9)
enforces that all obstacles for the selected actions are moved, while constraint (10) enforces that each
object is moved only once. Constraint (11) enforces that each object is moved after the obstacles for its
pick action have been moved. Constraint (12) enforces that each object is moved after the obstacles for its
place action have been moved. The objective function represents the number of moved objects.
From a MIP solution, we construct a task skeleton which is grounded later. Moreover, we want to
construct multiple task skeletons since some task skeletons may be impossible to ground. Every time we
obtain a solution, we add a constraint to the MIP model to enforce that we find a different solution from
the existing ones until we collect enough task skeletons (Danna et al., 2007). In our implementation, the
maximum number of task skeletons is a hyperparameter that varies for different problem instances.
24
Algorithm 2 Task-Skeleton Grounding(S¯, Sfut,Mfut, Vfut,Mout)
1: input: a task skeleton S¯; a sequence of grounded joint actions Sfut; the set of objects that are planned to be moved in Sfut, denoted
as Mfut; the volume of work space that is occupied by Sfut, denoted as Vfut; the set of movable objects that are not planned to be
moved in Sfut and S¯, denoted as Mout.
2: result: three possible returns: (1) a sequence of grounded joint actions S
∗, indicating that we successfully find an executable taskand-motion plan; (2) a sequence of grounded joint actions S
′
and a set of objects M∗, indicating that we can only partially ground
task skeleton S¯ and we have to relocate objects M∗; (3) a failure flag.
3: notation: We denote the sequence concatenating operation as ⊕.
4: G = goal specification of the MR-GTAMP problem instance
5: M = the set of movable objects of the MR-GTAMP problem instance
6: for t ∈ [T, . . . , 1] do
7: S¯[t] = PartiallyGroundedJointActionAt(S¯, t)
8: Mt
, Rt = ObjectsAndRobotsToMove(S¯[t])
9: P = FindPlacements(Mt
, Mout ∪ Mfut ∪ F ∪ Vfut, S¯[t])
10: if P is None then
11: P = FindPlacements(Mt
, Mfut ∪ F ∪ Vfut, S¯[t])
12: if P is None then
13: return failure flag
14: end if
15: Ξ = FindTrajectories(Mt
, Rt
, P, Mfut ∪ F, S¯[t])
16: if Ξ is None then
17: return failure flag
18: end if
19: Sfut = CreateGroundedJointAction(S¯[t], Ξ, P) ⊕ Sfut
20: M∗ = HaveNotBeenMoved(G, Sfut) ∪ MovablesOcclude(M, Sfut)
21: S
′ = Sfut
22: return S
′
,M∗
23: end if
24: Ξ = FindTrajectories(Mt
, Rt
, P, Mout ∪ Mfut ∪ F, S¯[t])
25: if Ξ is None then
26: Ξ = FindTrajectories(Mt
, Rt
, P, Mfut ∪ F, S¯[t])
27: if Ξ is None then
28: return failure flag
29: end if
30: Sfut = CreateGroundedJointAction(S¯, t, Ξ, P) ⊕ Sfut
31: M∗ = HaveNotBeenMoved(G, Sfut) ∪ MovablesOcclude(M, Sfut)
32: S
′ = Sfut
33: return S
′
,M∗
34: end if
35: Mfut = Mfut ∪ Mt
36: Sfut = CreateGroundedJointAction(S¯, t, Ξ, P) ⊕ Sfut
37: Vfut = Vfut.append(SweptVolume(Ξ,Mt
, Rt
))
38: end for
39: S
∗ = Sfut
40: return S
∗
3.2.2.2 Key Component 2: Task-skeleton grounding
The second key component in the search phase (Sec. 3.2.2) is to ground the task skeletons, i.e., to find
the object placements and motion trajectories for the partially grounded pick-and-place actions. We use a
reverse search algorithm inspired by (Stilman et al., 2007a) since forward search for continuous parameters
of long-horizon task skeletons without any guidance is very challenging (Kim, Kaelbling, and LozanoPérez, 2019). The insight behind the reverse search strategy is to use the grounded future joint actions as
the artificial constraints to guide the grounding for the present time step.
The input to this component is a task skeleton S¯ of T time steps and a sequence Sfut of future grounded
joint actions. We use Sfut as artificial constraints to guide the grounding for the current actions, so that we
can efficiently find geometrically feasible long-horizon plans (Stilman et al., 2007a). Ideally, if we manage
to ground task skeleton S¯ successfully, we will get a fully executable task-and-motion plan. However, in
many situations, since we cannot account for all geometric specifics during task-skeleton generation, we
25
can only ground the task skeleton partially. In such cases, we will get a set of objects, denoted as M∗
, for
which we have to generate new task skeletons to relocate. We will then return the sequence of grounded
joint actions together with objects M∗
. Furthermore, in certain situations, the grounding may totally fail.
In such cases, we will simply return a failure flag.
We denote the volume of work space occupied by grounded joint actions Sfut as Vfut. We denote the
set of movable objects that will be moved by grounded joint actions Sfut as Mfut. We denote the set of
movable objects that will not be moved by task skeleton S¯ and grounded joint actions Sfut as Mout. For
time step t ∈ [1, . . . , T], we denote the set of objects that are planned to be moved as Mt
and the set of
robots that are planned to move them as Rt
. Recall that we denote the goal specification and the set of
movable objects as G and M, respectively.
The detailed grounding algorithm is as follows (Alg. 2). The grounding starts at the last time step T. For
time step t, we first sample placements for objects Mt
that are collision-free with respect to objects Mout∪
Mfut at their initial poses, fixed objects F and volume Vfut (Alg. 2, line 9). The sampled placements should
not collide with volume Vfut, because, otherwise, they will prevent the execution of future grounded joint
actions that occupy Vfut.
Given the placements, we plan pick trajectories, place trajectories and handover trajectories for objects
Mt
and robots Rt
that are collision-free with respect to objects F∪Mfut∪Mout at their initial poses (Alg. 2,
line 24). We note that, in addition to the fixed objects F and the objects Mout, the planned trajectories
should not collide with the objects Mfut that are moved in future grounded joint actions.
Since we may move multiple robots and objects concurrently, we do not allow collisions between the
robots, collisions between the moved objects and collisions between a robot and a moved object that is not
intended to be manipulated by that robot.
If we succeed in grounding the joint action at time step t, then we expand volume Vfut with the volume
occupied by the newly planned robot and object trajectories, expand the set Mfut with the moved objects
Mt
and expand the grounded joint actions Sfut with the newly grounded joint action (Alg. 2, line 35-
37). We then start to ground the joint action at time step t − 1. If we succeed in grounding the joint
actions at every time step, we return an executable task-and-motion plan S
∗ = Sfut. However, if we fail
at grounding the joint action at time step t, we relax the collision constraints by allowing the sampled
placements and trajectories to collide with the objects Mout since we can generate new skeletons to move
them later (Alg. 2, line 10-23 and line 25-34). If we succeed after relaxing the constraints, then we terminate
the grounding and return the sequence of the grounded joint actions S
′ = Sfut and a set of objects M∗
.
The set of objects M∗
consists of the objects that are named in the goal specification G but have not yet
been moved and the movable objects in the environment that occlude the grounded joint actions S
′
(Alg. 2,
line 20 and line 31). During the search process (Sec. 3.2.2), the returned S
′
and M∗
are then used as input
to the first key component (Sec. 3.2.2.1) to generate new task skeletons. If, after relaxing the collision
constraints, we still cannot find feasible placements and paths, then we simply return failure.
26
3.3 Evaluation
We empirically evaluate our framework in two challenging domains and show that it can generate effective
collaborative task-and-motion plans more efficiently than two baselines.
3.3.1 Baselines
We compare our framework with two state-of-the-art TAMP frameworks. We provide both baseline planners with information about the reachable regions of each robot.
Ap1 is a multi-robot extension of the RSC algorithm (Stilman et al., 2007a) by assuming that the robots
form a single composite robot. The action space includes all possible combinations of the single-robot
actions and collaboration actions. Unlike our framework, which eliminates infeasible task plans using
computed information about the manipulation capabilities of individual robots (Sec. 3.2.1), thereby pruning the search space, Ap1 would require searching through a large space of all possible combinations of
multi-agent actions. Moreover, the focus of Ap1 is on feasibility of the task-and-motion plans, rather than
on the plan length and number of objects moved. In contrast, our framework uses the intermediate grounding results (Sec. 3.2.2) to guide the search towards more effective task-and-motion plans, considering the
resulting plan length and the number of objects moved.
Ap2 is a general MR-TAMP framework (Pan et al., 2021) that is efficient in searching for promising task
plans based on the constraints incurred during motion planning. We implement the planner in a way such
that geometric constraints can be utilized efficiently, e.g., the planner can identify that it needs to move the
blocking objects away before it can manipulate the blocked objects. Unlike our framework, which guides
the search for feasible positions for object relocation using sampled future actions (Sec. 3.2.2.2), Ap2 does
not include guidance for finding feasible positions for object relocation, which can facilitate finding feasible
plans in confined settings.
Table 3.1: Comparison of the proposed method with two baseline methods in the two benchmark domains
regarding the success rate, planning time, makespan and motion cost. The numbers in the names of the
problem instances indicate the numbers of the goal objects and the movable objects besides the goal objects.
In PA5, PA7 and PA10, each problem instance has 3 goal objects and 2 robots. We omit the planning time
and solution quality results for Ap2 on PA10 and BO8 because its success rate is significantly lower than
those of the other two methods.
Problem instance Success rate % Planning time (s) Makespan Motion cost
Ap1 Ap2 Ours Ap1 Ap2 Ours Ap1 Ap2 Ours Ap1 Ap2 Ours
PA5 100.0 80.0 100.0 5.6 (±1.3) 6.1 (±2.1) 2.4 (±0.2) 3.0 (±0.2) 2.9 (±0.2) 2.8 (±0.2) 3.8 (±0.2) 3.6 (±0.2) 3.6 (±0.2)
PA7 80.0 70.0 100.0 39.8 (±12.8) 10.5 (±2.9) 4.0 (±0.9) 3.7 (±0.3) 3.0 (±0.3) 3.1 (±0.2) 4.8 (±0.3) 4.3 (±0.2) 4.1 (±0.2)
PA10 55.0 40.0 90.0 129.2 (±58.2) N/A 19.6 (±6.1) 4.6 (±0.6) N/A 4.2 (±0.3) 5.6 (±0.6) N/A 5.2 (±0.4)
BO8 85.0 35.0 100.0 246.5 (±54.2) N/A 182.2 (±48.3) 4.8 (±0.2) N/A 3.4 (±0.3) 7.6 (±0.1) N/A 5.0 (±0.6)
3.3.2 Benchmark domains
We evaluate the efficiency and effectiveness of our method and the two baselines in the packaging domain
shown in Fig. 3.1 (left) and the box-moving domain shown in Fig. 3.1 (right).
27
Packaging (PA): In this domain, each problem instance includes 2 to 6 robots, 3 to 5 goal objects, 2 to 13
movable objects besides the goal objects, 1 start region and 3 goal regions. As in (Kim et al., 2022b), we
omit motion planning and simply check for collisions at the picking and placing configurations computed
by inverse kinematics solvers in this domain, because collisions in this domain mainly constrain the space
of feasible picking and placing configurations. We use Kinova Gen2 lightweight robotic arms. For each
benchmark problem instance, we conduct 20 trials with a timeout of 1, 200 seconds. For all methods, we
also count a trial as failed, if all possible task plans have been tried.
Box-moving (BO): In this domain, we evaluate our framework for mobile manipulating tasks where the
robots have to move target objects from one room to the other room (Fig. 3.1 (right)). We use simulated
PR2 robots. In this domain, each problem instance includes 2 robots, 2 goal objects, 6 movable objects
besides the goal objects, 1 start region and 1 goal region. For simplification, we do not consider handover
actions. For each benchmark problem instance, we conduct 20 trials with a timeout of 1, 200 seconds. For
both methods, we also count a trial as failed, if all possible task plans have been tried.
We use bidirectional rapidly-exploring random trees (LaValle, 2006) for motion planning and IKFast (Diankov, 2010) for inverse kinematics solving. All methods share the same grasp sets, the same sets of
single-robot actions and the same sets of collaboration actions. All experiments were run on an AMD
Ryzen Threadripper PRO 3995WX Processor with a memory of 64GB.
3.3.3 Results
We refer to the number of time steps as makespan and the number of moved objects as motion cost.
Planning time and success rate. Table 4.1 shows that our method outperforms both baseline methods
on all problem instances with different numbers of goal objects and movable objects with respect to both
the planning times and success rates. Ap1 and our method achieved higher success rates on all problem instances than Ap2 because the reverse search strategy (Sec. 3.2.2.2) utilized in Ap1 and our method can find
feasible object placements much more efficiently than the forward search strategy used in Ap2. Moreover,
Ap2 can generate task plans that include irrelevant objects while Ap1 and our method focus on manipulating the important objects, like blocking objects for necessary manipulation or goal objects. Our method
achieved higher success rates with shorter planning times than Ap1 on the difficult problem instances PA7,
PA10 and BO8 because our method first generates promising task skeletons (Sec. 3.2.2.1) that use the information about the collaborative manipulation capabilities of the individual robots to prune the task plan
search space, which can be extremely large when there are many objects and multiple robots (Pan et al.,
2021). The main cause of failure of our method is running out of task skeletons which can be addressed by
incrementally adding more task skeletons during the search process.
Solution quality. Table 4.1 shows that our method can generate effective task-and-motion plans with
respect to the motion cost and the makespan. Our method first generates task skeletons with short
makespans by incrementally increasing time step limit and with low motion costs by incorporating the
motion cost into the objective function of the MIP formulation (Sec. 3.2.2.1). On the other hand, our MCTS
exploration strategy motivates the planner to search for effective plans with small numbers of moved objects. It should be noted that, although Ap2 generated plans with shorter makespans for PA7, it has lower
28
success rates and longer planning times than our method. Also, Ap1 generated plans that move significantly more objects for PA7, PA10 and BO8 than our method because it uses a depth-first search strategy
for finding feasible plans (Stilman et al., 2007a).
Table 3.2: The results of the proposed method in domain PA regarding the success rate, planning time,
makespan and motion cost. The numbers in the names of the problem instances indicate the numbers of
the robots.
Problem instance Success rate % Planning time (s) Makespan Motion cost
2 robots 60.0 148.4 (±36.8) 6.1 (±0.4) 8.9 (±0.4)
3 robots 80.0 99.0 (±48.6) 4.9 (±0.3) 8.2 (±0.5)
4 robots 85.0 109.1 (±33.6) 4.7 (±0.3) 8.2 (±0.4)
5 robots 75.0 207.0 (±48.7) 4.1 (± 0.2) 8.0 (± 0.3)
6 robots 70.0 362.7 (± 64.6) 3.4 (± 0.2) 7.7 (± 0.4)
Scalability evaluation. We evaluated the scalability of our method in the PA domain with 18 movable
objects, including 5 goal objects and 2 to 6 robots. Table 3.2 shows that our method can solve these large
problem instances. For problem instances with 3 and more robots, it achieved higher success rates compared to the problem instances with 2 robots. Moreover, our method can achieve shorter makespans and
lower motion costs when more robots are involved. These results show that our method can effectively
utilize multiple robots to address challenging planning problem instances and generate intelligent collaboration strategies for multiple robots.
However, in our experiment, the success rates for problem instances with 5 and 6 robots are lower than
the success rates for problem instances with 3 and 4 robots. The required planning time also increases when
more robots are added, starting from the problem instances with 3 robots. This is because adding more
robots into the system will lead to more cluttered environments and more difficult collision avoidance
between robots. In future work we will explore potentially mitigating this issue by carefully designing the
layout of robots (Tay and Ngoi, 1996).
3.4 Related work
There has been much work on solving general TAMP problems efficiently. TAMP problems are challenging because they require search in a large hybrid space that consists of task-level search and motion-level
search. Different approaches for TAMP problems focus on different strategies to combine task-level search
and motion-level search. In (Lagriffoul et al., 2014; Bidot et al., 2017), efficient geometric backtracking algorithms are proposed to systematically consider all the combinations of geometric instances of a given
symbolic task plan such that the symbolic task plan can be efficiently rejected if there is no way to instantiate it geometrically. In (Dantam et al., 2018), task-level search is modeled as a constraint satisfaction
problem and failures on motion-level search are efficiently encoded as new constraints to inform tasklevel search. In (Srivastava et al., 2014), an extensible planner-independent interface layer is proposed to
combine task and motion planning. In (Garrett, Lozano-Pérez, and Kaelbling, 2020), motion-level facts are
encoded in task-level planning and modern task planners (Hoffmann, 2001) are leveraged to efficiently
29
search for task-and-motion plans. Recently, more and more work has been focused on utilizing learning
to guide TAMP by ranking task plans (Khodeir, Agro, and Shkurti, 2023), predicting feasibility of task
plans (Yang, Garrett, and Fox, 2022), and ranking object importance in problem instances (Silver et al.,
2020). More comprehensive surveys on TAMP can be found in (Garrett et al., 2021; Mansouri, Pecora, and
Schüller, 2021).
In this work, we focus on GTAMP which is an important subclass of TAMP. The goal of the GTAMP
is to move several objects to regions in the presence of other movable objects.
There has been much work on solving single-robot GTAMP (SR-GTAMP) problems efficiently (Kim
et al., 2022b; Kim, Kaelbling, and Lozano-Pérez, 2019; Kim and Shimanuki, 2020) by utilizing learning to
guide planning. However, these approaches cannot be directly applied to multi-robot domains. Several
problem types in the literature can also be seen as versions of the GTAMP problem. In (Stilman et al.,
2007a), the “manipulation among movable obstacles" (MAMO) problem is addressed, in which a robot has
to move objects out of the way to move a specified object to its goal location. Although this approach can be
extended to multi-robot settings straightforwardly, it would require searching through a large space of all
possible combinations of multi-agent actions. Moreover, the focus of this approach is on feasibility of the
task-and-motion plans, rather than on the plan length and number of objects moved. In (Hun Cheong et al.,
2020) and (Nam et al., 2020; Danielczuk et al., 2019), the object retrieval problem is addressed, in which a
robot has to retrieve a target object from clutter by relocating the surrounding objects. In (King, Cognetti,
and Srinivasa, 2016; Krontiris and Bekris, 2016), the rearrangement planning problem is addressed, in
which a robot has to move objects into given goal configurations. However, these methods do not plan
collaboration strategies in multi-robot domains.
There has been work on solving general TAMP with several robots efficiently (Pan et al., 2021; Toussaint and Lopes, 2017a; Mansouri, Pecora, and Schüller, 2021). We focus on a subclass of these problems,
where a robot has to move objects in the presence of movable obstacles. In (Pan et al., 2021), a novel
task scheduling layer, positioned between task planning and motion planning, is proposed to prune task
planning search space. However, since this approach does not focus on geometric aspects of the TAMP
problem, it does not include guidance for finding continuous parameters such as feasible positions for
object relocation. In (Rodríguez and Suárez, 2016; Ahn, Kim, and Nam, 2021), efficient approaches are proposed for the multi-robot object retrieval problem, assuming permanent object removal and considering
one target object at a time, while our planner relocates the obstacles within the workspace and considers several target objects at the same time. Multi-robot rearrangement planning problems (Shome et al.,
2021; Hartmann et al., 2021; Chen et al., 2022) are also closely related to MR-GTAMP. However, the rearrangement planning problems assume that the goal configurations of all the movable objects are given,
while MR-GTAMP requires the planners to decide which objects to move and to which positions. There is
also work that focuses on task allocation and scheduling for multiple robots, assuming that a sequence of
discrete actions to be executed is given (Behrens, Stepanova, and Babuska, 2020). However, MR-GTAMP
requires the planners to decide which discrete actions to execute, e.g., which objects to move.
30
There has been work on optimization-based TAMP, where TAMP problems are modeled as mixedinteger non-linear programs (Toussaint, 2015a; Toussaint and Lopes, 2017b), mixed-integer linear programs (Kogo, Takaya, and Oyama, 2021) and continuous nonlinear programs (Takano, Oyama, and Yamakita, 2021). However, these frameworks do not focus on scenarios where obstacle avoidance is the
major challenge and objects can be moved to enable the manipulation of other objects.
31
Chapter 4
Guided Object Placement Sampling for Efficient Geometric
Task-and-Motion Planning
In Chapter 3, we presented a framework for efficient and effective MR-GTAMP. Can we deploy the proposed framework in collaborative manipulation tasks that involve humans? Besides the challenges of
predicting human intentions and motions, the previously proposed GTAMP algorithm is also not efficient
enough for replanning on-the-fly when the robot shares the same workspace with humans. One significant limitation affecting the efficiency of the previously proposed GTAMP algorithm is the sampling of
continuous parameters, particularly the sampling of object feasible placements. In this chapter, we focus
on improving the efficiency of object placement sampling for efficient GTAMP.
We focus on GTAMP problems in monotone setups, where each object is moved at most once. It has
also been shown that a monotone planner can be integrated with a global planner to address non-monotone
planning problems (Wang, Miao, and Bekris, 2022). Therefore, any efficiency improvements to monotone
GTAMP problems, will potentially benefit addressing general GTAMP more broadly. On the other hand,
we are particularly interested in addressing GTAMP problems with pick-and-place actions because of its
practice and the importance of precise manipulation in human environments.
To efficiently address monotone GTAMP problems, the state-of-the-art planners (Stilman et al., 2007b;
Zhang et al., 2023) often deploy a reverse-time search strategy (Sec. 4.1.2), whose insight is to sample
actions for manipulating the goal objects first in order to constrain the space of placement sampling for
relocating obstacles. Uniform sampling for object placements is commonly used in these planners, which is
inefficient and ineffective for planning problems with cluttered environments. First of all, it is inefficient to
Figure 4.1: Left: Bring the knife (dashed box) from the storage region (purple) to the human workspace
region (red). Right: Retrieve the green bottle from the bookshelf to the storage region (purple). In both
scenarios, other movable obstacles are only allowed to be relocated within their current regions.
32
The object
moved
in the
future
action.
Robot
configurations
planned in
the future
action. (a) An object
placement sampling
problem instance.
The object
to relocate.
M1 F₂
F₁
M2
M5
M3
M₁
M4
(c) The new placement
should not cause
collisions with the
fixed objects.
(b) The new placement
shoud be reachable
by the robot.
(d) The new placement
should not cause
robot-object collisions.
(e) The new placement
should not cause the
collisions with the
planned future actions.
M1 F₂
F₁
M2
M5
M3
M₁
M4
M1 F₂
F₁
M2
M5
M3
M₁
M4 M1 F₂
F₁
M2
M5
M3
M₁
M1 F₂
F₁
M2
M5
M3
M
M4
₁
M1 F₂
F₁
M2
M5
M3
M₁
M4
(f) A placement that
satisfies all the feasibility
requirements.
M4
Figure 4.2: Visualization of an object placement sampling problem instance (a), the set of failure sampled
object placements (b - e) according to different feasibility requirements (Sec. 4.1.3) for object placement
sampling in monotone GTAMP, and an object placement that satisfies all the feasibility requirements.(c).
For efficient monotone GTAMP, we also need to consider an effectiveness requirement, which will be described in Sec. 4.1.3 and Sec. 4.2.3.
find feasible object placements (Fig. 4.2), such as the object placements that are collision-free with respect
to the fixed obstacles. Moreover, uniform sampling may generate ineffective object placements that reduce
the overall efficiency of GTAMP. For example, poorly sampled object placements can deplete the available
space for placing other objects rapidly and make the further object placement sampling difficult.
Guiding object placement sampling for GTAMP is challenging. Although, learning-based approaches
have shown promising results on guiding continuous parameters search for TAMP (Kim et al., 2022c;
Chitnis et al., 2016; Yang et al., 2023; Ait Bouhsain, Alami, and Simeon, 2023), it can be challenging to learn
a monolithic end-to-end object placement sampler that takes into account all the feasibility requirements
and effectiveness requirements as discussed before (Sec. 4.1.3). Moreover, an effective object placement
sampler should be generalizable to problem instances that contain various shapes and numbers of objects.
Therefore, learning a monolithic end-to-end object placement sampler would require a huge amount of
training data which is expensive to collect in the first place.
To guide object placement sampling for addressing monotone GTAMP problems efficiently and effectively, our key insight is that instead of learning a monolithic end-to-end object placement sampler, we can
efficiently compute object placement cost maps for individual feasibility requirements and effectiveness requirements, and compose them for overall sampling. We propose to learn a neural network-based collision
checker, which has shown to be robust and generalizable (Yamada et al., 2023; Chase Kew et al., 2020;
Bhardwaj et al., 2022; Danielczuk et al., 2021), and combine it with a set of human-crafted convolution
kernels to compute a set of object placement cost maps according to the feasibility and effectiveness requirements (Sec. 4.1.3) for object placement sampling in monotone GTAMP.
Given an environment consists of a set of regions and a set of objects of different shapes, our framework
named GOPS, first performs a pre-computing process (Sec. 4.2.1). During the pre-computing process, we
discretize the configuration space of placing objects to each region into a discrete set of candidate object
placements. We then compute the corresponding robot configuration for each candidate object placement.
We also pre-compute the compactness kernels which will be used to evaluate how much available space for
placing other objects would be depleted if an object placement was selected. It should be noted that, for
different objects of the same shape, the pre-computing only needs to be done once, and the pre-computed
results can be reused in various GTAMP problem instances with different object arrangements. Given an
33
object placement sampling problem instance (Sec. 4.1.3), GOPS compute object placement cost maps for
individual feasibility and effectiveness requirements (Sec. 4.2.3) and compose them for overall sampling.
We compare GOPS with a standard uniform sampler, and an end-to-end learning-based sampler in two
challenging GTAMP domains (Fig. 4.1). We show that the GTAMP planner, when equipped with GOPS,
can address GTAMP problem instances that are challenging for the baseline methods. We also show that it
outperforms two baselines with respect to the planning time, and the resulting plan length. Additionally,
we evaluate GOPS on object shapes that are not included in the training dataset of the neural networkbased collision checker, demonstrating its generalizability to various object shapes (Sec. 4.3).
4.1 Preliminaries
4.1.1 Monotone GTAMP formulation
In a GTAMP problem, we have a robot manipulator R, a set of nF fixed rigid objects F = {Fi}
nF
i=1, a set
of nM movable rigid objects M = {Mi}
nM
i=1 and a set of nRe regions Re = {Rei}
nRe
i=1 . We assume that
all objects and regions have known and fixed shapes. We assume a fixed set of grasps GrS for each object
shape S ∈ S, similar to prior work (Yang et al., 2023; Zhang et al., 2023)
Each object has a configuration, which includes its position and orientation. The robot has a configuration r defined in its joint space. We are given the initial configurations of the robot, objects, and
regions, and a goal specification G in form of a statement of the form InRegion(M, Re), where M ∈ M
and Re ∈ Re, which is true iff object M is contained entirely in region Re.
We assume each region Rei ∈ Re is a rectangular surface such as a tabletop, defined by its opposite
bounding corners ((x
Rei
min, y
Rei
min),(x
Rei max, yRei max)). We limit the space of relocating any object to SE(2).
Therefore, the relocated placement of any object on region Rei can be defined by (x, y, θ) where x ∈
[x
Rei
min, xRei max], y ∈ [y
Rei
min, yRei max], θ ∈ (0, 2π].
We focus on pick-and-place actions. We define a grounded pick-and-place action a, as a tuple composed
of ⟨M, Re, g, PM, ξ⟩, where M represents the object to move; Re represents the target region for M; g
represents the grasp used by the robot; PM represents the target configuration for M; ξ represents the
trajectory the robot follows, specified as a sequence of robot configurations. Each grounded action will
map the configuration of the object M to a new configuration and the unaffected objects remain at their
old configuration.
We want to find a task-and-motion plan, i.e., a sequence of collision-free grounded pick-and-place
actions ♢ to change the configurations of the objects to satisfy G.
4.1.2 Reverse-time search strategy for monotone GTAMP
The reverse-time search strategy (RTSS) (Stilman et al., 2007b) is a practical algorithm for addressing monotone GTAMP. The first call to RTSS attempts to move the goal object M∗
to its target region. Ignoring all
the other movable objects, RTSS samples a collision-free grounded pick-and-place action for moving the
object M∗
(Alg. 3, line 4-11). The algorithm then stores the sampled action (Alg. 3, line 14-16), identifies the
34
Algorithm 3 RTSS(Mc, Re,MP AST ,MF UT , V, ♢)
1: given: the set of movable objects M; the initial configurations of the robot and all objects, denoted as C0; the fixed objects F.
2: result: an executable task-and-motion plan or a failure flag.
3: notation: We denote the trajectory concatenating operation as ⊕.
4: (g, ξpick) ← PlanPick(Mc,MF UT , C0, F) # g is the sampled grasp.
5: if ξpick is NIL then return NIL
6: end if
7: P ← FindPlacements(Mc, g, Re, MF UT , V, ♢, C0, F)
8: if P is NIL then return NIL
9: end if
10: ξplace ← PlanPlace(g, P, Mc, MF UT , C0, F)
11: if ξplace is NIL then return NIL
12: end if
13: ξ = ξpick ⊕ ξplace
14: MF UT . add(Mc)
15: V. add(OccupiedVolume(ξ, Mc))
16: ♢.add((Mc, Re, g, P, ξ))
17: MP AST . add(IdentifyBlocking(ξ, P, Mc,M \ ({Mc} ∪ MF UT ), C0))
18: if MP AST = ∅ then return ♢
19: end if
20: for MP ∈ MP AST do
21: Re = TargetRegion(MP )
22: ♢
′ = RTSS(MP , Re,MP AST \ MP ,MF UT , V, ♢)
23: if ♢
′
is not NIL then return ♢
′
24: end if
25: end forreturn NIL
movable objects that collide with the sampled action, and marks them as blocking objects (Alg. 3, line 17).
The algorithm then attempts to relocate these blocking objects recursively (Alg. 3, line 20-25).
By searching in reverse-time, RTSS focuses on sampling for goal-relevant actions and maintaining the
long-horizon dependencies among actions.
4.1.3 Object placement sampling for monotone GTAMP
Our target is to design an efficient and effective FindPlacements procedure (Alg. 3, line 7) such that we can
efficiently find a placement pose P
∗
for an object Mc that is feasible and effective for the overall efficiency
of GTAMP.
The parameters of the procedure FindPlacements specify the object we want to move Mc, the used
grasp pose g, the target region Re, the objects that have been planned to moveMF UT , the volume occupied
by the sampled future actions V , the initial configurations of the robot and all objects C0, the sequence of
grounded future actions ♢, and the fixed objects F. The initial and relocated poses of the objects MF UT
are recorded in C0 and ♢. We denote the volume occupied by the planned future robot configurations as
VR.
We want to find object placements that satisfy the feasibility requirements and an effectiveness requirement as follows.
Feasibility Requirements:
(i) Reachability requirement. A feasible object placement should be reachable by the robot with the
grasp pose g.
35
Algorithm 4 Construct Compactness Kernel (S, S′
, MS, MS′ )
1: given: a region Re; NRe discretized rotations denoted as {roti}
NRe
i=1 ; the scale factors used to discretize the object placement space for the
y axis and x axis, denoted as λy and λx respectively.
2: input: an object shape S, an object shape S
′
; an object MS with shape S; an object MS′ with shape S
′
.
3: result: a compactness kernel denoted as Kcom.
4: S.diag ← the extent of the bounding box of shape S along the diagonal
5: S
′
.diag ← the extent of the bounding box of shape S
′
along the diagonal
6: Kcom ← a zero matrix with dimensions NRe by NRe by int((S.diag + S
′
.diag) × λy) by int((S.diag + S
′
.diag) × λx)
7: for i ← 1 to NRe do
8: Place object MS to pose (0, 0, roti) in simulation
9: for j ← 1 to Kcom.dim(2) do
10: for k ← 1 to Kcom.dim(3) do
11: for l ← 1 to Kcom.dim(4) do
12: Place object MS′ to pose (l × λx − Kcom.dim(4) × λx/2, k × λy − Kcom.dim(3) × λy/2, rotj ) in simulation
13: if collision_detected(MS, MS′ ) then
14: Kcom[i][j][k][l] = 1
15: end if
16: end for
17: end for
18: end for
19: end forreturn Kcom
(ii) Object collision-free requirement. A feasible object placement should not induce collisions between
the object Mc and the fixed objects F, or collisions between the object Mc and the objects MF UT
either at their initial poses or relocated poses.
(iii) Robot collision-free requirement. A feasible object placement should not induce collisions between
the robot and the fixed objects F, or collisions between the robot and the objects MF UT at their
initial poses.
(iv) Avoiding-volume requirement. A feasible object placement should not induce collisions between the
object Mc and the volume VR.
Effectiveness Requirement:
(i) Compactness requirement. An effective sampled object placement should leave as much available
space for placing other objects as possible.
4.2 GOPS: Guided object placement sampling for GTAMP
We start with describing the pre-computing process (Sec. 4.2.1). We then describe the neural network-based
collision checker (Sec. 4.2.2). We finally describe how we compute the object placement cost maps according to the aforementioned feasibility and effectiveness requirements given an object placement sampling
problem instance (Sec. 4.1.3).
4.2.1 Pre-computing process
In this section, we describe our pre-computing process with a robot R, a region Re, and a set of movable
objects M. We denote the set of distinguishing object shapes as S. The pre-computing only needs to be
done once for each object shape S ∈ S. The pre-computed results can be reused for different GTAMP
problem instances with various object arrangements.
36
collision
detector
object mlp
encoder
mlp
robot states
repeated object
embeddings
[0, 1, 0, 1, 0, 0]
repeated robot
states
Figure 4.3: Visualization of the neural network architecture of our collision checker, and a forward pass
through the network with a batch of three object height maps and two robot configurations.
During the pre-computing process, we first evenly discretize the space of placing objects onto the
region Re into NRe × HRe × WRe candidate placements PRe, where NRe denotes the number of discrete
orientations, HRe and WRe denote the number of discrete cells along the y axis and x axis, respectively.
We denote the scale factors for the y axis and x axis as λy and λx, respectively.
For each object shape S ∈ S, we filter out all placements from PRe that would cause an object with
shape S to extend beyond the boundaries of the region Re. We denote the filtered set of placements for
the shape S as PRe
S
. For each pair of shape S ∈ S and grasp g ∈ GrS, we compute the configurations for
the robot R to place an object with shape S to each candidate placement P
Re
S ∈ PRe
S
using the grasp g.
We denote the result set of configurations as ΨRe
R,S,g.
During the pre-computing process, for each pair of two different object shapes, (S, S′
), where S ∈
S, S′ ∈ S, S ̸= S
′
, we also compute a compactness kernel (Alg. 4), denoted as K
S,S′
com , to efficiently evaluate
how much available space will be depleted for placing objects with the shape S
′ on the region Re, if an
object placement is occupied by an object with the shape S. Compactness kernel is designed to operate on an
object placement feasibility map with dimensions NRe by HRe by WRe. Each element of an object placement
feasibility map has a value of 1 if the corresponding placement satisfies all the feasibility requirements
(Sec. 4.1.3) and 0 otherwise. We describe the method to efficiently compute the object placement feasibility
map and the method to apply the computed compactness kernels in Sec. 4.2.3. Given a pair of shapes(S, S′
),
where S ̸= S
′
, we construct the compactness kernel K
S,S′
com in simulation. We place an object MS with the
shape S at NRe different rotations (Alg. 4, line 7-8), then iteratively place an object MS′ with the shape
S
′
to each surrounding pose of the object MS (Alg. 4, line 9-12), and check the collision between the
two objects (Alg. 4, line 13). We update the values of the corresponding elements based on the collision
checking results (Alg. 4, line 13-14).
The entire pre-computing process can be computationally expensive when the environment has many
regions and diverse object shapes, since it requires solving inverse-kinematics and checking collisions.
However, the pre-computed results can be reused for GTAMP problem instances with different object
arrangements. Moreover, the inverse kinematics solving and compactness kernel computing can be potentially implemented as neural network-based modules (Bensadoun et al., 2022; He et al., 2023) which have
fast inference speed. We leave this potential extension for future work.
37
4.2.2 Neural network-based collision checking
We now describe our neural network-based collision-checking model (Fig. 4.3). Leveraging the modern
GPUs, the learned neural network-based collision-checking model allows us to process millions of collision checking queries in parallel. The collision checking operation is the major bottleneck of searching
for object placements that satisfy the robot collision-free requirement, avoiding-volume requirement, and
compactness requirement (Sec. 4.2.3).
Data collecting. In simulation, given a region Re, we randomly sample placements for objects with
various shapes and sample robot configurations in simulation. We use the collision checker provided by
the simulator to check collisions between the robot and the objects to generate ground truth labels. In this
work, we focus on two categories of object shapes, i.e., cylinders and boxes, since most object shapes can
be approximated with these two primary shapes (Quintero-Pena et al., 2023). Our model is trained on 3.6
million collision checking samples.
Object representation. Given an object M ∈ M, its pose on the region Re, we represent the object
with a height map. The height map has the same dimension as the region Re. From a top-down view, the
pixels occupied by the object have the same value as the object height and the rest pixels have a value of 0
(Fig. 4.3). A more general object representation is to use object point clouds (Danielczuk et al., 2021) and
we leave this for future work.
Robot state representation. A robot state s consists of its joint configuration r, its end-effector position
e
pos and orientation e
ori. Following (Yamada et al., 2023), we use the first two column vectors in its rotation
matrix to represent the end-effector orientation.
Neural network architecture and loss function. Following previous works (Yamada et al., 2023; Chase
Kew et al., 2020; Danielczuk et al., 2021), our network architecture (Fig. 4.3) consists of an object encoder,
and a collision detector. The object encoder is a fully connected network that takes a batch of object height
maps as input, and generates object embeddings. The collision detector takes the object embeddings and
a batch of robot states as input to predict collisions. We train our collision checker using the binary crossentropy loss with ground truth collision labels.
4.2.3 Computing object placement cost maps
In this section, we describe the methods to compute object placement cost maps which will be composed
for sampling object placements that satisfy all the feasibility and effectiveness requirements for monotone
GTAMP (Sec. 4.1.3).
Notation. Recall that, in an object placement sampling problem (Sec. 4.1.3), we have a robot R, and an
object Mc with shape Sc, for which we want to find a new placement on a region Re. We have a grasp
pose g that defines the relative transformation for the robot gripper relative to the object Mc. We have
a set of fixed objects F. We have a set of objects MF UT , that are planned to be moved. We denote the
initial placement pose and planned relocated placement pose of each object MF UT ∈ MF UT as P
ini
F UT and
P
relo
F UT . We also have a set of robot configurations, denoted as VR = {ri}, planned for manipulating the
objects MF UT . After the pre-computing process (Sec. 4.2.1), we have a set of NRe ×HRe ×WRe candidate
placements by evenly discretizing the continuous configuration space of placing any object to the region
38
sample
An object sampling
problem instance. Composed object
placement score map.
Reachability
cost map.
Object collisionfree cost map.
Robot collisionfree cost map.
Avoiding volume
cost map.
Compactness
cost map.
A sampled feasible and
effective placement.
compose
Figure 4.4: Visualization of the computed cost maps. The rotation space of object placement is evenly
discretized to a discrete set of 16 rotations. Left: An object placement sampling problem instance (part of
the shelf is omitted for better visualization), where the target is to retrieve the green object from the shelf.
The yellow objects are fixed obstacles and the white objects are movable objects. During the reverse-time
search, the action for moving the goal object (green) has been planned. We need to relocate the magenta
object (dashed box). The robot configurations for picking and placing the goal object (green), and the robot
configuration for picking the magenta object, are visualized. Middle: Visualization of the feasibility object
placement cost maps and the effectiveness object placement cost map. They are composed to construct an
overall object placement score map. Right: A sampled feasible and effective object placement. Instead of
selecting an object placement near the upper-right corner, the sampler prioritize a placement that has low
compactness cost based on the computed compactness cost map.
Re. For the shape Sc, we filter out all the placements that are not within the region Re, and have a filtered
set of placements PRe
Sc
. We also have a set of robot configurations, denoted as ΨRe
R,Sc,g, for placing the
object Mc with the shape Sc to each candidate placement in PRe
Sc
with the grasp g.
Computing reachability cost maps. It is straight-forward to convert the pre-computed robot configurations ΨRe
R,Sc,g to a reachability cost map, which is a matrix with dimensions NRe by HRe by WRe. We
denote the reachability cost map as Creach. Each element of the reachability cost map holds a value of 0 if
it has a corresponding robot configuration in ΨRe
R,Sc,g, and 1 otherwise (Fig. 4.4).
Computing object collision-free cost map. We want to efficiently find a placement for the object Mc
that avoids collisions with the fixed objects F and the objects MF UT at either their initial poses P
ini
F UT or
relocated poses P
relo
F UT . We first construct a binary image with dimensions HRe by WRe to represent the
obstacles in the scene. Each pixel of the binary image has a value of 1 if it is occupied by the obstacles and
has a value of 0 otherwise.
We then construct NRe square binary masks of the footprints of the object Mc corresponding to NRe
rotations as our convolution kernel. The convolution kernel assigns a value of 1 to object pixels and 0 to
the pixels surrounding them. It is then applied across the binary image that represents the obstacles in the
scene. We then apply an element-wise clipping to the resulting image such that all elements are capped
at a maximum value of 1. We call the resulting map an object collision-free cost map (Fig. 4.4), denoted as
Cobj , where each element is assigned a value of 0 if placing the object Mc at the corresponding placement
results in no collisions, and a value of 1 if collisions occur.
Computing robot collision-free cost map. We also want to efficiently find object placements whose corresponding robot configurations in ΨRe
R,Sc,g do not collide with the fixed objects F and the objects MF UT
at their initial poses. Leveraging our collision-checking model (Fig. 4.3), we efficiently check collisions
39
between the robot configurations and the obstacles in parallel to construct a robot collision-free cost map
(Fig. 4.4), denoted as Crobot. In the resulting robot collision-free cost map, each element is assigned a value
of 1 if placing the object Mc at the corresponding placement results in collisions between the robot and
the obstacles, and a value of 0 if no robot-object collisions occur or the corresponding placement is not
reachable by the robot∗
.
Computing avoiding-volume cost map. During the reverse-time search (Sec. 4.1.2), part of the available
space for relocating objects will be occupied by the planned future robot configurations VR = {ri}. Leveraging our collision-checking model, we efficiently check collisions between the robot configurations VR
and the object Mc at each candidate placement in PRe
Sc
in parallel to construct an avoiding-volume cost map
(Fig. 4.4), denoted as Cavoid. In the resulting avoiding-volume cost map, each element is assigned a value
of 1 if placing the object Mc at the corresponding placement results in collisions between the object and
the planned future robot configurations, a value of 0 if no robot-object collisions occur. For computational
efficiency, we extensively cache the results of object height map drawing and object embedding (Sec 4.2.2)
Computing compactness cost map. During the reverse-time search (Sec. 4.1.2), an increasing amount
of space for relocating objects will be occupied by the planned future robot actions, including both the
planned robot configurations and the objects that have already been relocated. Therefore, it is important
to find object placements for the object Mc that invalidate as few feasible placements for placing other
objects as possible (Sec. 4.1.3). Recall that during the reverse time search, we maintain a set of objects
MP AST . These are the objects we intend to relocate after placing the object Mc (Sec. 4.1.2). We denote
the set of distinguishing shapes of these objects as SP AST . Using the methods described earlier, we first
compute the object placement cost maps according to different feasibility requirements for each shape S in
SP AST . For each shape S, we then compose the computed object placement cost maps using the formula
C
S
feas = (1−C
S
reach)×(1−C
S
obj )×(1−C
S
robot)×(1−C
S
avoid) to construct an object placement feasibility
map C
S
feas. Each element of the object placement feasibility map has a value of 1 if the corresponding
placement satisfies all the feasibility requirements (Sec. 4.1.3) and 0 otherwise. Then for each pair of
S ∈ SP AST and Sc, we apply the pre-computed compactness kernel K
Sc,S
com (Sec. 4.2.1) across the object
placement feasibility map for shape S. We aggregate the resulting maps for each shape S ∈ SP AST by
taking the element-wise average for each corresponding element. We call the resulting map compactness
cost map (Fig. 4.4), denoted as Ccom. Each element of a compactness cost map has a value reflecting how
many feasible placements will be invalidated by placing the object Mc to the corresponding placement.
Composing object placement cost maps. To perform the overall sampling, we compose all the computed object placement cost maps to an object placement score map C, where C = (1 − Creach) × (1 −
Cobj ) × C
−1
robot × C
−1
avoid × α
−Ccom. We use the element-wise clip operation clip and hyperparameters
βrobot, βavoid to avoid division by zero. We use a hyperparameter α to modulate the importance of the
compactness requirement. We then convert the object placement score map to a weighted sampler.
∗These placements will be filtered out with the reachability cost map.
40
4.2.4 Integration with GTAMP planners
It is straight-forward to integrate the proposed object placement sampler with any existing GTAMP planners (Stilman et al., 2007b; Zhang et al., 2023) that deploy the reverse-time search strategy (Sec. 4.1.2), by
implementing the function FindPlacements (Alg. 3, line. 7) with the proposed sampler.
4.3 Evaluation
This work presents a guided object placement sampling framework named GOPS, for monotone GTAMP
planners that deploy the reverse-time search strategy (RTSS) (Sec. 4.1.2). In this section, we empirically
evaluate GOPS and two baseline samplers in two challenging tabletop manipulation domains (Sec. 4.3.1).
We use a state-of-the-art monotone GTAMP planner (Zhang et al., 2023) as our backbone planner (Sec. 4.2.4).
We then conducted an ablation study (Sec. 4.3.4) to evaluate the effectiveness of the feasibility cost maps
and the effectiveness cost map (Sec. 4.2.3). We also evaluate GOPS on object shapes that are unseen in the
training dataset for neural network-based collision checker (Sec. 4.3.5).
4.3.1 Benchmark domains
We evaluate the efficiency and effectiveness of GOPS and two baselines in the tool-transfer domain
(Fig. 4.1, Left) and the object-retrieval domain (Fig. 4.1, Right).
Tool-transfer (TT): In this domain, the target is to move a tool object from the storage region to a given
goal pose inside the workspace region in the presence of other movable obstacles. The planner needs to
efficiently find placements to relocate movable obstacles. Each problem instance includes 1 fixed robot
arm, 1 tool object on the storage region, and 5 − 7 bowls on the workspace region.
Object-retrieval (OR): In this domain, the target is to move a specified target object from a bookshelf to
the target region in the presence of other movable obstacles. Each problem instance includes 1 fixed robot
arm, 7 − 11 bottles on the workspace region in the bookshelf.
Problem instances are generated by randomly sampling the initial object poses. In each domain, we
generate 10 problem instances for each problem that involves a varying number of movable objects. For
each problem instance, we conduct 10 trials with a timeout of 200 seconds. As in (Kim et al., 2022c; Zhang
et al., 2023), we omit motion planning and simply check for collisions at the picking and placing configurations computed by inverse kinematics solvers in these domains. It is because collisions in these domains
mainly constrain the space of feasible picking and placing configurations. We use IKFast (Diankov, 2010)
for inverse kinematics solving. We use a Kinova Gen2 lightweight robotic arm as our robot. All experiments were run on a 13th Gen Intel Core i7-13700K Processor with a memory of 128GB and an NVIDIA
GeForce RTX 4070 Ti GPU.
4.3.2 Baselines
We compare our framework (GOPS) with a standard uniform sampler (UNIFORM) and an end-to-end
learning-based sampler (END2END). We developed the END2END sampler based on a sampler proposed
41
50 150 250
Number of Trials
0.0
0.2
0.4
0.6
0.8
1.0
Success Rate
Comparison of Object Placement Samplers
END2END
UNIFORM
GOPS
Figure 4.5: Success rates of GOPS and two baselines for solving object placement sampling problems in 50,
150, and 250 trials.
for the forward search-based geometric task-and-motion planning (GTAMP) (Kim et al., 2022a). Following (Kim et al., 2022a), the END2END sampler maintains a set of key robot configurations collected from
the training dataset. The key robot configurations are used to represent obstacles and planned future robot
actions in the reverse-time search. The END2END sampler is a Wasserstein GAN trained with gradient
penalty (Gulrajani et al., 2017) on 9136 object placement sampling problem data points.
Table 4.1: Comparison of the proposed method with two baseline methods in the two benchmark domains
regarding the success rate, planning time, and plan length. The numbers in the names of the problems
indicate the numbers of the movable objects besides the goal object.
Algorithm Success rate % Planning time (s) Plan length
OR9 OR10 OR11 TT6 TT7 OR9 OR10 OR11 TT6 TT7 OR9 OR10 OR11 TT6 TT7
UNIFORM 94.0 87.0 76.0 99.0 93.0 10.61 (± 2.72) 13.68 (± 3.13) 15.73 (± 3.75) 9.91 (± 2.64) 22.81 (± 3.69) 5.12 (± 0.18) 4.98 (± 0.20) 4.43 (± 0.11) 3.41 (± 0.08) 3.85 (± 0.09)
END2END 92.0 76.0 80.0 N/A N/A 24.73 (± 3.86) 25.52 (± 5.32) 29.79 (± 5.26) N/A N/A 4.75 (± 0.16) 4.38 (± 0.17) 4.46 (± 0.13) N/A N/A
GOPS 100.0 100.0 100.0 100.0 100.0 3.92 (± 0.45) 3.60 (± 0.50) 4.33 (± 0.47) 2.61 (± 0.57) 2.61 (± 0.51) 4.71 (± 0.14) 4.84 (± 0.19) 4.39 (± 0.11) 3.5 (± 0.10) 3.61 (± 0.06)
4.3.3 Evaluation on problems with seen object shapes
We first evaluated GOPS and two baseline methods on GTAMP problems with object shapes that are included in the training datasets for the neural network-based collision checker (Sec. 4.2.2) and the END2END
sampler.
Evaluation on individual object placement sampling problem instances. We evaluated GOPS and
the baseline methods on 1160 individual object placement sampling problems instances collected by solving GTAMP problems in domain OR (Sec. 4.3.1) with 10 movable objects and 1 goal object. For these object
placement sampling problem instances, a sampled object placement is counted as a success if it satisfies
all the feasibility requirements (Sec. 4.1.3). We count the success rates of different baselines across different trial limits. The results (Fig. 4.5) show that GOPS significantly outperforms the baseline methods,
achieving a nearly 100% success rate in at most 50 trials.
Evaluation on GTAMP problem instances. We then evaluated GOPS and the baselines on GTAMP
problems in domain OR with 9−11 movable objects and 1 goal object, and domain TT with 6−7 movable
objects and 1 goal object. The results (Table 4.1) show that the same GTAMP backbone planner (Zhang et
42
al., 2023), when equipped with GOPS, achieves higher success rates and shorter planning times compared
to when it is equipped with other baseline samplers. This is because GOPS finds feasible and effective object
placements more efficiently than the baseline samplers. Finding feasible object placements efficiently and
effectively is crucial because, in sampling-based GTAMP, if the planner cannot ground a task skeleton by
finding feasible object placements within a given trial limit, it will attempt to find another task skeleton
to ground, potentially resulting in significant computation time. Finding effective object placements is
crucial because, the challenging GTAMP problems usually require multiple pick-and-place actions to solve.
Depleting too much available space for relocating obstacles will make the overall planning problem harder.
GOPS also achieves shorter plan lengths than UNIFORM on the problems OR9−11 and TT7, because it can
successfully refine challenging task skeletons by solving challenging object placement sampling problems
in GTAMP. Although END2END performs better than UNIFORM on individual object placement sampling
problem instances (Fig. 4.5), both approaches exhibit comparable success rates when addressing GTAMP
problem instances. It is because, given a sufficient number of trials, UNIFORM has the potential to match
the object placement sampling performance of END2END. On the other hand, END2END needs to perform
expensive collision checks between the set of key configurations (Sec. 4.3.2) and obstacles to accurately
represent the environment, which results in fewer opportunities to refine various task skeletons.
Table 4.2: Comparison of the samplers that uses different object placement cost maps in the two benchmark
domains regarding the success rate, planning time, and plan length. The numbers in the names of the
problems indicate the numbers of the movable objects besides the goal object.
Algorithm Success rate % Planning time (s) Plan length
OR10 OR11 TT7 OR10 OR11 TT7 OR10 OR11 TT7
UNIFORM 87.0 76.0 93.0 13.68 (± 3.13) 15.73 (± 3.75) 22.81 (± 3.69) 4.98 (± 0.20) 4.43 (± 0.11) 3.85 (± 0.09)
GOPS-FEAS 100.0 99.0 100.0 5.72 (± 0.85) 4.80 (± 0.57) 4.15 (± 0.62) 4.99 (± 0.21) 4.54 (± 0.11) 3.65 (± 0.07)
GOPS-ALL 100.0 100.0 100.0 3.60 (± 0.50) 4.33 (± 0.47) 2.61 (± 0.51) 4.84 (± 0.19) 4.39 (± 0.11) 3.61 (± 0.06)
b GOPS-FEAS uses all the feasibility cost maps.
c GOPS-ALL uses all the feasibility cost maps and the effectiveness cost map.
4.3.4 Ablation studies
We conduct the ablation studies to evaluate the contribution of different object placement cost maps to the
effectiveness and efficiency of our framework. The results (Table 4.2) show that simply incorporating the
feasibility object placements cost maps can already significantly increase the effectiveness and efficiency of
GTAMP. Moreover, we observe that by adding the compactness cost map, we can achieve higher success
rates, shorter planning times and plan lengths in both benchmarking domains.
4.3.5 Evaluation on problems with unseen object shapes
We conduct additional experiments to study the generalizability of the proposed approach for various object shapes. We compare GOPS with UNIFORM on GTAMP problems in domain OR and TT with movable
objects whose shapes are unseen in the dataset for training the collision checking model (Sec. 4.2.2). The
43
Table 4.3: Comparison of GOPS with UNIFORM on benchmarking problems with object shapes that are
unseen in the training dataset for neural network-based collision checking model.
Algorithm Success rate % Planning time (s) Plan length
UNIFORM GOPS UNIFORM GOPS UNIFORM GOPS
ORV7 95.0 100.0 12.58 (± 2.52) 3.58 (± 0.27) 4.01 (± 0.13) 3.93 (± 0.12)
TTV5 100.0 100.0 5.32 (± 1.10) 1.92 (± 0.10) 3.46 (± 0.08) 3.15 (± 0.05)
results (Table 4.3) show that GOPS still significantly outperforms UNIFORM on problems with unseen
object shapes, demonstrating its generalizability to various object shapes.
4.4 Discussion
In this chapter, we presented a framework named GOPS, to guide object placement sampling for efficiently
addressing monotone GTAMP problems. Instead of learning a monolithic end-to-end object placement
sampling model, we propose to leverage a neural network-based collision checker and a set of humancrafted convolution kernels to efficiently compute object placement cost maps for individual feasibility
requirements and effectiveness requirements, and compose them for overall sampling. We showed that
our framework outperforms the standard uniform sampling approach and an end-to-end learning-based
approach in two challenging GTAMP domains.
Limitations. Our work is limited in many ways. The pre-computing process (Sec. 4.2.1) of our framework
requires expensive IK solving and collision checking. To reduce these computational demands, we plan to
integrate learning modules (He et al., 2023; Bensadoun et al., 2022) in future work. We also plan to account
for more complex object placement constraints, such as the constraints on the relative transformations
between different objects (Yang et al., 2023). We have also assumed full observability of the scene and
therefore cannot handle uncertainties, noise in robot perception (Muguira-Iturralde et al., 2022). We plan
to account for sensing limitations in the future (Nikolaidis, Dragan, and Srinivasa, 2016).
4.5 Related work
4.5.1 Object placement sampling in GTAMP
GTAMP can be viewed as a subclass of task-and-motion planning (TAMP). The object placement sampling
problem is a crucial subproblem of TAMP (Kim et al., 2022c; Ahn et al., 2021; Stilman et al., 2007b). Previously, uniform sampling within the target region (Stilman et al., 2007b), is the standard approach. However,
this approach can be ineffective for problems where environments are cluttered, and frequent replanning
is required. There have been works on efficiently searching for continuous parameters including object
placements for TAMP (Quintero-Pena et al., 2023; Lozano-Pérez and Kaelbling, 2014; Toussaint, 2015b;
Ahn et al., 2021; Nakhimovich, Miao, and Bekris, 2023). If a task plan is given, searching for continuous
parameters can be effectively formulated as Constraint Satisfaction Problem (Lozano-Pérez and Kaelbling,
44
2014), non-convex mixed-integer quadratically constrained program (Quintero-Pena et al., 2023), or nonlinear optimization problem (Toussaint, 2015b). Our work does not assume a given task plan, and focus on
searching for effective object placements that also benefit the task-level planning.
There have been works on learning to efficiently find continuous parameters including object placements for TAMP (Kim et al., 2022c; Yang et al., 2023; Chitnis et al., 2016; Driess et al., 2021). In (Kim et al.,
2022c), an actor-critic algorithm is proposed to learn a monolithic end-to-end neural sampler for sampling
continuous parameters. In this work, we propose to compute object placement cost maps for individual
feasibility requirements and effectiveness requirements, and compose them for overall sampling. In (Yang
et al., 2023), a framework is proposed to compose a set of diffusion models learned for different constraints
for overall continuous parameters sampling. Besides satisfying constraints, i.e., feasibility requirements
(Sec. 4.1.3), our work also attempt to sample object placements that are effective for the overall GTAMP.
4.5.2 Learning-guided collision avoidance
Efficient collision checking has been widely studied in sampling-based motion planning and rearrangement
planning literature (Bhardwaj et al., 2022; Danielczuk et al., 2021; Yamada et al., 2023). In (Chase Kew et al.,
2020), an approach is proposed to leverage the efficiency of neural networks at processing large collision
checking batches to speed up motion planning. Following this line of work, we propose to use a neural
network-based collision checker to improve the efficiency and effectiveness of object placement sampling
for GTAMP. Instead of focusing on motion planning and single-step picking-and-placing tasks, we focus
on complex multi-object manipulation tasks.
45
Chapter 5
Multi-Robot Planning for Automated Roof Bolting in Underground
Mining
Roof bolting is an essential operation within the underground mining cycle, as it aims to provide support
to the exposed roof and ribs of the new excavation (Peng and Tang, 1984; Mark, 2002) (Fig. 5.1). The roof
bolting operation follows immediately after the extraction task and reinforces the roof to provide a safe
working environment. Roof bolting is utilized in almost all coal mining operations around the world.
Figure 5.1: A human operator is installing a bolt into the roof bolter.∗
The roof bolt binds the unstable roof together, preventing movement in a rock mass. There are several
types of bolt installation techniques, depending on the mechanics of the bolt and the rock. In this thesis, we
focus on a technique where installation of the bolts is done by drilling a hole in the roof, inserting the resin
and inserting the bolt. The roof bolting operation is a labor-intensive task that requires the operators of
the machine to install and replace detachable drill steels and cutting bits, holding and positioning of resin
cartridges and 1.2 to 3 meter (4 to 10 foot) long bolts in a pattern that can be half a meter square. During
the roof bolts installation process, the operators are at risk from working in the proximity of potentially
unsupported roof, loose bolts, hydraulic-powered equipment, gas and heavy tools in awkward conditions.
Apart from these safety risks, the operators are also vulnerable to inhalation of dust and noise from drilling
and bolting processes which can be traced to the several pumps from the roof bolter machinery (Jiang and
Luo, 2021). The operation of these machines requires attention to the risks, which, combined with fatigue,
leads to accidents, injuries and severe injuries including fatalities. Therefore, more and more research
∗
https://bit.ly/3tfYOMY
46
efforts have been put into developing robot systems that are capable of carrying out the sequence of roofbolting operations to achieve a high-impact health and safety intervention for roof-bolter operators (Van
Duin et al., 2013; Schafrik, Kolapo, and Agioutantis, 2022).
The bolting machines have been automated before, but these modifications are not popular with the
community because autonomous machines are highly restricted in their usage. They are setups for a singlepurpose drilling and bolting operation, where in most mining and civil construction, flexible installation
is desired.
In this thesis, we first propose a novel design and development of an automated roof bolting machine
(Sec. 5.1) that performs the entire sequence of roof bolting operations in underground coal mining environments. The design of the system was previously published in one of our journal papers (Kolapo et al.,
2024). We then propose to formulate roof-bolting operations (Sec. 5.2) as Multi-robot Geometric Task-andMotion Planning (MR-GTAMP) problems. We show that we can generate intelligent multi-robot behaviors
by solving MR-GTAMP problems effectively.
5.1 Design of the automated roof bolting machine
The automated roof bolting machine has to perform roof bolting operations of drilling, drill steel removal,
resin placement, and bolt installation without human intervention. As a result, the operator is assigned a
new role of supervising one or more automated components and also controlling the whole automation
process via the human-machine interface (HMI).
A six-axis robotic arm (Sec. 5.1.1) was integrated into the system to mimic human tasks during the roof
bolting operation to enhance operator safety and improve productivity. During the automation process,
the robot performed tasks such as grasping, moving, lifting and positioning of drill steels, insertion of simulated resin membrane, and bolt installation. A reliable communication system was established between
the robot arm and other components. EtherNet/IP protocol is used to pass messages between components
of the automated roof bolting machine through a CAN bus device.
We have also developed several other novel components (Sec. 5.1.2) for the automated roof bolter, such
as the plate feeder, the bolt feeder, and the wrench. These components were built to support automation
and minimize human intervention during roof bolting operations. These components were linked to the
PLC and controlled by the HMI touchpad.
In the following subsections, we provide more details about the design of the system.
5.1.1 Integration of ABB IRB 1600 robotic arm
The robotic arm was integrated into the automated roof bolter to imitate human tasks. Introducing a
robotic arm does not only imitate human tasks but also improves performance and productivity. One of
the advantages of the robotic arm over humans in the bolting process is the ability to adapt and work
under high ambient temperatures, which affects human performance. The automation of a roof bolting
machine comprises of a complete cycle of drill steel positioning, drilling, bolt orientation and placement,
resin placement, and bolt securing was done using an anthropomorphic robotic arm. The robotic arm is
47
Figure 5.2: ABB IRB 1600 Robotic Arm next to a roof bolter unit.
envisaged to perform human operator tasks on the roof bolting process. That is, the robotic arm will mimic
the functions performed by humans during the roof bolter operation, which would remove operators from
hazardous work environments and assign them a new role to monitor the roof bolting process from a safe
location.
For this project, an ABB IRB 1600 robot (Fig. 5.2) was deployed in the automation process of the roof
bolter machine. This robot comprises of two main parts, the manipulator and the controller, shown in
Figure 5. The body of the robotic arm consists of links, joints, and other structures with a net weight of
250 kg (551.156 lbs). The robot arm can imitate human intelligence during bolting operations by performing
the operator’s tasks, such as grabbing drill steels for drilling, removing drill steels after drilling, pumpable
resin installation, and bolt placement for installation.
5.1.2 Components of the automated roof bolting machine
Several novel components were developed to minimize or have no human intervention during the roof
bolting operations. The researchers ensure minimal usage of sensors in the automation process of the roof
bolter to prevent challenges of equipment failure which are regarded as one of the critical considerations
in underground coal mines. The developed components include the plate feeder, the bolt feeder, and the
wrench system.
The Plate Feeder. The plate feeder (Fig. 5.3) is designed to perform the primary function of feeding the
plate for the bolt to wear. In roof bolting operation, operators use their hands to fix the square plate on
the bolt. This is an integral step toward the roof bolting procedure. Fixing the square plate on the bolt
during the bolting operation has been one of the most challenging steps for the operator during the roof
bolting process. The plate feeder contains a linear actuator that initiates the movement by pushing the
square plate to 1.25 inch PVC shaft to drop the plate on an inch shaft.
48
Figure 5.3: The plate feeder.
The pneumatic schunk gripper was introduced to align the bolt with an inch of PVC pipe. During the
roof bolting process, the robot places the bolt in between the gripper to enable the bolt to be aligned with
the plate feeder pvc pipe for a smooth fixing of the bolt’s plate. The pneumatic controller, connected to
the PLC, powers the gripper to close and open. This makes it possible for the system to be controlled from
the iQAN touchpad. The main function of the actuator on the plate feeder is to push the plates until one of
the plates drops on the sensor installed on the one- inch pipe; the sensor will, in turn, energize the 180N
solenoid magnet and produce a magnetic that holds the remaining plates on the 1.25-inch pipe.
Figure 5.4: The bolt feeder.
The Bolt Feeder. Despite the improvement in mining equipment over the years, the current roof bolting
machine in the mining market has yet to develop a machine prototype that can store bolts. This project
developed a bolt feeder (Fig. 5.4), which is strategically placed at a position where the robotic arm can
grab for installation. The current prototype of the bolt feeder is designed to allow the bolt to slide without
frictional force. The main function of the bolt feeder is to house multiple bolts and position the bolt for
grabbing. The bolt feeder was constructed so that a bolt can only be allowed to slide when the actuator
produces a motion that pushes the shaft through the feeder for the robot to grasp at a time.
49
Figure 5.5: The wrench system.
The Wrench System. To prevent human intervention during the automated roof bolting operation, there
is a need for a tool interchanger to perform the duty of human support. In this project, the tool interchanger
system is referred to as the wrench system. The wrench system was designed to quickly and efficiently
change tools while the roof bolting process was in operation. The wrench is positioned beside the drill
head roof bolter to stabilize the drill steels and the bolt during drilling and bolt installation (Fig. 5.5).
The wrench allows the drill steels and the bolt to be positioned appropriately and prevent possible
disengagement of drill steels during drill steel coupling and bolt installation. More importantly, it serves as
the connector between the drill head and drilling tools to be able to achieve maximum drilling and bolting
height. The automated wrench comprises a step motor connected to a mini controller, which sends high
and low signals to the PLC for task execution. When the communication path is established between the
mini wrench controller and the PLC, the wrench system can be controlled from the IQAN system via the
touchpad.
5.2 Multi-robot planning for roof bolting
Fig. 5.6 shows the complete robot-assisted roof-bolting system constructed in our lab (Schafrik, Kolapo,
and Agioutantis, 2022). In a roof-bolting task, the system does following actions step-by-step: (1) drill a
hole in the roof with a drill steel; (2) remove the drill steel; (3) install resin; (4) install a bolt. To perform
these actions successfully, the roof-bolter operator and the roof bolter need to collaborate seamlessly. The
role of the roof bolter is to drill the roof and install the resin and the bolt into the roof. The role of the
operator is to pick up the drill steel, the resin and the bolt and hand them over to the roof bolter. In our
50
robot-assisted roof-bolting system, we replace the human roof-bolter operator with an ABB IRB 1600 robot
because of its high accuracy and flexibility.
Figure 5.6: The roof bolting system.
Industrial robots have been widely deployed in factories (Nikolaidis and Shah, 2012) in isolation from
people, where their tasks can be pre-defined in the form of waypoints. However, underground mine is
usually cluttered and dynamic. For example, human workers who are focused on other tasks may leave
tools around unconsciously. The left tools and other objects in the environment will become obstacles
blocking the roof-bolting operation. The robot arm then has to clear its operation space, i.e., move movable
obstacles out of the way. Moreover, to perform roof-bolting tasks, it is critical to coordinate the roof bolter
and robot arm because of their different capabilities. On one hand, we need the roof bolter to drill holes
in the roof and install the bolts into the roof; on the other hand, we need the robot arm to hand bolts,
resins and drill steels over to the roof bolter and rearrange movable obstacles. To automatically generate
manipulation plans to coordinate the roof bolter and the robot arm, the planning framework should first
compute the occlusion and reachability information for the roof bolter and the robot arm (Chapter 3) and
then generate effective manipulation plans accordingly.
We observe that in each step of the roof-bolting operation we have a target object whose target configuration is specified and numerous objects that can be treated as movable obstacles. By treating the roof
bolter as the second robot, we propose to formulate each step of the roof-bolting operation as a multi-robot
geometric task-and-motion planning (MR-GTAMP) problem.
In the roof-bolting task, we need to move the drill steel, the resin and the bolt to their target configurations in the roof. In our application study, we only focus on bolt placement. Other actions can be
formulated as MR-GTAMP problems similarly. We assume the target configuration of the bolt has been
pre-defined. We formulate an MR-GTAMP problem where we have two robots, i.e., a roof bolter and an
ABB IRB 1600 robot arm. These two robots have different reachability: the roof bolter can place the bolt
into its target configuration, whereas the robot arm can pick up the bolt from its initial configuration.
Moreover, the robot arm can reach most of the movable objects in the environment.
51
Object
Object
Object
Object
M1
M2 M3
M4
Figure 5.7: Example scenario 1 in the simulation.
M1
M2
M3
M1 : Re1 : R1 : R2 : gM1,R1 : gM1,R2
M4
M4 : Re2 : R1 : R1 : gM4,R1 : gM4,R1
M3 : Re2 : R1 : R1 : gM3,R1 : gM3,R1
M2 : Re2 : R1 : R1 : gM2,R1 : gM2,R1
Figure 5.8: Generated CMTGs for example scenario 1. R1 and R2 represent the ABB robot arm and the
roof bolter.
Object
Object
Object
M1
M2
M3
Figure 5.9: Example scenario 2 in the simulation.
The reachability of the roof bolter and the robot arm can be computed by calling motion planning
algorithms (Chapter 3, Sec. 3.2.1) and can be easily encoded using collaborative manipulation task graphs
(CMTGs) (Chapter 3, Sec. 3.2.2.1). We will then use our proposed framework to compute executable taskand-motion plans for the roof bolter and the robot arm that account for their different manipulation capabilities.
52
M1
M2
M3
M1 : Re1 : R1 : R2 : gM1,R1 : gM1,R2
M2 : Re2 : R2 : R1 : gM2,R2 : gM2,R1
M3 : Re2 : R1 : R1 : gM3,R1 : gM3,R1
Figure 5.10: Generated CMTGs for example scenario 2. R1 and R2 represent the ABB robot arm and the
roof bolter.
5.3 Two example scenarios
In our application study, we run our proposed planner for two example scenarios. We show the environment setups in simulation and the built CMTGs (Fig. 5.7,5.8,5.9,5.10). We denote the ABB robot arm and
the roof bolter as R1 and R2. For each action, we denote the object that is moved as Mi
, the grasp that is
used by robot Rk as gMi,Rk
and the region to which the object is moved as Rej .
Example scenario 1. In the first example scenario, we have the bolt as a target object (object M1) and
three movable obstacles (objects M2, M3, M4) (Fig. 5.7). The CMTG for moving object M1 is shown in
Fig. 5.8 (left). The CMTG shows that to move object M1, the ABB robot arm and the roof bolter have to
perform a handover action. Object M4 blocks the ABB robot arm from picking up object M1 and object M3
blocks the ABB robot arm from picking up object M4. Given the CMTG, we can generate a task skeleton.
During grounding (Chapter 3, Sec. 3.2.2.2) the generated task skeleton, the planner finds that object M2
blocks the handover action between the ABB robot arm and the roof bolter. The planner then generates a
new CMTG to move object M2 (Fig. 5.8 (right)).
Example scenario 2. In the second example scenario, we have a target object, bolt (object M1) and two
movable obstacles (objects M2, M3) (Fig. 5.9). The CMTG for moving object M1 is shown in Fig. 5.10
(left). The CMTG shows that to move object M1, the ABB robot arm and the roof bolter have to perform
a handover action. Object M2 blocks the roof bolter from placing object M1 to its target configuration.
During grounding (Chapter 3, Sec. 3.2.2.2) the generated task skeleton based on the CMTG, the planner
finds that object M3 blocks the handover action between the ABB robot arm and the roof bolter. The
planner then generates a new CMTG to move object M3 (Fig. 5.10 (right)).
5.4 Planning and execution details
Planning. We conduct 5 trials on an AMD Ryzen Threadripper PRO 3995WX Processor with a memory
of 64GB for each scenario. The average planning time for example scenario 1 and example scenario 2 are
144.1(±21.5) seconds and 100.8(±15.4) seconds. We observe that most of the planning time is spent on
task skeleton grounding (Chapter 3, Sec. 3.2.2.2) where motion planning is extensively called. The average
planning time spent on motion planning for example scenario 1 and example scenario 2 are 143.4(±21.5)
seconds and 100.2(±15.4) seconds. This is because it is challenging to plan collision-free motion trajectories to move large objects such as the bolt and drill steel in a confined workspace. On the other hand,
53
Figure 5.11: Frames showing the execution of the generated plan for example scenario 1 in both simulation
(Left) and real-world (Right).
it only takes 0.6(±0.0) seconds and 0.6(±0.0) seconds on average to compute task skeletons (Chapter 3,
Sec. 3.2.2.1) for example scenario 1 and example scenario 2.
54
Figure 5.12: Frames showing the execution of the generated plan for example scenario 2 in both
simulation (Left) and real-world (Right).
Execution. In Fig. 5.11 and Fig. 5.12 we show the execution of the generated plans in simulation and
real-world. We include videos of scenarios 1 and 2 in the supplemental material. The execution time of the
55
generated plans for example scenario 1 and example scenario 2 are 250.0 seconds and 270.0 seconds. To
execute the planned motion trajectories on the ABB robot, we first manually smooth the motion trajectories
by downsampling the waypoints of the motion trajectories. We then automatically generate ABB robot
instructions in RAPID (Robotics, 2007) from the waypoints. Each waypoint is a robot configuration defined
in the ABB robot’s joint space and is as an argument passed to MoveJ command in RAPID.
5.5 Related work
The development of automated technologies in mining has attracted great attention. Although, automation
is not novel in mining operations, according to McNab and Garcia-Vasquez (McNab and Garcia-Vasquez,
2011), automation has been around for over 50 years in the mining sector to enhance efficiency, remove
operators from hazardous environments and improve the accuracy and reliability of data collection. In
mining operations, autonomous systems are usually used for monotonous tasks that require precision
and speed that humans cannot perform. One of the benefits of deploying autonomous technologies in
mining is that tasks can be executed faster with precision while simultaneously mitigating human risks
by moving operators away from potentially hazardous working environments. The current trends in the
mining industry show that automated technologies and digital solutions are deployed across the mining
value chain from the exploration to processing. These include drilling, blasting, loading, and hauling as
well as monitoring, control and communication systems. For instance, automated drills were developed
in the United States. When the machines are moved to the site and programmed with a drill pattern, the
machine will be left to execute the task while the operators move to a safer environment to monitor the
process (Bellamy and Pravica, 2011). Since then, there has been significant technological advancement
in data collection, reliable communication systems, robust sensing technology, navigation systems and
imaging technology which makes it possible to have a remote-control center at a greater distance from
mines.
Over the years, there has been a tremendous increase in the design, development, testing, and deployment of autonomous technology in the mining industry around the globe, which in turn promotes the
transition to a more automated industry. Attention has been drawn to the introduction of autonomous
equipment as solutions to improve mine workers’ safety and working conditions. For instance, the implementation of smart systems in automated mining equipment enables the system to navigate rugged
terrains and accomplish designated tasks effortlessly. Examples of such equipment are 3D mobile laser
scanning technologies and automated haulage systems in underground mine environments. This equipment is designed to perform a function in place of humans automatically, with a minimum of external
control. Lynas and Horberry (Lynas and Horberry, 2011) listed the differences between robots used in
mining and those in the manufacturing industry. The study stated that mining robots are required to
move around to perform specific tasks while the robots used in the manufacturing industry only assemble products and pass them to the next production phase. Examples of this robotics in mining include
autonomous blast hole drilling equipment, rope shovels and hydraulic excavators, automated load-hauldump (LHD) and automated dragline swings. The study by Ralston et al. (Ralston et al., 2015) describes
56
the technical contribution (milestones) attained worldwide to develop longwall automation which has influenced the direction and development of present-day longwall technology since its introduction in the
1960s. One of the pronounced advantages of Longwall Automation Steering Committee (LASC) in longwall automation technology has prevented miners from being exposed to respirable dust levels (Tyuleneva
et al., 2021). Larsson et al. (Larsson, Broxvall, and Saffiotti, 2006) proposed a fully automated navigation
system for LHD to navigate underground terrain; Connolly and Jessett (Connolly and Jessett, 2014) developed a system that can monitor dragline with the implementation of the MineWare integrated support
center; Marshall et al. (Marshall et al., 2016) designed an automated way to load materials using draglines
and shovels.
The absence of satellites presents a challenge in accurately positioning and localizing both mine personnel and automated equipment in underground mining environments. Several studies have proposed solutions to address these problems. The study by Billingsley and Bret (Billingsley and Brett, 2015) developed
a method to measure the three-dimensional (3D) shearer path directly through an inertial navigation-based
system. Similarly, Xu and Wang (Xu and Wang, 2010) applied a shearer path system based on 3D position
and dynamic-static fusion to collect dynamic data, static data from the conveyor shearer, and data from the
hydraulic support. Ruiz [20] highlighted the difficulty in collecting horizontal position data using inertia
navigation systems, as the measurement errors tend to increase with time. To address this, Fan et al. (Fan et
al., 2014) created a wireless sensor network that shows a decrease in the position drift error. However, the
study of Houshangi and Azizi (Houshangi and Azizi, 2006) estimated the probabilistic distribution using
a small range of numbers taken from specifically chosen test points using integrated robot positions and
orientation using a fiber optic gyroscope and Unscented Kalman Filter (UKF). Moreover, there has been
a continuous interest in improving the performance of manipulator control and functionality in automation. Numerous scholars have conducted research on the progress of the proportional integral control by
utilizing algorithms to attain optimal performance and improve turning methods. This has been explored
in works by Nagaraj and Murugananth (Nagaraj and Murugananth, 2010); Forley, et al. (Foley, Julien, and
Copeland, 2005); Iqbal, et al. (Iqbal et al., 2014). These studies have been useful in developing robots to
improve the performance of personnel in harsh working environments. For instance, Huh et al. (Huh et
al., 2011) designed a tele-operated mining robot to replace mine workers in Korean coal mines. In another
notable development, a study by Lu (Lu, 2009) demonstrates the conversion of a bucket wheel reclaimer
to a robotic arm that can be controlled automatically.
Figure 5.13: Integration of autonomous and remote operations improve efficiency.
As mining operations are declining in terms of quality reserves, mining companies are increasingly
adopting the implementation of autonomous and remotely operated systems to improve efficiency and
productivity, address labor shortages and especially, improving the health and safety conditions of mine
57
workers. Likewise, the deployment of automation and remote operation systems has reduced risks associated with mining operations. For example, in underground LHD machines, the use of tele-remote
systems is being superseded by autonomous navigation and monitoring can still be done remotely (Nebot
and Baiden, 2007). According to McNab and Garcia-Vasquez (McNab and Garcia-Vasquez, 2011), there is a
notable improvement in control, accuracy and efficiency when a remotely controlled system is integrated
with other equipment (manned or autonomous) as shown in Fig. 5.13.
Despite the various noticeable impacts of automation in the mining industry, some unanticipated problems and equipment malfunctions have been observed in automated technologies because of sensor failure.
The effects of the failed system can lead to automation bias, automation complacency, and obstruction. This
has caused a significant setback in the acceptability of automated technologies in mining operations. In
addressing this issue in automation, before the deployment of automated technologies into mining operations, the obstacles such as failed systems and the role of the operator related to the automated system
must be addressed during testing (Rogers et al., 2019).
58
Chapter 6
Conclusion
In this thesis, we have identified three fundamental challenges in enabling robots to efficiently and effectively collaborate with other robots and humans in long-horizon collaborative manipulation tasks - human
collaborative manipulation action recognition and imitation, high-level task decomposition and assignment, efficient low-level continuous parameters search. We presented three frameworks to address these
challenges efficiently in various setups. We first introduced a framework for robots to extract collaborative action sequences from unconstrained videos on the web. Robots can then convert the extracted action
sequences to executable plans that they can perform either independently, or as part of a human-robot
or robot-robot team. We then considered a smaller yet important subclass of long-horizon manipulation
planning problems, called MR-GTAMP, and introduced a novel framework that generates efficient and effective plans by considering different manipulation capabilities of individual robots. With GTAMP, robots
can execute the extracted high-level action sequences on the web in more diverse environments. We have
shown that the proposed framework can be applied in a critical underground mining operation, namely
roof-bolting operation. We then move on to addressing the continuous parameters search problem for
more efficient GTAMP. We have demonstrated that by identifying important sampling requirements and
designing corresponding heuristics we can improve the efficiency of the continuous parameters sampling
and overall planning.
The main idea that runs through all of the frameworks developed in this thesis is combining deep
learning-based techniques with traditional symbolic reasoning techniques, so that the entire system is
more data efficient, robust and adaptive. For instance, by combining the neural object detector and human
pose detector with grammar-based common sense reasoning, we can robustly detect human collaborative
manipulation actions from online videos in the wild.
However, our works still have many limitations. In all of our works, we have assumed full observability
of the scene and therefore cannot handle uncertainties, noise in robot perception. In our human activity
detection framework, we used a simple n-gram language model as our common sense reasoning engine.
We vision that by integrating large language models, the entire system can be more intelligent and robust.
We also have not incorporating human motion prediction framework into our task-and-motion planning
framework. We believe this will be a promising framework towards a more robust and intelligent humanaware task-and-motion planning framework.
59
This is a crucial and exciting time for robotics. Enabling robots to collaborate fluently with other agents
is a fundamental challenge for deploying robot to human environments. For the next few decades, we
believe a tight integration of machine learning and classical robot planning algorithms will be invaluable
in realizing this vision.
60
Bibliography
Jeeho Ahn, ChangHwan Kim, and Changjoo Nam. “Coordination of two robotic manipulators for
object retrieval in clutter”. In: arXiv preprint arXiv:2109.15220 (2021).
Jeeho Ahn, Jaeho Lee, Sang Hun Cheong, ChangHwan Kim, and Changjoo Nam. “An integrated
approach for determining objects to be relocated and their goal positions inside clutter for object
retrieval”. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2021,
pp. 6408–6414.
Smail Ait Bouhsain, Rachid Alami, and Thierry Simeon. “Simultaneous Action and Grasp Feasibility
Prediction for Task and Motion Planning through Multi-Task Learning”. working paper or preprint.
Mar. 2023. url: https://hal.laas.fr/hal-04016581.
Jan Kristof Behrens, Karla Stepanova, and Robert Babuska. “Simultaneous task allocation and motion
scheduling for complex tasks executed by multiple robots”. In: IEEE International Conference on
Robotics and Automation. 2020, pp. 11443–11449. doi: 10.1109/ICRA40945.2020.9197103.
Drew Bellamy and Luka Pravica. “Assessing the impact of driverless haul trucks in Australian surface
mining”. In: Resources Policy 36.2 (2011), pp. 149–158.
Raphael Bensadoun, Shir Gur, Nitsan Blau, and Lior Wolf. “Neural inverse kinematic”. In: International
Conference on Machine Learning. PMLR. 2022, pp. 1787–1797.
Dmitry Berenson, Siddhartha Srinivasa, and James Kuffner. “Task Space Regions: A Framework for
Pose-constrained Manipulation Planning”. In: Int. J. Rob. Res. ().
Mohak Bhardwaj, Balakumar Sundaralingam, Arsalan Mousavian, Nathan D Ratliff, Dieter Fox,
Fabio Ramos, and Byron Boots. “Storm: An integrated framework for fast joint-space model-predictive
control for reactive manipulation”. In: Conference on Robot Learning. PMLR. 2022, pp. 750–759.
Julien Bidot, Lars Karlsson, Fabien Lagriffoul, and Alessandro Saffiotti. “Geometric backtracking for
combined task and motion planning in robotic systems”. In: Artificial Intelligence 247 (2017),
pp. 229–265.
John Billingsley and Peter Brett. Machine vision and mechatronics in practice. Springer, 2015.
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “OpenPose: realtime
multi-person 2D pose estimation using Part Affinity Fields”. In: arXiv preprint arXiv:1812.08008. 2018.
61
Joao Carreira and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics
dataset”. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017,
pp. 6299–6308.
J Chase Kew, Brian Ichter, Maryam Bandari, Tsang-Wei Edward Lee, and Aleksandra Faust. “Neural
collision clearance estimator for batched motion planning”. In: International Workshop on the
Algorithmic Foundations of Robotics. Springer. 2020, pp. 73–89.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and
Tony Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.
Tech. rep. Google, 2013. url: http://arxiv.org/abs/1312.3005.
Annie S Chen, Suraj Nair, and Chelsea Finn. “Learning generalizable robotic reward functions from"
in-the-wild" human videos”. In: arXiv preprint arXiv:2103.16817 (2021).
Jingkai Chen, Jiaoyang Li, Yijiang Huang, Caelan Garrett, Dawei Sun, Chuchu Fan, Andreas Hofmann,
Caitlin Mueller, Sven Koenig, and Brian C Williams. “Cooperative Task and Motion Planning for
Multi-Arm Assembly Systems”. In: arXiv preprint arXiv:2203.02475 (2022).
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. “Channel-Wise
Topology Refinement Graph Convolution for Skeleton-Based Action Recognition”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV). 2021, pp. 13359–13368.
Rohan Chitnis, Dylan Hadfield-Menell, Abhishek Gupta, Siddharth Srivastava, Edward Groshev,
Christopher Lin, and Pieter Abbeel. “Guided search for task and motion plans using learned heuristics”.
In: 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2016, pp. 447–454.
Noam Chomsky. Lectures on government and binding: The Pisa lectures.
Kenneth Ward Church. “A stochastic parts program and noun phrase parser for unrestricted text”. In:
International Conference on Acoustics, Speech, and Signal Processing, 1989.
Mark Connolly and Andrew Jessett. “Integrated Support Centres–The Future of Dragline Fleet
Monitoring”. In: Procedia Engineering 83 (2014), pp. 90–99.
Erwin Coumans and Yunfei Bai. PyBullet, a Python module for physics simulation for games, robotics
and machine learning. http://pybullet.org. 2016–2019.
Michael Danielczuk, Andrey Kurenkov, Ashwin Balakrishna, Matthew Matl, David Wang,
Roberto Martín-Martín, Animesh Garg, Silvio Savarese, and Ken Goldberg. “Mechanical Search:
Multi-Step Retrieval of a Target Object Occluded by Clutter”. In: IEEE International Conference on
Robotics and Automation. 2019, pp. 1614–1621. doi: 10.1109/ICRA.2019.8794143.
Michael Danielczuk, Arsalan Mousavian, Clemens Eppner, and Dieter Fox. “Object rearrangement
using learned implicit collision functions”. In: 2021 IEEE International Conference on Robotics and
Automation (ICRA). IEEE. 2021, pp. 6010–6017.
62
Emilie Danna, Mary Fenelon, Zonghao Gu, and Roland Wunderling. “Generating Multiple Solutions
for Mixed Integer Programming Problems”. In: Integer Programming and Combinatorial Optimization.
2007, pp. 280–294. isbn: 978-3-540-72792-7.
Neil T Dantam, Zachary K Kingston, Swarat Chaudhuri, and Lydia E Kavraki. “An incremental
constraint-based framework for task and motion planning”. In: IJRR 37.10 (2018), pp. 1134–1151.
Rosen Diankov. “Automated Construction of Robotic Manipulation Programs”. PhD thesis. Carnegie
Mellon University, 2010.
Danny Driess, Jung-Su Ha, Russ Tedrake, and Marc Toussaint. “Learning Geometric Reasoning and
Control for Long-Horizon Tasks from Visual Input”. In: Proc. of the IEEE International Conference on
Robotics and Automation (ICRA). 2021.
Qigao Fan, Wei Li, Jing Hui, Lei Wu, Zhenzhong Yu, Wenxu Yan, Lijuan Zhou, et al. “Integrated
positioning for coal mining machinery in enclosed underground mine based on SINS/WSN”. In: The
Scientific World Journal 2014 (2014).
Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-shot visual imitation
learning via meta-learning”. In: Conference on robot learning. PMLR. 2017, pp. 357–368.
Michael W Foley, Rhonda H Julien, and Brian R Copeland. “A comparison of PID controller tuning
methods”. In: The Canadian Journal of Chemical Engineering 83.4 (2005), pp. 712–722.
Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver,
Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Integrated Task and Motion Planning”. In: Annual
Review of Control, Robotics, and Autonomous Systems 4.1 (2021), pp. 265–293. doi:
10.1146/annurev-control-091420-084139. eprint:
https://doi.org/10.1146/annurev-control-091420-084139.
Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. “PDDLStream: Integrating
symbolic planners and blackbox samplers via optimistic adaptive planning”. In: ICAPS. Vol. 30. 2020,
pp. 440–448.
Igor Griva, Stephen G Nash, and Ariela Sofer. Linear and nonlinear optimization. Vol. 108. Philadelphia,
Pennsylvania, USA: Siam, 2009.
Matthias Grundmann, Vivek Kwatra, Mei Han, and Irfan Essa. “Efficient hierarchical graph-based
video segmentation”. In: CVPR. IEEE. 2010, pp. 2141–2148.
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li,
Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. “Ava: A video
dataset of spatio-temporally localized atomic visual actions”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2018, pp. 6047–6056.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
“Improved training of wasserstein gans”. In: Advances in neural information processing systems 30
(2017).
63
David Hallac, Peter Nystrup, and Stephen Boyd. “Greedy Gaussian segmentation of multivariate time
series”. In: Advances in Data Analysis and Classification (2018), pp. 1–25.
Valentin Noah Hartmann, Andreas Orthey, Danny Driess, Ozgur S Oguz, and Marc Toussaint.
“Long-Horizon Multi-Robot Rearrangement Planning for Construction Assembly”. In: arXiv preprint
arXiv:2106.02489 (2021).
Kris Hauser. “Minimum Constraint Displacement Motion Planning”. In: Robotics: Science and Systems.
Berlin, Germany, 2013. doi: 10.15607/RSS.2013.IX.017.
Zhanpeng He, Nikhil Chavan-Dafle, Jinwook Huh, Shuran Song, and Volkan Isler. “Pick2place:
Task-aware 6dof grasp estimation via object-centric perspective affordance”. In: 2023 IEEE International
Conference on Robotics and Automation (ICRA). IEEE. 2023, pp. 7996–8002.
Guy Hoffman. “Evaluating fluency in human–robot collaboration”. In: IEEE Transactions on
Human-Machine Systems 49.3 (2019), pp. 209–218.
Jörg Hoffmann. “FF: The Fast-Forward Planning System”. In: AI Mag. 22 (2001), pp. 57–62.
Rachel Holladay, Tomás Lozano-Pérez, and Alberto Rodriguez. “Force-and-Motion Constrained
Planning for Tool Use”. In: Massachusetts Institute of Technology, 2019. ().
Nasser Houshangi and Farouk Azizi. “Mobile robot position determination using data integration of
odometry and gyroscope”. In: 2006 World Automation Congress. IEEE. 2006, pp. 1–8.
Yifei Huang, Yusuke Sugano, and Yoichi Sato. “Improving action segmentation via graph-based
temporal reasoning”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2020, pp. 14024–14034.
Sungsik Huh, Unghui Lee, Hyunchul Shim, Jong-Beom Park, and Jong-Ho Noh. “Development of an
unmanned coal mining robot and a tele-operation system”. In: 2011 11th International Conference on
Control, Automation and Systems. IEEE. 2011, pp. 31–35.
Sang Hun Cheong, Brian Y. Cho, Jinhwi Lee, ChangHwan Kim, and Changjoo Nam. “Where to
relocate?: Object rearrangement inside cluttered and confined environments for robotic manipulation”.
In: IEEE International Conference on Robotics and Automation. 2020, pp. 7791–7797. doi:
10.1109/ICRA40945.2020.9197485.
Jamshed Iqbal, Hamza Khan, Nikos G Tsagarakis, and Darwin G Caldwell. “A novel exoskeleton
robotic system for hand rehabilitation–conceptualization to prototyping”. In: Biocybernetics and
biomedical engineering 34.2 (2014), pp. 79–89.
Paul Jaccard. “The distribution of the flora in the alpine zone. 1”. In: New phytologist ().
Hua Jiang and Yi Luo. “Development of a roof bolter drilling control process to reduce the generation
of respirable dust”. In: International Journal of Coal Science & Technology 8.2 (2021), pp. 199–204.
64
Mohamed Khodeir, Ben Agro, and Florian Shkurti. “Learning to Search in Task and Motion Planning
With Streams”. In: IEEE Robotics and Automation Letters 8.4 (2023), pp. 1983–1990. doi:
10.1109/LRA.2023.3242201.
Beomjoon Kim, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Adversarial Actor-Critic Method for
Task and Motion Planning Problems Using Planning Experience”. In: AAAI Conference on Artificial
Intelligence. Vol. 33. 01. 2019, pp. 8017–8024. doi: 10.1609/aaai.v33i01.33018017.
Beomjoon Kim and Luke Shimanuki. “Learning value functions with relational state representations
for guiding task-and-motion planning”. In: Conference on Robot Learning. Vol. 100. 2020, pp. 955–968.
Beomjoon Kim, Luke Shimanuki, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Representation,
learning, and planning algorithms for geometric task and motion planning”. In: The International
Journal of Robotics Research 41.2 (2022), pp. 210–231.
Beomjoon Kim, Luke Shimanuki, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Representation,
learning, and planning algorithms for geometric task and motion planning”. In: The International
Journal of Robotics Research 41.2 (2022), pp. 210–231. doi: 10.1177/02783649211038280. eprint:
https://doi.org/10.1177/02783649211038280.
Beomjoon Kim, Luke Shimanuki, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Representation,
learning, and planning algorithms for geometric task and motion planning”. In: The International
Journal of Robotics Research 41.2 (2022), pp. 210–231. doi: 10.1177/02783649211038280. eprint:
https://doi.org/10.1177/02783649211038280.
Jennifer E. King, Marco Cognetti, and Siddhartha S. Srinivasa. “Rearrangement planning using
object-centric and robot-centric action spaces”. In: IEEE International Conference on Robotics and
Automation. 2016, pp. 3940–3947. doi: 10.1109/ICRA.2016.7487583.
Takuma Kogo, Kei Takaya, and Hiroyuki Oyama. “Fast MILP-based Task and Motion Planning for
Pick-and-Place with Hard/Soft Constraints of Collision-Free Route”. In: 2021 IEEE SMC. IEEE. 2021,
pp. 1020–1027.
P. Kolapo, S. Schafrik, H. Zhang, S. Nikolaidis, and Z. Agioutantis. “A concept for integrating robotic
systems in an underground roof support machine”. In: Journal of Industrial Safety (2024), p. 100002.
issn: 2950-2764. doi: https://doi.org/10.1016/j.jinse.2024.100002.
Athanasios Krontiris and Kostas E. Bekris. “Efficiently solving general rearrangement tasks: A fast
extension primitive for an incremental sampling-based planner”. In: IEEE International Conference on
Robotics and Automation. 2016, pp. 3924–3931. doi: 10.1109/ICRA.2016.7487581.
Fabien Lagriffoul, Dimitar Dimitrov, Julien Bidot, Alessandro Saffiotti, and Lars Karlsson. “Efficiently
combining task and motion planning using geometric constraints”. In: The International Journal of
Robotics Research 33.14 (2014), pp. 1726–1747.
Johan Larsson, Mathias Broxvall, and Alessandro Saffiotti. “A navigation system for automated loaders
in underground mines”. In: Field and Service Robotics: Results of the 5th International Conference.
Springer. 2006, pp. 129–140.
65
Przemyslaw A Lasota and Julie A Shah. “Bayesian Estimator for Partial Trajectory Alignment”. In:
2019.
Steven M LaValle. Planning algorithms. Cambridge, United Kingdom: Cambridge University Press, 2006.
Steven M LaValle and James J Kuffner. “Rapidly-exploring random trees: Progress and prospects”. In:
Algorithmic and computational robotics: new directions 5 (2001), pp. 293–308.
Jun Li and Sinisa Todorovic. “Action shuffle alternating learning for unsupervised action
segmentation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2021, pp. 12628–12636.
Zhe Li, Yazan Abu Farha, and Jurgen Gall. “Temporal action segmentation from timestamp
supervision”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2021, pp. 8365–8374.
Tomás Lozano-Pérez and Leslie Pack Kaelbling. “A constraint-based method for solving sequential
manipulation planning problems”. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and
Systems. IEEE. 2014, pp. 3684–3691.
Tien-Fu Lu. “Bucket wheel reclaimer modeling as a robotic arm”. In: 2009 IEEE International Conference
on Robotics and Biomimetics (ROBIO). IEEE. 2009, pp. 263–268.
Danellie Lynas and Tim Horberry. “Human factor issues with automated mining equipment”. In: The
Ergonomics Open Journal 4.1 (2011).
Masoumeh Mansouri, Federico Pecora, and Peter Schüller. “Combining Task and Motion Planning:
Challenges and Guidelines”. In: Frontiers in Robotics and AI 8 (2021). issn: 2296-9144. doi:
10.3389/frobt.2021.637888.
Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber,
and Antonio Torralba. “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking
Recipes and Food Images”. In: IEEE Trans. Pattern Anal. Mach. Intell. (2019).
Christopher Mark. “The introduction of roof bolting to US underground coal mines (1948-1960): a
cautionary tale”. In: Proceedgins of the 21st International Conference on Ground Control in Mining. 2002,
pp. 150–160.
Joshua A Marshall, Adrian Bonchis, Eduardo Nebot, and Steven Scheding. “Robotics in mining”. In:
Springer handbook of robotics (2016), pp. 1549–1576.
Karen McNab and Magaly Garcia-Vasquez. “Autonomous and remote operation technologies in
Australian mining”. In: Brisbane City, Australia: Centre for Social Responsibility in Mining
(CSRM)-Sustainable Minerals Institute, University of Queensland (2011).
Jose Muguira-Iturralde, Aidan Curtis, Yilun Du, Leslie Pack Kaelbling, and Tomás Lozano-Pérez.
“Visibility-Aware Navigation Among Movable Obstacles”. In: 2023 IEEE ICRA (2022).
66
B Nagaraj and N Murugananth. “A comparative study of PID controller tuning using GA, EP, PSO and
ACO”. In: 2010 International conference on communication control and computing technologies. IEEE.
2010, pp. 305–313.
Daniel Nakhimovich, Yinglong Miao, and Kostas E Bekris. “Resolution Complete In-Place Object
Retrieval given Known Object Models”. In: arXiv preprint arXiv:2303.14562 (2023).
Changjoo Nam, Jinhwi Lee, Sang Hun Cheong, Brian Y. Cho, and ChangHwan Kim. “Fast and resilient
manipulation planning for target retrieval in clutter”. In: IEEE International Conference on Robotics and
Automation. 2020, pp. 3777–3783. doi: 10.1109/ICRA40945.2020.9196652.
Eduardo Nebot and Greg Baiden. Mining Robotics Editorial. 2007.
Anh Nguyen, Dimitrios Kanoulas, Luca Muratore, Darwin G Caldwell, and Nikos G Tsagarakis.
“Translating videos to commands for robotic manipulation with deep recurrent neural networks”. In:
ICRA. IEEE. 2018.
Stefanos Nikolaidis, Anca Dragan, and Siddhartha Srinivasa. “Viewpoint-based legibility optimization”.
In: ACM/IEEE International Conference on Human-Robot Interaction. IEEE. 2016, pp. 271–278.
Stefanos Nikolaidis and Julie Shah. “Human-robot teaming using shared mental models”. In:
Proceedings of IEEE/ACM International Conference on Human-Robot Interaction, Workshop on
Human-Agent-Robot Teamwork. Boston, MA, 2012.
Tianyang Pan, Andrew M. Wells, Rahul Shome, and Lydia E. Kavraki. “A General Task and Motion
Planning Framework For Multiple Manipulators”. In: IEEE/RSJ International Conference on Intelligent
Robots and Systems. 2021, pp. 3168–3174. doi: 10.1109/IROS51168.2021.9636119.
Katerina Pastra and Yiannis Aloimonos. “The minimalist grammar of action”. In: Philosophical
Transactions of the Royal Society B: Biological Sciences 367.1585 (2012), pp. 103–117.
David Paulius, Ahmad B Jelodar, and Yu Sun. “Functional Object-Oriented Network: Construction &
Expansion”. In: ICRA. IEEE. 2018.
Syd S Peng and DHY Tang. “Roof bolting in underground mining: a state-of-the-art review”. In:
International Journal of Mining Engineering 2.1 (1984), pp. 1–42.
Siyuan Qi, Baoxiong Jia, and Song-Chun Zhu. “Generalized earley parser: Bridging symbolic grammars
and sequence data for future prediction”. In: arXiv preprint arXiv:1806.03497 (2018).
Carlos Quintero-Pena, Zachary Kingston, Tianyang Pan, Rahul Shome, Anastasios Kyrillidis, and
Lydia E Kavraki. “Optimal Grasps and Placements for Task and Motion Planning in Clutter”. In: 2023
IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2023, pp. 3707–3713.
Ana Huamán Quispe, Heni Ben Amor, and Henrik I. Christensen. “Combining arm and hand metrics
for sensible grasp selection”. In: IEEE International Conference on Automation Science and Engineering.
2016, pp. 1170–1176. doi: 10.1109/COASE.2016.7743537.
67
Jonathon C Ralston, David C Reid, Mark T Dunn, and David W Hainsworth. “Longwall automation:
Delivering enabling technology to achieve safer and more productive underground mining”. In:
International Journal of Mining Science and Technology 25.6 (2015), pp. 865–876.
Tianyu Ren, Georgia Chalvatzaki, and Jan Peters. “Extended Tree Search for Robot Task and Motion
Planning”. In: arXiv preprint arXiv:2103.05456 (2021).
ABB Robotics. “Operating manual robotstudio”. In: Västerås, Sweden (2007).
Carlos Rodríguez and Raúl Suárez. “Combining motion planning and task assignment for a dual-arm
system”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. 2016, pp. 4238–4243.
doi: 10.1109/IROS.2016.7759624.
W Pratt Rogers, M Mustafa Kahraman, Frank A Drews, Kody Powell, Joel M Haight, Yaxue Wang,
Kritika Baxla, and Mohit Sobalkar. “Automation in the mining industry: Review of technology,
systems, human factors, and political risk”. In: Mining, metallurgy & exploration 36 (2019), pp. 607–631.
Steven Schafrik, Peter Kolapo, and Zach Agioutantis. “Development of an automated roof bolting
machine for underground coal mines”. In: Proceedings of Annual SOMP Conference. September,
Namibia, 2022.
Fadime Sener and Angela Yao. “Unsupervised learning and segmentation of complex activities from
video”. In: CVPR. 2018.
Rahul Shome, Kiril Solovey, Jingjin Yu, Kostas Bekris, and Dan Halperin. “Fast, High-Quality Two-Arm
Rearrangement in Synchronous, Monotone Tabletop Setups”. In: IEEE Transactions on Automation
Science and Engineering 18.3 (2021), pp. 888–901. doi: 10.1109/TASE.2021.3055144.
Tianmin Shu, Xiaofeng Gao, Michael S Ryoo, and Song-Chun Zhu. “Learning social affordance
grammar from videos: Transferring human interactions to human-robot interactions”. In: arXiv
preprint arXiv:1703.00503 (2017).
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,
Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. “Mastering the game of
Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–.
Tom Silver, Rohan Chitnis, Aidan Curtis, Joshua Tenenbaum, Tomas Lozano-Perez, and
Leslie Pack Kaelbling. “Planning with learned object importance in large problem instances using
graph neural networks”. In: arXiv preprint arXiv:2009.05613 (2020).
Siddharth Srivastava, Eugene Fang, Lorenzo Riano, Rohan Chitnis, Stuart Russell, and Pieter Abbeel.
“Combined task and motion planning through an extensible planner-independent interface layer”. In:
2014 IEEE ICRA. IEEE. 2014, pp. 639–646.
Mike Stilman, Jan-Ullrich Schamburek, James Kuffner, and Tamim Asfour. “Manipulation Planning
Among Movable Obstacles”. In: IEEE International Conference on Robotics and Automation. 2007,
pp. 3327–3332. doi: 10.1109/ROBOT.2007.363986.
68
Mike Stilman, Jan-Ullrich Schamburek, James Kuffner, and Tamim Asfour. “Manipulation planning
among movable obstacles”. In: ICRA. 2007.
Shao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph Lim. “Neural program synthesis
from diverse demonstration videos”. In: ICML. 2018, pp. 4797–4806.
Rin Takano, Hiroyuki Oyama, and Masaki Yamakita. “Continuous Optimization-Based Task and
Motion Planning with Signal Temporal Logic Specifications for Sequential Manipulation”. In: 2021 IEEE
International Conference on Robotics and Automation (ICRA). 2021, pp. 8409–8415. doi:
10.1109/ICRA48506.2021.9561209.
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. “Asynchronous interaction aggregation for
action detection”. In: European Conference on Computer Vision. Springer. 2020, pp. 71–87.
ML Tay and BKA Ngoi. “Optimising robot workcell layout”. In: The International Journal of Advanced
Manufacturing Technology 12 (1996), pp. 377–385.
Praveen Tirupattur, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. “Modeling multi-label action
dependencies for temporal action localization”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 1460–1470.
Marc Toussaint. “Logic-Geometric Programming: An Optimization-Based Approach to Combined Task
and Motion Planning”. In: Proceedings of the 24th International Conference on Artificial Intelligence.
IJCAI’15. Buenos Aires, Argentina: AAAI Press, 2015, 1930–1936. isbn: 9781577357384.
Marc Toussaint. “Logic-geometric programming: An optimization-based approach to combined task
and motion planning”. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
Marc Toussaint and Manuel Lopes. “Multi-bound tree search for logic-geometric programming in
cooperative manipulation domains”. In: IEEE International Conference on Robotics and Automation.
2017, pp. 4044–4051. doi: 10.1109/ICRA.2017.7989464.
Marc Toussaint and Manuel Lopes. “Multi-bound tree search for logic-geometric programming in
cooperative manipulation domains”. In: 2017 IEEE ICRA. IEEE. 2017, pp. 4044–4051.
Tatiana Tyuleneva, Evgeny Kabanov, Marat Moldazhanov, and Evgeny Plotnikov. “Improving the
professional risk management system for methane and coal dust explosions using a risk-based
approach”. In: E3S Web of Conferences. Vol. 278. EDP Sciences. 2021, p. 01027.
Stephen Van Duin, Luke Meers, Peter Donnelly, and Ian Oxley. “Automated bolting and meshing on a
continuous miner for roadway development”. In: International Journal of Mining Science and
Technology 23.1 (2013), pp. 55–61.
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and
Kate Saenko. “Translating videos to natural language using deep recurrent neural networks”. In: arXiv
preprint arXiv:1412.4729 (2014).
Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. “You Only Learn One Representation: Unified
Network for Multiple Tasks”. In: arXiv preprint arXiv:2105.04206 (2021).
69
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. “TDN: Temporal Difference Networks for Efficient
Action Recognition”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). 2021, pp. 1895–1904.
Rui Wang, Yinglong Miao, and Kostas E Bekris. “Efficient and high-quality prehensile rearrangement
in cluttered and confined spaces”. In: 2022 International Conference on Robotics and Automation (ICRA).
IEEE. 2022, pp. 1968–1975.
Zhe Wang, Hao Chen, Xinyu Li, Chunhui Liu, Yuanjun Xiong, Joseph Tighe, and Charless Fowlkes.
“Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action segmentation”.
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022,
pp. 1819–1828.
Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, and Gangshan Wu. “Boundary-aware cascade
networks for temporal action segmentation”. In: European Conference on Computer Vision. Springer.
2020, pp. 34–51.
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and
Ross Girshick. “Long-term feature banks for detailed video understanding”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 284–293.
Zhi-peng Xu and Zhong-bin Wang. “Research on the key technology of shearer position”. In: 2010
International Conference on Intelligent System Design and Engineering Application. Vol. 2. IEEE. 2010,
pp. 280–283.
Jun Yamada, Chia-Man Hung, Jack Collins, Ioannis Havoutis, and Ingmar Posner. “Leveraging scene
embeddings for gradient-based motion Planning in latent space”. In: arXiv preprint arXiv:2303.03364
(2023).
Yezhou Yang, Anupam Guha, C Fermuller, and Yiannis Aloimonos. “A cognitive system for
understanding human manipulation actions”. In: Advances in Cognitive Sysytems 3 (2014), pp. 67–86.
Yezhou Yang, Yi Li, Cornelia Fermüller, and Yiannis Aloimonos. “Robot Learning Manipulation Action
Plans by "Watching" Unconstrained Videos from the World Wide Web.” In: AAAI. 2015, pp. 3686–3693.
Zhutian Yang, Caelan Reed Garrett, and Dieter Fox. “Sequence-Based Plan Feasibility Prediction for
Efficient Task and Motion Planning”. In: arXiv preprint arXiv:2211.01576 (2022).
Zhutian Yang, Jiayuan Mao, Yilun Du, Jiajun Wu, Joshua B. Tenenbaum, Tomás Lozano-Pérez, and
Leslie Pack Kaelbling. “Compositional Diffusion-Based Continuous Constraint Solvers”. In: 7th Annual
Conference on Robot Learning. 2023. url: https://openreview.net/forum?id=BimpCf1rT7.
Tianhe Yu, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “One-shot hierarchical imitation learning
of compound visuomotor tasks”. In: arXiv preprint arXiv:1810.11043 (2018).
Hejia Zhang, Shao-Hung Chan, Jie Zhong, Jiaoyang Li, Peter Kolapo, Sven Koenig, Zach Agioutantis,
Steven Schafrik, and Stefanos Nikolaidis. “Multi-robot geometric task-and-motion planning for
collaborative manipulation tasks”. In: Autonomous Robots 47.8 (2023), pp. 1537–1558.
70
Hejia Zhang, Eric Heiden, Stefanos Nikolaidis, Joseph J. Lim, and Gaurav S. Sukhatme.
“Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot Skills”. In:
CoRR abs/1810.00146 (2018). arXiv: 1810.00146. url: http://arxiv.org/abs/1810.00146.
Hejia Zhang, Po-Jen Lai, Sayan Paul, Suraj Kothawade, and Stefanos Nikolaidis. “Learning
Collaborative Action Plans from YouTube Videos”. In: ISRR 2019. Hanoi, Vietnam, 2019.
Luowei Zhou, Chenliang Xu, and Jason J Corso. “Towards automatic learning of procedures from web
instructional videos”. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
71
Abstract (if available)
Abstract
How can we enable robots that operate in unstructured human environments such as homes, hospitals, and underground mines to effectively collaborate with other agents such as robots, and humans? This is a challenging problem that necessitates improvements across various aspects of robot capabilities. For instance, it requires multiple robots to devise effective collaboration strategies for intelligent task decomposition and allocation. Additionally, robots must comprehend human actions to offer appropriate assistance.
In this thesis, we focus on robotic collaborative manipulation tasks where robots interact not only with their environments but also with other agents. We explored three important research questions arising in robotic collaborative manipulation tasks: (1) enabling robots to comprehend and imitate human collaborative manipulation actions; (2) optimizing robot collaboration considering their distinct manipulation capabilities; (3) devising efficient robot responses to environmental changes induced by humans.
One direction to address these research questions is to leverage the recent advances in deep learning. The drawback, however, is that it needs tons of data and it is brittle. If a model encounters a data point that is very different from the ones seen in the training set, then it is likely to make mistakes.
Our approach is to combine deep learning based approach with traditional symbolic reasoning approach. We propose multi-robot planning algorithms, neuro-symbolic task and motion planning algorithms, and human collaborative manipulation action representations for the three important arising research questions. We show that the proposed frameworks can address these questions effectively and lead to many interesting future directions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Accelerating robot manipulation using demonstrations
PDF
Scaling robot learning with skills
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Decision making in complex action spaces
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Learning affordances through interactive perception and manipulation
PDF
Coordinating social communication in human-robot task collaborations
PDF
Advancements in understanding the empirical hardness of the multi-agent pathfinding problem
PDF
Closing the reality gap via simulation-based inference and control
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Nonverbal communication for non-humanoid robots
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Contingency handling in mission planning for multi-robot teams
PDF
Planning for mobile manipulation
PDF
Rethinking perception-action loops via interactive perception and learned representations
Asset Metadata
Creator
Zhang, Hejia
(author)
Core Title
Planning and learning for long-horizon collaborative manipulation tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
08/07/2024
Defense Date
04/12/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
human-robot collaboration,OAI-PMH Harvest,robot learning,robotics,task-and-motion planning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nikolaidis, Stefanos (
committee chair
), Kuo, C.-C. Jay (
committee member
), Seita, Daniel (
committee member
)
Creator Email
hejiazha@usc.edu,hjzh578@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998T9U
Unique identifier
UC113998T9U
Identifier
etd-ZhangHejia-13346.pdf (filename)
Legacy Identifier
etd-ZhangHejia-13346
Document Type
Dissertation
Format
theses (aat)
Rights
Zhang, Hejia
Internet Media Type
application/pdf
Type
texts
Source
20240807-usctheses-batch-1193
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
human-robot collaboration
robot learning
robotics
task-and-motion planning