Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning from planners to enable new robot capabilities
(USC Thesis Other)
Learning from planners to enable new robot capabilities
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning from Planners to Enable New Robot Capabilities by Max Pflueger A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) August 2020 Copyright 2020 Max Pflueger Dedication For my father, mother, and brother, who have always encouraged me to pursue my curiosity. ii Acknowledgements A Ph.D. can’t happen without the support of a lot of people and organizations. First, thank you to my advisor, Gaurav Sukhatme, who invited me to join his lab as a masters student, and has supported my research ever since. He has helped me find problems to work on, people to work with, and offered sage advice on the practice of doing research, both technical and non-technical. I would not have been able to finish this dissertation without him. I would like to thank my collaborators and colleagues at JPL, where I was able to develop a major component of this dissertation. Thanks especially to Ali Agha who offered invaluable support and advice in pursuing my research, and who helped me develop a summer project into a much more significant contribution. Thanks also to Bradd Carey and Sammi Lei who helped me in data preparation, and Kyohei Otsu and his team who gave me access to specialized rover planning tools. Also thanks to the many others at JPL that met with me and shared their knowledge as I learned about the specialized problems of driving rovers on Mars. My studies at USC have been enhanced by professors, postdocs, Ph.D. students, and others who have given me advice, collaborated on projects, participated in reading groups, and been around to chat about ideas. Thanks to Stefan Schaal for guidance on the direction of my research. Thanks to my current and former lab mates, Karol Hausman, Stephanie Kemna, Oliver Kroemer, Yevgen Chebotar, Artem Molchanov, James Preiss, Peter Englert, Jörg Müller, Eric Heiden, Chris Denniston, Ryan Julian, K.R. Zentner and others who have served as editors and sounding boards for ideas. Thanks to Christian Potthast and Hörður Heiðarsson for teaching me about the PR2. And a special thanks to Arvind Pereira who advised me on my first project with RESL as a masters student. During my Ph.D. research I had multiple meaningful internships. I would like to thank my internship hosts at Google, Richard Maher, Chris Clark, and Brian Williams, as well as my host at Bosch, Lorenzo Riano. I was also supported in part by multiple scholarships from the ARCS Foundation. Finally, I’d like to thank all the friends I made along the way, in RESL, CLMC, the computer science department, and the university at large. Thanks especially to my friends from the USC Ballroom Dance Team, who became an essential part of my life, including fellow Ph.D. students Sarah, Reed, and Jeff, dance partners Sarah, Sarah, Michelle, and Giang, and my coaches Lorena and Jonathan. There are many more, and I cannot possibly list them all here at the hazard of omitting someone important. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract ix 1 Introduction & Motivation 1 2 Multi-Step Planning 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Continuous Space Planners . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Combined Task and Motion Planning . . . . . . . . . . . . . . . . . 10 2.2.3 Object Perception and Manipulation . . . . . . . . . . . . . . . . . . 11 2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 A Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Reachable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Adjacent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.3 State Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.4 State Storage and Reuse . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.5 Heuristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.6 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.7 Heuristic Inflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 The Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Continuous Planners . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iv 2.5.3 Tabletop Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Folding Chair Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.1 The Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6.2 Problem Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6.3 Chair State Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.4 Adjacent Stance Expansion . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.5 Reachability Checking . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.6 Closed Kinematic Chains . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6.7 Path Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.1 Simulated Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.2 Planning Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.3 Speed-Up Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.4 Branching Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.5 Folding Chair Physical Demonstration . . . . . . . . . . . . . . . . . 34 2.7.6 Tabletop Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Inverse Reinforcement Learning with Soft Value Iteration Networks for Planetary Rover Path Planning 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Planetary Rover Path Planning . . . . . . . . . . . . . . . . . . . . . 40 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3 Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.4 Value Iteration Module . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.5 Soft Action Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.6 Importance of Soft Policies in IRL . . . . . . . . . . . . . . . . . . . 52 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Grid World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.2 Jezero Crater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4 Plan-Space State Embeddings for Improved Reinforcement Learning 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Plan-Space State Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 68 v 4.3.1 Trajectory Linearity Model . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.3 Direct State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.4 Variational State Encoding . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Kinodynamic Planner Demonstrations . . . . . . . . . . . . . . . . . 75 4.4.2 Variational State Embedding . . . . . . . . . . . . . . . . . . . . . . 75 4.4.3 RL in Embedding Space . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.2 Embedding Performance . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.3 RL Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Conclusion & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5 Summary and Conclusion 87 References 89 vi List of Tables 2.1 Heuristic Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Heuristic Speed-Up Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Tabletop Planning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Jezero Crater Training Parameters . . . . . . . . . . . . . . . . . . . . . . . 61 vii List of Figures 2.1 PR2 robot in the process of folding a chair. . . . . . . . . . . . . . . . . . 5 2.2 Toy World Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Toy World Final Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Toy World State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Chair Initial Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Tabletop Initial State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Tabletop Final State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Chair Grasp Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Chair Manipulation Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Jezero Crater Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 MSL Wheel Damage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Rover-IRL Information Flow Diagram . . . . . . . . . . . . . . . . . . . . 47 3.4 Value Iteration Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Gridworld Training Examples . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Gridworld Vector Field Results . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Gridworld Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.8 Gridworld Reward & Value Map Example . . . . . . . . . . . . . . . . . . 56 3.9 Additional Jezero Crater Examples . . . . . . . . . . . . . . . . . . . . . . 58 3.10 Additional Jezero Crater Examples (part 2) . . . . . . . . . . . . . . . . . 59 3.11 Jezero Crater Reward Architecture . . . . . . . . . . . . . . . . . . . . . . 60 3.12 Jezero Crater Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 Path Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 PSSE Information Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 PSSE System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 PSSE Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5 TRPO Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 PPO Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 viii Abstract Solvingcomplexroboticproblemsofteninvolvesreasoningaboutbehaviormultiplesteps in the future, yet many robot learning algorithms do not incorporate planning structures into their architecture. In this dissertation we show how we can harness the capabilities of planning algorithms to learn from the structure of the robotic problems we wish to solve, thus expanding beyond what was available from baseline planners. We consider problems in multi-arm manipulation planning, path planning for planetary rovers, and reinforcement learning for torque controlled robots, and show how in each case it is possible to learn from the behavior of planning algorithms that are limited and unable to solve the full generalized problem. Despite not being full solutions these planners provide useful tools and insights that can be leveraged in larger solutions. In multi-step planning for manipulation we develop a high level planner that can find solutions in difficult spaces by solving sub-problems in sub-spaces of the main planning space. For planetary rovers we show how to use inverse reinforcement learning to learn a new planning algorithm that can function on different (and generally cheaper) input data. Reinforcementlearningalgorithmsoftensufferfromunstableorunreliabletraining, weshow how this can be mitigated by augmenting the robot state with a state embedding space learned from planner demonstrations. Planning and control algorithms often rely on rigid and prescribed assumptions about the nature of robot problems, which may not be suitable for the generalized and versatile robot systems we wish to build. However, as this dissertation argues, those structures are still useful in informing the behavior of more flexible families of algorithms. ix Chapter 1 Introduction & Motivation In order to build autonomous systems that are useful and practical, we need software agents that are flexible enough to produce useful behavior across the space of tasks we want to do and are reliable in those tasks. Some of the most successful methods in robotics have relied on constraining the space of problems they can solve in one way or another, in order to get the reliability necessary to be practical. Sampling based planning algorithms work well in environments with manageable di- mensionality, a good model of the state transition dynamics of the system, and the ability to quickly check if a state is valid (usually meaning checking if a state is in collision with the environment). Control algorithms may only be stable for a subset of the possible robot states. Trajectory optimization requires a reward function that can be a function of a whole trajectory, and sometimes a way to calculate a gradient in trajectory space. All of these algorithms can be made reliable and useful, but they are constrained to certain scenarios and can’t solve the full space of problems we care about. By contrast, recent research has created a number of powerful new ideas in robotics that are substantially more flexible than the techniques above, often requiring fewer assumptions and less system knowledge. Unfortunately, they are, in many, or most, cases unreliable along various axes. Reinforcement learning, for example, has produced impressive results 1 anddemonstrations, yetoftenstillsuffersfrominstabilityintraining. Inthisdissertationwe have studied three problems where we have reliable but constrained tools for planning, and we wish to leverage those tools to extend our capabilities beyond the existing constraints. In our work in multi-step planning (Chapter 2), we developed a way to leverage the capabilities of a collection of simple planners, each designed to work with a different set of problem constraints, to solve a longer horizon task for which we would otherwise be unable to plan directly. This is formulated to address a family of problems known as combined task and motion planning (other terms also exist, such as integrated task and motion planning, and hybrid planning). In our work on this topic [PS15] we have proposed the multi-step planning approach for manipulation problems. In contrast with other work in the area, this approach avoids making assumptions about the logical problem structure in the high level planner, opting instead to push those assumptions into lower level subroutines. Lower level planners are easier to make and easier to make reliable, and the high level planner maintains generality whilefollowing problem structure based onthe behavior ofthe low-level planners. In some scenarios we would like to exploit a planning behavior that is not always avail- able and only arrives in the form of demonstrations. This may arise because the planner depends on an expensive intermediate data processing stage and is not available in all scenarios, or because it can’t be queried interactively, such as a library of human demon- strations. Planetary rover path planning is an example of this where some data exists in the form of rover drive history, and some planning tools exist, but they depend on pre- prepared terrain labels or other high cost data. Ideally, we could take an image from orbit and then immediately be able to plan on the photographed terrain. Our work (discussed in Chapter 3) on inverse reinforcement learning develops a way to learn the implicit reward function of an existing planning algorithm in a rover path planning scenario, so that plan- ning behavior could be duplicated at lower effort in more scenarios. The results show how 2 we can learn to visually identify easily traversable terrain based only on demonstrations of behavior in similar environments, and scale those demonstrations to a useful and flexible planning algorithm. Reinforcementlearning(RL)isnormallyformulatedtostartfromablankslate, knowing nothing about the system it is learning to control, but in a lot of cases we may actually know a lot about the dynamics of the system and how to do some useful things in the space, though perhaps not with the flexibility we desire. Our work in plan-space state embeddings discussed in Chapter 4 hypothesizes that it is possible to distill the knowledge about a system implied by the available planners into an embedding space of the robot state that can then be augmented to the raw robot state to improve the performance of reinforcement learning algorithms. Experimentally, we see that this augmentation can improve the reliability of the reinforcement learning training process, which can be very important as many RL algorithms suffer from instability or unreliability during training. These three projects work to support the central thesis of this work, that existing, reliable, short horizon planning techniques for robotics contain significant insights about the nature of the systems they work for, and as we work to develop new ideas to lengthen our useful horizons and expand the versatility of our techniques, we can extract useful insights from the shorter range and more constrained behavior of these planners. 3 Chapter 2 Multi-Step Planning For some of the more difficult families of manipulation planning problems, techniques such as sampling based planning algorithms have been developed that work well for subspaces of the problem we care about, but the full problem is still quite difficult, often as a result of the number of degrees of freedom and the controllability of the space. In this chapter we propose a generalized way to use high-level discrete planning to combine the capabilities of different continuous planners to solve a more complex problem than the individual planners could handle. This chapter includes work from [PS15] as well as additional unpublished work. 2.1 Introduction Robotic systems capable of dexterous manipulation are becoming increasingly common, driven by precipitous drops in the cost of sophisticated sensing technologies as well as ever more powerful computers. These new sensors allow robots to be used in unstructured settings where objects may arrive in unknown configurations and obstacles may exist. In order to use robots in these environments we need planning technology capable of dealing with these unknowns to achieve a manipulation task. 4 Figure 2.1: PR2 robot in the process of folding a chair. Techniquesforrobotmotionplanninghavecomealongwayinthelastdecadeorso, with powerful planning algorithms now available essentially ‘off the shelf’ through frameworks such as OMPL [ŞMK12] and MoveIt [SC13]. These planners can effectively create plans to move robot arms to almost any position through most practical environments. When thinking about practical robotic manipulation we are almost always dealing with 6 or 7 degree-of-freedom robot arms, mounted on a mobile base. These arms have complex kinematics that make analytic solutions to collision checking problems near impossible. Consequently, sampling-based planning algorithms are used to develop collision free trajec- tories. These planning algorithms have advanced to a point where ‘off the shelf’ algorithms exist that can produce practical, collision free, motion plans for many arms in diverse cir- cumstances [ŞMK12]. However, they leave something to be desired when we think about the objectives we actually care about in robotic manipulation problems. When we talk about robotic manipulation, we are interested the ability of the robot to change the state of an object in its environment, and not so much in the state of the robot itself. From the perspective of a motion planner, robots tend to be well behaved, with fully actuated degrees of freedom, where as objects themselves are more difficult to plan for. Many effective techniques exist to develop plans for the kinematic motion of robots, but 5 they tend to fall short of a complete solution when we try to develop plans for the object rather than the robot. Traditional robotic motion planning problems focus on the state of the robot. A more useful (though more difficult) problem formulation would specify goal conditions in terms of the state of the environment. To wit, in many useful problems, the state of the robot is not intrinsically important. If we wish to move a bowl on a table by using a robot, we care that the bowl gets to where we want it. Usually (as long as the robot doesn’t get in our way or run into anything) we don’t care where the robot goes to achieve the goal. Combined task and motion planning refers to planning through a combination of the task space (the state of the environment) and the motion of the robot. Although it may seem like a simple matter of adding more degrees of freedom, combined task and motion planning problems are significantly different from the problems usually encountered in robotic motion planning. Holonomic robots are those that are capable of instantaneously moving in any free direction (such as a ground robot with omni-wheels, or a robot arm in joint space). Planning is easier for holonomic robots because it is easy to develop a local algorithm to move the robot from one waypoint to another. Planning for non-holonomic robots is more difficult because of restrictions in how waypoints can be connected (requiring a more sophisticated local planner). We believe that a significant difficulty of planning in this domain comes from the non- holonomic nature of the problem, that is, it is not possible for the system to instantaneously move in any direction in its configuration space. In general, motion planners exist that can deal with non-holonomic problems (RRTs [LK01] and variants are a good example), how- ever these sorts of planners were generally designed for a particular type of non-holonomic problem where even if it is not possible to move instantaneously in any direction, it is possible to get to any goal state that is nearby in a short amount of time. For example 6 a differential drive robot can use a forward and back motion to slide to the side a small distance in a relatively small amount of time. Put another way, these spaces have an easily computed or well defined sense of locality. That turns out not to be the case for manipulation problems, as the state of the objects in the environment cannot be changed until the robot is in a proper position to interact with them. For example, if our planner desired a small motion of an object on a table, the actual time to achieve that may vary widely (and be unbounded) depending on how easily the robot can reach the object. Some researchers in the motion planning community have looked at the problem of planning for the state of objects along with the robot that moves them, sometimes referring to this problem as combined task and motion planning, or integrated task and motion plan- ning. Some approaches have involved using logic based solvers [SFR + 14] or pre-designed plan outlines supplied by an expert [NPM + 14]. So far the problem is still very difficult, and without good generalized solutions. The solution we have proposed for solving these combined task and motion planning problems is called multi-step planning. Multi-step planning has been applied previously by other researchers in some other domains, and we believe it is a strong technique for robotic manipulation problems as well. The principle behind multi-step planning is the combination of our known, well studied robot motion planners (also referred to in this work as continuous planners, as they operate on a continuous space) with a discrete planning algorithm able to choose the sequence of continuous plans that must be executed. It relies on the ability of the programmer to do two important things in the space: first, to find discrete points for breaking up plans, and second, to provide continuous planners that can generate plans at all of those discrete points. Our system is agnostic of how the programmer chooses to meet these constraints, but any modern kinematic motion planner should work 7 well, as well as more specialized planners for specific tasks. (We discuss this process in more detail in Section 2.4: Algorithm.) We start by demonstrating how to use multi-step planning to solve the non-trivial problem of manipulating a folding chair. The folding chair problem demonstrates the power of this technique on an interesting planning problem with a relatively small search space. This problem is in some ways easier that other multi-step problems (only one degree of freedom outside the robot), but also demonstrates the power of this approach by showing how it can integrate a specialized planning algorithm in some steps of the plan where it is necessary (two arm manipulation of the articulated object), while also using more simple planning algorithms when the robot is differently constrained (standard sampling based planners for free arms). We will then extend the technique by showing how to improve the discrete search component of the process to provide a significant speed up in planning time as well as shrinking the size of the explored search tree. We evaluate the magnitude of this speed up by running our planner on four simulated configurations of our chair folding problem, and collecting a variety of metrics about the planning process. Additionally, with the increased performance of this new discrete planner, we are able to demonstrate how this technique can cross applications to a simple tabletop manipulation problem. Although the tabletop problem is more common and appears simple, it actually exhibits significantly higher combinatorial complexity than the chair folding problem. 2.2 Related Work Most robot motion planning can be classified as either dynamic (or policy based) planning, or kinematic planning. Our work is entirely based on the kinematic planning model in which the result of a planner is a viable trajectory through state space that the robot is 8 expected to follow. Kinematic planners are not well suited for applications where the robot must deal with large uncertainty in state transitions, and so for that reason we have to allow the robot to move slowly so it can stay close to its trajectory. Our work relies on the A* search algorithm for the discrete search component. A* was originally proposed by Hart and Nilsson [HNR68] and has become a standard algorithm since then. Other heuristic search algorithms exist such as D* [Ste94] and its variants [Ste95][KL05], and multi-heuristic A* [ASN + 14]. Examining the properties of these algo- rithms applied to multi-step planning could be a promising future area of research. 2.2.1 Continuous Space Planners Motion planning problems for modern, high degree-of-freedom robots, such as arms and mobile manipulators, are most frequently solved using one of a group of sampling based motion planning algorithms. This includes techniques such as probabilistic road maps [KSLO96] and RRTs [LK01]. RRTs in particular are well suited to the well behaved non-holonomic problems often encountered in robot navigation (discussed earlier), and many variants have been proposed, including RRT* [KF11], RRT-connect [KL00] and others. However, RRTs rely on a notion of locality that works well in these well behaved spaces (specifically, a meaningful distance metric for finding nearest nodes), but tends to break down in manipulation problems such as the type we address with multi-step planning and other combined task and motion planners. While RRT and its variants fundamentally look like search processes, there are also approaches to motion planning based on the idea of trajectory optimization, where we start with some hypothesis trajectory and deform it according to some cost function with 9 techniques such as CHOMP [RZBS09] or STOMP [KCT + 11]. Cost functions can be built to incorporate the need for collision avoidance, as well as other cost metrics. 2.2.2 Combined Task and Motion Planning Other researchers have used a variety of techniques to approach the combined task and motionplanningproblem. Nedunuriet. al. [NPM + 14]developedanapproachthatreliedon plan outlines supplied by an expert. Barry et. al. [BKLP13] have demonstrated a technique that constructs plans in a hierarchical manner, starting with object motions, then moving up to robot motions to satisfy the object trajectory. Srivastava et. al. [SFR + 14] developed a planner based on a new interface layer between known motion planning algorithms and known PDDL based task planning algorithms. In contrast to some of the above approaches, our work does not use specialized task planners, but instead divides the world state into qualitative classes based on the nature of the robot’s interaction with the environment, and asks the programmer to provide motion planners that are valid in each class of world states. We believe our approach is more generalized than others in the field, making very few assumptions about the nature of the world, and instead relying on the implementer to build the necessary constraints into the provided planners. In a practical sense, this requirement would exist with any attempt at combined task and motion planning, but we think it makes sense to push it to the lowest level possible, rather than building structural constraints into the high level algorithm. This approach may come with a computational cost (at least until we can develop improved heuristics), however we are bullish on future improvements in computational power. As an example of its versatility, our formulation allows us to provide planners for specialized situations such as two hand manipulations of articulated objects. 10 Multi-step planning has been used before in some other specialized applications, Bretl et. al. [BLLR05] used it to plan a path for a wall climbing robot. In their experiment a robot climbs a wall using rock climbing holds similar to what would be seen in a rock climbing gym. In this context the discrete states represented the set of holds the robot is currently using to attach to the wall, and state transitions have to account for whether the robot can reach a new hold without falling off the wall. Hauser et. al. [HBL05] demonstrated an application of machine learning to improve the efficiency of the multi-step planner for the same robot. The formulation developed by Siméon et. al. [SLCS04] is very similar to our own in that they also look at the space as a high dimensional configuration space including both objects and the robot. They use families of ‘transit’ and ‘transfer’ paths to break up the space into subdimensional manifolds where the robot either holds an object or it does not. These ‘transit’ and ‘transfer’ paths then form a graph that can be used with a probabilistic roadmap (PRM) planner. In our own work we do not distinguish algorithmically between paths where the robot gripper is occupied or not, instead allowing the programmer to provide whatever planners (and thus, subdimensional manifolds) they choose. But, more importantly, we use a heuristic search process, rather than a PRM, which has different performance characteristics and guarantees. 2.2.3 Object Perception and Manipulation Outside of simulated or highly structured environments, any manipulation planning system relies on some form of sensing and perception to detect the presence and position of the objects with which to interact. Our experimental system relies on fiducial markers and knowledge of the structure of the object. At a practical level it can be difficult to supply structured information about articulated objects when a CAD model is not handy. Sturm 11 et. al. studied ways to model and learn the configuration of articulated objects [SSB11]. Katz et. al. have done similar work using a robot to interact with the scene and detect articulated structure [KOB10]. 2.3 Problem Formulation The full planning space of the problem combines the full pose of the robot along with the poses of all objects in the scene. Planning in that space in a raw form would normally be very difficult due to the high dimensionality of the space, and the need to model state transitions that take into account the physical interaction of objects. We simplify this space by assuming that the scene (non-actuated degrees of freedom) is fixed, and only moved under special circumstances. In this case, objects in the scene only move when the robot is holding on to them. A limitation of this approach is that it does not directly model the physics of objects in the world. For example, if the robot deposits a small tabletop object into a location in free space, the object tends to fall to the ground (Robonaut [DMA + 11] will be allowed to disagree). This can be fixed with a constraint that objects may only be released in stable positions, but it requires the programmer to codify these stable positions. 2.3.1 A Toy Problem The need for this solution approach is easy to demonstrate by a simple, 2D toy problem. Consider a robot and object in a 1 dimensional world, such as Fig. 2.2. Suppose we wish to achieve a final state as in Fig. 2.3. To visualize a plan we can see the whole state space represented in 2D in Fig. 2.4. We observe that since the cart is not actuated, only certain forms of motion in the state space are possible. A plan must either move along the diagonal line, as when the robot is 12 Figure 2.2: Initial conditions for a 1D toy world, with a robot on the left, and cart on the right. Figure 2.3: Final conditions for the 1D toy world. Figure 2.4: The full 2D state space for the 1D toy world. Note that motion is only possible along the diagonal line (robot holding the cart) or vertically under the diagonal line (robot unattached to the cart). 13 attached to the cart, or it must move vertically under the diagonal line, as when the robot is unattached. We note that most easy notions of locality or distance metrics break down in this problem as two states may be very close, if the cart is near its goal position, but if the robot is not nearby, the actual manipulation distance may be much larger. For this toy problem it is easy to see how to find a solution: travel up to the diagonal line (grab the cart), traverse the diagonal line until vertically aligned with the goal state (move the cart to its goal position), then move vertically down to the goal state (release the cart and move the robot to its own goal position). Of course, this is an ad hoc solution that would only work for this particular problem, below we describe the multi-step planning algorithm for finding solutions in these sorts of spaces. 2.3.2 General Formulation We formulate this problem as one of kinematic planning in a high dimensional configuration space that includes all dimensions of both the robot and any objects. Instead of planning directly through the entire space, we define subdimensional manifolds in the space that we will refer to as stances. The key properties of a stance are that (1) we have a planner that can plan on the manifold, and (2) we have a way to sample states that exist at the intersection of different stances. Sets of disjoint stances can be combined to form a planning space of higher dimensionality. A planning space would be the set of stances that all use the same planner. This might be visualized as a stack of sheets. An example of this could be seen in a tabletop problem with objects A and B. The robot is holding object A and object B is on the table. The robot is in a planning space that includes all cases of holding and moving object A, and the specific stance in that space is parameterized by the position of object B on the table, and which grasp point the robot is 14 using to hold object A. The robot could then sample intersections with the planning space of holding no objects by sampling places to set down object A. The search problem becomes one of finding the sequence of trajectories along stance manifolds and stance transition points that minimize the distance metric of our choice. 2.4 Algorithm Multi-step planning is, fundamentally, a layering of a discrete search algorithm on top of specialized continuous planners. We consider the full state of our world to include all degrees of freedom of our robot, as well as all the degrees of freedom of any objects we will interact with. Some dimensions of our world can be directly controlled (i.e. robot joints), where as others cannot (e.g. the position of an object). The core of our algorithm is the getPlan function outlined in Algorithm 1. We use the A* discrete search algorithm, with a dynamically generated graph. A* is a well known and studied algorithm, but when using it on the dynamically generated graph that is essential to multi-step planning, the behavior of a couple of important functions becomes key to its performance. We discuss those functions below. Algorithm1dependsontwoimplementationspecificfunctions, reachableandheuristic, as well as the adjacent function which is domain specific. The heuristic function can be any admissible and consistent heuristic for our application that also matches the cost calculation for g values. In this work we will use the number of steps as a distance metric with the matching minSteps heuristic outlined in Algorithm 4. 2.4.1 Reachable The reachable function is designed to tell the discrete search algorithm if a valid path exists from one world state to another. To do this it has to verify that any changes in the state 15 Algorithm 1 getPlan 1: Input: world states init and goal 2: Output: trajectory, if one exists 3: Let state_queue be a Priority Queue 4: closed_set ; 5: state_queue.insert(init) 6: while state_queue is not empty do 7: current state_queue.pop() 8: closed_set.insert(current) 9: if current == goal then 10: trace steps backwards to construct trajectory 11: end if 12: for all s2 adjacent(current)nclosed_set do 13: if s is reachable from current then 14: h heuristic(s, goal) 15: g current.g_val+ distance(current, s) 16: f g +h 17: if s is not in state_queue then 18: insert s in state_queue with priority f 19: else 20: if f < current priority for s in state_queue then 21: update s in state_queue with priority f 22: end if 23: end if 24: end if 25: end for 26: end while 27: return trajectory 16 are valid (i.e. objects cannot move unless the robot is holding them), and use a continuous planner to verify that a feasible path exists for the robot. As we noted earlier, the multi-step planning algorithm depends on the programmer to supply a family of planning algorithms that can generate plans in all of the various stances that may exist in our problem. reachable will read the stance and then make a call to the appropriate planner to see if the states can be connected, returning a boolean value about the reachability of the new state, as well as (optionally, depending on the implementation) a distance to the new state. As a small optimization, if a valid trajectory is found then it will be returned so it can be stored until the end of the planning process, thus preventing the need to replan all the steps in the final multi-step plan at the memory cost of storing trajectories for each valid edge in our search graph. 2.4.2 Adjacent An important abstraction here is the idea of an adjacency. Two stances are considered to be adjacent if there exists a world state that is valid in both stances. The implementation of this algorithm requires a way to sample world states that are valid at the intersection of two stances. In order to do multi-step planning we need the ability to find stances adjacent to our currentstance. Formanipulationproblemswedefinestancesbythesetofgraspstherobotis using, so finding adjacent stances is a matter of looking at each gripper and either releasing the current grasp or taking a new one, which provides a set of adjacent stances. It is then necessary to sample world states in the intersection of each of those stances with our base stance. This process is detailed in Algorithm 2. In the next section we will discuss the state sampling process. 17 Algorithm 2 adjacent 1: Input: world state a 2: Output: set of adjacent world states adj 3: adj ; 4: adj_stance ; 5: for g in robot grippers do 6: if g is holding object then 7: release a:stance with g released 8: adj_stance adj_stance[release 9: else 10: for i in available grasp points do 11: grab a:stance with gripper g grabbing i 12: adj_stance adj_stance[grab 13: end for 14: end if 15: end for 16: for stance in adj_stance do 17: adj adj[ sampleStates(a:stance, stance) 18: end for 19: return adj 2.4.3 State Sampling Algorithm 3 outlines our state sampling process. The generation of individual samples is a random, application specific process that must be supplied by the programmer for a given application. Two other aspects of this function are worth noting. First is the sampled_states data structure. This is a structure that provides an efficient way to store all previously sampled states along with all of the stances of which they are members. Furthermore, it can be queried with a pair of stances to efficiently return all already sampled states that are in both stances. In the next section we will briefly discuss how we implemented a data structure with these specifications. 18 Second, we have a parameter, sample_count that tells us how many samples to take before returning. This parameter can have a very large effect on the running time and solution quality of the planner. If it is set too high it will grow the branching factor of the search tree to make it too large. If it is set too low the planner may not be able to find a high quality, or any, solution. Algorithm 3 sampleStates 1: Input: stances a and b 2: Output: set of world states at the intersection of a and b 3: states sampled_states.atIntersection(a, b) 4: while states.size <sample_count do 5: sample randomly selected state in intersection of a and b 6: states states[sample 7: sampled_states.insert(sample in a and b) 8: end while 9: return states 2.4.4 State Storage and Reuse In order to facilitate storing and retrieving world states by stance intersection, we created a data container we call a VennSet defined by two operations: insert(node,set_list): Insert a node that is associated with a list of possibly preex- isting sets. intersect(set_a, set_b): Return all nodes associated with both set_a and set_b. Internally, the nodes are stored in a dynamically resizable array (C++ std::vector in our implementataion), and sets are stored as a mapping from from the set to an array of node IDs for the nodes in that set (we use the C++ std::map, which is usually implemented as a red-black tree). Thus insert can be implemented by adding the new node to the end of the node array, and then adding it’s ID to the end of each array for each set it belongs to. 19 The intersect query is implemented then by first sorting in place the arrays of node IDs for each set, then it is easy to linearly pass through to get the intersection. This may be slightly expensive the first time it is called for a given set, but by using a sorting algorithm that runs faster with ‘mostly sorted’ data, this cost can be minimized. This is not the most asymptotically efficient structure possible for these queries (using hash tables for internal structures could provide an asymptotic advantage), but we found it to be sufficiently fast for our application. 2.4.5 Heuristic Function Algorithm 4 lists the procedure for our proposed minSteps heuristic. This heuristic is based on a simple intuition that objects must move to the goal if they are not already there, and they do not move unless the robot moves them. As such, if an object is out of place, the state is at least one step away from a solution. Taking this one step further (with the inner if statement), we check if the robot is holding the object that must move. If not then we are one extra step from the goal as the robot must first reach to grab the object before it can be moved. Ultimately, this is a relatively simple heuristic, though, as we will show in our experi- ments, it provides a significant speed-up. 2.4.6 Optimality Since our algorithm relies on the A* search algorithm, it is able to inherit from it the associ- atedguaranteesononoptimalsolutionlengthforthesearchgraphassampled. However, the random sampling of the search graph prevents us from making a general optimality claim for the multi-step planning algorithm. We mentioned earlier that tuning the sample_count parameter can have a significant impact on solution quality. It is potentially an area of 20 Algorithm 4 minSteps 1: Input: world states a and b 2: Output: a lower bound on steps to get from a to b 3: steps 0 4: for object o in object list do 5: if a.object_state[o] != b.object_state[o] then 6: steps steps + 1 7: if robot is not holding o in a then 8: steps steps + 1 9: end if 10: end if 11: end for 12: return steps future research to look into ways to ramp up the intersection sampling density until a satisfactory solution is found. In our experiments we use step count as the measure of solution quality. This works well for our problem since most paths have similar execution time, and there is a significant and constant time cost associated with taking and releasing grasp points. Solutions for other problems might reasonably like to choose measures of travel time, end effector travel distance, or some other metric of plan cost. Problems with a mobile robot in particular may benefit from different cost metrics. In this case it would also be necessary to develop a new heuristic specific to the cost function being used. 2.4.7 Heuristic Inflation A common approach to accelerating a heuristic search process is to apply an inflation factor to the heuristic. If starting with an admissible heuristic this technique comes at the cost of optimality, however, the sub-optimality of the final solution is bounded by the factor used to expand the admissible heuristic. 21 We have experimented with using a very small inflation factor (1.05) to prevent the search process from needlessly expanding nodes with identical cost. It should be observed that since we use a cost metric of number of steps, and that cost is discrete, for problems withoptimalsolutionslessthan20steps, thefactor1.05stillguaranteesanoptimalsolution. 2.5 Simulation We have implemented the multi-step planning algorithm for a simulated folding chair, as well as a simulated tabletop manipulation problem. 2.5.1 The Chair Configuration files provide the planner with a full kinematic model of the chair, as well as a list of all the valid grasp points on the chair. Configuration files also provide information about how finely to discretize the states of the chair during the planning process. The experiments in this work use 7 grasp points on the back of the chair, 10 on the seat, and discretize the joint into 5 states. For the purposes of this planner, we assume the pose of the back of the chair is fixed in space. This is reasonable in some configurations, such as when the chair is resting on the ground in such a way that robot will not need to support the full weight of the chair while grasping it, as we demonstrate in this work. In our simulated planning problems we are less concerned with the practical realities of such things as gravity (though it poses interesting challenges for future work), and so, as one can see in our Experiments sections, we will consider the planning problem with the chair in some less conventional configurations. 22 (a) config 1 (b) config 2 (c) config 3 (d) config 4 Figure 2.5: Four initial configurations of the chair used for testing the multi-step planner. 23 2.5.2 Continuous Planners As per the requirements of multi-step planning, we must provide a family of continuous planners capable of generating robot trajectories for the various stances that may exist. In this problem there are 3 such planners. A 2 free arm planner and a 1 free arm planner account for 2 of those, and are essentially identical except that in the 1 free arm planner the constrained arm is held stationary. The third planner is application specific to the closed kinematic chain formed when the robot grabs 2 links on the same object that have a joint between them, and works by interpolating through the joint positions of the object. 2.5.3 Tabletop Problem Our tabletop manipulation problem involves two cylindrical objects on a table within reach of the robot that both need to be repositioned on the table. A single grasp point (from the top) is provided for each object, and the objects can be placed anywhere on the table in a grid of potential poses (we use a grid resolution of 5cm). Similar continuous planners are provided as with the chair folding problem, except for the closed kinematic chain which does not exist in this problem. 2.6 Folding Chair Implementation The implementation of this algorithm involves resolving a number of specific technical challenges. Some of these are specific to the nature of the object we are working with, others are not. The sections below will cover the significant technical challenges and assumptions made by our implementation for the physical folding chair demonstration. 24 Figure 2.6: Initial conditions for a 2 object tabletop manipulation problem. Figure 2.7: Goal conditions for a 2 object tabletop manipulation problem, both objects have been repositioned. 25 2.6.1 The Chair We examine the problem of a robot that folds (or unfolds) a typical folding chair. The initial and goal states will be specified as poses for the chair, leaving the planner to decide where to hold the chair and how to move. We allow ourselves full prior knowledge of the shape and kinematics of the chair, as well as a list of available grasp points on the chair. We perform this manipulation with the PR2 robot from Willow Garage, as a practical matter the robot is not strong enough to manipulate the chair arbitrarily, so we add an extra constraint to the state of the chair that requires two legs of the chair to be on the ground at all times, and we assume they will not slip. In most practical configurations this will ensure the robot does not have to support any significant forces with its arms. In this implementation we have also assumed that the pose of the chair base link (the back of the chair) will not change. 2.6.2 Problem Specifications We allow the algorithm to start with a complete kinematic model of a real chair, supplied in the standard URDF format. We also provide a list of valid grasp points on the chair. Each grasp point is tied to a particular link on the chair, so it will continue to be valid through manipulation actions. Fig. 2.8 shows a visualization of available grasp points being evaluated during the planning process, note that some of the grasp points are shown for hypothetical configurations of the chair. Grasps are shown green when a valid kinematic solution could be found to perform a grasp, and red when one could not be found, or because that grasp is considered in collision. Currently the list of grasp points passed to the algorithm is hand specified, however it would be sensible to do this process automatically. The running time of the algorithm, as 26 Figure 2.8: Visualization of hypothetical grasp points during planning for different chair configurations. Grasps are shown as PR2 grippers, green for valid, red for invalid. Note that some grasps are shown for hypothetical chair positions. well as the existence and quality of a solution, can depend heavily on how many and where grasp points are provided, so having a principled process here would be quite useful. The initial state of the object (chair) is observed at the beginning of the planning process, the final state is specified as the state of the object and robot. Currently we specify the pose of the root link of the object to be unchanged, and only allow the robot to move the articulated joint of the object. 2.6.3 Chair State Estimation Perceiving the exact state of the chair is also a challenging problem, however a general solution to that problem is outside the scope of this work. We have attached AR Tag style markers to the chair that can be used with a calibrated 2D camera to perceive the pose of the parts of the chair in the frame of the robot. We used the ar_pose ROS package to perceive these tags. In ROS we treat the chair as another robot with a URDF file, so it is necessary to extract from multiple observed tracking tags the joint angles of the 27 chair. We calculate the observed rotation between two links of the chair, then project that rotationinquaternionspaceontotheconstraintcreatedbytherotationaljointintheURDF specification of the chair. Note that an axis of rotation constraint becomes a hyperplane passing through the origin in 4D quaternion space, so we can use standard linear algebra techniques for this. 2.6.4 Adjacent Stance Expansion For this robotic manipulation task stances can be defined by the state of the robot grippers with respect to what (if anything) they are holding and where they are holding it. This makesitrelativelyeasytosamplestanceintersectionssincetheywillalwaysbecharacterized by a change of state for a gripper (grabbing or releasing something). We maintain an index of sampled world states for every stance intersection, and by choosing an upper limit on how many states may by sampled for a given stance intersection we avoid the issue of oversampling an intersection if it is revisited. The sampled world states are maintained in a data structure that associates each world state with the stances it is in, and allows efficient queries of stance intersections to find all world states in that intersection. 2.6.5 Reachability Checking Reachability checking involves determining when a state transition is feasible in 1 step. We have to verify three things, kinematics, collisions, and path validity. To check kinematics we look at the initial and final states wherever the robot is holding an object. If the robot is holding an object, we use inverse kinematics to verify that a valid kinematic solution exists for the configuration of the object in that state. 28 Currentlywecheckforcollisiononlybyspecificallyexcludinggrasp-configurationcombi- nations that are known to cause collisions, however it makes sense to use a more generalized collision checker here. Checking path validity can be done by a standard robot arm path planner for whatever change in robot joint state is necessary from initial to final conditions. We are currently using joint space interpolation for unconstrained arms, but in future work this could incor- porateastandardcollisionavoidancepathplanner. Hereweaddsomeadditionalconstraints that the configuration of the object may not change unless the robot is holding it with both arms on different links (this is a constraint specific to the chair, but it is conceptually simple to generalize to other objects, a joint cannot be moved unless the robot is holding both the links). When the robot does change the configuration of the object, this creates some special constraints as a closed kinematic chain, discussed below. 2.6.6 Closed Kinematic Chains Solving kinematics for closed chains in general can be difficult. We deal with this problem by prioritizing the pose of the object. When a path must be found where the robot changes the configuration of the object, we perform a linear interpolation in the joint space of the object, and then solve inverse kinematics for the robot to hold the object at points along that path. We keep the joint states of the robot smooth by seeding each IK solver with the solution from the previous state. 2.6.7 Path Execution We assume that our environment is very predictable and so after finding a solution trajec- tory we simply execute it without sensor feedback. 29 2.7 Experiments We have been able to demonstrate the multi-step planning algorithm running both in simulation and on the real robot, with successful planning and execution results. 2.7.1 Simulated Chair We ran the planner for four different chair configurations and for each chair configuration we tested with 3 heuristics. A null heuristic that always returns 0, thus producing a naive breadth-first search, the minSteps heuristic discussed in Section 2.4.5, and the minSteps heuristic inflated by a factor of 1.05. Fig 2.5 shows the four chair initial configurations we used. The chair can be seen with a blue back, red seat, and grey rear legs. Table 2.1 shows the differences between the two search processes. Each configuration- planner combo has been run 5 times, and results are shown as the average of those 5 plans; standard deviation is shown in parentheses. All of the planning results were run on an Intel Core i7-950 desktop with 8 GB of ram. We evaluate the planners on 3 metrics: planning time: cpu time spent in the getPlan function. search tree nodes: total number of reachable nodes added to the search tree. branching factor: average number of reachable successors from each expanded search node. Table 2.2 shows the speed-up factor of both non-null heuristics over the null heuristic. 2.7.2 Planning Difficulty We can see based on the planning times and other metrics that not all the configurations were equally difficult. Specifically, 3 appears to be quite difficult, where as 2 and 4 are 30 null heuristic Config Time (s) Nodes Branching Factor 1 19.37 (1.15) 497 (26) 2.16 (0.06) 2 24.02 (0.60) 968 (19) 2.11 (0.06) 3 78.45 (0.53) 1830 (15) 1.42 (0.01) 4 15.86 (0.37) 821 (15) 2.63 (0.05) minSteps Time (s) Nodes Branching Factor 1 4.39 (0.26) 174 (2.9) 2.84 (0.15) 2 1.96 (0.03) 111 (1.4) 18.27 (0.23) 3 28.61 (2.27) 985 (46) 1.63 (0.07) 4 2.13 (0.07) 112 (2.2) 15.89 (0.32) minSteps with inflation Time (s) Nodes Branching Factor 1 6.25 (0.44) 261 (20) 3.32 (0.16) 2 1.88 (0.02) 94 (0.8) 15.57 (0.13) 3 25.66 (0.67) 1010 (13) 1.98 (0.03) 4 3.53 (0.18) 308 (37) 9.29 (0.06) Table 2.1: Running time, search graph nodes, and branching factor for the 4 initial config- urations with the null heuristic, the minSteps heuristic, and the minSteps heuristic with a 1.05 inflation factor. Each data point is an average of 5 runs, with the standard deviation shown in parentheses. 31 Speed-Up Factor Config minSteps minSteps w/ inflation 1 4.4 3.1 2 12.3 12.8 3 2.7 3.1 4 7.4 4.5 Table 2.2: Speed-up factor shows how many times faster the search is with each heuristic over the null heuristic, in terms of running time. easier. Qualitatively we note that configurations 1 and 3 require solutions where the robot starts by grasping the seat of the chair from the edge to start the motion, and then has to regrasp from underneath to close the chair. This is due to the grasp from underneath being kinematically impossible in the initial conditions, but the grasp from the edge being in collision with the chair in the final conditions. Motion of this type was also required in the solution present in our previous work with a real robot and chair. By contrast, configurations 2 and 4 are able to grasp the seat from underneath in the initial conditions and can thus solve the problem in fewer steps. 2.7.3 Speed-Up Factors Based on the table we can see that while all configurations got a significant speed-up from the minSteps heuristic, some gained a lot more than others. Specifically, our ’easier’ problems (2 and 4) got a much bigger speed up than the harder problems. We suspect this is related to the fact that we are using a relatively simple heuristic. Intuitively, a simple heuristic will be very effective at directing search graph expansion in areas of the problem where decision making is easy, but will be less useful in areas where decision making is hard. Under this intuition, configurations with more ’hard’ decisions to make will not be 32 able to shed as much of their planning time. We think this also suggests that there may be further headroom for improvement on difficult problems with more sophisticated heuristics. Heuristic inflation seemed to produce inconsistent results, resulting in better runtimes in some problems, and worse in others. 2.7.4 Branching Factors We measure branching factor by counting the number of times the planner successfully evaluates the reachability of another state (about line 14 in Algorithm 1) and dividing that by the size of the closed set when planning is complete. Note that this is not the average number of adjacencies, as already closed states are not counted, and neither are calls to reachable that return false. A property of this process is that the branching factor starts out quite large as the early states have many reachable successors, but as states move into the closed set the average branching factor shrinks, to the point that if the planner terminates without finding a plan, it will show a branching factor approaching 1 (though possibly higher, depending on how cyclic the search graph is). This is an interesting metric because it gives us a measure of something related to the exponential difficulty of the problem, as well as how much of the graph is being left unexpanded by our heuristic. Looking at the branching factors we notice that all configurations with the null heuristic have branching factors around 2, and with the minSteps heuristic configurations 1 and 3 (our ’hard’ problems) also have a branching factor around 2 (though in each case higher than their corresponding null heuristic search), but configurations 2 and 4 have radically higher branching factors in the 15-19 range. We think this shows that in these simpler configurations the minSteps heuristic has been extremely successful in cutting out unnecessary exploration, leaving behind a large 33 branching factor. This also corresponds to the significant decreases in computation time for these problems. 2.7.5 Folding Chair Physical Demonstration The physical demonstration of the algorithm runs on a slightly different code base from the simulation results, but is comparable to the simulation using a null heuristic. Fig. 2.9 shows a sequence of states while executing a resulting plan on the real robot. The chair starts in an unfolded state with a goal configuration to have it folded up. The plan is executed blind and is susceptible to perception errors in the initial state of the chair, or small miscalibrations in the sensors, so as a result some undesired motion of the chair is observed. Despite this the plan executes successfully and moves the chair into a folded state. Of particular interest should be the state where the planner has naturally discovered the need to reposition the gripper part way through the motion. This is a result of the fact that in order of fully close the chair the robot must grasp the seat from below to avoid a collision. However, those grasp poses are not kinematically feasible in the unfolded state, so it is necessary to start with different grasp, fold the chair part way, then change grasp point. The full video of the test can be seen here: [http://youtu.be/cmL4UpjyMng]. 2.7.6 Tabletop Planning In the tabletop planning problem we do not have comparisons with a null heuristic since the problem is significantly more complex. We can, however, observe some comparisons to the chair folding problem. The planning time here is generally significantly larger, and the branchingfactorsareanorderofmagnitudelargerthananywehaveseenonthechairfolding problem. The large branching factors in particular suggest that the improvement here from 34 Figure 2.9: A sequence of robot states from a plan executed in the lab. The initial state of the chair is shown in tile 1, the goal state is reached in tile 10. Observe how the robot moves the seat part way, then transitions to a different grip (tiles 7-8) from which it is possible to finish the manipulation. 35 Tabletop Planning Problem Heuristic Time (s) Nodes Branching Factor minSteps 31.56 (0.19) 1991 (5) 210 (38) minSteps * 1.05 47.12 (0.10) 2870 (4) 161 (15) Table2.3: Planningresultsforour2-objecttabletopplanningproblem. Resultsareaveraged over 5 runs and standard deviations shown in parentheses. a useful heuristic is even more important. Intuitively, the larger branching factor is an expected result of the robot having more choices at each adjacency in the tabletop problem. The chair only had one free dimension, where as our 2 can problem has 2 dimensions per can, greatly multiplying the available world configurations. In our state sampler we discretized the table into a grid with a resolution of 5cm. Finer discretizations are desirable for completeness, however, in addition to multiplying the number of reachable states from a given state, they also decrease the probability of the search graph revisiting old nodes, which further pushes up the branching factor. 2.8 Discussion and Future Work Although we believe this multi-step planning approach is quite powerful, there are some situations that present challenges. Without enough samples at a stance intersection, the probability of getting within an acceptable distance of the goal position may become very small. As such, a coarse discretization, particularly for more subtle manipulation tasks, could be difficult to use. Also, as we increase the degrees of freedom of the object, this will multiply the number of samples necessary to achieve the same coverage of a stance intersection, and thus multiply the branching factor of the search tree (in nodes where the object can be moved). 36 A limitation of this algorithm is that once a plan is generated it must be executed blind, without any updates based on sensor data. This is usually how traditional motion planners are operated, but when multi-step motion involves interacting with objects, the system becomes less predictable. If a robot gripper is slightly ill-positioned then a grasp may fail, or it may succeed but move the object into an unexpected pose. We observed this behavior in our own experiments. Extensions to this algorithm might provide the opportunity for replanning or adjusting plans based on new sensor data. A reasonable extension of this work could also include an algorithm for generating grasp points on the object. The same branching factor issues would apply as above, but a finer grained selection of grasp points may be necessary for problems with tighter kinematic constraints. Based on our results in simulation with the ’easy’ configurations, we think that further development of the heuristic search process may lead to stronger improvements with the ’hard’ configurations. One area that we think is particularly promising in this regard is the use of the multi-heuristic A* algorithm [ASN + 14]. Because multi-heuristic A* can remain optimal with inadmissible heuristics, this opens the door to other forms of heuristic generation that may not be able to guarantee admissibility. Furthermore, significant speed- ups in the discrete search process may create opportunity for using multi-step planning on more complex problems with more steps or bigger dimensionality. 2.9 Conclusion Combined task and motion planning problems represent a new frontier in robotic planning tasks. They allow the programmer to no longer be concerned with the details of where a robot holds an object, or the order of operations necessary for larger problems. In this work we presented a new approach to combined task and motion planning problems using 37 multi-step planning. We demonstrated the efficacy of the approach by implementing it for a complex manipulation problem involving an articulated object, in this case a folding chair. In this work we have also proposed heuristic search techniques as a significant develop- ment in multi-step planning, and demonstrated the performance of this strategy with our minSteps heuristic for our multi-step chair folding problem as well as a new tabletop prob- lem. Using multiple initial configurations we have shown between 2.7x and 12.8x speedup in the planning procedure, depending on the configuration. We believe that heuristic search techniques will be very powerful for any future implementations of multi-step planning, and we hope to see future work use these speed ups to expand the applicability of this technique to more complex problems. In a wider context, this work has shown how low-level and specialized planners can encapsulate the complexities of a difficult manipulation task. Even though a direct solution of the full problem may be too difficult, we have shown how to combine the capabilities of lower level planners to solve a higher level and longer horizon problem than we were able to solve directly. 38 Chapter 3 Inverse Reinforcement Learning with Soft Value Iteration Networks for Planetary Rover Path Planning In the previous chapter we looked at how to combine a collection of limited planners with a more general high level planning algorithm to solve complex, longer horizon tasks. Now we lookataproblemwherewehaveaplannercapableofgivingdemonstrations, butaccesstoit is limited or expensive. Here we wish to learn to imitate the behavior of that planner, with a learning system that will not share the same constraints as the demonstrating system. In addition to giving us similar planning behavior in a wider range of scenarios, this process can also give us insight into why our demonstrating planner made the decisions that it did. This chapter is primarily a reprint of [PAS19]. 3.1 Introduction Valueiterationnetworks(VIN)areanapproximationofthevalueiterationalgorithm[Bel57, Ber95]thatwereoriginallyproposedby[TWT + 16]asawayofincorporatingadifferentiable planning component into a reinforcement learning architecture. In this work, we apply VIN 39 in an inverse reinforcement learning (IRL) approach for robot path planning problems. The architecture has two main neural network components, the VIN itself which is an unrolling of the value iteration recurrence to a fixed number of iterations, and the reward network which transforms terrain and goal data into a reward map that feeds into the VIN. An important feature of this approach is that the learned component of the network exists entirely within the reward network, the output of which must be a well behaved reward function for our planning problem, making human interpretation of the planning results relatively easy. In this work, we restrict ourselves to 2D path planning problems on grids that have only local neighbors. This is a natural constraint of using convolutional neural networks for implementing the value iteration algorithm, although one could imagine convolutional VINs of higher dimensionality. Although the use of a reward function in value iteration is quite intuitive, writing down how to calculate a reward function to produce the desired planning results is deceptively difficult, as has been observed by researchers in the past [BBS + 10]. Part of the difficulty comes from the need to keep the relative costs and rewards of different types of terrain in balance with each other. Systems that learn a reward function can side step these manual difficulties by requiring the reward function to perform well and allow the relative weights of different risks to come into balance automatically. Figure 3.1 shows an instance of one of these learned reward functions from our dataset and some resulting path plans. 3.1.1 Planetary Rover Path Planning Planetary rovers, such as those currently being operated on Mars, face difficult navigation problems in both short and long range planning [GAD + 16]. These problems are made more difficult by tight bandwidth constraints and time lag to communicate with Earth that operationally limit the rovers to only making one round trip communication with Earth per 40 Figure 3.1: Test example from our Jezero Crater dataset (best viewed in color). Top Left: Satellite map shown in grey scale, this image covers a 64m x 64m square of terrain with a pixel size of 25cm per pixel. The red star shows the goal position. Top Right: Learned reward function: Darker colors show low reward areas and brighter colors show high reward areas. Note the single bright pixel at the goal position. Bottom Left: Value function: shows calculated expected discounted reward when starting from each location. Dark to light shows low to high reward values. Bottom Right: Satellite map with paths: Example paths from our policy (as implied by the value function) are shown in red, the demonstration path from the dataset is shown in yellow (rendered under the red paths). Red paths are shown starting from 4 corners of the map as well as the start of the yellow demo path. In this example we see one plan trajectory that closely follows the demo, and another that takes an alternate path around an undesirable region of terrain. The region being avoided shows a ripple pattern that is a common feature in this dataset that we see being avoided by both the expert planner and our learned trajectories. 41 Figure 3.2: MSL image showing damage to the wheels from small rocks. The rocks that cause this damage are too small to see in satellite images, but may co-occur with certain mineral types and terrain features that can be identified from orbit. sol (a sol is a Martian day, about 24 hours 40 minutes). Existing rover driving techniques rely mostly on imagery taken on the surface to plan actions, however those plans can only go as far as the rover can see. Rovers on the ground are naturally limited in their ability to see by variations in terrain and obstructions, as well as by limitations of the cameras. Orbital imagery can be pro- vided at effectively unlimited distance, but arrives at lower resolution than images on the ground. Although orbital imagery can be processed to obtain certain characteristics of the terrain, such as coarse elevation, or in some cases estimates of terrain type, relating these characteristics to a cost function relevant to navigation is nontrivial. In particular, many dangers to the Mars Science Laboratory (MSL) “Curiosity" rover are too small to be seen 42 directly in orbital imagery (Figure 3.2), but it might be possible to find terrain or mineral patterns that are associated with dangerous surface features. We observe that this problem possesses two important properties: we can formulate useful algorithms for long range planning if a suitable cost function is available, and short range planning techniques exist (based on surface imagery) that are both sophisticated and reliable. Inthiswork, wepresentaninversereinforcementlearningarchitectureusingavarianton value iteration networks we call soft value iteration networks. This system implicitly learns to solve navigation planning problems using training data derived from other planners or historicalroverdrivingdata. Thesenavigationfunctionsonlydependonorbitaldatatopro- duce a plan, and are thus useful at ranges beyond what is visible from the ground. Because the inverse reinforcement learning architecture we use is structured around value iteration, it implicitly produces useful cost functions for navigation. We propose an architecture that allows this planning product to be integrated with existing human expert-designed local planners to maintain guarantees in safety critical systems. An additional application of our technique is that as a data driven planning technique based only on information obtained from orbit, our planner could be used in mission plan- ning scenarios for tasks such as evaluating the value of a landing site. Planners that are currently used for this purpose rely on heuristic rules for travel cost and maps created through a terrain labeling process. 3.2 Related Work Deep reinforcement learning (RL) techniques have been frequently applied in settings where we wish to model an action policy as a function of robot state for both discrete and con- tinuous action spaces [MKS + 15, LHP + 15, SML + 15]. Reinforcement learning has also been 43 used in challenging space applications for hopping rovers [HP17] using least-squares policy iteration [LP03]. Value iteration networks were first explored by [TWT + 16] in the context of reinforce- ment learning. They developed the value iteration module as an approximation of the value iteration algorithm that could be used inside a reinforcement learning algorithm to allow the algorithm to ‘learn to plan’. We demonstrate a different application of the value itera- tion module from their work: instead of allowing it to have abstract spatial meaning with a learned state transition model and pass through a final policy network, we bind it tightly to the map of our terrain and the available state transitions. By doing this our algorithm is explicitly forced to learn the reward function that produced the observed behavior in our IRL formulation of the problem. This is important to us as it allows the reward map to be interpretable in other contexts. QMDP-net, proposed by [KHL17], uses a planning technique similar to VINs, but adapted for partially observable problem settings. Other re- searchers have also looked at ways of using neural networks to model dynamic programming and planning techniques [IKW07, SvHH + 17]. IRLfornavigationhasbeenstudiedbyothergroupsaswell[SBS10,ZMBD08,WWP16], for example the LEARCH technique proposed by [RSB09] has a very similar functional in- put/outputdesign,althoughitreliesonsubstantiallydifferentcomputationaltools. [WWP16] and [WOP15] combined the MaxEntIRL framework with a deep network similar to our own for computing cost maps. Imitation learning has also been done with deep neural network (DNN) based policies (instead of cost functions) and visual state information [GGC + 16]. Our formulation of IRL with VINs shares a substantial mathematical similarity with previous work in maximum entropy inverse reinforcement learning [ZMBD08, WWP16]. In particular, our use of a ’soft’ action model (discussed in section 3.4.5) in combination with a softmax cross-entropy loss function makes our optimization objective very similar to the 44 MaxEntIRL objective. The main difference comes from how the probabilistic action model in the maximum entropy approach is normalized over full trajectories, where as our soft action model is normalized over local actions. We will discuss this comparison in more detail in subsection 3.4.6. 3.3 Preliminaries We use the problem formulation of a Markov Decision Process M = (S;A;P;R;) where (s) defines the distribution over initial states, and a robot in state s2S must choose an action a2 A, which will give reward r = R(s;a) and proceed to state s 0 with probability P (s 0 js;a). We say that a policy (ajs) will define a policy distribution over actions condi- tioned on states, and that (s 0 ;a 0 ;s 1 ;a 1 ;:::) is the distribution over trajectories created recursively when s 0 (s), a t (ajs t ), and s t+1 P (s t+1 js t ;a t ). We then define the value function V conditioned on policy as V (s) = E s;a " 1 X t=0 t R(s t ;a t ) # (3.1) for some discount factor 2 [0; 1]. Here the notation Es;a refers to the expected value of theexpressionwithsandadrawnfromthedistribution. Thiscanberewrittenrecursively as V (s) = E a(ajs) " R(s;a) + X s 0 P (s 0 js;a)V (s 0 ) # (3.2) We further define Q and rewrite the value function: Q (s;a) =R(s;a) + X s 0 P (s 0 js;a)V (s 0 ) (3.3) 45 V (s) = E a(ajs) [Q (s;a)] (3.4) We define the optimal value functionV (s) = max V (s) and a policy as an optimal policy if V = V . The value iteration algorithm can be used to find V through the following iterative process: Q n (s;a) =R(s;a) + X s 0 P (s 0 js;a)V n (s 0 ) (3.5) V n+1 (s) = max a Q n (s;a) (3.6) It is well known that as n!1, V n ! V , and an optimal policy can be inferred as (ajs) = 1 argmax a Q1(s;a) (a). 3.4 Algorithm We develop an inverse RL problem formulation based on using Value Iteration Networks (VIN) to backpropagate gradients through the planning algorithm to update the reward function. By contrast to previous work using VINs ([TWT + 16]), we require the value iteration module to operate directly on the state space of our planning problem (rather than an inferred state space as in [TWT + 16]). Although this restricts what can be done in the learning process (the problem must have a 2D grid state space, and linear state transition functions), by forcing a traditional interpretation on the value map we can use it with any control policy that is designed to incorporate this kind of data. In particular this includes the expert designed local control algorithms that provide the safety guarantee necessary for operating high value planetary rovers. 46 Other Layers VI Module f P Action Selection Path Demonstration diff Loss Applications Training Map Goal f R Parameterized by ? Q V - Hybrid Local Planners - Mission Planning Tools Figure 3.3: Information flow diagram for our architecture. Map data may consist of visual overhead images, as well as other data products such as hyperspectral image channels, or elevation data. This is stacked with a one-hot representation of the goal to form a multi- channel image passed to the reward network f R . The reward network is parameterized by , and its exact architecture may vary with the problem domain, our architecture for the JEZ dataset is shown in Figure 3.11. The VI Module receives the reward map as well as the state transition modelf P , and calculates the value map V and Q-function. During training, an action selection process infers the policy from the Q-function and then compares that with the demonstrated path to get a loss value. Applications may take advantage of V or Q, depending on their needs. 3.4.1 Architecture The data flow of our algorithm is shown in Figure 3.3. We start by stacking visual map data along with a one-hot goal map and any other relevant data layers (such as elevation). These pass through the network f R to produce the reward map that feeds into the value iteration module. Also feeding into the value iteration module are the state transition and reward kernels, denoted here as f P . The output from the VI module is used differently depending on whether we are cur- rently training the algorithm or deploying it on an operational system. During deployment the value map from the VI module can be fed to an expert designed local planner and used in combination with high resolution local information to make planning decisions for a long range goal. During training the output from the VI module is compared with actions 47 R Q f P V' VI Module K recurrence f R V Value Function Action Model Q Function Figure 3.4: Value Iteration Module. The (potentially multi-channel) reward map f R is stacked with the current estimate of the value function V (typically initialized with 0’s). The state transition modelf P is delivered as the convolutional kernel that creates the next estimateofQ.Qhasasmanychannelsasavailableactions. Theactionmodelthenprocesses Q into the next estimate of V. Traditionally this is a max pooling operation, however in subsection 3.4.5 we discuss our proposed soft action model. This process repeats for K iterations before outputting the final estimates of V and Q. from a plan demonstration (which may come either from an expert planning algorithm, or from historical operations on a real platform), and the difference is used to calculate a loss function. 3.4.2 Training In general this architecture can have trainable parameters in both f R and f P , however in our problem we fix the transition and reward kernels f P and only train parameters in the networkf R . During training we attempt to minimize over the loss function defined below as: L = X s2Sy X a2A y(s;a) log expQ (s;a) P i2A expQ (s;i) (3.7) where S y is the set of states on the training path and y is an indicator function for the action taken at each state on the training path. This may be recognized as a common 48 softmax cross-entropy loss function, implying a probabilistic action policy based on the softmax of Q over actions. 3.4.3 Plan Execution The network is provided (as input) an orbital image of the relevant terrain, a goal position, and optionally other data products such as stereo elevation, surface roughness estimates, or hyper-spectral imagery. With a forward pass through the network we calculate the value and Q function estimates for that terrain and goal position. The policy implied by the Q function could be used directly, such as in a mission planning application, or the value and/or Q functions could be used in any other application where such estimates would be useful. An important example is using the estimated value function as a tool for long range planning in combination with a more accurate or safe short range planner. A rover on the ground has more accurate data for short range planning within its radius of sight than is available from orbit. Within the radius of its sight, the rover is able to use expert algorithms for navigation, but can combine calculated path costs with value estimates at its planning horizon to choose an appropriate plan that considers longer range objectives. Replanning can then happen as frequently as is allowed by on board resources. 3.4.4 Value Iteration Module The value iteration module uses the value iteration iterative process defined in Eqs. 3.5 & 3.6 to perform an approximation of value iteration for a fixed number of iterations k. The approximate nature of this module derives from the need to choose the fixed number of iterationsk a priori, instead of iterating to convergence as required by the traditional value 49 iteration algorithm. A representation of the architecture of the value iteration module is shown in Figure 3.4. The two inputs f R and f P provide the reward map and convolutional kernel respec- tively. The reward map is stacked with the value map from the previous iteration and then convolved with f P to produce a map of Q values. The Q channels must then be collapsed into the next value map through a model of the action selection process. Strict adherence to the value iteration algorithm requires that this be a max pooling operation (an optimal policy chooses the action of maximum reward), however, in the next section we propose an alternative approach. 3.4.5 Soft Action Policy The traditional formulation of the value iteration algorithm requires that updates be done using the optimal action policy of always choosing the action of highest Q value as in Eq. 3.6 above. This is a theoretically well justified choice for a planning algorithm. However, in value iteration networks we have an additional objective to provide an effective gradient through the algorithm. If we assume a reward function R parameterized by , we can calculate the gradient of the value function with respect to after k iterations as: r V k (s) =r R (s;a ) + X s 0 P (s 0 js;a )r V k1 (s 0 ) (3.8) 50 where, a is the optimal action selected by the max Q value in iteration k 1. Assuming a deterministic state transition model P (as is the case for the problems studied in this work), this equation can be further simplified and expanded as: r V k (s) =r R (s;a ) + r V k1 (s 0 ) =r R (s;a ) + r R (s 0 ;a 0 ) + 2 r V k2 (s 00 ) (3.9) The key observation from Eq. 3.9 is that the gradient through the value function will only incorporate information about states that are reached by the best actions under the current reward function. In this work, we propose a modification to the value iteration algorithm that leads to more effective gradient values, which enhances the network training, particularly in the early training stages. Instead of using the value update from Eq. 3.6, we propose the following: V n+1 (s) = X a Q n (s;a) exp(Q n (s;a)) P i2A exp(Q n (s;i)) = E an [Q n (s;a)] (3.10) n (ajs) = exp(Q n (s;a)) P i2A exp(Q n (s;i)) (3.11) This formulation can be interpreted as performing value iteration under the assumption of a probabilistic action policy n rather than an optimal action policy (in this MDP for- mulation, the optimal action policy is the argmax over actions of the Q function, which is always deterministic). In the traditional formulation with deterministic state transitions, 51 the reward gradient cannot carry information about action selections in sub-optimal suc- cessor states. However, we can see that if we attempt to take the gradient of Eq. 3.10 w.r.t. , we will not be able to remove the sum over actions and corresponding successor states. As such all possible future states and actions will (recursively) participate in the gradient. We speculate that this action model may also be beneficial for problems with probabilistic state transitions, although we suspect the benefits would be less pronounced. 3.4.6 Importance of Soft Policies in IRL Inverse reinforcement learning in general has to deal with a problem that optimal policies are also deterministic, as we see in the form of traditional value iteration. This is prob- lematic for IRL algorithms as it makes it difficult to calculate useful gradients that can be used to update a parameterized reward function. Our method and previous work (such as maxEnt IRL [ZMBD08, WWP16]) aim at solving this problem by softnening the policies. Specifically, methods that rely on the principle of maximum entropy tend to make the policy class probabilistic, which then leads to a derivation of a practical gradient calcula- tion [ZMBD08, WWP16]. These techniques, known as MaxEnt IRL, define their policy as proportional to the exponential of the expected discounted reward of full trajectories. In our work we address this challenge from a different perspective: Backpropagation through a VIN allows us to calculate gradients of the deterministic optimal policy. As discussed in previous sections, when those gradients are not useful, we modify our policy classtobeprobabilisticbymakingitproportional totheexponentialofexpecteddiscounted reward. In our approach, however, this stochasticity is added over local actions. This new approach results in subtly different normalization across trajectories. In our system, trajectories of similar reward can receive different probability as a result of the graph structure. Inotherwords, whiletheMaxEntobjectiveattemptstomaximizetheprobability 52 of the “demonstrated trajectories", our model attempts to maximize the probability of the “individual actions" in the trajectory demonstration. While all these soft policy strategies attempt to converge to the same point (i.e., to a deterministic policy that matches the demonstration), they have different reward contours around that optimal point. The SVIN work proposed here provides a new mathematical counterpoint for developing useful IRL gradients, paving the way for new options for future development. 3.5 Experiments In this section, we demonstrate the performance of the proposed inverse reinforcement learningframework. Wearetestingouralgorithmwithtwodatasets. Thefirstisasimplistic grid world dataset designed to show that the learning objectives and formulation of our algorithm are realistic. The second is a synthetic set of paths produced with actual Mars terrain data from expert terrain traversability labelings. All our datasets use a grid state space and a deterministic action model with 9 available actions corresponding to 4 straight neighbors, 4 diagonal neighbors, and 1 action to stay in place. We track the performance of our models using two metrics, loss and accuracy. Loss refers to the standard softmax cross-entropy loss function described earlier. Loss is the function that is being optimized via gradient descent. Accuracy refers to the fraction of steps along the demonstrated path where optimal actions predicted by our network match the actions selected by the path data. The accuracy metric is not differentiable and hence cannot be optimized directly, however it is a better proxy for the useful performance of the network than the loss metric. Therefore, we primarily track progress in the accuracy value during training. 53 Figure 3.5: Two example maps and paths from our grid world dataset. The star shows the goal position with one example path to that goal. The black cells are considered impassible and example paths are calculated with a standard search based planner. 3.5.1 Grid World The grid world dataset was designed to test the behavior of our algorithm in a simplistic environment that still required non-trivial planning behavior with significant look-ahead in the map. Details Each map has resolution of 32x32 cells. For each map we randomly choose start and goal positions and compute the shortest path. The full dataset consists of 1000 maps with approximately 50 paths per map. (The number of paths per map is approximate because occasionally a start and goal position are chosen for which there is no valid path in the map.) We split the grid world data by maps into 90% training and 10% test. Figure 3.5 shows a couple of sample paths on their corresponding maps from the grid world dataset. 54 Figure 3.6: Vector fields show optimal policy planning results on our grid world dataset. Learned policy actions are shown in red and the goal is denoted by a blue star. A few green arrows (highlighted in the blue circles) show places where the demonstrated actions from the training data deviated from our learned (red) policy. On the left we see a case where our algorithm did not perfectly predict the demonstration because of an identical length path that it took instead. In the map on the right, we can see a different form of error where some regions do not show valid policies that converge to the goal (bottom-right and bottom-left corners). This is an artifact of fixing the number of iterations (i.e., k) in VI. When k is too small, information about the goal location cannot propagate to the whole map. 55 Figure 3.7: Training curves show accuracy and loss on the test set against gradient steps on the grid world dataset. The model in orange uses a standard hard action model, the blue uses our proposed soft action model. Solid lines show a moving average of the shaded raw performance. Loss is described in Eq. 3.7. Accuracy is calculated as the percentage of grid cells in which the calculated optimal action and the action shown in the demon- stration match. Though the accuracy value is more intuitive and easier to track, it is not differentiable like the loss value which is the optimization objective during training. Note the substantial improvement of our SVIN algorithm over standard VIN in accuracy. Figure 3.8: Left: An example grid world obstacle map, the red star denotes the goal position. Middle: The corresponding reward map after training. Reward values are shown in grey scale with dark values being low reward. Note the slightly brighter goal position. Right: Value map produced by the SVIN algorithm. Grey values here show the expected discounted reward for all positions on the map. We see the shapes of the obstacles from the map, as well as ’shadows’ cast by the obstacles across regions of the state space made less accessible (from the goal) by their presence. 56 The reward network f R we use for grid world consists of 2 convolutional layers with 1x1 kernels (spatial context is not important for this problem). The first has 16 output channels and relu nonlinearity; the second has a single linear output channel. Results Figure 3.7 shows our training curves on the grid world dataset. While the standard VIN algorithm with a hard action model plateaus around 70-75% accuracy, our SVIN algorithm is able to reach a substantially higher level of performance on this dataset (89%). It is also worth noting that the optimal upper performance bound is likely less than 100% and thus our actual performance is more than 89% of the optimal solution. There are two reasons for this: First, the grid world environment often contains multiple paths of identical length, and, structurally, the network is not capable of learning tie-breaking preferences. Additionally, the network is constrained by the choice of hyper-parameters. In particular, k controls how far information can propagate through the network. Our grid world network is trained with k = 64, which is longer than most paths, but under some conditions still may not be enough to converge rewards across the whole map. Figure 3.6 illustrates the performance of our grid world network, as well as potential failure modes. The learned policy is shown in red. A few green arrows show where the training path deviates from the learned policy. In both maps the goal position can be identified as the major point of convergence of the vector field. The left map shows some instances of deviations of identical length. The right map has some distant areas of the map (bottom-right and bottom-left corners) that show clearly incorrect behavior. Figure 3.8 visualizes the behavior of a trained network. It shows the input map on the left, the result after passing through the reward network in the middle, and the output value map on the right. It can be seen that SVIN correctly identifies the goal position 57 Figure 3.9: Additional test examples from our Jezero Crater dataset, rendered in the same style as Figure 3.1. Left to right: satellite map, reward function, value function, satellite map with paths, example paths from our policy are shown in red, the demonstration path from the dataset is shown in yellow (rendered under the red paths). Row 1: Note how the trajectory instance started with the yellow demo path follows it closely at first, then deviates through a region the demo avoided. The upper right path has started in an area of the map that did not receive sufficient reward signal from the goal, and is thus showing undesirable behavior. Row 2 & 3: Here we see instances were the demo trajectory is simply a straight line, probably because the terrain labels were too coarse to pick out any features here. Nonetheless, our learned model had identified some meaningful terrain variations. Particularly in row 2 we see it steering through a smoother area to avoid a group of rocks. 58 Figure 3.10: Additional test examples from our Jezero Crater dataset, rendered in the same style as Figure 3.9. 59 Figure 3.11: Architecture of the reward network trained for the Jezero Crater dataset. as the brightest square in the reward map (middle graph). Also, in the value map (right graph), we see how the obstacles appear to cast shadows through the space. 3.5.2 Jezero Crater Jezero crater is a candidate landing site for the Mars 2020 rover, and as a part of the landing site evaluation process detailed maps have been created of the area along with a traversiblity tool that can produce minimum time paths between arbitrary waypoints [ORA + 16]. These traversibility calculations are based on heuristics written by expert rover drivers as a function of satellite-based terrain classification, rock abundance, and slope. We used this traversibility tool to generate a dataset of 9010 terrain tiles (each 64m square) containing a path from this tool. Approximately 10% of the dataset is held out as a test set. 60 Figure 3.12: Accuracy and loss training curves on the Jezero Crater dataset. Solid lines show a moving average of the shaded raw performance. Loss and accuracy are calculated here in the same way as the grid world dataset shown above. Training Parameters (discount factor) 0.98 k (VIN Iterations) 150 Optimizer Adam Learning Rate 0.0001 Batch Size 20 Table 3.1: Jezero Crater Training Parameters 61 A diagram of the reward network we used with this dataset is shown in Figure 3.11. Table 3.1 lists some important training parameters. Training on this dataset, our model approaches 75% accuracy, as shown in Figure 3.12. As in the grid world example, these accuracy numbers can be difficult to interpret, but to get a qualitative feel for what that 75% looks like, we have shown a representative set of examples of what the paths generated by our policy look like in Figure 3.1 and Figure 3.9. Although a probabilistic action model is used by SVIN for calculating the value function, the paths shown here are generated using a best action policy, so at test time the trajectories are not probabilistic. These figures can be compared with Figure 3.6, except in this case we have only rendered a collection of paths instead of the full vector field for visual clarity. A qualitative analysis of these results suggests that most of the time our network is generating policies that are goal directed and make an effort to avoid unsafe looking terrain. In some cases we have even seen instances where the policy from our network seems to outperform the training data (possibly due to relatively coarse terrain label data). This is surprising observation, that we hope will receive more scrutiny in future work. An important qualitative observation is that even when the shown paths are nearly identical, there can still be a substantial fraction of states where our policy makes a dif- ferent decision from the demonstration trajectory. This goes to suggest that the accuracy metric tracked during training can likely not be pushed all they way to 100%, and even the approximately 75% achieved by the model shown here may be close to the upper limit. A valuable direction for future work may be to develop a less noisy loss metric for training, that better represents policy similarity. 62 3.6 Discussion and future work In this work we have demonstrated an inverse reinforcement learning architecture using soft value iteration networks (SVIN) that can be applied to path planning problems with plan- etary rovers. We have analyzed how the SVIN formulation can improve training gradients for problems with deterministic state transition dynamics, and have seen that improvement empirically on our gridworld dataset. We also applied our technique to a new synthetic set of paths on real Mars terrain, and have seen how we can substantially approximate the performance of the expert planner that was used to generate the data, with some hints that it may even be possible to outperform the training data. In the future we think this technique could be used to generate reward networks trained on short paths to produce policies on much larger maps, a capability which would be key for mission planning and long range navigation applications. The algorithmic capability to do this kind of extension is a major strength of this approach, and it shows how by imitating the behavior of an existing planner, we can produce a system with more flexibility and capabilities than what was demonstrated. Available planning techniques don’t always fit the requirements of the way we want to use those planners, but the inverse reinforcement learning architecture we developed and demonstrated here has shown how we can accept diverse and specialized expertise, and generalize it into more flexible and informative tools. 63 Chapter 4 Plan-Space State Embeddings for Improved Reinforcement Learning We now look at a way to use the insight from planning algorithms to improve the perfor- manceofreinforcementlearning(RL)algorithmsthatmayhaveentirelydifferentobjectives. RL algorithms generally start from scratch, with no knowledge about the environment or reward objectives, but they are also very versatile and able to learn policies for otherwise difficult tasks. Although our planning algorithms may not be able to solve every prob- lem we care about with a particular robot, we have shown how we can learn from their demonstrations an alternative geometry for the state space of a robot, in the form of an embedding space. When we use that ’plan-space’ to augment the observed state of the robot we observe an improvement in the performance and reliability of some reinforcement learning algorithms. This chapter is primarily a reprint of [PS20]. 4.1 Introduction Recently, there has been a lot of interest in ways to solve difficult robotics problems using learning. Learning based approaches offer a lot of advantages in flexibility and adaptability, 64 and don’t require the engineer to offer much knowledge about the behavior or structure of the system. Reinforcement learning (RL), a common approach here, is used to solve a prob- lem from the ground up, only guided by a reward function and experience interacting with the world, with no knowledge of the way the world works. Recent advances in reinforcement learning have led to significant success in solving challenging problems, but usually only in spaces where it is possible to leverage enormous data collection capabilities, either with large numbers of physical robots or simulated environments. A lack of world knowledge makes longer horizons increasingly difficult for RL algorithms, so we are interested in look- ing for ways to incorporate knowledge from other sources to improve the performance of our RL learners. Substantial world knowledge is available in the form of existing planning techniques for robots that are able to reason directly about cause and effect relationships in manipulating the world. In many cases they may not be able to provide fully optimized solutions to planning problems, but they can offer a significant prior base of knowledge about what useful motions in an environment look like. We allow them to express this knowledge by providing a set of demonstration trajectories that show useful, goal directed actions in our environment. In this work we aim to make this knowledge available to an RL or other planning agent in the form of an embedding of the state space of the robot and environment such that the geometry of the embedding space has a desirable relationship to the demonstration trajectories. The idea of our embedding objective is that if we take a demonstrated trajectory, the path of that trajectory in the embedding space should be linear and constant velocity. Our system pursues this objective on short overlapping segments of the demonstrations, thus driving the full embedded demonstrations towards linearity as well. In this objective, an ideal representation would be one that creates a metric space for our system, with 65 distances representing actual manipulation distance in our system. It is also desirable to do this embedding in a variational framework as this will allow us to avoid the difficult problem of defining a distance metric that makes sense for our raw state space. Although these learned embedding spaces are likely to be imperfect against our objec- tive, we hypothesize that they provide a useful intermediate representation for RL algo- rithms that attempt to learn a policy function. The mechanism here could be a reduction in the learned function distance from our embedding to an ideal target function or smooth- ing of the reward surface enabling more reliable convergence during training. We test this empirically by training embedding networks for sample environments and then using standard RL algorithms to learn behaviors in these environments. We compare the train- ing performance of using our state embeddings against learning from the unmodified state observations and see significant improvements in training performance and reliability. An observation we make in our experiments is that in some problems, by augmenting the observed state with our embedding space we are able to significantly reduce the variance in training performance that results from changes in the initial random seed, which is a significant confounding factor in practical RL as well as in doing good RL research. In Section 4.3 we cover the math behind our formulation, and then address some im- plementation details in Section 4.4. We will then cover the results of our experiments in Section 4.5. 4.2 Related Work Our optimization objective is inspired by work in word embeddings (such as [MCCD13]) for languagemodelswherewordsaremappedtovectorsinalatentspace. Canonicaltechniques in this space use sequence data from natural language and set an optimization objective that words that occur sequentially in language should have embeddings that are nearly 66 co-linear in the latent space. We treat the sequence data from a robot trajectory in a similar way, and develop a new embedding formulation to meet the needs of our robot state embeddings, which we discuss further in Section 4.3. Robot learning from demonstrations exists in multiple other forms. Inverse reinforce- ment learning supposes that the demonstrations are drawn from a distribution optimizing somerewardfunction,andattemptstolearnthatrewardfunction[NR00][ZMBD08][PAS19]. A very similar problem form known as imitation learning or behavior cloning aims to learn to reproduce the behavior of the demonstrator, without explicitly learning the underly- ing reward function. Generative adversarial imitation learning (GAIL) [HE16][FCAL16] combines these ideas with inspiration from generative adversarial networks [GPAM + 14] by training a discriminative cost function simultaneously with an RL agent. Our approach exists slightly outside these techniques in that while we assume the demonstrator has some form of optimality in movement costs, we do not attempt to learn the demonstrator’s re- ward function per se, and indeed in our RL step we may drop in a reward function of our choice. Embedding spaces have been applied before in reinforcement learning, such as in the form of action space embeddings to support skill transfer between tasks [HSW + 18]. State embeddings, while less common in RL, have been used in robotics applications. In [IP19], Ichter and Pavone construct a way to train embedded state spaces with predictable dynam- ics in the latent space to enable sampling based planning techniques to function directly in the latent space. An important point of comparison for our work is the embed-to-control algorithm proposed in [WSBR15]. This algorithm uses a variational model to learn state embeddings supporting the objective of having locally linear state transition dynamics in the embedding space. Our approach also takes the high-level form of a variational auto- encoder [KW13], with some specific structure in the distribution of states and embedded 67 states. Although the embed-to-control structure and objective is different from our own, they use some very similar mathematical tools for resolving underlying difficulties and learning the embedding. 4.3 Plan-Space State Embedding In this section we will describe our formulation for learning a plan-space state embedding. This embedding space is built on the concept of constraining demonstrated trajectories to be linear in the embedding space. We present an initial direct formulation of this objective, then show why optimizing that objective may be problematic, and fix those issues with a more sophisticated variational model. 4.3.1 Trajectory Linearity Model We wish to learn an encoding of the raw state space such that locality in the embedding implies locality in manipulation or control distance. To do this we take advantage of the availability of a dataset of demonstration trajectories believed to be efficient, or ideally near optimal, and due to the optimal substructure property, we note this would also imply that snippets of these trajectories are efficient for their respective sub-goals (final position in the snippet). To satisfy our embedding objective, we attempt to constrain the demonstrated trajectories to form straight lines through our embedding space. We form a loss function on this by sampling evenly spaced triplets of states along a demonstrated trajectory, and rewarding the middle point for having an embedding value close to the average of the embeddings of the edge points. 68 4.3.2 Preliminaries We define our robot state as x2X, and we define our embedding space as Z =R n . The dimensionality n of the embedding space will be a hyperparameter, and can in general be greater or less than the robot state dimensionality. We consider a trajectory to be a vector of states such as x 0 ;:::;x f , and sampled triplets from trajectories are notated as x = (x t1 ;x t ;x t+1 ), although in general the states may be more than a single time step apart from each other. We also define the trajectory distribution of our training data as (x). 4.3.3 Direct State Encoding A simple and direct form of this problem would be to use a modified auto-encoder structure. We define two parametric functions h enc : X7! Z and h dec : Z7! X, for encoding and decoding respectively. We could then define our loss function as E x t1 x t x t+1 9 > > > > > > > = > > > > > > > ; x t h dec 1 2 (h enc (x t1 ) +h enc (x t+1 )) 2 (4.1) This defines our loss in terms of the ` 2 -norm of the difference in state values, however it is not clear that this is a reasonable distance metric in the robot state space, particularly for spaces that include velocities or higher order inertial states. We could make this more general by allowing an arbitrary distance metric d(x;y) in the state space as 69 z t1 z t z t+1 x t1 x t x t+1 N Figure 4.1: Our variational model of path data. E x t1 x t x t+1 9 > > > > > > > = > > > > > > > ; d x t ;h dec 1 2 (h enc (x t1 ) +h enc (x t+1 )) (4.2) Unfortunately, in this formulation L becomes a powerful hyperparameter without any ob- viously good choices. In another sense, if we had access to a good distance metric on our space, we would already have a solution to the majority of our problem. In the next section, we propose a probabilistic approach that can be optimized without the need for a distance metric in the robot state space. 4.3.4 Variational State Encoding Instead of modeling our embedding function as a deterministic function of the state x, we make the embedding probabilistic, so h enc (x) = z q (zjx). In this way we will now be able to more flexibly model uncertainties in paths, as well as no longer having to specify a distance metric in the state space X. We assume that our encoded states z are drawn from a prior distribution z p(z) = N (0; 1) and robot states x are conditioned on the latent states as xp(xjz). When con- sidering a trajectory triplet (x t1 ;x t ;x t+1 ) we say that statesx t p(x t jz t ) forz t1 ;z t ;z t+1 70 respectively. However, while z t1 ;z t+1 p(z), z t is defined differently as z t = z t1 +z t+1 2 . We create an approximation of the encoding distribution q (zjx). We also create an ap- proximate inverse decoding distribution p (xjz). We see that when we define x = (x t1 ;x t ;x t+1 ) and z = (z t1 ;z t+1 ) our problem takes the form of a variational auto-encoder [KW13], where we wish to maximize the marginal log likelihood of the data: logp (x) =D KL (q (zjx)jjp (zjx)) +L(;;x) (4.3) As with other variational methods ([KW13, WSBR15]), instead of optimizing this ob- jective directly we optimize the variational lower boundL, expanded below: L(;;x) =E zq (zjx) [ logq (zjx) + logp (x;z)] (4.4) =D KL (q (zjx)jjp(z)) +E zq (zjx) [logp (xjz)] (4.5) We choose to restrict p and q to the space of multivariate normal distributions with diagonal covariance matrices. They are parameterized with neural networks which produce and as a function of z and x for p and q respectively. In this form we can use the generic stochastic gradient variational bayes (SGVB) estimator from [KW13]. If we further approximate q (zjx) as ~ q (zjx) =q (z t1 jx t1 )q (z t+1 jx t+1 ) (4.6) 71 then we can analytically calculate the KL-divergence and use the lower variance form of the SGVB estimator. Unfortunately, in this case we have unbound a key constraint in our system, so we add an additional KL-divergence term between ~ q (z t jx) and q (z t jx t ), modifying the lower bound as follows: ^ L =D KL (~ q (zjx)jjp(z)) +E z~ q (zjx) [logp (xjz)] D KL (~ q (z t jx)jjq (z t jx t )) (4.7) where > 0 is a training hyperparameter. These calculations are visualized in the information flow diagram Figure 4.2. Note that in this diagram, z values are shown where we internally track the distribution of the respective z’s, and these distributions are only sampled when necessary (shown as dotted lines). We note that ~ q (z t jx) is a distribution overz t and we will refer to values sampled from this distribution as ^ z t . Because we have assumed that our z’s are drawn from a normal distribution, we can analytically calculate the distribution ~ q (z t jx) wherez t is constrained to be the mean value of z t1 and z t+1 . Our embedding distribution q gives us z t1 N ( a ; 2 a ) z t+1 N ( b ; 2 b ) (4.8) where and are vectors in the dimensionality of our embedding space. The distribution of ^ z t = z t1 +z t+1 2 is also a normal random variable as follows: 72 mean sample from distribution calculate error Figure 4.2: Information flow diagram for our variational state embedding algorithm. The green boxes with x values represent samples from our training dataset. The z values are stored internally by their mean and variance, and sampled when necessary along the dotted lines. The double red arrows represent points where terms from our evidence lower bound are calculated. Not shown is the calculation of KL divergence of z values from the embedding space prior. 73 ^ z t N a + b 2 ; 2 a + 2 b 4 (4.9) 4.3.5 Evaluation While we optimize the modified lower bound above, we will track performance of the em- bedding against a metric designed to measure whether the embedding space is creating the geometry we desire. We look at` 2 distances in the embedding space compared with the tra- jectory distance between points along a trajectory. To do this analysis we have to account for the fact that there could be arbitrary scale differences between distances in the two spaces, so we have to find a scale factor to match the embedding space distances to the tra- jectory space distances. We calculate the distance integral along each of our demonstration trajectories denotedy i and with a current snapshot of the embedding functionq calculate embedded distances d i =jjz init i z final i jj where z i is the mean value from q (z i jx i ). For consistency we normalize the trajectory lengthsy i by dividing them by the mean trajectory length. We then calculate the best matching scale factor C by minimizing the following error term: X i (y i Cd i ) 2 (4.10) which can be calculated analytically as: C = P d i y i P d 2 i (4.11) 74 We can then look at the error between the scaled embedded distances Cd i and the demonstrated path distancesy i . During training of the state embedding we track the mean and standard deviation of the absolute values of these errors on our training dataset. 4.4 Implementation 4.4.1 Kinodynamic Planner Demonstrations Our embedding algorithm relies on a dataset of trajectory demonstrations, however it is agnostic of the source of these trajectories. As such, we might use any available planning algorithm, expert system, or even human demonstrations to gather this data. In this work we use the SST* algorithm [LLB16] implemented in the OMPL motion planning library [ŞMK12]. Our robot environments are specified in and simulated by the MuJoCo robot simulator [TET12]. To produce the necessary trajectory demonstrations for our algorithm, we have built an interface layer between MuJoCo and OMPL. Our interface allows us to use MuJoCo in the inner loop of OMPL algorithms as the state propagator, enabling OMPL’s kinodynamic planning algorithms to function on a wide variety of spaces specified for MuJoCo with minimal additional problem specification. We are sharing this interface code for academic use at the link below . 4.4.2 Variational State Embedding We implemented the variational form of our state encoder and are sharing our source code y . Our implementation maximizes the lower variance form of the SGVB estimator using the Adam algorithm [KB14]. https://github.com/mpflueger/mujoco-ompl y https://github.com/mpflueger/plan-space-state-embedding 75 Kinodynamic Planner (SST*) Environment Specification (MuJoCo) Simulator (MuJoCo) Plan Demo Dataset Plan-Space State Embedding Training RL Problem Spec (Reward, Termination, Initial States) Reinforcement Learning (TRPO, PPO, etc.) Encoding Network Select Best Figure 4.3: This system diagram shows the workflow described in this work. We use a kinodynamic planner linked with MuJoCo environment simulation to generate a dataset of demonstration plans in our environment. The plan-space state embedding is trained based on that dataset multiple times, and the best resulting encoding is selected to augment the observed state in reinforcement learning problems in our environment. 76 For the experiments in this work we parameterized the encoder and decoder networks as fully connected nets with 2 hidden layers of 64 units per layers. The networks produce two sets of output for and the diagonal terms of the covariance matrix . The units have linear output. In order to avoid numerical issues we define our to output standard deviations rather than variances, and bound outputs to the range (0;1) by using the activation function: f(x) = 8 > < > : x + 1 x 0 exp(x) x< 0 (4.12) We also observe that the distribution of ^ z t is systematically smaller than the distribu- tions of z otherwise predicted by q as a result of it being distributed as the average of z t1 and z t+1 . This could create a systematic error in learning because in equation 4.7 we place a cost on the distributional match of ~ q (z t jx) to q (z t jx t ), despite them having the same underlying parameters with x’s drawn from the same distribution. By looking at equation 4.9 we can see that the variance of ^ z t ~ q (z t jx) is expected to be half that of the variance of zq (zjx), yet we are constraining them to come from the same distribution. We correct this issue by doubling the variance of ^ z t to match the distribution fromq (zjx). Lambda () from equation 4.7 was set to 0.5. Before training begins the dataset of demo trajectories is transformed into a set of state triplets. The triplets have even and varied spacing along the trajectories, we used step differences of [1; 3; 5; 10; 30]. Figure 4.4 shows the training curves for our plan-space state embedding, using 5 seeds in each configuration. We select the best trained model for evaluation in RL environments based on performance on the metric defined in Section 4.3.5. 77 4.4.3 RL in Embedding Space To evaluate how using this embedding space affects the performance of reinforcement learn- ing (RL) algorithms, we have chosen a couple of benchmark RL environments where we can modify the observed state space sent to the RL algorithm while keeping other aspects of the system the same. To facilitate this we used implementations of trust region policy optimization (TRPO) [SLA + 15] and proximal policy optimization (PPO) [SWD + 17] pro- vided by the garage library [The19], and created custom environment wrappers using the pre-trained state encoders from our variational state embedding. Our environment wrappers always return the mean value as the embedded state. Ad- ditionally, we can look at the effect of completely replacing the normal state space, or augmenting it by appending our embedded state vector onto the existing state vector. 4.5 Experiments 4.5.1 Environments In our experiments we used two primary environments, each with two variants. Our envi- ronments are based on standard environments from the OpenAI Gym project, however in each case we have also created modified environments to make the tasks more challenging. The cartpole environment has a cart that can move along a 1 meter track and has an uncontrolled spinning pole attached to it. The cartpole environment has a 4 dimensional state, including the position and velocity of both the cart and pole. Control of the cart along the track is done by choosing an applied force (continuous value), with an upper limit on the available force. Two RL tasks are defined for the cartpole environment. In the first environment the cart starts near the center of the track with the pole nearly upright and reward is given for keeping the pole upright as long as possible. The second, much more 78 difficult task, starts the cart and pole at random locations with small random velocities and reward is given for swinging the pole into an upright position and keeping it there. We refer to this variant as the cartpole swingup task. The reacher environment consists of a 2-link robot arm operating in a plane, with an objective to move the tip of the second link to a particular position. The target position is encoded as part of the state space of the environment as the desired x;y position of the end point of the robot arm. The robot is controlled by choosing bounded joint torques. Our two reacher variants differ in the observation space presented to the policy function. The initial version from Gym observes sine and cosine of the joint angles, which we refer to as reacher trig. We also created a version of the task that observes the joint angles directly as radians instead (reacher raw). Both versions observe joint velocities. In both cases, the state input to our embedding function is the raw joint angles and joint velocities (a 4 dimensional space). Note here that our state embedding does not observe the target location, so although we include some performance curves of the embedding replacing the normal observation space, the target is not observable in these cases and goal directed behavior is not possible. 4.5.2 Embedding Performance In Figure 4.4 we show training curves for our plan-space state embeddings for the cartpole and reacher environments. The plots show performance on our evaluation metric from Section 4.3.5 and the variational lower bound ^ L from equation 4.7. For each environment we train with our embedding space Z-dimension set to 10 (the high-Z configuration) and set to 3 (the low-Z configuration). Note that both the cartpole and reacher environments have a native dimensionality of 4. In each configuration we run training 5 times with different 79 0 2000000 4000000 6000000 8000000 step 0.30 0.32 0.34 0.36 0.38 Metric Error Cartpole PSSE Training - Mean Metric Error highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 0 2000000 4000000 6000000 8000000 step −30 −28 −26 −24 −22 −20 −18 −16 −14 L Cartpole PSSE Training - Variational Lower Bound highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 0 2000000 4000000 6000000 8000000 step 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 Metric Error Reacher PSSE Training - Mean Metric Error highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 0 2000000 4000000 6000000 8000000 step −45.0 −42.5 −40.0 −37.5 −35.0 −32.5 −30.0 −27.5 −25.0 L Reacher PSSE Training - Variational Lower Bound highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 highz/train_0 highz/train_1 highz/train_2 highz/train_3 highz/train_4 lowz/train_0 lowz/train_1 lowz/train_2 lowz/train_3 lowz/train_4 Figure 4.4: Training curves for our plan-space state embedding in the cartpole and reacher environments. Solid curves show a rolling window average with shaded regions showing the range of the unsmoothed curve. ‘highz’ curves are training a 10 dimensional embedding space, ‘lowz’ represents a 3 dimensional embedding space. The curves on the left show performance on the embedding performance metric described in Section 4.3.5. The curves on the right show performance on the variational lower bound described in equation 4.7. Each curve shows training for a separate seed value, and the seed that achieves the best performance on the embedding metric (left) is selected for testing with RL algorithms. The variational lower bound converges noisily but consistently, but we can see there is significant seed variation on the left. 80 random seeds, and choose as our final embeddings for each configuration the one with the best performance as measured by our metric from Section 4.3.5. We see from training curves for our state embeddings that although loss on the op- timization objective ^ L converges consistently, there is still significant seed noise in the performance on our evaluation metric error. Because this is an offline process it is easy to deal with this seed noise by training multiple seeds in parallel and selecting the best one, however, we believe that in future work it may make sense to more closely examine the convergence characteristics of this algorithm in order to get more consistent performance. 4.5.3 RL Performance We examine the performance benefit of using our algorithm with two reinforcement learning algorithms, trust region policy optimization (TRPO) and proximal policy optimization (PPO). We use the reference implementation of these algorithms provided by the garage project [The19]. TRPO and PPO are both policy gradient algorithms, but with different underlying mechanisms. Policy representations were parameterized as a neural network with two fully connected hidden layers of 32 units. Experiments were run with a discount factor of 0.99 for 100 time steps, except for the cartpole swingup, which got 200 time steps (to allow enough time to swing up and demonstrate balance of the pole). Figure 4.5 shows training curves for TRPO and Figure 4.6 shows the same curves for PPO. All of the plots show average (un-discounted) performance across the training runs. The cartpole swingup task in particular has some observable qualitative performance levels: policies that achieve near 100 can quickly swing up the pole from any initial state and maintain stable balance for the rest of the episode. In the 0 to -100 range are policies that can swing up the pole but fail to achieve stable balance. The -200 to -300 range are policies that are putting energy in the pole, but not in a controlled way. In the -600 range are 81 0 250 500 750 1000 1250 1500 1750 2000 step −500 −400 −300 −200 −100 0 100 AverageReturn TRPO Cartpole Swingup raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) embedded high-Z (mean) embedded high-Z (best) embedded low-Z (mean) embedded low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) embedded high-Z (range ex best & worst) embedded low-Z (range ex best & worst) 0 250 500 750 1000 1250 1500 1750 2000 step 90 92 94 96 98 100 102 AverageReturn TRPO Cartpole (standard) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) embedded high-Z (mean) embedded high-Z (best) embedded low-Z (mean) embedded low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) embedded high-Z (range ex best & worst) embedded low-Z (range ex best & worst) 0 250 500 750 1000 1250 1500 1750 2000 step −16 −14 −12 −10 −8 −6 −4 AverageReturn TRPO Reacher (trig observations) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) 0 250 500 750 1000 1250 1500 1750 2000 step −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 AverageReturn TRPO Reacher (raw observations) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) embedded high-Z (mean) embedded high-Z (best) embedded low-Z (mean) embedded low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) embedded high-Z (range ex best & worst) embedded low-Z (range ex best & worst) Figure 4.5: Training curves for the cartpole and reacher environments using trust region policy optimization (TRPO). Solid lines show the mean value of smoothed performance across 15 random seeds. Shaded regions show the range from the second worst performing curvetothesecondbestateachtrainingstep. Dashedlinesshowthesmoothedperformance ofthetrainingrunthatreachedthebestperformanceatanypointintraining. High-Zcurves used a 10 dimensional embedding space, while low-Z used a 3 dimensional embedding space. (Both cartpole and reacher have 4 dimensional raw state spaces.) For ’embedded augment’ curves the RL agent observes both the robot state and the embedding, ’embedded’ curves observed only the embedding space. While in general the embedding spaces achieve similar best case performance with raw states (in blue), we see significant improvements in reduced seed variance, particularly with the low-Z embeddings (green). 82 0 200 400 600 800 1000 step −700 −600 −500 −400 −300 −200 −100 0 100 AverageReturn PPO Cartpole Swingup raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) embedded high-Z (mean) embedded high-Z (best) embedded low-Z (mean) embedded low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) embedded high-Z (range ex best & worst) embedded low-Z (range ex best & worst) 0 200 400 600 800 1000 step 0 20 40 60 80 100 AverageReturn PPO Cartpole (standard) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) embedded high-Z (mean) embedded high-Z (best) embedded low-Z (mean) embedded low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) embedded high-Z (range ex best & worst) embedded low-Z (range ex best & worst) 0 200 400 600 800 1000 step −25 −20 −15 −10 −5 AverageReturn PPO Reacher (trig observations) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) 0 200 400 600 800 1000 step −30.0 −27.5 −25.0 −22.5 −20.0 −17.5 −15.0 −12.5 −10.0 AverageReturn PPO Reacher (raw observations) raw (mean) raw (best) embedded augment high-Z (mean) embedded augment high-Z (best) embedded augment low-Z (mean) embedded augment low-Z (best) raw (range ex best & worst) embedded augment high-Z (range ex best & worst) embedded augment low-Z (range ex best & worst) Figure 4.6: Training curves for our cartpole and reacher environments using proximal policy optimization (PPO). Conventions follow Figure 4.5, please see that caption for further detail. InourexperimentsPPOgenerallyunder-performedTRPO,especiallyonthereacher where none of the spaces produced particularly strong policies. In the standard cartpole we see similar results to TRPO where most spaces find nearly perfect policies, but the cartpole swingup task shows a situation where our embedding spaces produce a slight improvement in average cases, but substantial improvement in the best cases over learning with raw state values. 83 policies that have learned to zero controls to avoid the control penalty and do nothing. For the reacher task it can generally be said that scores around -6 are effectively achieving the task of reaching to the target point, where as scores below -14 do not appear to demonstrate significant goal directed behavior. In the standard cartpole task (where the pole starts in a nearly upright position with low velocity) the baselines are extremely strong, achieving near perfect performance almost immediately. In these cases augmenting with the embedded state seems to be even or slightly destabilizing. Using the embedded state only can still achieve good performance, especially with TRPO, though not as good as baseline. In the cartpole swingup task, augmenting with the embedded state offers significant performance improvement in both TRPO and PPO. In TRPO we see slightly faster rise time, better average performance from both high-Z and low-Z embeddings, similar best case performance, and reduced seed variance towards the end. With PPO we see substantially better best and 2nd best case performance from both embeddings, and slightly improved averageperformance, thoughPPOisgenerallyunder-performingTRPOinourexperiments. For our reacher tasks with TRPO, we see much smaller variations. With trigonometric observations the baseline reacher appears difficult to outperform, and our low-Z embedding is even, with the high-Z underperforming early and then catching up. With angular (raw) observations the low-Z embedding offers some early benefits in reduced variance, though high-Z slightly under performs. With PPO both embeddings appear to slightly under perform, and best case results are mixed. Absolute performance on the PPO reacher tasks is not good compared to TRPO though. When training with policies that observe only the embedding we generally do not get good performance, which suggests that these state embedding are not encoding enough information to fully observe the system state, or that that learning problem is particularly 84 difficult. In the case of the cartpole clearly some system state can be inferred though due to respectable (though sub-baseline) performance on the standard cartpole task. Also, as noted earlier, the reacher spaces have a changing goal position that is not observed by our embedding only experiments, so we do not expect to see good performance on this task. Overall, we see a couple of patterns in performance when augmenting with embedding spaces. On tasks where the baseline algorithm is already very strong and reliable we don’t see improvements, and we think this is reasonable as there may not be much room for im- provement. The standard cartpole is firmly in this category, and trigonometric observations of the reacher represent a very engineered feature space, which may be difficult to improve upon. In tasks where the baseline can achieve good performance but does not do so reliably is where we see the most improvement, which clearly describes the cartpole swingup, and to a lesser extent the reacher with angular (raw) observations. 4.6 Conclusion & Discussion In this work we have proposed a new form of state embedding to learn the geometry of the plan-space of an environment from expert demonstrations, in this case supplied by a motion planner. We have shown how to train these embeddings, and that once trained they appear to improve the performance and reliability of policy gradient RL algorithms when used to augment the observed state space. Future work may look at the impact on a greater variety of RL algorithms as well as look at more varied problem spaces. The experiments in this work have formulated our embedding from an already observed joint state space, however it could also be formulated to be a function of a more difficult to interpret observation space such as images, where 85 it might also offer particular advantages in encoding information about the environment beyond robot joint states. Another interesting direction for future work could be to evaluate the usefulness of this technique using human demonstrations, this could be particularly relevant to complex problems where effective planners are not readily available. The performance improvements we have observed in RL problems suggest the need for a better understanding of the reward contours of learned policy functions. Since our embedding space is a function of the observed state space in these example problems, it could be interpreted as an expansion of the expressiveness of the policy network through a pretrained sub-network. Although we offer some intuition as to why our embedding objective should produce a useful embedding function, the field lacks a more principled understanding of the need for expressiveness in policy functions. This work has demonstrated how planning algorithms can provide insight into a latent geometry of a robot state space in which good trajectories have simple shapes. Our exper- iments show that these plan-space state embeddings are useful in improving the reliability of reinforcement learning on some benchmark problems. The RL algorithms that can ben- efit from this embedding technique are much more general than the planners that produce our demonstrations, and at run time are often much faster, showing how the insights from planning can be leveraged to increase the capabilities of other types of algorithms. 86 Chapter 5 Summary and Conclusion This dissertation has presented three projects built around the idea of extracting insights from planning algorithms in ways that enable new capabilities beyond what our baseline planners are capable of. Our work on multi-step planning showed a way to use existing planners as elements in a hierarchical framework that could search more of the reachable state space than any individual planner could address. We can also treat planners as demonstrators and attempt to learn the cost function implicit in their behavior via inverse reinforcement learning, and ourworkinChapter3showedhowtodevelopausefulformulationofthatproblemforplanar path planning tasks, and tested the performance of that formulation for a planetary rover. Finally, we showed how the trajectories produced by path planners imply a geometry in the state space of the robot that we can learn using our plan-space state embedding algorithm. Our evaluations have shown that this embedding space can be useful in improving the reliability of reinforcement learning algorithms working with these robots. As the field of robotics moves forward to address new and more challenging problems, we may find it useful to remember that we don’t have to start from scratch on many of the systems we would like to work with. With clever constructions we can augment our machine learning algorithms to take advantage of the problems we already know how to 87 solve, and hopefully retain the reliability of our baseline techniques on simple problems while we expand our capabilities to include more complex problems and more versatile solutions. 88 References [ASN + 14] Sandip Aine, Siddharth Swaminathan, Venkatraman Narayanan, Victor Hwang, and Maxim Likhachev. Multi-heuristic A*. In Robotics: Science and Systems (RSS), 2014. [BBS + 10] James Andrew Bagnell, David Bradley, David Silver, Boris Sofman, and An- thony Stentz. Learning for autonomous navigation. IEEE Robotics & Au- tomation Magazine, 17(2):74–84, 2010. [Bel57] Richard Bellman. Dynamic programming. Princeton University Press, Prince- ton, NJ, 1957. [Ber95] Dimitri P Bertsekas. Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995. [BKLP13] John Barry, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. A hierarchical approach to manipulation with diverse actions. In IEEE International Con- ference on Robotics and Automation (ICRA), pages 1799–1806. IEEE, 2013. [BLLR05] Tim Bretl, Sanjay Lall, Jean-Claude Latombe, and Stephen Rock. Multi- step motion planning for free-climbing robots. In Algorithmic Foundations of Robotics VI, pages 59–74. Springer, 2005. [DMA + 11] Myron A Diftler, JS Mehling, Muhammad E Abdallah, Nicolaus A Radford, Lyndon B Bridgwater, Adam M Sanders, Roger Scott Askew, D Marty Linn, John D Yamokoski, FA Permenter, et al. Robonaut 2-the first humanoid robotinspace. InIEEE International Conference on Robotics and Automation (ICRA), pages 2178–2183. IEEE, 2011. [FCAL16] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connec- tion between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016. 89 [GAD + 16] D Gaines, R Anderson, G Doran, W Huffman, H Justice, R Mackey, G Ra- bideau, AVasavada, VVerma, TEstlin, etal. Productivitychallengesformars rover operations. In Proceedings of 4th Workshop on Planning and Robotics (PlanRob), pages 115–125. London, UK, 2016. [GGC + 16] Alessandro Giusti, Jérôme Guzzi, Dan C Cireşan, Fang-Lin He, Juan P Ro- dríguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmid- huber, Gianni Di Caro, et al. A machine learning approach to visual percep- tion of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016. [GPAM + 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [HBL05] Kris Hauser, Tim Bretl, and J-C Latombe. Learning-assisted multi-step plan- ning. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on, pages 4575–4580. IEEE, 2005. [HE16] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565–4573, 2016. [HNR68] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cyber- netics, IEEE Transactions on, 4(2):100–107, 1968. [HP17] B Hockman and M Pavone. Stochastic motion planning for hopping rovers on small solar system bodies. In Proceedings of ISRR, 2017. [HSW + 18] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. [IKW07] Roman Ilin, Robert Kozma, and Paul J Werbos. Efficient learning in cellular simultaneous recurrent neural networks-the case of maze navigation problem. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pages324–329.IEEE,2007. [IP19] Brian Ichter and Marco Pavone. Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters, 4(3):2407–2414, 2019. [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. CoRR, abs/1412.6980, 2014. 90 [KCT + 11] Mrinal Kalakrishnan, Sachin Chitta, Evangelos Theodorou, Peter Pastor, and Stefan Schaal. STOMP: Stochastic trajectory optimization for motion plan- ning. In IEEE International Conference on Robotics and Automation (ICRA), pages 4569–4574. IEEE, 2011. [KF11] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7):846– 894, 2011. [KHL17] Peter Karkus, David Hsu, and Wee Sun Lee. QMDP-Net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, pages 4697–4707, 2017. [KL00] JamesJKuffnerandStevenMLaValle. RRT-connect: Anefficientapproachto single-query path planning. In IEEE Intl. Conf. on Robotics and Automation, volume 2, pages 995–1001. IEEE, 2000. [KL05] Sven Koenig and Maxim Likhachev. Fast replanning for navigation in un- known terrain. IEEE Transactions on Robotics, 21(3):354–363, 2005. [KOB10] Dov Katz, Andreas Orthey, and Oliver Brock. Interactive perception of artic- ulated objects. In 12th International Symposium on Experimental Robotics, Delhi, India, Dec 2010. [KSLO96] L.E. Kavraki, P. Svestka, J.-C. Latombe, and M.H. Overmars. Probabilistic roadmapsforpathplanninginhigh-dimensionalconfigurationspaces. Robotics and Automation, IEEE Transactions on, 12(4):566–580, 1996. [KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [LHP + 15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. [LK01] Steven M LaValle and James J Kuffner. Randomized kinodynamic planning. The International Journal of Robotics Research, 20(5):378–400, 2001. [LLB16] Yanbo Li, Zakary Littlefield, and Kostas E Bekris. Asymptotically optimal sampling-based kinodynamic planning. The International Journal of Robotics Research, 35(5):528–564, 2016. [LP03] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Jour- nal of machine learning research, 4(Dec):1107–1149, 2003. 91 [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima- tion of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [MKS + 15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. [NPM + 14] Srinivas Nedunuri, Sailesh Prabhu, Mark Moll, Swarat Chaudhuri, and Ly- dia E. Kavraki. SMT-based synthesis of integrated task and motion plans from plan outlines. In IEEE Intl. Conference on Robotics and Automation, 2014. [NR00] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learn- ing. In in Proc. 17th International Conf. on Machine Learning, pages 663–670. Morgan Kaufmann, 2000. [ORA + 16] Masahiro Ono, Brandon Rothrock, Eduardo Almeida, Adnan Ansar, Richard Otero, AndresHuertas, andMatthewHeverly. Data-drivensurfacetraversabil- ity analysis for Mars 2020 landing site selection. In IEEE Aerospace Confer- ence, 2016. [PAS19] Max Pflueger, Ali Agha, and Gaurav S Sukhatme. Rover-IRL: Inverse rein- forcement learning with soft value iteration networks for planetary rover path planning. IEEE Robotics and Automation Letters, 4(2):1387–1394, 2019. [PS15] Max Pflueger and Gaurav S Sukhatme. Multi-step planning for robotic ma- nipulation. In IEEE International Conference on Robotics and Automation (ICRA), pages 2496–2501. IEEE, 2015. [PS20] Max Pflueger and Gaurav S Sukhatme. Plan-space state embeddings for im- proved reinforcement learning. arXiv preprint arXiv:2004.14567, 2020. [RSB09] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25–53, 2009. [RZBS09] Nathan Ratliff, Matt Zucker, J Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for efficient motion planning. In IEEE International Conference on Robotics and Automation (ICRA), pages 489–494. IEEE, 2009. 92 [SBS10] David Silver, J Andrew Bagnell, and Anthony Stentz. Learning from demon- stration for autonomous navigation in complex unstructured terrain. The International Journal of Robotics Research, 29(12):1565–1592, 2010. [SC13] Ioan A Sucan and Sachin Chitta. Moveit! Available Online http://moveit.ros.org, 2013. [SFR + 14] Siddharth Srivastava, Eugene Fang, Lorenzo Riano, Rohan Chitnis, Stuart Russell, and Pieter Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In IEEE Intl. Conference on Robotics and Automation, 2014. [SLA + 15] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on ma- chine learning, pages 1889–1897, 2015. [SLCS04] Thierry Siméon, Jean-Paul Laumond, Juan Cortés, and Anis Sahbani. Ma- nipulation planning with probabilistic roadmaps. The International Journal of Robotics Research, 23(7-8):729–746, 2004. [ŞMK12] Ioan A. Şucan, Mark Moll, and Lydia E. Kavraki. The Open Motion Plan- ning Library. IEEE Robotics & Automation Magazine, 19(4):72–82, December 2012. http://ompl.kavrakilab.org. [SML + 15] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage es- timation. arXiv preprint arXiv:1506.02438, 2015. [SSB11] J. Sturm, C. Stachniss, and W. Burgard. A probabilistic framework for learn- ing kinematic models of articulated objects. Journal on Artificial Intelligence Research (JAIR), 41:477–626, August 2011. [Ste94] Anthony Stentz. Optimal and efficient path planning for partially-known en- vironments. In IEEE International Conference on Robotics and Automation (ICRA), pages 3310–3317. IEEE, 1994. [Ste95] Anthony Stentz. The focussed D* algorithm for real-time replanning. In International Joint Conference on Artificial Intelligence (IJCAI), volume 95, pages 1652–1659, 1995. [SvHH + 17] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. arXiv preprint arXiv:1612.08810, 2017. 93 [SWD + 17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. [TET12] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 5026–5033. IEEE, 2012. [The19] The Garage Contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019. [TWT + 16] AvivTamar, YiWu, GarrettThomas, SergeyLevine, andPieterAbbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016. [WOP15] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015. [WSBR15] ManuelWatter,JostSpringenberg,JoschkaBoedecker,andMartinRiedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746– 2754, 2015. [WWP16] Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch this: Scalable cost-function learning for path planning in urban environments. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Con- ference on, pages 2089–2095. IEEE, 2016. [ZMBD08] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 94
Abstract (if available)
Abstract
Solving complex robotic problems often involves reasoning about behavior multiple steps in the future, yet many robot learning algorithms do not incorporate planning structures into their architecture. In this dissertation we show how we can harness the capabilities of planning algorithms to learn from the structure of the robotic problems we wish to solve, thus expanding beyond what was available from baseline planners. We consider problems in multi-arm manipulation planning, path planning for planetary rovers, and reinforcement learning for torque controlled robots, and show how in each case it is possible to learn from the behavior of planning algorithms that are limited and unable to solve the full generalized problem. Despite not being full solutions these planners provide useful tools and insights that can be leveraged in larger solutions. ❧ In multi-step planning for manipulation we develop a high level planner that can find solutions in difficult spaces by solving sub-problems in sub-spaces of the main planning space. For planetary rovers we show how to use inverse reinforcement learning to learn a new planning algorithm that can function on different (and generally cheaper) input data. Reinforcement learning algorithms often suffer from unstable or unreliable training, we show how this can be mitigated by augmenting the robot state with a state embedding space learned from planner demonstrations. ❧ Planning and control algorithms often rely on rigid and prescribed assumptions about the nature of robot problems, which may not be suitable for the generalized and versatile robot systems we wish to build. However, as this dissertation argues, those structures are still useful in informing the behavior of more flexible families of algorithms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Algorithms and systems for continual robot learning
PDF
Learning objective functions for autonomous motion generation
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Scaling robot learning with skills
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Accelerating robot manipulation using demonstrations
PDF
Machine learning of motor skills for robotics
PDF
Learning affordances through interactive perception and manipulation
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Motion coordination for large multi-robot teams in obstacle-rich environments
PDF
Target assignment and path planning for navigation tasks with teams of agents
PDF
Nonverbal communication for non-humanoid robots
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
Asset Metadata
Creator
Pflueger, Max
(author)
Core Title
Learning from planners to enable new robot capabilities
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/04/2020
Defense Date
06/17/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,motion planning,OAI-PMH Harvest,path planning,planning,reinforcement learning,robotics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav S. (
committee chair
), Agha, Ali (
committee member
), Gupta, Sandeep (
committee member
), Lim, Joseph (
committee member
)
Creator Email
mpflueger@gmail.com,pflueger@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-359665
Unique identifier
UC11666307
Identifier
etd-PfluegerMa-8879.pdf (filename),usctheses-c89-359665 (legacy record id)
Legacy Identifier
etd-PfluegerMa-8879.pdf
Dmrecord
359665
Document Type
Dissertation
Rights
Pflueger, Max
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
motion planning
path planning
reinforcement learning
robotics