Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning objective functions for autonomous motion generation
(USC Thesis Other)
Learning objective functions for autonomous motion generation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning Objective Functions for
Autonomous Motion Generation
by
Mrinal Kalakrishnan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
Committee:
Prof. Stefan Schaal (Chair) Computer Science
Prof. Gaurav S. Sukhatme Computer Science
Prof. Francisco Valero-Cuevas Biomedical Engineering
May 16, 2014
Copyright 2014 Mrinal Kalakrishnan
Table of Contents
Abstract v
Chapter 1 Introduction 1
1.1 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2 Background and Related Work 5
2.1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Motion Planning for Robotic Manipulation . . . . . . . . . . . . . . . . . 6
2.3 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Rough Terrain Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Stochastic Trajectory Optimization for Motion Planning 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The STOMP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Path Integral Stochastic Optimal Control . . . . . . . . . . . . . . 16
3.2.4 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 Trajectory updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.6 Noise adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Experiments on 2D environments . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Synthetic 2-D Cost Maps . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Parameter Sensitivity Tests . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Motion Planning for a Robot Arm . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Real Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Learning Force Control Policies for Compliant Manipulation . . . . . . . . 33
3.5.1 Learning Force Feedback Control . . . . . . . . . . . . . . . . . . . 34
3.5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4 Learning Objective Functions for Manipulation 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Path Integral Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 46
ii
4.2.1 Path Integral Inverse Reinforcement Learning . . . . . . . . . . . . 46
4.2.2 Regularization and Ecient Optimization . . . . . . . . . . . . . . 48
4.3 Learning Inverse Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Learning Optimal Motion Planning . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 5 Learning Rough Terrain Locomotion Using Terrain Tem-
plates 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Learning Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Testing and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 59
5.3 Control Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Terrain Reward Map Generation . . . . . . . . . . . . . . . . . . . 60
5.3.2 Approximate Body Path Planner . . . . . . . . . . . . . . . . . . . 62
5.3.3 Footstep Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.4 Pose Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.5 Body Trajectory Generator . . . . . . . . . . . . . . . . . . . . . . 65
5.3.6 Foot Trajectory Planner . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.7 Dynamic Motions for Extreme Terrain . . . . . . . . . . . . . . . . 66
5.3.8 Sensor Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.9 Execution and Control . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Learning a Ranking Function for Foothold Selection using Terrain Templates 69
5.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2 Collecting Expert Demonstrations . . . . . . . . . . . . . . . . . . 70
5.4.3 Learning From Expert Demonstrations . . . . . . . . . . . . . . . . 71
5.4.4 Terrain Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.5 Template Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.6 Template Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.1 Learning Locomotion Test Results . . . . . . . . . . . . . . . . . . 80
iii
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 6 Conclusions and Future Work 82
6.1 Reinforcement Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Simultaneous Motion and Force Planning . . . . . . . . . . . . . . . . . . 83
BIBLIOGRAPHY 84
iv
Abstract
Planning and optimization methods have been widely applied to the problem of tra-
jectory generation for autonomous robotics. The performance of such methods, however,
is critically dependent on the choice of objective function being optimized, and is non-
trivial to design. On the other end of the spectrum, eorts on learning autonomous
behavior from user-provided demonstrations have largely been focused on reproducing
behavior similar in appearance to the provided demonstrations. An alternative approach,
known as Inverse Reinforcement Learning (IRL), is to learn an objective function that
the demonstrations are assumed to be optimal under. With the help of a planner or
trajectory optimizer, such an approach allows the system to synthesize novel behavior in
situations that were not experienced in the demonstrations.
We present novel algorithms for IRL that have successfully been applied in two real-
world, competitive robotics settings: (1) In the domain of rough terrain quadruped loco-
motion, we present an algorithm that learns an objective function for foothold selection
based on "terrain templates". The learner automatically generates and selects the ap-
propriate features which form the objective function, which reduces the need for feature
engineering while attaining a high level of generalization. (2) For the domain of autono-
mous manipulation, we present a probabilistic model of optimal trajectories, which results
in new algorithms for inverse reinforcement learning and trajectory optimization in high-
dimensional settings. We apply this method to two problems in robotic manipulation:
redundancy resolution in inverse kinematics, and trajectory optimization for grasping
and manipulation. Both methods have proven themselves as part of larger integrated
systems in competitive settings against other teams, where testing was conducted by an
independent test team in situations that were not seen during training.
v
Chapter 1
Introduction
The last fty years have seen the introduction, acceptance, and wide prevalence of several
life-changing technologies. The advent of personal computers, mobile phones, and the
Internet have completely transformed society as we know it. Despite being the dream
of science ction writers, scientists, and technologists for years, robots that are truly
autonomous are unfortunately not yet common-place. An autonomous, general-purpose
robot capable of operating in human environments would be tremendously benecial to
society at large. The wide-spread use of robots still appears to be conned to factory
settings, where their environment is tightly controlled and is considered deterministic.
These robots are capable of achieving high speed and precision in repetitive tasks such as
assembly, pick and place, and welding. Our day-to-day human environment, on the other
hand, tends to be fairly unstructured. Repeated execution of pre-programmed motions
are clearly not sucient any more. Robots operating in the real world need to be equipped
with a variety of sensors which help them estimate the state of the world around them.
They need to reason about, and synthesize the motions required to change the state
of the world as desired. Finally, they need to generate control commands to execute
these motions, while continually monitoring the state of execution and adapting to any
deviations and newly introduced uncertainties. Biological systems appear to carry out all
of these operations eortlessly. The complexity involved in each of these steps tends to be
to taken for granted, until we try to program a robot to act similarly. In essence, we are
trying to reproduce in a machine, the complex behavior of life-forms that are the product
of more than 3 billion years of evolution. This turns out to be a challenging task.
In this thesis, we focus on the problem of generating appropriate control commands
for an autonomous robot, given the situation that it is faced with. Even for a seemingly
simple task like grasping a cup with a robotic arm, the immense range of possible states
of the world, and uncertainties in estimating this state make this problem challenging. It
would be intractable for robot programmers to hand-craft specialized control policies for
all possible scene and task variations that might be expected. Addressing this problem
has been a key area in robotics research over the last few decades.
Learning control policies from demonstration has been an appealing approach to train
autonomous robots [4]. Providing a demonstration of a particular task tends to be a lot
1
simpler than designing the appropriate control policy by hand. However, the problem of
inferring a general control policy from a few demonstrations is hard. Particularly, the
resulting control policies do not necessarily generalize well to new situations in which
no demonstrations were provided. Quite often, such learning methods impose a specic
method of generalization [33], which may or may not generalize well enough for the task
at hand.
Optimal control, and related elds like reinforcement learning have been widely applied
to the problem of generating control policies and trajectories for autonomous robots.
These methods allow the system designer or user to specify the desired task in terms of
an objective function. The objective function is a succinct representation of the problem
at hand, and generally applies to several instances of the problem, under many dierent
conditions. Optimal control algorithms or planners then search for a control policy that
optimize the given objective. This can be computationally intensive, but advances in
algorithms combined with massive increases in available computing power over the years
has resulted in such techniques being widely applied. The performance of optimization-
based methods is critically dependent on the denition of the objective function, which
may require extensive manual tuning in order for the planner or optimal controller to
achieve acceptable real-world performance.
Inverse reinforcement learning (IRL) and inverse optimal control (IOC) methods seek
to address the above problem. The core idea behind these techniques is to infer a suitable
objective function given observations of optimal behavior. In many robotics problems, it
is signicantly easier to provide demonstrations of optimal behavior than it is to design
an objective function that typies the required behavior. IRL is commonly used as a
method of imitation learning. It is used to learn the underlying intentions of the observed
behavior, allowing the synthesis of novel control policies in new situations. This form
of learning, and subsequent reproduction using a planner promises vastly improved gen-
eralization abilities compared to a mere reproduction of observed controls or states. In
essence, such an approach aims to bring together the ability to quickly learn policies from
demonstrations, as well as the wide generalization abilities of optimization-based control
synthesis methods.
1.1 Contributions and Outline
In this thesis, we present a set of algorithms that learn objective functions for planning
and optimization problems in robotics. We show that the techniques we develop can
(1) signicantly improve performance, and (2) make robotics systems easier and more
intuitive to train. Each algorithm is presented within the context of the larger system
that it is a part of. We provide the necessary background and discussions of related work
in Chapter 2.
Stochastic Trajectory Optimization for Motion Planning: We intend to learn
objective functions for trajectory optimization problems in high-dimensional settings.
2
However, when faced with such high dimensionality, the problem of solving for the opti-
mal control policy or trajectory is itself intractable, and approximate methods need to be
developed. In this chapter, we propose a probabilistic model which represents the distri-
bution over desirable trajectories. We show how smoothness and other general objective
functions can be included in this model, and derive a method to optimize the expected cost
of trajectories under the proposed distribution. The resulting sampling-based trajectory
optimization method, called \STOMP", achieves all the following qualities: (1) generation
of smooth trajectories suitable for direct execution on robot hardware, (2) optimization
arbitrary objective criteria (which could possibly be learnt), and (3) is fast enough for
interactive use in a complete robotic system. We show results in 2-D examples, and on a 7
degree-of-freedom (DOF) robot arm in a manipulation planning setting, to plan collision
free trajectories that satisfy task constraints and minimize motor torques. Comparison
with gradient-based approaches show that our approach is less prone to getting stuck in
local minima due to its stochastic nature. We also show results from the application of
this approach in a reinforcement learning setting, where we learn force control policies for
compliant manipulation tasks.
Learning Objective Functions for Manipulation: The trajectory optimizer de-
veloped in Chapter 3 provides us with the much-needed
exibility in designing cost func-
tions that can then be optimized. In Chapter 4, we develop a path-integral inverse rein-
forcement learning technique, for learning objective functions in high-dimensional settings.
The approach is based on the probabilistic trajectory model developed in Chapter 3. The
key insight is to assume that the expert demonstration is only locally rather than globally
optimal. Given the high dimensionality of the problems that we consider, striving for
global optimality could be considered futile. We apply our method to two core problems
in robotic manipulation: redundancy resolution in inverse kinematics, and motion plan-
ning for grasping and manipulation. Experiments show that our method outperforms
previous algorithms in high-dimensional settings.
Learning Rough Terrain Locomotion using Terrain Templates: Chapters 3
and 4 dealt with optimization and inverse reinforcement learning in the high dimensional
space of trajectories. In Chapter 5, we address the problem of parameterization of the
objective function used in IRL algorithms. This work was developed in the context of an
autonomous quadruped locomotion system intended to walk over very rough terrain. The
development of this work pushed the performance of legged locomotion to regimes that had
never been demonstrated before. While there are several novel contributions that resulted
from this work on the whole, here we focus on the module that learns an objective function
for foothold selection. The key innovation here is the use of terrain templates, inspired by
template matching approaches in computer vision, as a feature representation to construct
a reward function. Our template learning algorithm automatically selects a minimal set
of templates and learns a reward function from foothold demonstrations provided by the
user. We demonstrate the level of generalization thus achieved by presenting results from
testing performed by an independent external test team, on terrain that has never been
shown to us.
3
Conclusions and Future Work: In Chapter 6, we summarize the contributions of
this thesis, and discuss possibilities for future work.
4
Chapter 2
Background and Related Work
In this chapter we discuss the foundations and recent developments in several elds which
provide the context to the work presented in this thesis.
2.1 Optimal Control
Much of the work presented in this thesis is related to optimal control, and specically
trajectory optimization. Optimal control is an extension of the calculus of variations,
which involves the minimization or maximization of a functional. A functional is a map-
ping from the space of functions to a real number. In the most general case of optimal
control, the functional is dened over the space of control policies, which themselves map
the current state of the system to a control command. The functional typically measures
the quality or cost which results from the application of a particular control policy. The
Pontryagin minimum principle [73] and Bellman's dynamic programming [6] formulation
are considered some of the foundational theories of optimal control. In particular, ap-
plication of the Bellman equation has resulted in algorithms that eciently nd optimal
control policies for systems with discrete states and actions. The Bellman equation denes
the optimal value function as the expected cumulative costs incurred by starting from a
given state and following the optimal control policy. The key discovery was that the value
function could be expressed in a recursive form, which stems from the observation that
the optimal action in a given state does not depend on how the system arrived at that
state. The continuous time version of this recursive equation is known as the Hamilton-
Jacobi-Bellman equation. Kalman provided an elegant solution to a specic case of this
problem involving linear dynamics and quadratic costs, known as the Linear Quadratic
Regulator (LQR) [45]. The optimal solution to this problem turns out to be a linear
feedback controller with an optimal gain matrix. However, more general forms of the op-
timal control problem tend to be highly non-linear and typically do not admit analytical
solutions. A potential strategy is to use extensive approximations to the problem until
it becomes analytically tractable. An alternate, more general purpose solution is to use
numerical techniques such as non-linear programming (NLP) to optimize a discretized
version of the original problem. Methods like dierential dynamic programming [34] and
5
iterative LQG [95] nd locally optimal trajectories, along with a local feedback control
policy.
Trajectory optimization could be considered a specic sub-problem of optimal control.
Instead of nding a feedback control policy as a function of state, trajectory optimization
methods attempt to nd the optimal trajectory as a function of time, along with the
associated sequence of control commands required to achieve this trajectory. Trajectory
optimization methods assume that once the optimal trajectory is computed, it can be
executed without signicant disturbances, or that any disturbances can be rejected using a
pre-determined control strategy. Alternately, trajectory optimization can be wrapped in a
model-predictive control (MPC) framework [27], which recomputes the optimal trajectory
at each instant of time, giving rise to a full state-feedback controller. Constrained non-
linear optimization is a popular tool used to solve trajectory optimization problems.
2.2 Motion Planning for Robotic Manipulation
Unfortunately, the application of optimal control and trajectory optimization problems
in robotic manipulation in unstructured environments has been limited. This is primarily
due to the amount of time required to compute optimal solutions. On the other hand,
motion planning has been a highly successful paradigm in this domain. Motion planners
introduce two relaxations to the classical optimal control problem: (1) they solve for
paths rather than trajectories, and (2) they focus on nding a feasible (collision-free)
solution rather than an optimal one. Sampling-based motion planning algorithms have
proven extremely successful in addressing manipulation problems [8, 18, 20, 57, 83], due
to their eciency at nding collision-free paths. They have been used to address task
space constraints and to impose torque constraints while lifting heavy objects [9]. While
sampling based planners result in feasible plans, they are often lacking in the quality of the
paths produced. A secondary shortcutting step is often required to smooth and improve
the plans output by these planners [29]. Recent work in sampling-based planning [25] has
led to the development of the RRT* algorithm, which can improve the quality of plans
after nding an initial feasible plan. RRT* can optimize cost functions while planning
paths. However, this algorithm does not scale well with increase in dimensionality of
the state space for planning, e.g. when needing to plan for velocities as well. Computing
a distance metric required by RRT* in such spaces is also non-trivial, thus making it
dicult to use RRT* for problems where higher-order derivatives need to be taken into
account.
Recent work in motion planning for manipulation [77], has proposed a covariant gra-
dient descent technique for trajectory optimization, which has proven to be responsive
enough as a planner for manipulation in unstructured environments. This algorithm
(called CHOMP) minimizes a combination of smoothness and obstacle costs and requires
gradients for both. It used a signed distance eld representation of the environment to
derive the gradients for obstacles in the environment. We adopt a similar obstacle cost
function for the motion planning approach presented in this thesis, but in contrast to
6
CHOMP, our optimization approach can handle general cost functions for which gradi-
ents are not available. Our algorithm, called STOMP, is closely related to a reinforcement
learning algorithm based on the path integral stochastic optimal control framework [93].
The algorithm, known as Policy Improvement with Path Integrals (PI
2
), has been success-
fully applied to reinforcement learning of several high-dimensional robotic manipulation
tasks [72,89]. It is able to optimize trajectories encoding positions [72,89], forces [43], and
even control gains [13]. The STOMP algorithm proposed in this paper can be derived
from the PI
2
algorithm, with a specially designed policy formulation and cost function,
and some simplications to take into account the kinematic nature of motion planning
tasks.
Several black-box optimization methods also bear similarity with our techniques. The
cross-entropy method (CEM) [82] and the CMAES (Covariance Matrix Adaptation Evo-
lution Strategy) algorithm [28] both share the same
avor as PI
2
and STOMP, i.e., they
sample parameters locally which are then recombined with dierent weights based on their
costs. CEM has also been applied to motion planning problems [49]. Our method is dif-
ferent from these works in that we update dierent parts of the trajectory independently
from each other, accelerating the performance of the optimizer. We also propose a specic
cost function formulation, update rule, and exploration method that works eciently for
the types of problems that we consider.
A variant of PI
2
was introduced [90], which adopts the covariance matrix update
technique from CEM and CMAES, in order to automatically tune the covariance of the
sampling distribution. We leverage this technique with some modications to automati-
cally tune the exploration noise parameter in STOMP.
2.3 Inverse Reinforcement Learning
Optimal control and reinforcement learning address the problem of nding optimal tra-
jectories given a specic cost function. In contrast, inverse optimal control (IOC) and
inverse reinforcement learning (IRL) address the problem of nding a cost function under
which a particular trajectory is optimal. Unfortunately, the problem as stated is generally
ill-posed: there may be an entire range of cost functions which are compatible with the
given trajectory. Most IRL and IOC algorithms thus introduce additional constraints or
preferences in order to arrive at a specic solution. In particular, the maximum entropy
formulation of IRL [100] (MaxEnt) is worth mentioning here, as it arises from a proba-
bilistic approach. It treats the agent as a probabilistic optimizer over entire trajectories,
and tries to match the expected costs incurred by the agent with the cost of the exemplar
trajectory. Among all such possible distributions over trajectories, it chooses the one
with the maximum entropy, i.e., the least committed distribution that still achieves the
expected cost constraint.
Several algorithms to solve the IRL problem have been proposed over the last decade
[1,67,76,100]. Most of them can be described as the following iterative procedure: (a) solve
the optimal control problem under the current estimate of the cost function; (b) update
7
the cost function estimate using the output of (a). The exact form of the cost function
update diers in each case. These procedures work well in discretized state-action spaces
of low dimensionality. However, in higher dimensional continuous state-action settings,
computation of the full solution of the optimal control problem becomes intractable.
Methods have been proposed that do not involve solving the full forward MDP opti-
mization problem iteratively. [23] proposes an ecient solution which uses samples to
approximate the value function directly based on a linear MDP formulation. [11] uses a
sampling-based approach combined with importance sampling to arrive at an objective
function similar to MaxEnt. In this thesis, we present an IRL method that bears similar-
ities to the MaxEnt method in terms of formulation, but is dened in the local vicinity
of the exemplar trajectory instead of the global space of all trajectories. This localiza-
tion allows our method to deal with high-dimensional spaces like the ones encountered in
robotic manipulation.
2.4 Rough Terrain Locomotion
Early research on robotic legged locomotion focused on gaits that are statically stable, i.e.
the center of gravity of the robot is always kept over the polygon formed by the supporting
feet. McGhee built the rst computer-controlled quadruped robot, Phony Pony [62], and
studied the stability properties of various quadruped gait sequences [61]. Hirose et al.
built a series of quadruped robots called TITAN [30] that could climb stairs using static
gaits.
A breakthrough in legged locomotion was achieved by Raibert et al. when he built a
series of legged robots that achieved high-speed locomotion with dynamic balancing [31,
75]. This line of research has recently resulted in the BigDog robot [74] built by Boston
Dynamics, which is a quadruped that uses a dynamically balanced trot gait to traverse
rough terrain.
Another important wave of research was created by biologically inspired robot de-
signs [16, 26, 84, 87], that try to overcome the limited ability of multi-legged robots in
rough terrain with biomimetic hardware design principles. In the realm of biologically in-
spired legged locomotion, there has also been research on biologically inspired controllers,
usually in the form of central pattern generators (CPGs) [32]. The CPG approach can be
seen as the control-based analogue to biomimetic robot design, i.e., in both approaches,
robustness of the controller is sought in the emergent attractor dynamics between the
robot, controller, and environment.
Machine learning for locomotion has been used to acquire quadruped locomotion gait
patterns [66] and optimize them for performance [51], primarily over
at ground. Re-
cently, the DARPA Learning Locomotion program spawned o a line of research on ap-
plying machine learning and optimization techniques to the problem of legged locomotion
over rough terrain, where the obstacles are comparable in size to the leg length of the
robot. Complete control architectures have been presented that perform locomotion over
8
challenging terrain [39,53,79,103]. Control techniques that enable compliant walking and
disturbance rejection have been demonstrated on rough terrain [12].
Foot placement is arguably one of the most critical aspects of rough terrain locomotion,
making the application of machine learning techniques to this problem appealing. One
approach is to learn a classier for footholds on the terrain as acceptable or unacceptable
using terrain features like slope and proximity to clis [79]. Other work involves dening
a reward function over footholds as a weighted linear combination of terrain features like
slope and curvature on dierent length scales, and subsequently picking the foothold that
maximizes the reward [52]. The weights on the features in the reward function in [52] are
inferred using a learning procedure called hierarchical apprenticeship learning on footholds
and body paths demonstrated by an expert. The performance of such a system, however,
is critically dependent on the careful design of heuristic terrain features which are
exible
enough to model the expert's training data. In this thesis, we present a method for
learning foot placement preferences based on IRL. We focus on the parameterization of
the reward function that is learnt, and propose the concept of terrain templates as a
exible foothold feature representation. When combined with sparse learning methods,
our method is highly performant without the need to manually program various terrain
heuristics.
9
Chapter 3
Stochastic Trajectory Optimization for Motion
Planning
3.1 Introduction
Motion planning for mobile manipulation has received widespread attention in recent
years. Multiple approaches to the problem have been developed over the recent years,
including randomized planners, search-based planners and trajectory optimization-based
techniques. Traditional motion planning techniques have focused on solving the problem
of nding collision-free paths for a mobile manipulator, i.e. paths that the system can
follow without internal collisions or collisions with the environment. Little attention
has been paid to other considerations for motion planning, e.g. minimizing the torques
required for executing a particular task or dealing with orientation constraints on the end-
eectors. Such constraints arise in real-world tasks where conserving energy or keeping
objects level are often also important (in addition to avoiding obstacles).
For motion planners to be useful in real-world situations, they require the ability to
quickly output good quality plans that satisfy all the constraints inherent in an applica-
tions. Speed of computation is important to be able to respond quickly to requests from
users or high-level programs. The quality of motion plans, especially their smoothness,
plays a big role in allowing controllers to execute them eectively and also has a large
eect on human perception of the quality of motion of the robot. The ability to satisfy
a wide variety of constraints is essential to deal with the range of tasks that a mobile
manipulation system may be required to perform. Motion planners must also be reli-
able, i.e. they should be capable of solving most motion planning problems that a mobile
manipulation system may face.
In this chapter, we build on and extend our work in [41] and present a stochastic tra-
jectory optimization based approach to motion planning (STOMP). Our approach starts
with an initial (possibly infeasible) trajectory as a candidate solution and a cost func-
tion incorporating the dierent costs we would like to minimize. An optimization process
iteratively samples in the space around this trajectory and then combines the samples
intelligently to obtain a new trajectory with lower cost. A dierentiating characteristic
10
(a) (b)
Figure 3.1: (a) The PR2 robot manipulating objects in a household environment. (b)
Simulation of the PR2 robot avoiding a pole in a torque-optimal fashion.
of our planner is the ability to incorporate very general cost functions, especially because
our approach does not rely on having access to the gradient of the cost function. We
further extend this approach by incorporating an additional step: a noise adaptation rule
that attempts to automatically tune the exploration noise for faster convergence.
We rst demonstrate our approach on a simplied 2D problem, using it to illustrate
and explain the key features and working of our approach. We also use this 2D example
to compare our approach to an existing gradient based approach. We then demonstrate
our approach both in simulation and on a real-world robot for a 7 DOF manipulator
performing manipulation tasks while incorporating sensor data. We compare our approach
to gradient-based approaches, showing that planning times are comparable and that our
planner generates valid plans in far more cases. Finally, we also apply our trajectory
optimization approach to a reinforcement learning problem, wherein we learn a force
control policy for compliant manipulation tasks. We demonstrate this approach on two
dierent manipulation tasks: opening a door with a lever door handle, and picking up a
pen o the table.
3.1.1 Structure
This chapter is organized as follows. In Section 3.2, we rst present details of our motion
planning algorithm itself. This includes a detailed derivation of the update rules that
STOMP uses. We then present an extension to the algorithm that helps in improving
its convergence capabilities. In Section 3.3, we present an application of the algorithm
to a 2D problem, illustrating its working and also comparing it with an existing trajec-
tory optimization-based motion planner algorithm. In Section 3.4, we then present the
application of our approach to motion planning for a 7-DOF manipulator.
11
3.2 The STOMP Algorithm
Traditionally, motion planning is dened as the problem of nding a collision-free tra-
jectory from the start conguration to the goal conguration. In addition to nding a
collision-free trajectory, we also attempt to account for other constraints acting on the
system, e.g. torque or orientation constraints. We treat the resulting motion planning
problem as an optimization problem, to search for a smooth trajectory that minimizes a
given cost function, where the cost function is appropriately designed to account for all
the constraints.
Our planner starts with a discretized trajectory as an initial guess for the desired
motion plan. This trajectory may often be infeasible, i.e. it may violate the constraints
imposed on the motion of the robot and may be in collision with the environment. We
consider trajectories of a xed durationT , discretized intoN waypoints, equally spaced in
time. In order to keep the notation simple, we rst derive the algorithm for a 1-dimensional
trajectory; this naturally extends later to multiple dimensions. This 1-dimensional dis-
cretized trajectory is represented as a vector 2 R
N
. We assume that the start and
goal of the trajectory are given as x
s
and x
g
respectively, and are kept xed during the
optimization process.
In the following subsections, we rst present the general form of the cost function
and the prior that we impose upon trajectories in our framework. Next, we describe
two alternate ways of arriving at the STOMP update rules: the rst using a stochastic
optimization approach, and the second using the principles of path integral stochastic
optimal control. Finally, we present the noise adaptation extension to STOMP.
3.2.1 Cost Function
We dene the trajectory cost function J() as the sum of state costs J
x
() and control
costs J
u
():
J() =J
x
() +J
u
() (3.1)
J
x
() =
N
X
t=1
q(
t
) (3.2)
J
q
() =
1
2
t
R (3.3)
q(
t
) is an arbitrary state-dependent cost function measured at time t. This typically
includes costs due to obstacles, constraint violations, and other task-related objectives.
The control costJ
q
() is quadratic in the parameters, where R is a positive semi-denite
matrix.
12
We dene the control cost to be the sum of squared derivatives along the entire
trajectory, using a set of nite dierencing matrices A
1
::: A
D
:
J
q
() =
D
X
d=1
w
d
kA
d
[x
s
t
x
g
]
t
k
2
, (3.4)
where d is the order of dierentiation, and w
d
is the weight of each term in the cost
function. In our experiments, we choose to minimize squared accelerations, i.e., w
2
= 1,
and all other weights are set to 0. The corresponding nite dierencing matrix A
2
is of
the form:
A
2
=
2
6
6
6
6
6
6
6
6
6
4
1 0 0 0 0 0
2 1 0 0 0 0
1 2 1 0 0 0
.
.
.
.
.
.
.
.
.
0 0 0 1 2 1
0 0 0 0 1 2
0 0 0 0 0 1
3
7
7
7
7
7
7
7
7
7
5
(3.5)
This denition of control cost results in a matrix R (Eq. 3.3) that is symmetric and
positive denite. Terms from Eq. (3.4) that are linear in the parameters can be absorbed
into J
x
() without loss of generality.
3.2.2 Stochastic Optimization
We start with the following optimization problem:
min
E [J()] = min
E
"
N
X
t=1
q(
~
t
) +
1
2
~
t
R
~
#
(3.6)
where
~
=N (; ) is a noisy parameter vector with mean and variance .
Previous work [77] has demonstrated the optimization of the non-stochastic version
of Eq. (3.6) using covariant functional gradient descent techniques. In this work, we
instead optimize it using a derivative-free stochastic optimization method. This allows
us to optimize arbitrary costs q(
~
) for which derivatives are not available, or are non-
dierentiable or non-smooth.
Taking the gradient of the expectation in Eq. (3.6) with respect to, we get:
r
E
"
N
X
i=1
q(
~
i
) +
1
2
~
t
R
~
#!
= 0 (3.7)
which leads to:
E () =R
1
r
E
"
N
X
i=1
q(
~
i
)
#!
(3.8)
13
Further analysis results in:
E () =R
1
E
r
~
"
N
X
i=1
q(
~
i
)
#!
(3.9)
The expression above can be written in the formE () =R
1
^
G
where
^
G
is now
the gradient estimate dened as follows:
^
G
=E
r
~
"
N
X
i=1
q(
~
i
)
#!
(3.10)
Inspired by previous work in the probability matching literature [19] as well as recent
work in the areas of path integral reinforcement learning [92], we propose an estimated
gradient formulated as follows:
^
G
=
Z
dP (3.11)
Essentially, the equation above is the expectation of (the noise in the parameter vector
~
) under the probability metric P/ exp
1
J(
~
)
where J(
~
) is the trajectory cost
according to Eq. (3.1). Thus the stochastic gradient is now formulated as follows:
^
G
=
Z
exp
1
J( +)
d (3.12)
In practice, we estimate the above gradient by sampling a nite number of trajectories:
^
G
=
K
X
k=1
P( +
k
) (3.13)
P( +
k
) =
exp
1
J( +
k
)
P
K
l=1
exp
1
J( +
l
)
(3.14)
The application of this gradient update to the original trajectory is then iterated,
where sample trajectories are re-drawn from the newly updated trajectory at every iter-
ation. We additionally make the assumption that the state cost at each time step q(
i
) is
purely dependent only on the parameter
i
, and we do not blame future or past costs on
the current state. This allows each parameter to be updated independently, and signi-
cantly accelerates convergence in our experiments. The resulting algorithm is presented
in Table 3.1. In the next sub-section, we show how this algorithm can be derived from
the path integral stochastic optimal control framework.
14
Table 3.1: The STOMP Algorithm
Given:
{ Start and goal positions x
s
and x
g
{ Initial 1-D discretized trajectory vector
{ State-dependent cost function q(
i
), control cost matrix R
{ Standard deviation of exploration noise
Repeat until convergence of trajectory cost J():
1. Create K noisy trajectories,
~
1
:::
~
K
with parameters +
k
, where
k
=
N (0;
2
R
1
)
2. For k = 1:::K and t = 1:::N, compute:
(a) J(
~
k;t
) =q(
~
k;t
) +
1
2N
t
R
(b) P
~
k;t
=
e
1
J(
~
k;t
)
P
K
l=1
[e
1
J(
~
l;t
)
]
3. For t = 1::: (N 1), compute: [
~
]
t
=
P
K
k=1
P (
~
k;t
)[
k
]
t
4. Update + R
1
~
5. Compute trajectory cost J() =
P
N
t=1
q(
t
) +
1
2
t
R
15
3.2.3 Path Integral Stochastic Optimal Control
The above gradient estimation process has strong connections to how the gradient of the
value function is computed in the path integral stochastic optimal control framework.
The goal of stochastic optimal control is to nd the sequence of controls that minimize
a given performance criterion, in the presence of control and/or observation noise. We
brie
y outline the development of the path integral stochastic optimal control frame-
work, and its application to trajectory-based reinforcement learning in the form of the
Policy Improvement with Path Integrals (PI
2
) algorithm [93]. We then describe a policy
formulation that reduces the PI
2
update rules to those of STOMP.
We consider control systems of the following form:
_ x
t
= f(x
t
;t) + g
t
t
( +
t
), (3.15)
wherex
t
denotes the one-dimensional state of the system at timet,f(x
t
;t) represents the
passive dynamics, g
t
is a time-dependent basis vector which denotes the linear impact
of the controls on the state, 2 R
p
are the policy parameters, and
t
2 R
p
represents
control noise at time t, which is Gaussian with variance
.
The immediate cost function at time t is dened as:
r
t
=q(x;t) +
1
2
( +
t
)
t
R( +
t
), (3.16)
where q(x;t) is an arbitrary state-dependent cost function, and R2R
pp
is the positive
semi-denite weight matrix of the quadratic control cost. The total cost of a trajectory
is then dened as:
J() =
N
X
t=1
r
t
+, (3.17)
where is a terminal cost.
The Hamilton-Jacobi-Bellman (HJB) equation associated with this continuous-time
optimal control problem is a non-linear, second order partial dierential equation (PDE),
which when solved provides the value function. The HJB equation can be linearized using
a log-transformation of the value function and the assumption R
1
=
. Application
of the Feynman-Kac theorem to this PDE converts it into a problem of numerical path
integral approximation. We omit the detailed derivation here, and instead refer the reader
to a previous publication for details [93]. Pseudocode for one iteration of the PI
2
algorithm
to optimize the parameters is listed below:
Create K roll-outs of the system
1
:::
K
from the same start state x
0
using
stochastic parameters +
k;t
at every time step.
16
For k = 1:::K, and time-steps t = 1:::N, compute:
M
t
=
R
1
g
t
g
t
t
g
t
t
R
1
g
t
(3.18)
J(
k;t
) =
k
+
N1
X
j=t
q
k;j
+
1
2
N1
X
j=t
( + M
t
k;t
)
T
R( + M
t
k;t
) (3.19)
P (
k;t
) =
e
1
J(
k;t
)
P
K
j=1
e
1
J(
j;t
)
(3.20)
For time-steps t = 1:::N, compute:
t
=
K
X
k=1
P (
k;t
) M
t
k;t
(3.21)
Compute =
1
N
P
N
t=1
t
Update +
M
t
is a matrix that projects the noise
t
onto the range space of the basis vector g
t
under the metric R
1
, at time t. J(
k;t
) is the cumulative cost for rollout k from time t
until the end. P (
k;t
) is its corresponding discrete trajectory probability: the lower the
trajectory cost, the higher the probability, and vice versa.
We can now apply the PI
2
algorithm to a specially constructed dynamical system that
tracks the desired discretized trajectory:
_ x
t
=K(g
t
t
x
t
), (3.22)
where g
t
2 R
N1
is a vector that contains a 1 at position t and 0 at all other indices.
This system, with a suciently high gain K, ensures that the state x tracks the desired
trajectory.
Substitution of this system into the PI
2
algorithm aords several simplications and
results in update equations that look very similar in nature to those of STOMP (Table
3.1). The primary dierence between the two is in the computation of the per time-step
cost. PI
2
optimizes the parameters of a dynamical system, in which the actions taken at
a particular time-step aect all future costs. Hence, the cost assigned to each time-step
is the cumulation of the current and all future costs (Eq. 3.19). In STOMP however, we
are able to break this dependency since we optimize the kinematics directly and remove
any dynamics from the system. This allows us to directly optimize the per time-step cost
without cumulation, making the optimization more ecient.
The other dierence is the use of the projector M
t
at every time-step instead of R
1
.
When summing up M
t
at each time-step to obtain a single matrix M, we nd that it is
17
5 10 15 20 25 30 35 40 45
0
0.2
0.4
0.6
0.8
1
1.2
x 10
−4
5 10 15 20 25 30 35 40 45
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
(a) (b)
Figure 3.2: (a) Each curve depicts a column/row of the symmetric matrix R
1
. (b)
20 random samples of , drawn from a zero mean normal distribution with covariance
= R
1
.
equal to R
1
, but with each columni scaled by the inverse of itsi-th element. The eect
of this scaling is to even out the magnitude of updates at dierent time-steps.
There are a few points which warrant further discussion and are examined in the next
two subsections.
3.2.4 Exploration
In order to keep control costs of noisy trajectories low, we sample the noise from a
zero mean normal distribution, with
= R
1
as the covariance matrix, as shown in
Figure 3.2(b). The PI
2
algorithm is also derived based on this assumption. This is
preferable to sampling with = I for several reasons: (1) samples have a low control
cost
t
R, and thus allow exploration of the state space without signicantly impacting
the trajectory control cost; (2) these noisy trajectories may be executed without trouble
on a physical system; (3) the samples do not cause the trajectory to diverge from the start
or goal. Goal-convergent exploration is highly desirable in trajectory-based reinforcement
learning of point to point movements, where dynamical systems have been designed that
satisfy this property [33].
3.2.5 Trajectory updates
After generating K noisy trajectories, we compute their costs per time-step J(
k
;t) (Ta-
ble 3.1, Step 2(a)). In Step 2(b), we compute the probabilties P (
k
;t) of each noisy
18
trajectory, per time-step. The parameter regulates the sensitivity of the exponenti-
ated cost, and can be automatically optimized per time-step to maximally discriminate
between the experienced costs. We compute the exponential term in Step 2(b) as:
e
1
J(
k;t
)
=e
h
J(
k;t
)minJ(
k;t
)
maxJ(
k;t
)minJ(
k;t
)
, (3.23)
with h set to a constant, which we chose to be h = 10 in all our evaluations. The max
and min operators are over all noisy trajectories k. The noisy update for each time-step
is then computed in Step 3 as the probability-weighted convex combination of the noisy
parameters for that time-step.
Finally, in Step 4, we smooth the noisy update using the R
1
matrix, before updating
the trajectory parameters in Step 5. Multiplication with R
1
ensures that the updated
trajectory remains smooth, since it is essentially a projection onto the basis vectors of
R
1
shown in Figure 3.2(a).
These trajectory updates can be considered safer than a standard gradient descent
update rule. The new trajectory is essentially a convex combination of the noisy trajecto-
ries which have already been evaluated, i.e. there are no unexpected jumps to unexplored
parts of the state space due to a noisy gradient evaluation. Our iterative update rule is
analogous to an expectation-maximization (EM) algorithm, in which we update the mean
of our trajectory sampling distribution to match the distribution of costs obtained from
sampling in the previous iteration. This procedure guarantees that the average cost is
non-increasing, if the sampling is assumed to be dense [19, 97]. An additional advantage
is that no gradient step-size parameter is required; the only open parameter in this algo-
rithm is the magnitude of the exploration noise. In the next sub-section, we introduce a
noise adaptation rule that allows us to automatically nd the optimal exploration noise
magnitude, further simplifying tuning of the algorithm.
3.2.6 Noise adaptation
Steps 3 and 4 of the STOMP algorithm in Table 3.1 update the mean of the trajectory
sampling distribution . The covariance matrix and its magnitude, however, are kept
xed throughout the process. The magnitude of exploration noise is a parameter that
may have to be tweaked for dierent tasks. In the original version of STOMP [41], we
exponentially decay the noise at every iteration in order to force the optimization to
converge.
Black-box derivative-free optimization algorithms like the Cross-Entropy Method [82]
and CMAES [28] update both the mean and the covariance of the sampling distribution.
Recently, a covariance matrix adaptation variant of PI
2
was introduced (PI
2
-CMA) [90].
We adapt this procedure for STOMP, in order to update the magnitude of the exploration
noise after every iteration.
The mean of the sampling distribution is computed as a weighted convex combination
of the samples, in both PI
2
and STOMP. Similarly, in PI
2
-CMA, the covariance matrix is
19
Table 3.2: The STOMP Algorithm with Noise Adaptation. (The bold lines indi-
cate new steps introduced by the noise adaptation extension.)
Given:
{ Start and goal positions x
s
and x
g
{ Initial 1-D discretized trajectory vector
{ State-dependent cost function q(
i
), control cost matrix R
{ Initial standard deviation of exploration noise =
0
{ Noise update inertia parameter
{ Minimum noise standard deviation
min
Repeat until convergence of trajectory cost J():
1. Create K noisy trajectories,
~
1
:::
~
K
with parameters +
k
, where
k
=
N (0;
2
R
1
)
2. For k = 1:::K and t = 1:::N, compute:
(a) J(
~
k;t
) =q(
~
k;t
) +
1
2N
t
R
(b) P
~
k;t
=
e
1
J(
~
k;t
)
P
K
l=1
[e
1
J(
~
l;t
)
]
3. For t = 1::: (N 1), compute: [
~
]
t
=
P
K
k=1
P (
~
k;t
)[
k
]
t
4. Update + R
1
~
5. For k = 1:::K, compute:
(a) J(
~
k
) =
P
N
t=1
J(
~
k;t
)
(b) P
~
k
=
e
1
J(
~
k
)
P
K
l=1
[e
1
J(
~
l
)
]
6. Compute
~
=
P
K
k=1
P (
~
k
)
k
t
k
7. Compute
0
=
P
N
i=1
P
N
j=1
~
;i;j
R
1
i;j
P
N
i=1
P
N
j=1
R
1
i;j
R
1
i;j
1
2
8. Update max (
0
+ (1
);
min
)
9. Compute trajectory cost J() =
P
N
t=1
q(
t
) +
1
2
t
R
20
also computed using a weighted covariance update. In STOMP however, the size of our
parameter vector is quite large (all of our experiments use N=100), and the corresponding
covariance matrix containsN
2
parameters. Estimation of these parameters from very little
data tends to be unreliable, and results in a rank-decient covariance matrix. Also, we
would like to maintain the structure of the covariance matrix to be similar to R
1
, so that
new trajectory samples do not lose their smoothness or convergence properties. In order
to achieve this, we rst estimate the full covariance matrix
~
from the weighted samples,
and then solve for the noise magnitude that minimizes the Frobenius norm between the
matrices
~
and
2
R
1
:
~
=
K
X
k=1
P (
k
)
k
t
k
(3.24)
= min
jj
~
2
R
1
jj
F
(3.25)
Eq. 3.25 is a least squares minimization problem whose solution can be found analytically:
=
P
N
i=1
P
N
j=1
~
;i;j
R
1
i;j
P
N
i=1
P
N
j=1
R
1
i;j
R
1
i;j
!1
2
(3.26)
where the
i;j
sux refers to the entry in the i-th row and j-th column of the matrix in
question. In practice, we keep the noise above a minimum threshold value
min
, and also
add some inertia to the updates to avoid premature convergence:
=max (
new
+ (1
)
prev
;
min
) (3.27)
where
new
is the newly estimated noise magnitude from Eq. 3.26, and
prev
is the noise
magnitude from the previous iteration. We set
= 0:2 in all our experiments.
Table 3.2 shows the complete STOMP algorithm, with the noise adaptation extension.
Experimental results in Sec. 3.3.2 show that the use of this extension signicantly improves
the convergence rate, and reduces sensitivity to the initial value of the noise magnitude.
3.3 Experiments on 2D environments
We rst evaluate the STOMP algorithm on a set of synthetic cost functions in a 2-D
environment. This allows us to systematically test the eects of dierent parameters, and
easily visualize and understand the working of the algorithm.
3.3.1 Synthetic 2-D Cost Maps
We consider the problem of planning the motion of a point in a two dimensional space.
We constructed four dierent cost maps, as shown in the left column of Figure. 3.3. Each
21
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(a) OFFSET
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(b) OFFSET-BOOL
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(c) CENTERED
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(d) CENTERED-BOOL
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(e) MAZE
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(f) MAZE-BOOL
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(g) SADDLE
0.0 0.2 0.4 0.6 0.8 1.0
x [m]
0.0
0.2
0.4
0.6
0.8
1.0
y [m]
Initial trajectory
CHOMP
STOMP
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(h) SADDLE-BOOL
Figure 3.3: A set of four two-dimensional cost maps, and their boolean transformations.
Each plot shows the initial trajectory of the motion of a point from (0,0) to (1,1) be-
fore optimization, and after 500 iterations of optimization with CHOMP and STOMP
respectively.
22
cost map (left column) contains a set of ellipses with dierent centers and radii. The cost
at a position q
map
(x;y) on the map varies linearly from 1:0 at the center of an ellipse
to 0:0 outside it. Positions that lie outside of the depicted map area are penalized at a
cost equal to the l
1
distance from the map boundary. We also tested the algorithm on
boolean transformations of these cost maps, as shown in the right column of Figure. 3.3.
Each plot shows the initial trajectory before optimization, which is a straight line from
the point (0:01; 0:01) to (0:99; 0:99) that minimizes the control cost, dened as the sum
of squared velocities along the trajectory. The trajectory has a duration of 1:0 seconds,
and is discretized into N = 100 time-steps. The total trajectory cost is dened as:
J() =
N
X
t=1
q
map
(x
t
;y
t
) ( _ x
2
t
+ _ y
2
t
)
1
2
+ 0:0001
t
R (3.28)
where the map cost at every time step q
map
(x
t
;y
t
) is multiplied by the instantaneous
velocity of the point at that time. This approximates the cost of the trajectory as an inte-
gration over path length as opposed to integration over time. Without this modication,
the algorithm would be able to reduce the trajectory cost by moving through high-cost
areas of the map very quickly.
We executed 500 iterations of STOMP on each of the four maps and their boolean
transformations. We used the following settings: number of rollouts K = 50, initial
noise standard deviation
0
= 1:0, with the noise adaptation feature switched on. We
also ran the CHOMP algorithm, which optimizes the same cost function using covariant
gradient descent [77]. For CHOMP, gradients of the map were obtained using nite
dierentiation, and trajectory gradients were computed using the discretized functional
gradient formulation from [77]. The gradient step size was chosen to be 0:1. The nal
trajectories obtained by the two algorithms are shown in Fig. 3.3. The initial trajectory
in the OFFSET map has a clear gradient pointing away from the center of the obstacle,
and is easy to optimize for both CHOMP and STOMP. OFFSET-BOOL has no gradients
in the interior of the obstacle, but CHOMP is able to utilize the gradients at the obstacle
boundary to successfully optimize the initial trajectory. CENTERED and CENTERED-
BOOL were constructed such that the initial trajectory passes through the center of the
obstacle, resulting in a gradient of zero. This causes CHOMP to fail, while STOMP is
still able to successfully nd a valid trajectory due to its stochasticity. The optimized
trajectory may fall on either side of the obstacle with equal probability. The MAZE map
is harder to optimize than OFFSET because the two obstacles push the trajectory in
opposing directions. We see that CHOMP fails at MAZE-BOOL, because of unreliable
gradient estimates from the obstacle boundaries. Finally, the LOCAL and LOCAL-BOOL
maps simulate a situation where the initial trajectory is at a local minimum of the cost
function. Again, STOMP is able to break out of this condition due to its stochasticity,
while CHOMP is stuck in the local minumum.
23
0 100 200 300 400 500
Number of iterations
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Cost
s = 0.1
s = 0.25
s = 0.5
s = 1.0
s = 2.5
s = 5.0
(a) Without noise adaptation
0 100 200 300 400 500
Number of iterations
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Cost
s
0
= 0.1
s
0
= 0.25
s
0
= 0.5
s
0
= 1.0
s
0
= 2.5
s
0
= 5.0
(b) With noise adaptation
Figure 3.4: Evolution of trajectory costs for STOMP on the MAZE map, without and with
the noise adaptation feature. The use of noise adaptation clearly improves convergence
rate and removes sensitivity to the initial value of noise standard deviation.
3.3.2 Parameter Sensitivity Tests
We use the MAZE map (Fig. 3.3(e)) to test the eects of various parameter settings of
the STOMP algorithm. First, we test the eectiveness of the noise adaptation algorithm.
We xed the number of rollouts K to 50, and ran 500 iterations of STOMP without
noise adaptation, with six dierent standard deviations ranging from 0:1 to 5:0. The
evolution of trajectory costs over iterations in this setting is shown in Fig. 3.4(a). All
data shown is averaged over 100 runs. We see that the lowest setting of = 0:1 results in
very slow improvement in trajectory costs, while the highest setting of = 5:0 essentially
performs worse than the initial trajectory and never improves. The best results are
obtained somewhere in between, for example at = 0:5 and = 1:0. This shows that
the noise standard deviation setting is quite critical and may need to be tuned dierently
for dierent tasks. Next, we ran the same test, but with the noise adaptation feature
switched on. The corresponding results are shown in Fig. 3.4(b). The dierence is quite
striking: convergence is much faster, and much more insensitive to the initally chosen
value of
0
. The change in the noise value during the noise adaptation process is shown
in Fig. 3.5. We see that converges to a low value as the cost converges, regardless of
the initial setting
0
.
Next, we test the eect of varying the number of noisy trajectory rollouts per iteration.
We used the MAZE map, with initial noise
0
= 1:0, and the noise adaptation feature
enabled. We ran the algorithm with a varying number of rollouts per iteration, ranging
from K = 5 to K = 100. The results are shown in Fig. 3.6, where the data is averaged
over 100 runs. We nd that using a low number of rollouts per iteration results in a poor
approximation of the integral in Eq. 3.12, and may cause the algorithm to converge pre-
maturely to a suboptimal solution. The problem is mitigated by increasing the number
of samples drawn per iteration, at the expense of higher computational cost. In fact, a
24
0 100 200 300 400 500
Number of iterations
0
1
2
3
4
5
Standard Deviation of noise (s)
s
0
= 0.1
s
0
= 0.25
s
0
= 0.5
s
0
= 1.0
s
0
= 2.5
s
0
= 5.0
Figure 3.5: Evolution of noise standard deviations during noise adaptation on the MAZE
cost-map, averaged over 100 trials, from dierent initial settings
0
.
0 100 200 300 400 500
Number of iterations
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Cost
K = 5
K = 10
K = 20
K = 50
K = 100
Figure 3.6: Eect of number of rollouts per iteration (K) on the MAZE cost-map, averaged
over 100 runs. Increasing the number of rollouts allows convergence to a lower trajectory
cost.
25
closer look reveals that the trials with a low number of rollouts per iteration actually
perform better initially, because the algorithm updates the trajectory more greedily than
if trajectories were densely sampled. In many real-world situations, this may be desir-
able, for example, optimizing the trajectory away from collision is a higher priority than
smoothing the trajectory. Our experiments in the next section about 7-DOF robot arm
motion planning use a setting ofK = 5, because we found that this allows the planner to
quickly nd collision-free trajectories which are still qualitatively smooth.
3.4 Motion Planning for a Robot Arm
In this section, we discuss the application of the STOMP algorithm to the problem of
motion planning of a high-dimensional robot manipulator
. We address the design of
a cost function that allows planning for obstacle avoidance, optimization of task con-
straints, and minimization of joint torques. STOMP is an algorithm that performs local
optimization, i.e. it nds a locally optimum trajectory rather than a global one. Hence,
performance will vary depending on the initial trajectory used for optimization. STOMP
cannot be expected to solve typical motion planning benchmark problems like the alpha
puzzle [99] in a reasonable amount of time. We evaluate the possibility of using STOMP
as a motion planner for a robot arm in typical day-to-day tasks that might be expected
of a manipulator robot. We conduct experiments on a simulation of the PR2 robot in a
simulated world, followed by a demonstration of performance on the real robot.
3.4.1 Cost Function
We represent trajectories of the robot arm in joint space, with a xed duration and
discretization. We assume that the start and goal congurations are also provided in
joint space.
The cost function we use is comprised of obstacle costs q
o
, constraint costs q
c
, torque
costs q
t
, joint limit costs q
j
, and the smoothness cost
t
R.
J() =
N
X
t=1
q
o
(
t
) +q
c
(
t
) +q
t
(
t
) +q
j
(
t
)
!
+
t
R (3.29)
Obstacle costs
We use an obstacle cost function similar to that used in previous work on optimization-
based motion planning [77]. We start with a boolean voxel representation of the envi-
ronment, obtained either from a laser scanner or from a triangle mesh model. Although
our algorithm can optimize such non-smooth boolean-valued cost functions (as shown in
A substantial amount of the work in this section has been presented in [41].
26
Sec. 3.3, faster convergence can be achieved by using a function for which local gradi-
ents are available (or can be estimated by sampling). We compute the signed Euclidean
Distance Transform (EDT) [98] of this voxel map. The signed EDT d(x), computed
throughout the voxel grid, provides information about the distance to the boundary of
the closest obstacle, both inside and outside obstacles. Values of the EDT are negative
inside obstacles, zero at the boundary, and positive outside obstacles. Thus, the EDT pro-
vides discretized information about penetration depth, contact and proximity. Fig. 3.7(a)
shows voxels sensed in the environment by a stereo camera, and Fig. 3.7(b) shows the
corresponding iso-surface 5 cm away from obstacles, as obtained from the EDT.
We approximate the robot bodyB as a set of overlapping spheres b2B. We require
all points in each sphere to be a distance at least away from the closest obstacle. This
constraint can be simplied as the center of the sphere being at least +r away from
obstacles, where r is the sphere radius. Thus, our obstacle cost function is as follows:
q
o
(
t
) =
X
b2B
max( +r
b
d(x
b
); 0)k _ x
b
k, (3.30)
where r
b
is the radius of sphere b, x
b
is the 3-D workspace position of sphere b at time
t as computed from the kinematic model of the robot. Similar to the previous section,
we multiply the cost by the magnitude of the velocity of the sphere. A straightforward
addition of cost magnitudes over time would allow the robot to move through a high-
cost area very quickly in an attempt to lower the cost. Multiplication of the cost by the
magnitude of the workspace velocity of the sphere (k _ x
b
k) avoids this eect [77].
Constraint costs
We optimize constraints on the end-eector position and/or orientation by adding the
magnitude of constraint violations to the cost function.
q
c
(
t
) =
X
c2C
jv
c
(
t
)j, (3.31)
whereC is the set of all constraints,v
c
is a function that computes the magnitude of con-
straint violation for constraintc2C. In the experiments below, we optimize a commonly
occuring constraint in manipulation tasks: to maintain 0 roll and pitch of the gripper (eg:
when the robot is holding a glass of water in its gripper). This is represented as follows:
q
c
(
t
) =jv
p
(
t
)j +jv
r
(
t
)j (3.32)
where v
p
and v
r
are functions that compute the pitch and roll of the end-eector respec-
tively using forward kinematics.
27
Figure 3.7: (top) Voxels sensed in the environment by the PR2's stereo camera. (bottom)
The red cubes show the iso-surface 5 cm away from obstacles, as obtained from the
Euclidean Distance Transform (EDT).
28
(a) (b) (c)
Figure 3.8: (a) Simulation setup used to evaluate STOMP as a robot arm motion plan-
ner. (b) Initial straight-line trajectory between two shelves. (c) Trajectory optimized by
STOMP to avoid collision with the shelf, constrained to maintain the upright orientation
of the gripper.
Torque costs
Given a suitable dynamics model of the robot, we can compute the feed-forward torque re-
quired at each joint to track the desired trajectory using inverse dynamics algorithms [24].
The motor torques at every instant of time are a function of the joint states and their
derivatives:
t
=f(x
t
; _ x
t
; x
t
), (3.33)
where x
t
=
t
represents the joint states at time t, and _ x
t
; x
t
are the joint velocities and
accelerations respectively, obtained by nite dierentiation of.
Minimization of these torques can be achieved by adding their magnitudes to the cost
function:
q
t
(
t
) =j
t
j (3.34)
Joint limit costs
Joint limit violations are penalized by adding a cost term proportional to the amount of
violation per time-step:
q
j
(
t
) =
8
>
<
>
:
t
max
; if
t
>
max
min
t
; if
t
<
min
0; otherwise
(3.35)
29
Table 3.3: Results obtained from planning arm trajectories in the simulated shelf envi-
ronment shown in Figure 3.8(a).
Scenario
STOMP CHOMP STOMP
Unconstrained Unconstrained Constrained
Number of
210 / 210 149 / 210 196 / 210
successful plans
Avg. planning time
0.88 0.40 0.71 0.25 1.86 1.25
to success (sec)
Avg. iterations
52.1 26.6 167.1 113.8 110.1 78.0
to success
Smoothness costs
For robot arm motion planning, we choose to minimize the sum of squared accelerations,
as this results in smoother paths than minimizing squared velocities. If model-based
controllers are used to compute torques sent to each joint, acceleration continuity is
important. This ensures that the torques sent to the motors are also continuous in nature.
3.4.2 Simulation Results
We created a simulation world containing a shelf with 15 cabinets, as shown in Fig-
ure 3.8(a). Seven of these cabinets were reachable by the 7 degree-of-freedom right arm
of the PR2. We tested planning of arm trajectories between all pairs of the reachable
cabinets in both directions, resulting in 42 dierent planning problems. Each problem
was repeated 5 times, resulting in a total of 210 trials. In each case, the initial trajectory
used was a straight-line through conguration space of the robot. Each trajectory was 5
seconds long, discretized into 100 time-steps.
These planning problems were repeated in two scenarios with two dierent cost func-
tions. The unconstrained scenario used a cost function which consisted of only obstacle
costs and smoothness costs. Success in this scenario implies the generation of a collision-
free trajectory. In the constrained scenario, we added a constraint cost term to the cost
function. The task constraint was to keep the gripper upright at all times, i.e. as though
it were holding a glass of water. Specically, the roll and pitch of the gripper was con-
strained to be within 0.2 radians. Success in this scenario involves generation of a
collision-free trajectory that satises the task constraint within the required tolerance.
We also evaluated the performance of the gradient-based optimizer CHOMP [77] on the
unconstrained scenario using the same cost function. The gradient descent step size for
CHOMP was tuned to achieve good performance without instability, while the initial
30
(a) (b) (c)
Figure 3.9: Planning problem used to evaluate torque minimization. (a) Plan obtained
without torque minimization: arm is stretched. (b,c) Two dierent plans obtained with
torque minimization. In (b), the entire arm is pulled down, while in (c) the elbow is folded
in: both solutions require lower gravity compensation torques. Figure 3.10(b) shows the
torques required for these movements.
noise
0
for STOMP was set to a reasonable value. Both algorithms were capped at 500
iterations. For STOMP,K = 5 noisy trajectory samples were generated at each iteration.
Table 3.3 shows the results obtained from these experiments. STOMP produced a
collision-free trajectory in all trials in the unconstrained scenario. In contrast, CHOMP
fails on nearly 30% of them, presumably due to the gradient descent getting stuck in local
minima of the cost function
y
. The execution times are comparable, even though CHOMP
usually requires more iterations to achieve success. This is because each iteration of
STOMP requires multiple trajectory cost evaluations, but can make larger steps in a
more stable fashion than the CHOMP gradient update.
In the constrained scenario, 93.3% of trials resulted in plans that were both collision-
free and satised the task constraints. Figure 3.10(a) shows the iterative evolution of
trajectory costs for one of the constrained planning problems, averaged over ten trials.
These results are obtained when initializing the algorithm with a na ve straight-line
trajectory through conguration space, usually infeasible due to collisions with the envi-
ronment (see Fig. 3.8(b)). The algorithm is able to push the trajectory out of collision
with the environment, as well as optimize secondary costs like constraints and torques.
The trajectory after optimization is ready for execution on a robot, i.e., no smoothing is
necessary as is commonly used in conjunction with sampling-based planners [29].
In order to test the part of the cost function that deals with minimization of torques,
we ran 10 trials on the planning problem shown in Figure 3.9, with and without the
torque term in the cost function. The resulting torques of the optimized movements in
both cases are shown in Figure 3.10(b). Since the movements are rather slow, most of the
y
This result was obtained using the standard CHOMP gradient update rule, not the Hamiltonian
Monte Carlo variant. Please refer Sec. 3.4.4 for details.
31
50 100 150 200
0
200
400
600
Iteration number
Trajectory cost
Trajectory cost
± 1 standard deviation
1 2 3 4
35
40
45
50
55
Time (sec)
Sum of abs. joint torques (Nm)
No torque opt.
Torque opt.
(a) (b)
Figure 3.10: (a) Iterative evolution of trajectory costs for 10 trials of STOMP on a
constrained planning task. (b) Feed-forward torques used in the planning problem shown
in Figure 3.9, with and without torque optimization, averaged over 10 trials.
torques are used in gravity compensation. Hence the torques at the beginning and end
of the movement stay the same in both cases. The planner nds parts of the state space
towards the middle of the movement which require lower torques to support the weight of
the arm. Interestingly, the planner usually nds one of two solutions as shown in Figures
3.9(b) and 3.9(c): (1) the entire arm is pulled down, or (2) the elbow is folded in on itself,
both of which require lower torque than the stretched out conguration.
3.4.3 Real Robot
The attached video shows demonstrations of trajectories planned using STOMP in a
household environment, executed on the PR2 robot. Obstacles in this setup are sensed
using a combination of a tilting laser range-nder and a stereo vision system with a texture
projector.
3.4.4 Discussion
A Hamiltonian Monte Carlo variant of CHOMP is discussed in [77,78], as a principled way
of introducing stochasticity into the CHOMP gradient update rule. While theoretically
sound, we found this method dicult to work with in practice. It introduces additional
parameters which need to be tuned, and requires multiple random restarts to obtain a
successful solution. In contrast, our algorithm requires minimal parameter tuning, does
not need cost function gradients, and uses a stable update rule which under certain
assumptions guarantees that the average cost is non-increasing.
In contrast to other planning approaches like randomized planning which require a
post-processing step to generate smooth paths, STOMP directly outputs trajectories that
32
(a) (b)
Figure 3.11: Force control policies for two dierent manipulation tasks were learnt using
our method: (a) opening a door, and (b) picking up a pen from the table.
can be executed directly by a controller. STOMP's ability to optimize more complex con-
straints also sets it apart from other motion planners, including trajectory optimization
planners like CHOMP, allowing it to be applied to real-world problems where such con-
straints may often arise.
3.5 Learning Force Control Policies for Compliant Manip-
ulation
Developing robots capable of ne manipulation skills is of major importance in order to
build truly assistive robots. Manipulation tasks imply complex contact interactions with
an unstructured environment and require a controller that is able to handle force interac-
tions in a meaningful way. In order for robots to co-exist in an environment with humans,
safety is a prime consideration. Therefore, touching and manipulating an unstructured
world requires a certain level of compliance while achieving the intended tasks accurately.
Methods for planning kinematic trajectories for manipulators are well-studied and
widely used. Rigid body dynamics models even allow us to plan trajectories that take
the robot dynamics into account. However, once the robot comes into contact with the
environment, planning algorithms would require precise dynamics models of the resulting
contact interactions. These models are usually unavailable, or so imprecise that the
generated plans are unusable. This seems to suggest alternate solutions that can learn
these manipulation skills through trial and error.
Acquisition of manipulation skills using reinforcement learning has been previously
demonstrated [13,48,55,72,91]. In most of these approaches, the policy encodes positions
which are tracked by a controller. These position trajectories are adapted in order to
33
achieve the task, thus indirectly exerting forces on the objects being manipulated. A
desired position trajectory must penetrate into the object being manipulated, in order
to apply a force on it. This can potentially be dangerous, for example, if the object
is wrongly positioned, and the resulting forces generated are too high. In contrast, we
propose to learn the forces and torques to be controlled at the end-eector in conjuction
with a demonstrated kinematic trajectory. This corresponds more directly to the physical
quantities that need to be controlled. Control of forces instead of positions allows the
system to deal with position and state estimation errors that are unavoidable when acting
in the real world. Furthermore, in tasks that involve contact with the environment,
exploration in high gain position control mode could be damaging to the robot and its
surroundings.
Methods for learning force proles from demonstrations provided via haptic input
have been proposed [56]. The method does not explicitly consider task performance:
it merely reproduces the demonstrated trajectories. In the context of our work, such
techniques could be used to initialize a force prole for futher optimization to improve
task performance.
In this section, we present an approach to learning manipulation tasks on compliant
robots through reinforcement learning. An initial kinesthetic demonstration of the task
is provided. Execution of this demonstration on a robot with compliant control fails
to execute the task succesfully, because the demonstration does not include information
about forces to be exerted at the end-eector. We propose to learn the required end-
eector forces through trial and error reinforcement learning. This allows the robot to
acquire a robust strategy to perform the manipulation task, while remaining compliant in
its control. We demonstrate our approach on two dierent manipulation tasks: opening a
door with a lever door handle, and picking up a pen o the table (Fig. 3.11). We show that
our approach can learn the force proles required to achieve both tasks successfully. The
contributions of this section are two-fold: (1) we demonstrate that learning force control
policies enables compliant execution of manipulation tasks with increased robustness as
opposed to sti position control, and (2) we introduce a policy parameterization that uses
nely discretized trajectories coupled with a cost function that ensures smoothness during
exploration and learning.
3.5.1 Learning Force Feedback Control
Fig. 3.12 shows a high-level overview of our approach to learning manipulation tasks on a
compliant robot. The policy is initialized from a user-provided kinematic demonstration.
The PI
2
reinforcement learning algorithm is used to optimize the policy and acquire the
right force/torque proles through trial and error. We further discuss some of the steps
involved:
34
Initial Policy
Kinematic demonstration
+
Zero forces/torques
Figure 3.12: A high-level overview of our approach to learning force control policies for
manipulation.
Demonstration
We record the end-eector position and orientation trajectories during a kinesthetic
demonstration of the task provided by the user. Since forces and torques applied by
the robot on the object being manipulated cannot be observed correctly during kines-
thetic demonstration, the force-torque proles are initialized to zero. When controlling
zero forces, the robot is maximally compliant, i.e., the end-eector gives in to contact
forces easily. This forms a safe starting point for exploration, and ensures that only the
forces required to satisfy the task are learnt. We use the discretized policy representation
discussed in Sec. 3.2.3, Eq. (3.22) for both position and force trajectories.
Cost Function
In order to acquire the correct force/torque proles, the reinforcement learning algorithm
requires a measure of task success for every trial that it executes. This feedback could
be provided by the user for every trial, but it is more convenient if task success can be
evaluated in an automated fashion. PI
2
has previously been used to optimize boolean
cost functions [91], but learning tends to be faster if the cost function has a gradient to
follow.
35
Execution
The combined position/force trajectory needs to be controlled by the robot in a suitable
way. Numerous methods can be found in the literature that control forces and positions
simultaneously [86]. PI
2
, being a model-free reinforcement learning algorithm, is indeed
agnostic to the type of controller used. It simply optimizes the policy parameters to
improve the resulting cost function, treating the intermediate controllers and unmodeled
system dynamics as a black box. However, the generalization ability of the learnt policy to
dierent parts of the workspace will depend on the force and position tracking performance
of the controller.
Rollout reuse
In every iteration of PI
2
,K noisy rollouts are generated, evaluated, and used to update the
policy. We instead prefer to preserve a few good rollouts from previous iterations, so that
we continue to learn from them in future iterations, and achieve stable convergence. How-
ever, if the task itself is stochastic in nature, this procedure may be counter-productive.
For example, if a noisy rollout happens to achieve a low cost during its evaluation, but
is not repeatable, it nevertheless continues to contribute to future policy updates. In
order to mitigate this eect, we reevaluate all the reused rollouts at every iteration. This
ensures that only rollouts that consistently generate low costs are carried forward at each
iteration. This feature was critical for PI
2
to converge to a robust policy in the pen
grasping experiment presented in Sec. 3.5.2.
3.5.2 Experiments
Our approach was veried using two dierent manipulation tasks: opening a door and
picking up a pen lying on a table. These tasks were chosen because each one involves
signicant contact with the environment, and are thus appropriate test-beds for the use of
force control policies. The learning process and nal executions of both tasks are shown
in the attached video [42].
Both tasks were performed on the 7 degree of freedom (DOF) Barrett WAM arm,
equipped with a three-ngered Barrett Hand and a 6-DOF force-torque sensor at the
wrist. Fig. 3.13 shows an overview of the controllers running on our system. Our control
law for the 7-DOF arm is as follows:
arm
=
inv. dyn.
+
joint
+
force
Desired task-space position/orientation trajectories are converted into joint space using
the Jacobian pseudo-inverse. The resulting joint velocities are integrated and dieren-
tiated, to get joint positions and accelerations respectively.
inv.dyn.
are the inverse dy-
namics torques obtained from a recursive Newton Euler algorithm [24].
joint
is obtained
from low-gain joint PD controllers. Finally,
force
is obtained from a PI controller in task
space on the desired force/torque trajectory, converted to joint space torques using the
36
+
Endeffector
Force/Torque
PI Control
Finger
Low Gain
PD Control
Inverse
Dynamics
Low Gain
PD Control
WAM Robot
Inverse
Kinematics
Desired Cartesian
Position Trajectories
Desired Finger
Position Trajectories
Desired Cartesian
Force/Torque Trajectories
Figure 3.13: Overview of the controllers used in our experiments.
37
Jacobian transpose. The ngers are position controlled using low-gain PD controllers.
All our controllers run at a rate of 300Hz on a desktop computer running the Xenomai
real-time operating system.
Opening a door
The aim of this experiment is to learn a control policy to successfully operate a lever door
handle and open the door shown in Fig. 3.11(a). A kinesthetic demonstration provides
the desired cartesian positions and orientations, while the desired force/torque trajectories
are initialized to 0. The trajectory was 10 seconds long, discretized into 100 time-steps.
Direct playback of the demonstration fails to achieve the task due to the compliance of
the robot and failure to apply the required forces and torques on the door handle.
In order to measure task success, we attached a MicroStrain 3DM-GX3-25 Inertial
Measurement Unit (IMU) to the door handle. The resulting orientation measurement
gives us the angle of the handle and of the door itself. We recorded the desired trajectories
of the door angle and handle angle during the demonstration: tracking these trajectories
forms the primary task success criterion. The immediate cost function at time t is:
r
t
= 300q
door
+ 100q
handle
+ 100q
pos
+ 10q
orient
+ 0:1q
fmag
+ 0:02q
tmag
+ 0:02q
ttrack
+
0:01q
ftrack
+ 0:0001
t
R, where q
door
and q
handle
are the squared tracking errors of the
door and handle angles respectively,q
pos
andq
orient
are the squared tracking errors of the
position and orientation of the hand, q
fmag
and q
tmag
are the squared magnitudes of the
desired forces and torques, q
ftrack
and q
ttrack
are the squared force and torque tracking
errors, and
t
R is the control cost.
We use PI
2
to learn policies for all 6 force/torque dimensions. A relatively small
exploration noise is also simultaneously introduced in the cartesian position and orien-
tation, to correct for imperfections in the demonstrated trajectory. Fig. 3.15(a) shows
the improvement in the cost with the number of trials executed. After 110 trials, we
obtained a policy which achieved the task successfully 100% of the time, with a cost of
498.1 5.3, averaged over 23 trials. Learning was terminated when all the noisy rollouts
in a particular learning iteration successfully achieved the task, indicating that the learnt
policy was robust to the exploration noise. Fig. 3.14 shows the learnt force/torque proles
and their tracking performance. Fig. 3.18 shows snapshots from an execution of the nal
policy.
For this experiment, the proportional gains for the force controller were set to 1, while
integral gains are set to 0. In theory, a proportional gain of 1 should compensate for the
dierence in desired and sensed forces. However, due to the addition of the joint-space
position controller, these forces cannot be perfectly realized. Despite the force tracking
inaccuracies seen in Fig. 3.14, the algorithm learns to achieve the task successfully. This is
possible because the PI
2
learning algorithm treats the controllers as a black box; it simply
learns the required sequence of control inputs that minimize the task cost function.
38
Figure 3.14: Learnt force/torque control policy and tracking errors for the door opening
task. The green solid lines show the learnt desired forces and torques respectively. The
blue dashed lines indicate the corresponding measured force/torque values.
(a) (b)
Figure 3.15: Evolution of cost functions during learning for the two manipulation tasks:
(a) door opening, and (b) pen grasping.
39
Figure 3.16: Learnt force/torque control policy and tracking errors for the pen grasping
task. The green solid line shows the learnt desired force in the z axis. The blue dashed
line indicates the corresponding measured force.
Figure 3.17: This table shows the extent of position and orientation errors that the nal
pen grasping force control policy was able to handle. Position error was introduced along
the direction of closing of the ngers. The 'O' in the center marks the original position
of the pen during learning. The red 'X' squares indicate failure, while the green empty
squares indicate success. Negative position errors were tolerated much better than positive
errors, because the hand has two ngers on this side, versus only one on the other.
40
Figure 3.18: Sequence of images from executions of the learnt policies for the two ma-
nipulation tasks: (top) opening a door with a lever handle, and (bottom) picking up a
pen from the table. The attached video shows the learning process and nal executions
of each of these tasks.
Grasping a pen
The next task is to pick up a pen from the table, as shown in Fig. 3.11(b). This task is
considered dicult, because the size of the pen is comparable to the size of the nger-tips,
making the chance of slippage very high. In addition, this task would be quite dicult
to achieve using a pure position-control strategy, since this would require very precise
knowledge of the pen position and the table height.
We use the following initial policy: we keep the desired position and orientation of the
hand xed over the pen, while closing the ngers to perform the grasp. The kinematics
of the ngers are such that the ngers dig into the table during the motion (see Fig. 3.18
for a visual depiction of the grasping motion). Servoing zero forces at the wrist allows the
hand to move up when the ngers come into contact with the table. While this simple
strategy works for larger objects, it was not successful at grasping the pen. Hence, we
would like to learn a prole of downward forces to be applied into the table that can help
in robustly grasping the pen. We use both proportional and integral gains in the force
controller in order to achieve good force tracking. We use PI
2
to learn the desired force in
thez axis, while xing desiredx andy torques to 0. Force control gains for the remaining
3 axes are set to zero.
We detect whether the pen has been grasped or not based on the nger joint angles
crossing a threshold value. We assign a cost for every time-step that the joint angles are
above this threshold. This provides a gradient to the learning algorithm, i.e., the longer
the pen remains grasped without slipping, the lower the cost. Costs are accumulated
41
during the 6-second grasping phase as well as a subsequent 6-second lifting phase, which
veries the stability of the grasp. As with the previous experiment, trajectories were
discretized into 100 time-steps. The immediate cost function at time t is: r
t
= 100q
pen
+
1:0q
ftrack
+ 0:5q
fingertrack
+ 0:1q
fmag
+ 0:0001
t
R, whereq
pen
is an indicator cost which
is 1 if the pen has slipped out of the hand (as described above), q
ftrack
is the squared
force tracking error, q
fingertrack
is the squared nger position tracking error, q
fmag
is the
squared force magnitude, and
t
R is the control cost. After 90 trials, we converged to a
policy for which all noisy rollouts succeeded. This nal policy achieved a success rate of
100%, with a cost of 47.1 0.8, averaged over 21 trials. Fig. 3.15(b) shows the evolution
of costs during the learning process. For this task, we set the proportional gains of the
force controller to 1, but also use an integral gain of 0.1. This allows better force tracking
when the ngers are in contact with the table. The learnt force prole and its tracking
performance are shown in Fig. 3.16.
During learning, we keep the position and orientation of the pen xed, i.e. robustness
to position and orientation uncertainty was not incorporated in the learning process.
We found, however, that the nal learnt policy was robust to position and orientation
uncertainty. Fig. 3.17 shows the extent of position and orientation errors that could be
tolerated by our policy. We believe these robustness results could further be improved
upon if uncertainty were to be incorporated in the learning process [91].
Finally, we recorded the cartesian position and orientation trajectories from a success-
ful grasp, and replayed these trajectories without the force controller. This policy was
unsuccessful at grasping the pen, which shows that the learnt force control policy plays a
signicant role in achieving the task.
3.6 Conclusions
We have presented an algorithm for planning smooth trajectories for high-dimensional
robotic manipulators in the presence of obstacles. The planner uses a derivative-free
stochastic optimization method to iteratively optimize cost functions that may be non-
dierentiable and non-smooth. We have demonstrated the algorithm on several 2-D envi-
ronments, and on a 7-DOF robot arm planning task, both in simulation and on a mobile
manipulator. We have demonstrated the use of STOMP for obstacle avoidance, optimiza-
tion of task constraints and minimization of motor torques used to execute the trajectory.
Experimental results have shown that the approach is able to overcome local minima that
gradient-based methods can get stuck in. Further, we have introduced a noise adaptation
extension to STOMP that provides automatic tuning of the exploration noise. Results
have shown that this addition accelerates convergence and reduces sensitivity to initial
settings. We have also demonstrated the application of STOMP/PI
2
in a reinforcement
learning setting. We used trial and error learning to acquire a force control policy that
can achieve manipulation tasks. We have shown that such control policies allow robots
to increase their robustness towards localization errors.
42
A possibility for future work is to augment this local trajectory optimizer with a tra-
jectory library approach, which can recall previous trajectories used in similar situations,
and use them as a starting point for futher optimization [7,35]. We are also investigating
the use of this planner in conjunction with controllers that are able to predict forces at
the contact points [81], in order to plan dynamic maneuvers on
oating base robots like
quadrupeds and bipeds.
The primary motivation in developing this planner was to have a fast trajectory op-
timizer that can work in high dimensional spaces, and optimize arbitrary cost functions.
The resulting
exibility in cost function design can be exploited through machine learn-
ing. In the next chapter, we present an inverse reinforcement learning algorithm that
uses principles from STOMP, and we apply this algorithm to learn objective functions for
manipulation planning from user demonstrations.
43
Chapter 4
Learning Objective Functions for Manipulation
4.1 Introduction
Optimal control, reinforcement learning and other optimization-based methods have be-
come tools of choice in solving robotics problems. This is due not only to advances in
the algorithms and techniques themselves, but also due to massive increases in available
computing power over the years. State-of-the-art systems for rough terrain quadruped
locomotion rely heavily on search and optimization techniques for foot-hold selection and
trajectory planning [101]. Reinforcement learning has found wide application in the do-
main of robotic manipulation [47,70].
Randomized motion planning has been a very successful paradigm for manipulation,
but results in unpredictable robot behavior because it typically focuses on producing
feasible solutions and not optimal ones. Optimization can help overcome these problems
by resolving the redundancy within the space of feasible solutions. Predictable robot
behavior improves task repeatability, allows for improved human-robot interaction [5],
and produces characteristic sensor signals which can be used to detect failures [70] or
trigger recovery behaviors [71].
Optimization-based methods rely on the fact that it is usually easier to dene an ob-
jective function that measures the quality of a chosen sequence of controls or parameters,
than it is to directly specify the control sequence. The performance of these systems is
critically dependent on the denition of this objective function, which may require exten-
sive manual tuning in order for the planner or optimal controller to achieve acceptable
real-world performance [54].
Inverse reinforcement learning (IRL) and inverse optimal control (IOC) methods seek
to address the above problem. The core idea behind these techniques is to infer a suitable
cost function given observations of optimal behavior. In many robotics problems, it is
signicantly easier to provide demonstrations of optimal behavior than it is to design a
cost function that typies the required behavior. IRL is commonly used as a method of
imitation learning. It is used to learn the underlying intentions of the observed behavior,
allowing the synthesis of novel control policies in new situations. This form of learning
44
Figure 4.1: The ARM robot with the 7-DOF Barrett WAM arm grasping objects using a
learned inverse kinematics cost function.
promises vastly improved generalization abilities compared to a mere reproduction of
observed controls or states.
Several IRL algorithms have been proposed over the last decade [1,67,76,100]. Most of
them can be described as the following iterative procedure: (a) solve the optimal control
problem under the current estimate of the cost function; (b) update the cost function
estimate using the output of (a). The exact form of the cost function update diers in each
case. These procedures work well in discretized state-action spaces of low dimensionality.
However, in higher dimensional continuous state-action settings, computation of the full
solution of the optimal control problem becomes intractable.
Methods have been proposed that do not involve solving the full forward MDP opti-
mization problem iteratively. [23] proposes an ecient solution which uses samples to
approximate the value function directly based on a linear MDP formulation. [11] uses a
sampling-based approach combined with importance sampling to arrive at an objective
function similar to the maximum entropy method [100].
The rest of this chapter is structured as follows. In Sec. 4.2, we extend our previous
work on path integral inverse reinforcement learning [44] with ecient regularization and
optimization methods that allows our method to scale to high-dimensional problems with
potentially large amounts of data. We then present two applications to robot manipula-
tion. In Sec. 4.3, we apply our method to learn a cost function for redundancy resolution
in the inverse kinematics problem. In Sec. 4.4, we apply our algorithm to learning cost
functions for optimization-based motion planning for use in grasping and manipulation
tasks. We compare the performance of our algorithms with previous IRL methods, and
discuss the results in Sec. 4.5. Finally, we conclude and present ideas for future work in
Sec. 4.6.
45
4.2 Path Integral Inverse Reinforcement Learning
Our approach to inverse reinforcement learning comes from path integral stochastic op-
timal control methods [93]. While most previous methods assume that the provided
demonstrations are globally optimal, we aim to discover the cost function that makes the
demonstrated trajectory appear locally optimal. This makes our technique applicable to
high dimensional problems where global optimality may be infeasible. We build upon our
previous work [44], and propose ecient regularization and solution methods that scale to
large problems. We also highlight the relationships and dierences between our approach,
Maximum Entropy IRL (MaxEntIRL) [100] and Relative Entropy IRL (RelEntIRL) [11].
4.2.1 Path Integral Inverse Reinforcement Learning
Given the path integral reinforcement learning framework from the previous chapter,
there is a surprisingly simple formulation of IRL. We assume that the state-dependent
cost function q is linearly parameterized as:
q(x;t) = w
t
q
(x;t) (4.1)
where
t
are user provided cost feature values at timet, and w
q
are the weights associated
with the features, which we are trying to learn. Due to the coupling of exploration noise
and command cost in the path integral formalism, we assume that the shape of the
quadratic control cost matrix R is given, and we can only learn its scaling factor. We
also need to learn the scaling factor on the terminal cost:
R =w
R
^
R; q
t
N
=w
qt
N
q
t
N
(4.2)
where w
R
is the control cost scaling factor that needs to be learned,
^
R is the shape of
the control cost, and w
t
N
is the scaling of the terminal cost
t
N
. We can rewrite the
cumulative cost of trajectory
k
at time t, J(
k;t
), as follows:
J(
k;t
) = w
t
2
6
4
P
N
j=t
k;j
1
2
P
N
j=t
( + M
t
k;t
)
t
R( + M
t
k;t
)
q
k
t
N
3
7
5
= w
t
k
t
(4.3)
where w = [w
q
w
R
w
qt
N
]
t
is the weight vector to be learned, and
k;t
are sucient
statistics that can be computed for every time-step t for each sample trajectory
k
.
In order to derive an inverse reinforcement learning algorithm based on path integral
stochastic optimal control theory, we start from (3.20), which is a theoretically derived
46
probability distribution over trajectories. It implies that paths with lower costs are expo-
nentially more likely than paths of higher cost. Given a demonstration
with cumulative
feature counts
, we can dene the likelihood of the optimal trajectory as:
L(
jw) =
e
w
t
R
e
w
t
d
(4.4)
If sample trajectories are available where control noise is introduced independently at
each time-step, we can dene the overall trajectory likelihood as the following product:
L(
jw) =
N
Y
t=1
e
w
t
t
R
e
w
t
t
d
t
, (4.5)
where
t
refers to the cumulative feature counts starting at timestep t.
At rst glance, the expression for the trajectory likelihood looks similar to that of Max-
EntIRL [100]. However, in [100], the partition function is computed with the assumption
that every path through the MDP with the same value is equally likely. In our case, the
trajectory distribution in the denominator of (4.4) is generated by the dynamics of (3.15).
This distribution over trajectories arises from the addition of normally distributed noise
to the controls (policy parameters). The assumption is that the variance of the control
noise
is proportional to R
1
. This means that trajectories that require a signicant
control eort (as measured by the cost
t
R) do not get sampled as much as trajectories
that are closer to the nominal trajectory. This is the reason our approach is more suitable
in high-dimensional settings when the demonstrations can only be assumed to be locally
optimal. In essence, the optimal trajectory
is only compared against trajectories in its
local vicinity rather than considering the entire space of trajectories uniformly. Locality
in our setting is determined by the amount of control cost incurred in deviating from the
demonstration. RelEntIRL [11] on the other hand samples trajectories from an arbitrary
distribution and normalizes them using importance sampling to approximate a uniform
distribution of trajectories, making the end result similar to MaxEntIRL. Importance
sampling tends to perform poorly in high-dimensional parameter spaces due to the curse
of dimensionality [58]. In [2], a similar IRL approach based on our previous work [44]
is presented. The sampling technique used is similar to ours, but this local optimality
property is not discussed. They also propose an additional step towards active sampling
of trajectories by applying the PI
2
update on the trajectory samples and adding this
sub-optimal trajectory to the sample set. Intuitively, it would seem that the addition
of several trajectories that are very close to the demonstration in the sample set causes
the optimality of the demonstrated trajectory tend to become iteratively more and more
myopic. These conjectures are experimentally tested in Sec. 4.5, where we nd that this
procedure results in high variance during learning.
47
4.2.2 Regularization and Ecient Optimization
Our approach to IRL above requires the parameterization of the cost function as a linear
combination of features. Choosing the right features for each task, such that the learnt cost
function will generalize well is not trivial. As such, we would like to provide the learning
algorithm with an over-complete set of features and allow it to select the appropriate ones.
One way to achieve sparse learning and feature selection is to penalize theL
1
norm of the
weight vector w. L
1
regularized models are known to learn good models that generalize
well even in the presence of a large number of irrelevant features [69]. The sparse solutions
generated as a result of L
1
regularization also allow for fast evaluation of the learnt cost
functions [40]. The nal objective we minimize is a weighted sum of the negative log
likelihood of (4.4) and the l
1
norm of w:
minimize
w
D
X
i=1
log
e
w
t
i
P
K
k=1
e
w
t
i;k
+kwk
1
(4.6)
where the integral in (4.4) was replaced by a nite summation, indicates the strength
of the regularizer,D is the number of demonstrations,
i
is the cumulative feature count
vector for demonstration i, and
i;k
is the cumulative feature count vector for noisy
sample k around demonstration i.
The objective function is convex, but non-dierentiable at zero due to the regular-
ization term. Previous work [2] has used Generalized Iterative Scaling to optimize this
objective function without the added regularization term. It has been shown that for
maximum entropy modeling tasks, gradient-based methods outperform all variants of it-
erative scaling [60]. In particular, the L-BFGS limited memory quasi-Newton method
has shown superior performance on large scale problems [60]. Unfortunately, due to the
non-dierentiability of the L
1
term in the objective (4.6) at zero, vanilla L-BFGS cannot
be applied to this problem. We use the Orthant-Wise Limited-memory Quasi-Newton
(OWL-QN) [3] variant of L-BFGS which deals with the non-dierentiability by using ad-
ditional projection steps and constraining the search to one orthant at a time. We also
evaluated Bayesian model selection methods [15] to choose an optimal value for the regu-
larization parameter , but found that the resulting models were usually too sparse and
did not perform well. Cross-validation was used instead, as it produced good results in
all our experiments.
4.3 Learning Inverse Kinematics
We rst evaluate our approach on the problem of learning a cost function for inverse
kinematics (IK). Even though the problem considered here is a one-shot optimization
problem that does not involve costs over time, we nd that the principles developed in
Sec. 4.2 are equally applicable. Sec. 4.4 deals with learning cost functions for the full
48
trajectory optimization problem. We use the 7 degree of freedom Barrett WAM robot
shown in Fig. 4.1 in all our experiments.
Inverse kinematics refers to the problem of nding a joint angle conguration q2R
m
that results in a desired end eector conguration x
des
2 R
n
. If m > n, the system is
redundant and there is an innite number of solutions q2Q that satises x
des
. Some of
these solutions may be preferable to others for several reasons, such as obstacle avoidance,
reachability, manipulability, and consistency in style. These preferences may be asserted
by dening a cost function C(q) which is to be optimized in addition to satisfying the
task variables x
des
. In this experiment we learn the function C(q) from demonstrations
of preferred arm joint angle congurations.
4.3.1 Problem Denition
The inverse kinematics problem is dened as follows:
q
= argmin
q
C(q); s.t. f(q) = x
des
(4.7)
where f(q) is the non-linear forward kinematics function mapping joint angles q to the
end eector position x. A local optimum of this non-convex minimization problem can
be found by iterating the following update:
q = J
y
(x
des
f(q)) (I J
y
J)
@C(q)
@q
(4.8)
where J is the nm Jacobian matrix containing partial derivatives of f(q) w.r.t. q, J
y
refers to the Moore-Penrose inverse of J, and (I J
y
J) is an orthogonal projector in the
null-space of J. Note that this is a fairly standard way of solving the inverse kinematics
problem [86], except that here we attempt to learn the cost functionC(q) to be minimized
in the null space of the task.
4.3.2 Demonstrations
The goal is to nd good congurations for the robot arm when grasping objects on a
table in front of it. To this end, we kinesthetically demonstrated 18 congurations of the
robot grasping objects using top and side grasps. Note that the object to be grasped is of
no signicance here, since we are trying to learn an optimum conguration for the robot
arm given a desired end eector pose. In addition, we demonstrated 6 full trajectories
of the robot arm reaching from its home conguration to a grasping conguration. We
randomly chose 40 arm congurations from within these trajectories and added them to
the training set. This ensures that the learnt cost function is applicable not only to the
grasping conguration, but also to the trajectory reaching up to that point. Fig. 4.2(a) and
Fig. 4.2(b) show the demonstration process for kinematic congurations and trajectories
respectively.
49
4.3.3 Features
The cost functionC(q) is parameterized as a linear combination of the following features:
Joint limits For each joint, we created a feature that grows linearly from 0 when the
joint angle is 0.3 radians away from the joint limit, to 1 when the joint is at its limit.
Manipulability measure In order to keep the robot arm away from singular congu-
rations, we compute the manipulability measure [86] as: w(q) =
p
det(JJ
t
)
Occlusions When grasping or manipulating objects, it is useful to maintain visual
contact with the end eector and the object being grasped. We include two features that
measure the perpendicular distance of the position of the wrist and the elbow respectively,
to the line joining the head and the end eector.
Elbow position In order to model preferred elbow positions, we created sigmoidal
features centered around 10 x and z positions of the elbow each. (z points up and x
points to the right of the robot).
Joint-space gaussian basis functions Finally, in order to capture stylistic elements
not modeled by the above feature, we created 4-dimensional gaussians centered around
the rst four joint angles of each demonstration in the training set. The rst three joints
of the robot are at the shoulder, the fourth joint is the elbow joint, and the last three are
at the wrist. The aesthetic appearance of the arm is mainly dependent on the rst four
joints, and using only these joints allows these gaussian features to generalize better.
4.3.4 Sampling
In order to estimate the denominator of (4.4), we draw normally distributed samples
around the demonstration, and then iterate (4.8) without the null space term, until con-
vergence. This ensures that each sample drawn satises the constraints in (4.7), making
their costs comparable. Sampling from a Gaussian distribution with a very high covari-
ance would be equivalent to uniform sampling, since samples have to be clipped at the
joint limits of the robot. However, this does not guarantee that the distribution after
projection onto the constraint manifold remains uniform.
4.4 Learning Optimal Motion Planning
Traditional motion planning systems for manipulation consider the task of generating
collision-free paths. Sampling-based motion planning algorithms have proven successful
in this regard [8]. While sampling based planners are extremely fast at producing feasible
plans, they are often lacking in the quality of paths produced. These plans typically
50
(a) (b) (c) (d) (e)
Figure 4.2: (a) Kinesthetic demonstration of a preferred kinematic conguration for grasp-
ing the
ashlight. (b) A sampled set of inverse kinematics solutions (transparent), along
with the optimal conguration (solid). (c) Demonstrated kinematic conguration for
grasping the
ashlight (transparent), along with the optimal conguration after cost func-
tion learning (solid). (d) Demonstration of a full trajectory from the home conguration
to the grasp conguration. (e) Demonstrated end-eector trajectory from home to grasp
conguration (green), 20 samples around the demonstration (blue), and the optimized
trajectory based on the learnt objective function (red).
require smoothing before they can be executed on a robot. It is appealing to use a
trajectory optimization approach which can generate collision-free plans in addition to
optimizing other costs such as smoothness, torque minimization and other task-dependent
constraints. Recently introduced motion planning algorithms such as RRT* [46], CHOMP
[22, 77] and STOMP [41] attempt to minimize a cost function dened over trajectories,
and have successfully been applied to manipulation tasks. In this experiment, we learn
the cost function to be minimized, based on user-provided demonstrations of reaching
motions.
4.4.1 Problem Setup
Our experiment is based on the STOMP [41] motion planner, which is a variant of PI
2
with a specic policy formulation and control cost that allows for ecient exploration of
smooth trajectories. Planning is performed in the conguration space of the robot. The
policy parameters (for a single dimension) represent the entire discretized trajectory
from start to goal. The control costJ
q
() is dened as the squared acceleration along the
entire trajectory, using a second order nite dierencing matrix A
2
:
J
q
() =kA
2
[x
s
t
x
g
]
t
k
2
=
t
R + b
t
(4.9)
wherex
s
andx
g
are the xed start and goal points respectively. The matrixA
2
multiplies
to produce a vector of accelerations at each timestep (see [41] for details).
51
4.4.2 Demonstrations
As in the previous experiment, we demonstrated 6 full trajectories of the robot reach-
ing from its home conguration to dierent top-grasp and side-grasp congurations dis-
tributed around the workspace. The green trajectory in Fig. 4.2(e) shows a demonstrated
trajectory to a grasp conguration.
4.4.3 Features
The cost functionJ() is parameterized as a linear combination of the following features:
Collision Obstacles in the environment are represented as a 3-D boolean voxel map.
We compute the signed Euclidean Distance (EDT) Transform [98] of this voxel map. The
absolute value of each voxel of the EDT represents the distance to the closest obstacle
surface, while its sign represents whether the voxel is inside an obstacle or outside. From
this EDT, we create 6 sigmoidal features, centered around dierent values of the EDT.
This allows the learner to choose a preferred obstacle clearance distance from the data.
Cartesian Velocities and Accelerations At every time-step, we compute the in-
stantaneous velocity, acceleration, angular velocity, and angular acceleration of the end-
eector through nite dierentiation. The squared magnitude of each component of these
velocities and accelerations is represented by a set of features.
Joint Velocities and Accelerations Similarly, features are created that represent the
squared velocity and acceleration of each joint.
Kinematics Cost Finally, the learned kinematics cost function C(q) from Sec. 4.3 is
introduced as a feature. Sharing of features across the two problems also admits the
possibility of learning both cost functions jointly, in a hierarchical manner similar to [54].
This is left for future work.
4.4.4 Sampling
PI
2
was derived based on the assumption that the covariance of the control noise
is
proportional to R
1
. Samples from this distribution have a low control cost R and
allow exploration of the state space without signicantly impacting the trajectory control
cost. These samples do not diverge from the start or goal, which is important if we are
considering the space of trajectories constrained at the start and goal points. Samples
drawn from this distribution around one of the demonstrated trajectories is shown in
Fig. 4.2(e). More details about this sampling procedure may be found in [41].
52
4.5 Results
Since we do not have access to the true cost functions of the human demonstrations, we
assigned a known set of weights w
for both the above domains and solved the optimization
problems for each instance. These solutions (with feature counts
) were treated as the
input to the IRL algorithm. After learning the cost function weights ~ w, we then solve
each instance of the optimization problem under this learned cost function, producing
feature counts
~
. We then compare how well these solutions fare under the original
cost function, by evaluating w
~
w
. Figure 4.3 (a) and (b) show the results for
the two domains, averaged over 20 and 10 runs respectively. The algorithm presented
in this work (labeled PI LOC: path integrals with local optimality) quickly learns cost
functions which produce low cost trajectories, with relatively low variance. The algorithm
from [2] (labeled PI OPT: path integrals with optimal sampling) diers from ours in that
the samples they use to approximate the denominator of (4.6) are drawn one at a time
from the optimal trajectory under the currently learnt cost function. This produces a
highly peaked distribution around the demonstration, and renders learning of the cost
function unstable. It does produce good sets of weights on occasion, but exhibits very
high variance as observed. Finally, we evaluated RelEntIRL [11] (labeled RELENT), in
which each sample is assigned an importance weight in order to approximate a uniform
distribution. In the IK domain, the algorithm is equivalent to ours since our sampling
distribution is essentially uniform. However, in the trajectory optimization domain, it
shows poor performance due to the high dimensionality of the space. Each trajectory is
comprised of 100 time-steps and 7 joints, thus samples are drawn from a 700-dimensional
normal distribution. The resulting importance weights for the drawn samples tend to
be several orders of magnitude apart from each other, causing the few highest weighted
samples to dominate over the rest. This results in the high variance seen in Figure 4.3(b).
Next, we learnt cost functions based on the human demonstrations. In the case of
the inverse kinematics problem, we nd that 29 of the 94 features are selected by the
algorithm. The manipulability measure and occlusion features are completely ignored
because they were not needed to account for the demonstrations. If such costs must be
minimized for a particular application, it may be set to a xed value, and the rest of the
cost function can be relearned. Alternately, some features can potentially be left out of
the L
1
regularization term if they are not required to be sparse. Two of the joint limit
features were selected, for the wrist rotation and wrist
exion-extension joints. 22 out of
58 gaussian basis function features were used, and 5 elbow sigmoid features were selected.
The learnt cost function was qualitatively tested in a set of top-grasp and side-grasp
tasks around the workspace, and all kinematic congurations obtained looked similar to
the demonstrations, i.e. no unexpected congurations were observed. Fig. 4.2(b) and
(c) show a set of samples, the demonstration, and the learnt optimal conguration for a
particular grasp.
When learning the cost function for trajectory optimization from human demonstra-
tions, we nd that just 4 out of 36 features are selected. First, the collision feature that
53
represents a clearance distance of 10 cm is chosen. Next, angular accelerations of the
end eector in the x and z axis are used in the cost. Finally, the feature representing
the kinematic costC(q) is assigned a small positive weight. All weights corresponding to
joint velocities and accelerations were set to 0. It is possible that since all joint velocity
or acceleration contribute to the velocity and acceleration of the end eector, it is more
parsimonious to assign cost weights at the end eector rather than at the joint level. An-
other possibility is that the human demonstrations are somewhat jerky and this aects
the learning of velocity and acceleration costs. The green trajectory in Fig. 4.2(e) shows
one instance of the human demonstration, while the red trajectory shows the reproduc-
tion obtained by optimizing the learned cost function. Optimization of trajectories using
the learnt cost function were smooth and qualitatively satisfying. In both domains, the
proposed method has successfully learnt and modeled the user's intentions from just a
few demonstrations.
This learning module has been applied as part of a larger autonomous robotic ma-
nipulation system. A video of the grasping and manipulation tasks may be found at
http://youtu.be/VgKoX3RuvB0. Preference learning on such a system is an ongoing
process, in which new demonstrations are added to the dataset whenever suboptimal
behavior is observed.
4.6 Conclusion
We have presented our approach to learning objective functions using path integral inverse
reinforcement learning. Our algorithm is based on sampling a local region around the
optimal trajectory, and hence is suitable for sampling and learning in high-dimensional
spaces where demonstrated trajectories are assumed to be locally but not necessarily
globally optimal. We have successfully applied our algorithm to two key problems in
robotic manipulation, and have demonstrated faster learning with lower variance than
previous algorithms.
54
10
0
10
1
10
2
Number of samples
0
20
40
60
80
100
Cost difference
PI LOC
PI OPT
(a) Learning inverse kinematics
10
0
10
1
10
2
Number of samples
0
200
400
600
800
1000
Cost difference
PI LOC
PI OPT
RELENT
(b) Learning optimal motion planning
Figure 4.3: Our algorithm (PI LOC) was compared with algorithms PI OPT and RE-
LENT (see Sec. 4.5 for details), on two domains: (a) inverse kinematics, and (b) optimal
motion planning. The charts show the cost dierence w
~
w
, with error bars de-
picting1 standard deviation. PI LOC shows fast learning with lower variance than the
other algorithms on both test domains.
55
Chapter 5
Learning Rough Terrain Locomotion Using
Terrain Templates
5.1 Introduction
Traversing rough terrain with carefully controlled foot placement and the ability to clear
major obstacles is what makes legged locomotion such an appealing, and, at least in
biology, a highly successful concept. Surprisingly, when reviewing the extensive history of
research on robotic legged locomotion, relatively few projects can be found that actually
address walking over rough terrain. Most legged robots walk only over
at or at best
slightly uneven terrain, a domain where wheeled systems are usually superior.
In this chapter, we address the problem of quasi-static quadruped locomotion over
extremely rough terrain, where terrain irregularities are on the order of the leg length. We
present a complete control architecture for autonomous locomotion over such terrain. The
diculty of the problem we consider is comparable to that of human rock-climbing, where
foot/hand-hold selection is one of the most critical aspects. Central to the performance of
our controller is a module that learns a suitable objective function for foothold selection
from human demonstration. Foothold selection is a problem that requires the robot to
trade-o between a number of dierent objectives. Manually coding these objectives and
tuning their parameters is non-trivial. We introduce the concept of terrain templates as
a
exible foothold feature representation, and propose an algorithm that learns a set of
templates and a corresponding foothold ranking function from expert demonstrations.
The work presented here pushes the performance of quadruped locomotion over rough
terrain well above that of previously published results. This end result is not only due
to the foothold selection module, but the introduction of several other novel sub-systems,
and tight integration of them. After planning a sequence of footholds using the learnt
reward function, a stable body trajectory is planned. This is achieved by optimizing these
trajectories for smoothness subject to quasi-static stability constraints which allows for
very fast, stable locomotion over rough terrain. Trajectory execution is achieved using a
novel
oating-base inverse dynamics control law. This allows for reduced control gains,
which when coupled with force control at the feet enables compliant locomotion that has
56
Figure 5.1: The LittleDog quadruped robot on rocky terrain
57
the ability to deal with unperceived obstacles. Evaluations of our controller are carried
out on the LittleDog robot, over terrain of various diculty levels. Additional testing on
our software is carried out by an independent external test team, on terrain that we have
never seen, to validate the generalization abilities of the controller.
The rest of this chapter is laid out as follows. In Section 5.2, we describe the Learning
Locomotion program, experimental setup, test procedures and evaluation metrics. In
Section 5.3, we discuss the structure of our control architecture, and brie
y describe
every component of it. Sections 5.4 describes in detail our method for learning foothold
selection using terrain templates. In Section 5.5, we present evaluations of our controller
on the LittleDog quadruped robot, on various kinds of terrain. We present a discussion of
certain points of interest in Section 5.6. Finally, we conclude and discuss ideas for future
work in Section 5.7.
5.2 Background
In the following subsections, we provide some background information and context to the
technical work conducted. We discuss the DARPA Learning Locomotion program, and
describe the standardized experimental setup, test procedures and metrics.
5.2.1 Learning Locomotion
The DARPA Learning Locomotion program was set up to push the performance of robotic
legged locomotion, both in terms of speed and robustness, over very rough terrain. Six
teams were provided identical hardware and test setups, which allows for meaningful com-
parisons to be made between the dierent control strategies used by the teams. Software
developed by the teams was subject to monthly testing by an independent test team,
on terrain that has never been shown to the teams. The program was split into three
phases, with successively increasing metrics for minimum speed and obstacle heights.
These metrics and test procedures are detailed in Sec. 5.2.3.
5.2.2 Experimental Setup
Our experimental setup consists of the LittleDog quadruped robot (Figure 5.1) manu-
factured by Boston Dynamics, with ball-like feet that are close to point feet, It is about
0.3 m long, 0.18 m wide, and 0.26 m tall and weighs approximately 2.5 kg. Each leg has 3
degrees of freedom, actuated by geared electric motors. The robot has a usable leg length
of roughly 13 cm, measured as the perpendicular distance from the bottom of the body to
the tip of the foot with the leg in its maximally stretched conguration. LittleDog has a
3-axis force sensor on each foot, position sensors to measure joint angles, and an on-board
inertial measurement unit (IMU). An external motion capture system (VICON) provides
information about the absolute world position and orientation of the robot at a rate of
100 Hz.
58
Figure 5.2: LittleDog crossing a variety of terrains
LittleDog's 12 degrees of freedom are controlled by an onboard computer using PD
control at 400 Hz. The rest of our control software runs on an external 4-core Intel
Xeon CPU running at 3 GHz, which sends control commands to LittleDog over a wireless
connection at 100 Hz.
In order to easily test the performance of our controller on dierent kinds of terrain,
we use a set of interchangeable terrain modules of size 61 61 cm. The terrain modules
include
at modules, steps of various step heights, dierent sizes of barriers, slopes, logs,
and rocky terrain of varying diculty levels. Figure 5.2 shows a sampling of the terrains
that are available to us for testing.
Each terrain module is scanned beforehand by a laser scanning system, to produce
a high (1 mm) resolution 3D model. A unique arrangement of re
ective markers is xed
on each terrain module, enabling the motion capture system to track the position and
orientation of each module independently, thus providing complete information about the
terrain in the world. This greatly simplies the perceptual component of rough terrain
locomotion, allowing us to focus on the planning and control problems. The terrain
modules are not moved during each trial, i.e. one run of the robot from the start to the
goal location.
5.2.3 Testing and Evaluation Metrics
In the nal tests of Phase 3 of the Learning Locomotion program, teams were required to
submit LittleDog control code to traverse seven dierent classes of terrains. These terrains
had never been seen by the teams before; only the terrain classes were known. Each terrain
had to be crossed at a minimum speed of 7.2 cm/s (0.24 body lengths/s) and contained
obstacles of up to 10.8 cm (83% of the usable leg length) in height. Three trials were
conducted over each terrain, of which the best two were considered for evaluation. This
test procedure and metric encourages the teams to create controllers that can generalize
to novel situations, behave robustly, and achieve a high locomotion speed.
59
5.3 Control Architecture
Finding an optimal sequence of motor commands for 12 degrees of freedom that take
the robot from the start to the goal state over rough terrain is considered an intractable
problem. Fortunately, such problems tend to have a natural hierarchical decomposition,
which we exploit in order to obtain tractable solutions within practical time limits. We
perform a limited amount of pre-processing on the terrain before each trial, namely,
calculation of terrain rewards, and planning of a rough body path through the terrain.
All the remaining planning and control is executed online while the robot is walking. This
online approach makes our system applicable to domains where terrain information is not
available before hand. It also allows for quick re-planning in case of slippage or imprecise
execution which is inevitable during fast locomotion on rough terrain. An overview of our
control architecture is presented in Figure 5.3.
In the following subsections, we describe in detail each sub-system of our approach to
quadruped locomotion over rough terrain:
5.3.1 Terrain Reward Map Generation
In rough terrain, foothold selection becomes one of the most crucial aspects of robotic
legged locomotion. Optimal footholds must be chosen such that the danger of slipping and
unplanned collisions are minimized, while stability and forward progress are maximized.
However, engineering such a foothold selection function, that scores available footholds
based on the terrain features and the state of the robot, turns out to be very dicult.
Moreover, nding the set of parameters that perform well on many dierent kinds of
terrain is intractable, due to the number of trade-os that need to be considered [52].
Instead, we propose an approach that learns a set of terrain features and a foothold
ranking function from good example foothold choices demonstrated by an expert. We
provide only a high-level overview of this approach here, full details can be found in
Section 5.4. The algorithm extracts discretized height maps of small patches of terrain
of dierent scales, so called terrain templates [40]. Small terrain templates capture the
quality of the actual foothold to avoid slipping, while large terrain templates capture
information about collision clearance from terrain and distance to drop-os. A large set
of templates is sampled from a variety of terrain modules. A suitable distance metric
is used to measure the similarity of a candidate foothold with each template, forming a
feature vector that is used to rank footholds.
The template learning algorithm learns a foothold ranking function by converting the
ranking problem into one that can be solved by an o-the-shelf classier. Feature selec-
tion is simultaneously achieved by choosing a sparse, L
1
-regularized logistic regression
classier, which learns a classication function that uses a small subset of the available
features. The size of the template library is eectively reduced, enabling the footstep
planner described in Section 5.3.3 to evaluate the foothold ranking function in real-time.
60
Terrain Reward
Map Generation
Learn Foothold
Ranking Function
Expert Foothold
Demonstration
Approximate Body
Path Planner
Footstep Planner
Pose Finder
PD Control
Inverse Dynamics
Force Control
Optimal foot placements are learnt
from expert demonstration using a
template learning algorithm.
T errain foothold rewards are pre-
calculated using the learnt function.
Rough body path is planned that
guides the robot through regions
with good footholds.
Robot pose is optimized to
maximize kinematic reachability
and avoid collisions.
Stable, yet smooth body
trajectory is generated for the
next four footsteps.
Collision-free swing leg
trajectory is planned using
whole-body motions.
Plan is executed with accuracy
and robustness to perturbation.
Next four footholds are chosen
that maximize rewards.
Offline Pre-processing
(once per trial)
Online (once per footstep) Real-time
(100Hz)
Learned foothold
ranking function
Terrain reward map
Approximate body path
Next four foot steps
Desired body trajectory
T errain Height Map
Optimized body poses at
these next four foot steps
Desired joint trajectories
Swing leg joint trajectory,
optimized body trajectory
Body Trajectory
Generator
Inverse
Kinematics
Swing Leg
Trajectory Planner
Desired stance leg joint
trajectories are computed.
Figure 5.3: An overview of our control architecture for quadruped locomotion over rough
terrain.
61
Most importantly, the use of terrain templates does away with the need for careful en-
gineering of terrain features, since the appropriate features in the form of templates are
automatically selected by the learning algorithm.
Training, collection of expert demonstrations, and learning of the foothold ranking
function are performed o-line, while generation of the terrain reward map is done in the
pre-processing phase, just before each trial. However, since our foothold ranking function
also takes the current state of the robot into account to calculate collision scores and other
pose-dependent features, the nal foothold rewards are computed online. More informa-
tion about the learning process, additional terrain and pose features, and evaluations of
the generalization performance are presented in Section 5.4.
5.3.2 Approximate Body Path Planner
After generating the terrain reward map, we plan an approximate path for the robot body
through the terrain, from the start to the goal. This path is intended to guide the robot
through regions of the terrain that contain a better choice of footholds.
We discretize the entire terrain into 1 cm square grid cells, and assign a reward to each
cell. The cell rewards are obtained by assuming that the center of the robot is in the cell
and the four feet are in their default stance positions. For each leg, the best n (we use
n = 5) foothold rewards in a small search radius around the default position are summed
up to form the nal reward for the cell. Using the footholds with the highest rewards
around each leg re
ects the intentions of the footstep planner, which is to maximize the
same foothold rewards.
The nal path is obtained by generating a policy in this grid using Dijkstra's al-
gorithm [21], which is equivalent to solving for the value function at every state using
dynamic programming [88]. This provides the optimal path to the goal from every cell
in the grid. This approximate body path is then used to direct the lower level footstep
planner.
Generation of terrain rewards and planning of the approximate body path are the
only two pre-processing steps that are required. All the other modules described below
are run online while the robot is walking, i.e. planning for the next footstep occurs while
the current footstep is being executed. This allows for quick recovery and replanning in
case of imprecise planning or execution, and results in a very robust locomotion system
.
5.3.3 Footstep Planner
Given the cached terrain reward map and the current state of the robot, the footstep
planner calculates optimal foothold choices for the next four steps. Optimality is based on
the foothold reward function which was learnt from expert demonstration, as described in
Section 5.3.1. These rewards are biased to guide the robot towards the approximate body
The robustness aorded by continuous footstep replanning is dicult to quantify, but it is immediately
clear to the expert observer that generating a complete footstep plan in advance and attempting to track
this plan will fail drastically in case of slippage or other execution errors
62
path plan. Search complexity is reduced by following a xed sequence of leg movements
(left hind, left front, right hind, right front), that has been shown to be the optimum
sequence that maximizes static stability [61].
Our planner computes the next four optimal foothold choices by using a greedy search
procedure at each step. For every candidate foothold choice, we optimize the pose of
the robot body using an approximate pose nder, described further in Section 5.3.4. The
nal reward for each candidate foothold is a combination of the cached terrain reward and
additional features based on the robot pose, such as in-radius of future support triangles
and collision cost features.
The greedy search technique turns out to be inadequate in certain scenarios that are
kinematically challenging, such as crossing a very large gap. Choosing a footstep in a
greedy fashion can put the robot in a conguration from which it cannot proceed, or
has to make a bad foothold choice to do so. When faced with such terrain, we use the
Anytime Repairing A* (ARA*) algorithm [59] to nd a plan that maximizes the sum of
foothold rewards from the start to the goal. ARA* is an anytime variant of the heuristic
search algorithm A*, that starts with a sub-optimal plan and then iteratively improves
the plan until terminated. The anytime property of ARA* allows us to terminate it after
a xed amount of time, and use the current best plan found by the algorithm. We run the
ARA* planner for 30 seconds in the pre-processing phase, to generate a complete footstep
plan from the start to the goal. We use this planner only to traverse terrain that the
greedy planner is known to perform poorly on, currently only large gaps that are as long
as the legs of the robot (e.g., Figure 5.4). For all other classes of terrain, the advantage
of running ARA* before execution was found to be small, hence we prefer to minimize
pre-processing time by using the online, greedy planner. The terrain class is currently
specied by the user per terrain module; automatic terrain class detection is outside the
scope of this work.
5.3.4 Pose Finder
Given the 3-D locations of the stance feet, the pose nder optimizes the 6-D pose of the
robot body in order to maximize kinematic reachability and to avoid congurations that
collide with the terrain. The x and y positions of the body are xed based on stability
criteria. This leaves the z position of the body and its roll, pitch and yaw angles as
the search variables for body pose optimization. The kinematics of the LittleDog robot
are such that the 3-D positions of all four feet and the 6-D pose of the robot body are
sucient to fully specify the 12 joint angles; i.e, there are no extra redundancies that
need to be optimized.
We use two dierent versions of the pose nder. In the rst, we use the cyclic coordi-
nate descent search technique [10] over a very coarse grid of the free variables (body roll,
pitch, yaw, and z height). This pose nder returns an approximate solution very fast,
and is used to optimize the pose of every candidate foothold evaluated by the foothold
planner.
63
Figure 5.4: The pose nder optimizes the pose of the robot body in order to maximize
kinematic reachability while avoiding collisions with the terrain.
64
The second version is a more exact, gradient-based optimizer. We use a
oating-
base iterative inverse kinematics algorithm [65] to satisfy the locations of the feet, while
optimizing the secondary objectives of collision avoidance and maintenance of a default
posture in the null space of the rst. This ne-grained optimization provides a locally
optimal solution at the expense of higher computation time when compared to the discrete
search. This optimization is only performed on the sequence of footholds selected by the
footstep planner.
5.3.5 Body Trajectory Generator
The body trajectory generator uses the next four planned footholds and creates a body
trajectory. The generated trajectory is smooth, stable with respect to the Zero-Moment
Point, a dynamic stability criterion, and works even over highly irregular foot placement
patterns, which is almost always the case in rough terrain. In contrast to traditional center
of gravity (COG) based trajectory planners, our body trajectory generator eliminates the
four leg support phase, which is the duration of time spent moving the body with all
four feet on the ground. This ensures that one leg is swinging at all times, resulting in
increased locomotion speed, without sacricing stability.
We formulate the body trajectory generation problem as an optimization problem.
The COG trajectory is represented as a series of quintic spline segments, and the cost func-
tion is to minimize squared accelerations along the trajectory. This function is quadratic
in the spline coecients. We use the Zero-Moment Point (ZMP) [96] as our stability crite-
rion. To attain stability, the ZMP must lie within the polygon formed by the supporting
feet, at all times. At each time step, these constraints can be represented as linear func-
tions of the spline coecients. The resulting optimization is a convex quadratic program
(QP) that can be solved eciently by o-the-shelf solvers. This optimization provides a
smooth and stable body trajectory for locomotion over rough terrain.
5.3.6 Foot Trajectory Planner
Given a stable body trajectory (provided by the body trajectory generator) and the
desired footstep locations (provided by the footstep planner), the foot trajectory planner
generates a trajectory for the swing leg that avoids collisions with the terrain. An initial
trajectory is generated that guides the end-eector over the terrain from its initial position
to its goal. This trajectory is subsequently optimized to eliminate potential shin or knee
collisions.
An initial trajectory is generated that moves the foot over the terrain in the vertical
plane containing the start and the goal. Points on the convex hull of the terrain along
the plane from the start to the goal are raised by a clearance height. A trajectory is then
generated through these points using piece-wise quintic splines optimized for minimum
acceleration. This trajectory is sampled at 10ms intervals and converted into joint angles
65
using the standard analytical inverse kinematics solution for a 3-DOF arm. Kinematic in-
feasibilities are simply ignored, since they are taken care of in the subsequent optimization
phase.
The initial trajectory is only guaranteed to avoid collisions of the end-eector with the
terrain, however, it may still contain knee or shin collisions. Moreover, if parts of the ini-
tial trajectory were kinematically infeasible, these congurations may contain collisions of
the end-eector with the terrain. We run a trajectory optimizer called CHOMP (Covari-
ant Hamiltonian Optimization for Motion Planning) [77] on the trajectory to eliminate
any collisions that remain. CHOMP also serves to smooth the trajectory in joint space
according to a quadratic cost function; we choose a minimum squared jerk cost function
to provide continuous accelerations for inverse dynamics control. CHOMP uses covariant
gradient descent in order to push the trajectory away from obstacles while still maintain-
ing its smoothness. The cost function for collisions (and its gradients) are dened based
on a signed distance eld (SDF), which is a 3-D grid of distances to the closest obstacle
boundary. The sign of the SDF voxel values are positive if it is outside obstacles, and
negative otherwise. The SDF is generated before the start of the trial as a pre-processing
step. Joint limit violations are handled using a covariant update that eliminates them
without impacting smoothness. We run CHOMP on a virtual 8-DOF system: the 3 joints
of the leg, the body z height, and 4 variables that together represent the body quaternion.
This enables the use of whole-body motions such as pitching in order to avoid swing leg
collisions.
In our system, footstep planning and trajectory generation is performed while the
robot is executing a swing motion with one of its legs. Hence, it is critical that the entire
trajectory for the next swing leg be generated before the current swing motion ends.
Our implementation of CHOMP is most often able to produce collision-free, smooth
trajectories in less than 100 iterations, which typically takes no more than 50ms. The
lower limit on swing leg duration in our system is 300ms, allowing us to run the optimizer
while the robot is walking, one step in advance, as opposed to pre-generating trajectories
all the way to the goal before execution as in [77]. In the rare occasion that 100 iterations
of CHOMP does not produce a collision-free trajectory, we optimistically execute the nal
trajectory as-is, and leave it to the execution module (Section 5.3.9) to maintain stability,
and eventually recover and replan.
5.3.7 Dynamic Motions for Extreme Terrain
Terrain that contain steps and barriers of height up to 12 cm and gaps of width up to 17 cm
push the LittleDog robot beyond its limits of kinematic feasibility. Since the maximum
ground clearance the robot can achieve with its legs fully stretched is roughly 13 cm, it is
not possible to safely walk over such obstacles. Therefore a special series of movements,
in particular a lunging movement and a sliding movement, were designed to enable the
robot to cross such obstacles. Similar dynamic motions have been demonstrated before
on the LittleDog robot [14].
66
The dynamic lunging motion involves shifting the weight of the robot onto its hind
feet, pitching up its nose, and lifting its feet onto the step or barrier. In case of lunging
across a gap the robot executes a bounding-like movement before the lunge to ensure that
the COG reaches the other side of the gap (using its momentum). The sliding movement
involves shifting the robot's weight in front of its front legs and using the robot's nose as a
fth stance to lift its rear, which allows for collision free hind leg movements. To account
for dierent COG positions and dierent friction properties across dierent LittleDog
robots, a calibration procedure has been developed that performs a series of lunging
movements to estimate the optimal parameter set for that particular robot.
5.3.8 Sensor Preprocessing
The LittleDog robot has two sources of orientation measurement, the VICON motion cap-
ture system, and the on-board inertial measurement unit (IMU). Both provide a quater-
nion q of the robot's orientation: q
MOCAP
from the motion capture system, and q
IMU
from the IMU. q
MOCAP
contains outliers and noise, but provides an absolute orientation
measurement, without drift. q
IMU
, on the other hand, is obtained through integration,
which provides a clean signal, but drifts slowly over time due to integration errors. The
performance of control algorithms can be improved if the two sources of orientation can be
combined to compensate for the shortcomings of one another. We developed a Bayesian
outlier detection and removal algorithm, which estimates the oset between q
IMU
and
q
MOCAP
, which is a slowly changing signal infested with noise and outliers. This Bayesian
regression algorithm operates incrementally, in real-time, providing us with cleaner ori-
entation and angular velocity estimates than either source provides on its own. Details
and results from this algorithm can be found in a previous publication [94].
5.3.9 Execution and Control
The 6-D pose trajectory of the body is converted into joint angle trajectories for the 3
stance legs using a closed form inverse kinematics solution for each 3-DOF leg individually.
These trajectories are then combined with the swing leg joint trajectories to form the nal
sequence of joint angles to be executed on the robot.
In order to guarantee stability of the robot and accurate foot placement, it is essential
that the planned joint trajectories be executed accurately. At the same time, the robot
has to be able to overcome small perturbations and slips, and robustly handle unperceived
terrain. To achieve this level of robustness, we use a combination of ve controllers, as
described below. A graphical overview of these controllers is shown in Figure 5.5.
1. Joint PD control (400 Hz): The desired joint angle positions and velocities are sent
over a wireless link to LittleDog's on-board computer that servos these commands
using PD control at 400 Hz.
2. Inverse dynamics (100 Hz): We use a novel, analytical
oating-base inverse dynamics
algorithm [63] that does not require knowledge of the contact forces at the feet.
67
LittleDog
Robot
Inverse Dynamics
Force P
Control
Joint PD
Control
+
Foot Error I Control
+ Inverse Kinematics
Body Error I Control
+ Inverse Kinematics
Planner
Onboard (400Hz)
Figure 5.5: Flow diagram of our low-level control software. The joint PD and force P
controllers run on the robot at 400Hz, while the remaining controllers operate on the
oboard computer and send commands to the robot at 100Hz.
68
The joint torques thus computed are sent to the LittleDog at 100 Hz, which it then
applies as feed forward torques, in addition to the torques from the PD controllers.
Using inverse dynamics provides better tracking of trajectories, and allows us to
use lower PD gains than otherwise possible, making the robot more compliant to
perturbations [12].
3. Force P control (400 Hz): The above inverse dynamics algorithm provides us with
the expected contact forces at all four feet. This allows us to control the forces
at each foot using the force sensor data as feedback. The use of force control in
conjunction with inverse dynamics allows us to negotiate unperceived obstacles up
to 4 cm in height.
4. Body Error Integral Control (100 Hz): We monitor the 6-D robot pose as registered
by the motion capture system, and use an integral (I) controller to track the desired
pose. The controller performs
oating base inverse kinematics to servo body posi-
tion and orientation, in the null space of foot location constraints. This modies
the desired joint angles that are sent to the PD and inverse dynamics controllers.
Tracking the body pose keeps the robot stable in spite of small slips at the stance
feet.
5. Foot Error Integral Control (100 Hz): We use a similar integral controller on the
swing foot position. This controller signicantly improves foot placement accuracy,
which is critical for walking on rough terrain.
5.4 Learning a Ranking Function for Foothold Selection us-
ing Terrain Templates
In this section, we describe our approach to learning an optimal foothold selection policy
from expert demonstration. We develop this work in the context of quadruped locomo-
tion with the LittleDog robot, but this approach in principle scales to any other legged
locomotion system. We consider static or quasi-static locomotion, in which either the
Center of Gravity (COG) or the Zero Moment Point (ZMP) [96], respectively, is always
maintained within the support polygon. The task is to nd a suitable foothold for the
swing leg, given the positions of the stance legs. The chosen foothold must:
Minimize slipping on terrain
Not result in collisions of the robot with the environment
Maximize forward progress towards the goal
Be within the kinematic range of the robot
Serve to maximize the area of future support polygons for maximum stability
69
Most often, these goals are con
icting, and trade-os must be made between them.
Specifying these trade-os by hand is non-trivial, since it typically involves manual tuning
of a large number of parameters. From our experience, it is nearly impossible to achieve
a system that generalizes well to all kinds of terrain purely by manual parameter tuning,
as the search space is prohibitively large.
We seek to learn the optimal trade-os for foothold selection, using footholds that are
demonstrated by an expert. To this end, we formally dene the foothold selection problem
below, describe the process of collecting expert demonstrations, and then introduce an
algorithm that learns how to select good footholds.
5.4.1 Problem Formulation
The state of the robot is dened as the 3D position of all feet, 3D body position and ori-
entation, and the knowledge of which leg is the swing leg. We dene the set ofn footholds
reachable from this state byF =ff
1
:::f
n
g. Each of these footholds f
i
is described by
a feature vector x
i
2 R
d
. We dene this feature vector x
i
as being composed of two
groups: terrain features, which encode information about the terrain at the foothold, and
pose features that quantify all other relevant information like progress towards the goal,
stability margins, and collision margins.
We dene a reward functionR over footholds as a weighted linear combination of the
features:
R(f
i
) = w
t
x
i
, (5.1)
where w2R
d
is the weight vector. This reward function can then be used to rank all the
footholds inF, and the one with the highest reward selected as the target foothold
y
. The
task of learning the trade-os for foothold selection has thus been reduced to learning the
weight vector w.
5.4.2 Collecting Expert Demonstrations
An expert is shown logs of runs executed by the robot on dierent kinds of terrain. S/he
inspects each run for sub-optimal foot placement choices made by the robot, and in each
of those situations, labels the correct footholdf
c
that s/he thinks is the best greedy choice
from among the entire set of reachable footholds
z
F. Figure 5.7(a) shows a screenshot of
our teaching interface, where the red ball depicts the sub-optimal foot placement choice
made by the robot, and the white ball below shows the optimal foot placement chosen by
the expert.
y
Maximizing the immediate reward for the current foothold is an inherently greedy approach. At
the expense of higher computation time, one can achieve better performance in practice by choosing the
foothold that maximizes a multi-step criterion, i.e., the value function, which is the sum of expected future
rewards obtained by choosing a foothold, using standard heuristic graph search algorithms like A*.
z
Foothold reachability is evaluated by running the pose nder described in Section 5.3.4, on a grid of
footholds around the current leg position.
70
5.4.3 Learning From Expert Demonstrations
When the expert labels the foothold f
c
from the setF as the optimal foothold, s/he is
implicitly providing information that the reward for the chosen foothold f
c
is better than
the reward for all others:
w
t
x
c
> w
t
x
i
8i2F; i6=c. (5.2)
This corresponds to the widely studied rank learning or preference learning problem in the
machine learning literature [17]. This problem can be converted into a binary classication
problem [36]. Rearranging Eq. (5.2), we get:
w
t
(x
c
x
i
)> 0 8i2F; i6=c. (5.3)
The optimum value of w could be found by dening a suitable loss function that penalizes
values of w
t
(x
c
x
i
) that are lower than zero. However, it is observed that such a loss
function would be identical to that of a linear binary classier, in which the inputs are the
pairwise dierence feature vectors (x
c
x
i
), all mapped to the target class +1. Hence, the
weights w can be obtained by running a linear classier
x
like Support Vector Machines
(SVM) or logistic regression (LR) on the pairwise dierence feature vectors (x
c
x
i
). The
resulting weight vector w is used to evaluate the reward R in Eq. (5.1) on all candidate
footholds f
i
2F. The ranking imposed uponF by the reward R will satisfy all the
training data in Eq. (5.2), if and only if the expert's ranking function is truly a linear
function of the features. We show in the following section that this is usually not the
case, especially with conventional feature representations of footholds.
5.4.4 Terrain Templates
Some of the terrain features previously used for the foothold selection problem are slope,
curvature and standard deviation of the terrain on various spatial scales [52, 53]. How-
ever, since the rewards are dened to be linear in the features, such a feature set can only
represent more complex decisions by careful engineering of nonlinear features. For exam-
ple, Figure 5.6 shows three footholds with increasing levels of curvature. The foothold
illustrated in Figure 5.6(b) is preferable to
at terrain (Figure 5.6(a)), because it prevents
foot slippage. The one in Figure 5.6(c), however, is not, because the foot could get stuck
in it. Such a ranking is hard to achieve with a reward function that is a linear function
of curvature.
Although this specic example could possibly be handled by hand-crafting additional
features, one can imagine that many more such situations exist which cannot be foreseen.
The problem thus requires a much more general solution
{
.
x
Our choice of a specic classier is motivated in Section 5.4.4.
{
Non-linear classiers, such as the use of radial basis function kernels in SVMs [85] can learn non-
linear ranking functions [36], but are still limited by the original feature representation. Additionally, the
decision function cannot be interpreted as easily as in the linear case.
71
(a) (b) (c)
Figure 5.6: Three footholds with increasing values of curvature: (a) Flat terrain; (b)
Terrain with mild curvature, preferable to
at terrain since it prevents slipping; (c) Terrain
with high curvature, which can cause the foot to get stuck, thus less preferred than (a)
or (b).
We propose the concept of a terrain template, which is a discretized height map of
the terrain in a small area around the foothold. One could imagine an approach in which
the robot maintains a library of such templates, each associated with a reward value.
Subsequently, during execution, each candidate foothold can be assigned the reward from
the closest matching template in the library (using an arbitrary similarity measure). In
the context of Figure 5.6, the robot would store a template for each of the three footholds
shown, along with an appropriate reward for each. Assigning the highest reward to the
template for Figure 5.6(b) would allow the robot to select that foothold over the others.
Manual creation of such a library of templates would be too time-consuming, and it
would be nearly impossible to attain good generalization performance, since the rewards
for each template would have to be tuned carefully. Hence, we propose an algorithm that
uses expert-demonstrated footholds to simultaneously learn a small set of templates, and
a foothold ranking function that uses these templates. We rst describe our methodology
for template extraction from expert demonstrations, followed by the template learning
algorithm.
5.4.5 Template Extraction
From each foothold demonstrationhF;f
c
i made by the expert, we extract a set of tem-
plates on multiple scales. Figure 5.7(b) shows the three scales of templates that were
extracted from the demonstrated foothold (light green ball in Figure 5.7(a)). The scales
that we use are dictated by the geometry of our quadruped robot, and are designed to
independently capture dierent properties of the terrain that make up its reward function.
The smallest scale (2424 mm) is on the order of the size of the foot, and encodes informa-
tion about surface friction properties and micro-features that dene the slip characteristics
of the foothold. We intend to capture information about clearance from obstacles and
72
(a)
(b)
24 24 (mm) 54 54 (mm) 144 54 (mm)
Figure 5.7: (a) Teaching interface used to demonstrate footholds. Red ball at the foot
indicates the chosen (dangerous) foothold, light green ball below indicates the demon-
strated optimal (safe) foothold; (b) Terrain templates in multiple scales extracted from
the demonstrated foothold. The white spheres indicate the position of the foot on each
template.
73
drop-os using the medium scale (54 54 mm). Finally, the large scale (144 54 mm)
accounts for potential shin and knee collisions. The representation of dierent proper-
ties of the terrain using multiple scales at dierent resolutions prevents the problem of
over-tting that can occur if only the largest scale is used at high resolution.
Templates on all scales are also extracted from all the other reachable footholds in
the setF, for every expert demonstration. This creates a large library of templates
representing dierent kinds of terrain. This comprises the input to the following template
learning algorithm, which selects a small subset of these templates and learns a foothold
ranking function using them.
5.4.6 Template Learning
Each template in the library contributes a feature to the feature vector x
i
of a foothold
f
i
. The feature value represents the similarity between the template and the candidate
foothold. We choose a radial basis function similarity measure { a negative squared
exponential kernel centered at the template, and evaluated at the candidate foothold.
The feature value is given by:
x = exp
h
n
X
i=1
(t
i
c
i
)
2
!
, (5.4)
where x is the value of the feature, h is the bandwidth parameter of the kernel
k
, n is
the number of terrain points in the discretized height map of the template, t
i
is the i-th
height value of the template, and c
i
is the i-th height value of the candidate foothold.
This feature set forms the input to the algorithm described in Section 5.4.3, that
learns a ranking function for footholds from expert demonstrations. This is done by
converting the expert ranking data into a problem of linear binary classication, and using
the resulting weights as a ranking function. However, since we have a potentially huge
template library, selecting a small subset of these templates for use in the nal ranking
function is important for two reasons. Firstly, the use of the entire template set would
allow the learning algorithm to trivially overt all the training data, resulting in poor
generalization. And secondly, the similarity measure for templates needs to be evaluated
in real-time while the robot is walking, and these computations can be prohibitively
expensive for a large template library.
We combine the steps of template subset selection and weight learning by using a linear
classier that promotes sparsity in feature space. We run an ecient implementation of
l
1
-regularized logistic regression (LR) [50] on the pairwise dierence feature vector data
(x
c
x
i
). l
1
- regularized LR is a linear classier that can be dened as an optimization
problem to minimize the following cost function:
k
Currently, the bandwidth parameter h of the kernel is tuned manually per length scale. We keep the
parameter xed for all experimental results shown in this chapter.
74
J =
m
X
i=1
log
1
1 + exp(w
t
x
i
y
i
)
+kwk
1
, (5.5)
wherem is the number of training examples, x
i
2R
d
is thei-th input vector,y
i
2f1; 1g
is the label of thei-th input vector, w2R
d
is the weight vector, and is the regularization
parameter. The predicted label y
test
for an input x
test
is obtained using:
y
test
= sgn(w
t
x
test
). (5.6)
The second term in Eq. (5.5) is a regularization term that penalizes the l
1
norm of the
weight vector. The use of an l
1
regularization term has been known to produce sparse
solutions, i.e. a weight vector w which contains very few non-zero values [68]. It has been
shown thatl
1
-regularized LR can outperforml
2
- regularized LR, especially in cases where
the number of features is much larger than the number of data points [68]. In our case,
each expert demonstration produces a large set of training examples for the classier.
However, data from footholds that are very close to each other is likely to be very similar,
so the true number of unique training examples that convey novel information is usually
much lower.
The regularization parameter can be viewed as a control for the desired amount of
sparsity { higher values of typically produce sparser weight vectors. We x the value
of by searching through a range of values and picking the one that minimizes the cross
validation training error.
The sparse weight vector w that we obtain from the l
1
-regularized LR allows us to
discard all the templates from the library which have a zero weight. The reward for each
foothold can now be evaluated by calculating the feature values for templates that remain
in the library using Eq. (5.4), and evaluating the reward function in Eq. (5.1) using the
learnt weight vector w.
5.4.7 Results
In order to evaluate the performance of our template learning approach, we collected
around 100 expert foothold demonstrations across dierent kinds of terrain of varying
diculty levels. Three dierent foothold reward functions were learnt using the following
feature sets: (a) multi-scale terrain features, (b) multi-scale terrain templates and (c)
terrain templates + terrain features. 26 pose features were also used along with each of
the above feature sets. The pose features included measures of progress towards the goal,
stability margins, reachability of the foothold, dierences in leg heights, and knee and
body clearance from the terrain.
75
Table 5.1: Number of templates learnt on each spatial scale, from a total of 946, with and
without augmentation by features
Small scale Medium scale Large scale
Templates + Features 1 19 23
Templates Only 10 24 28
The terrain features we used (on each scale) were slope and curvature along the x
andy axes
, and standard deviation of the terrain. This resulted in a total of 9 features
being used per length scale. The same three length scales were used for both feature
computation and template extraction, and are depicted in Figure 5.7(b). The features
and templates are mirrored in the y direction for the left and right legs, but independent
functions were learnt for front and hind legs as their properties tend to be quite dierent.
A foothold ranking function was learnt for each of the three feature sets mentioned
above, using the algorithms described in Sections 5.4 and 5.4.4. The l
1
regularization
parameter was set individually in each case by choosing the value that minimized cross
validation error on the training data. Table 5.1 shows the number of templates selected
on each scale, with and without the use of terrain features. Interestingly, almost no
templates were needed on the smallest scale when features were used, however, most of
the templates on the medium and large scales were still required. This suggests that the
expert's ranking function contained a signicant amount of non-linearities in the larger
scales that terrain features were unable to capture. At the smallest scale, the information
contained by the templates seems to be identical to that of the features computed at this
scale.
We measured locomotion performance using two metrics: success rate, and average
slip experienced by the robot at each foothold. A run was deemed a success if the robot
crossed the terrain and reached the goal without falling over. Slip was measured as the
distance between the position of a foot at touchdown and the position of the same foot
before it swings again, i.e., one full walking cycle later. We averaged the slip only over
successful runs, since including the slip experienced on runs that failed would render the
statistic meaningless. Table 5.2 shows the results that were obtained by performing 21
runs using the three dierent ranking functions learnt from the three feature sets. These
runs were performed on a test terrain module which the robot was not trained on. The use
of templates is seen to improve performance over that of terrain features, in terms of both
success rate and slip. It is also observed that combining templates and features results in
the most robust performance and lowest average slip. The baseline slip performance of
We split each slope and curvature feature into two distinct features for the positive and negative
directions. The x and y axes are aligned with the orientation of the robot, i.e., in body coordinates, with
x pointing forward, and y to the left.
76
Table 5.2: Success rates and average slip at footholds using dierent feature sets. Exper-
imental details can be found in Section 5.4.7.
Method Success rate Average slip on
(out of 21 runs) successful runs (mm)
Features Only 47:6% 24:8 5:9
Templates Only 76:2% 20:2 5:1
Templates + Features 100:0% 17:3 3:3
Baseline (on
at terrain) 100:0% 13:4 0:4
Figure 5.8: Sequence of snapshots of the quadruped robot LittleDog crossing the test
terrain after template learning. (left to right, top to bottom)
the robot walking on
at terrain is also shown, as a lower bound to the achievable amount
of slip on our experimental setup
yy
.
The results achieved using a combination of templates and features suggest that
achieving broad generalizations using heuristic features, while simultaneously represent-
ing exceptions to those rules by using templates results in a highly performant system.
We stress the fact that since the templates are learnt automatically, strong regularization
during the learning process is a key component in achieving generalization and preventing
overtting to the training data.
5.4.8 Discussion
Our technique for learning good foot placement is inherently a supervised learning ap-
proach, wherein an expert demonstrates good footholds and the algorithm learns how to
choose footholds with similar characteristics. This is in contrast to reinforcement learn-
ing (RL) approaches, where the robot would learn to select good footholds based on its
own past successes and failures. The RL technique can possibly result in performance
yy
The amount of slip experienced on
at terrain might seem rather high. This is largely because the
spherical feet on the LittleDog robot roll forward as the robot moves forward, resulting in a systematic
change in the perceived foot location.
77
superior to that achievable by learning from expert demonstrations, since the expert is
not necessarily always optimal in her/his choices. Indeed, we have experimented with RL
approaches in the early stages of the project. Promising though they might be, they re-
quire a very large number of trials over the terrain in order to learn a reasonable foothold
selection function. This tends to be quite taxing on the robot hardware. When it comes
to very dicult terrain, where good footholds are few and far between, the probability
of a successful trial during RL exploration is low, making learning very dicult. Our
supervised learning approach learns much faster with the help of direct feedback provided
by the expert.
5.5 Evaluations
We now present evaluations of our LittleDog control software on the terrain modules that
are present in our lab, as well as results from tests conducted by an external test team
on terrain that we have never seen.
In our lab, we tested our controller on rocky terrain of various diculty levels, round
rocks/pebbles (not moveable), and wavy dunes. The most dicult rocky terrain contains
obstacles up to 10 cm (76% of the usable leg length) in height. We also tested our soft-
ware on more structured test scenarios like gaps, steps of varying step heights, slopes,
and barriers. Our controller is able to traverse all classes of terrain at speeds ranging
from 7.2 cm/s to 13 cm/s (0.24 to 0.43 body lengths/s). The advantage of our ability
to replan quickly in case of a deviation from the plan is evident when testing on more
challenging rocky terrain, where the robot succeeds with high probability even after slip-
ping or imprecise foot placement. Steps up to a height of 10 cm can be climbed using
our regular walking gait, and lunging can be performed to traverse steps up to 12 cm in
height. Similarly, lunging is employed to cross barriers up to 12 cm in height. Gaps up to
15 cm wide can be traversed successfully by generating a footstep plan using the ARA*
planner, and lunging can again be employed to cross gaps up to 17 cm wide.
Inverse dynamics control, coupled with force control and low PD gains allow us to
overcome unperceived obstacles. This property was tested by placing wooden planks on
the terrain, which was not sensed by the motion capture system. Our controller was
able to handle obstacle heights up to 4 cm without signicant diculties. Additionally, a
moveable see-saw was set up on the terrain, once again, unsensed by the motion capture
system. This scenario posed no problems for our software, demonstrating the robustness
added by this combination of controllers.
Figure 5.8 shows the robot traversing one of the test terrain modules. The attached
video (Extension 1, also available at [38]) highlights the key features of our controller, and
contains a compilation of video from test runs of the robot on dierent kinds of terrain.
78
Table 5.3: Results from nal tests of Phase 3 of the Learning Locomotion program.
Average speeds are from the best two trials (out of three): individual run speeds were
not provided to us. Testing on the rst seven terrains was mandatory; these terrains were
not shown to us before testing. The last three terrains were optional tests that we chose
to demonstrate our performance on.
Terrain Average speed
Gap 10.7 cm/s
Barrier 12.1 cm/s
Rocks 5.2 cm/s
Sloped Rocks 7.9 cm/s
Round Rocks 10.0 cm/s
Logs 7.8 cm/s
Large Steps 7.2 cm/s
Wavy Dunes 10.2 cm/s
Small Steps 10.9 cm/s
Mixed Steps 7.9 cm/s
79
5.5.1 Learning Locomotion Test Results
As mentioned in Section 5.2.3, our control software was subject to additional testing by an
external test team, on seven dierent classes of terrains that have never been shown to us.
Table 5.3 shows the average speeds obtained from the nal tests on these seven classes
of terrains, as well as three optional terrains. Our controller managed to successfully
traverse all classes of terrains in at least two out of three trials, with only one terrain
class below the metric speed of 7.2 cm/s. Although we cannot disclose the performance
of the other teams on the program, our team was the only one that crossed at least six
out of seven terrains at or above the metric speed. These results serve as a testament to
the true generalization ability of our controller.
5.6 Discussion
Our system is made up of a number of components, as discussed throughout the chapter,
and depicted in Figure 5.3. Clearly there are many complex and subtle interactions
between these components. In general, when studying the approaches of the other groups
involved in the Learning Locomotion project, we see a balance being achieved between
planning and control. If planning is performed accurately and at great depth, one can get
away with simple high-gain position control to execute this perfect plan. On the other
end of the spectrum, one can get away with more approximate planning, if the execution
of this plan is robust to failure. We believe our approach tends more towards the latter.
We use a fast but greedy foot-step planner that is myopic in nature, but its deciencies
are compensated for by using a good swing leg trajectory planner, and executing these
plans using our compliant control module.
In a large software and control system like the one we have presented, it is inevitable
that there are quite a few parameters that need tuning. In our experience, a large number
of the parameters have been of the \set-and-forget" variety, i.e., we tuned them once
during the initial development of each module and never needed to revisit them later.
The one parameter that does need tuning in our system is the heuristic that denes the
duration of a swing leg motion. The locomotion speed of the robot depends primarily on
this one parameter, which requires tuning per terrain class for optimal performance. If
suboptimal speed were acceptable, as in non-competitive daily life scenarios, no parameter
tuning would be required.
Finally, the biggest lesson that we have learnt in the development of this project is that
there is no single component or algorithm in our controller that is primarily responsible
for achieving good locomotion performance. Each component plays an equally important
role, from selecting the right footholds that do not slip, to compliant model-based control
of the robot for increased robustness. It is the combination of all of these techniques
that has rendered our approach successful and competitive in the Learning Locomotion
project.
80
5.7 Conclusion
We have presented a general quadruped locomotion controller for fast locomotion over
rough terrain, and its application on the LittleDog robot. This controller relies on a
decomposition of the problem into many sub-systems, in which we apply state-of-the-
art learning, planning, optimization and control techniques to achieve high performance.
Novel features of our controller include a procedure that learns optimal foothold choices
using terrain templates, a body trajectory optimizer based on the ZMP stability criterion,
and a
oating-base inverse dynamics controller that does not require knowledge of the
contact forces. Evaluations presented have shown that our controller is highly performant
and generalizes well on many dierent kinds of terrain. This controller represents our
nal submission to the Learning Locomotion program, in which ve other teams also
took part. Our approach has proven to be competitive, achieving the maximum number
of terrains crossed successfully above the metric speed set by the program.
Most of the techniques developed in this chapter are generally applicable to similar
planning and control problems on other locomotion systems. The template-based foothold
learning algorithm is a data-driven technique, and does not make assumptions about the
number of legs or shape of the foot. The ZMP trajectory optimizer can be applied to any
robot which can be approximated by the cart-table model, which has been used before
for biped robots [37]. Similarly, our
oating-base inverse dynamics controller extends to
other articulated robots, and has been applied to a simulated biped in other work [64].
In future work we intend to apply these techniques to other platforms such as humanoid
robots.
81
Chapter 6
Conclusions and Future Work
As long as Moore's law continues to hold, and ever increasing computing power is cheaply
available, the use of planning and optimization methods will continue to be successful
paths to enabling autonomous behavior. The objective function is the sole specication
of how well these systems perform, and thus is extremely crucial. In this thesis, we have
investigated algorithms for learning these objective functions from demonstration, with
applications to autonomous locomotion and manipulation. We presented a sampling-
based trajectory optimizer that is able to generate smooth, locally optimal trajectories
with respect to arbitrary cost functions, and demonstrated its application to manipulation
planning and reinforcement learning. We developed the path integral inverse reinforce-
ment learning method which learns a cost function over trajectories by sampling. The
assumption that the demonstration is locally optimal makes learning feasible in high di-
mensional spaces. We applied this algorithm to learn a redundancy resolution scheme
for inverse kinematics, and a cost function for trajectory optimization, both from human
demonstrations. Finally we presented a system developed for quadruped locomotion, the
key component of which is a module that learns optimal foothold selection using IRL.
Our method learns a reward function for foothold selection in rough terrain using terrain
templates, which removes the need for feature engineering and manual weight tuning. The
work presented thus far opens up a whole new set of potential research directions. Below,
we list the ones that we deem most promising and have already started investigating:
6.1 Reinforcement Planning
In this thesis, we have presented ways of learning objective functions from demonstration,
which fall under the umbrella of Inverse Reinforcement Learning. In some cases, obser-
vations of optimal behavior may be hard to come by. For example, it may be extremely
dicult to provide demonstrations of how a quadruped robot must move all 12 of its joints
while maintaining stability. These scenarios usually have a well-understood long term re-
ward: i.e., does the robot fall or remain stable. However, long optimization horizons lead
to slow performance of optimal planners. Breaking up this long term reward into many
short-term rewards can increase performance and make optimization tractable, but the
82
form of these short term rewards are less clear. We can attempt to approximate and
learn these short term \proxy" rewards through reinforcement learning. Such a method,
combined with a short horizon model predictive control approach may prove successful in
locomotion control. To the best of our knowledge, research on such techniques has been
scarce; we know of one paper that called this technique \Reinforcement Planning" [102].
6.2 Simultaneous Motion and Force Planning
Simulatenous planning of motion and contact forces remains a challenging problem. With
methods like STOMP and recent developments in inverse dynamics algorithms for optimal
contact force distribution [80], we now have an alternate way of addressing such problems,
as opposed to traditional methods in optimal control. This would allow planning of
complex, multi-contact motions for humanoid robots, that satisfy balance constraints.
Including contact forces in the planner will allow a wider range of motion as we move
away from the regime of static stability.
83
BIBLIOGRAPHY
[1] P. Abbeel and A. Ng. Apprenticeship Learning via Inverse Reinforcement Learning.
In Proc. of the Intl Conf. on Machine Learning, 2004.
[2] N. Aghasadeghi and T. Bretl. Maximum Entropy Inverse Reinforcement Learning in
Continuous State Spaces with Path Integrals. In IEEE/RSJ Intl Conf. on Intelligent
Robots and Systems, pages 1561{1566, 2011.
[3] G. Andrew and J. Gao. Scalable Training of L1-Regularized Log-Linear Models. In
ACM Intl Conf. on Machine Learning, pages 33{40, 2007.
[4] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of
robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469{
483, 2009.
[5] Michael Beetz, Freek Stulp, Piotr Esden-Tempski, Andreas Fedrizzi, Ulrich Klank,
Ingo Kresse, Alexis Maldonado, and Federico Ruiz. Generality and legibility in
mobile manipulation. Autonomous Robots, 28(1):21{44, 2010.
[6] Richard Bellman. On the theory of dynamic programming. Proceedings of the
National Academy of Sciences of the United States of America, 38(8):716, 1952.
[7] D. Berenson, P. Abbeel, and K. Goldberg. A robot path planning framework that
learns from experience. In IEEE International Conference on Robotics and Automa-
tion (ICRA), 2012.
[8] D. Berenson, S.S. Srinivasa, D. Ferguson, A. Collet, and J.J. Kuner. Manipula-
tion Planning with Workspace Goal Regions. In IEEE Intl Conf. on Robotics and
Automation, pages 618{624, 2009.
[9] Dmitry Berenson, Siddhartha S. Srinivasa, Dave Ferguson, and James J. Kuner.
Manipulation planning on constraint manifolds. In IEEE International Conference
on Robotics and Automation, May 2009.
[10] D.P. Bertsekas, WW Hager, and OL Mangasarian. Nonlinear programming. Athena
Scientic Belmont, MA, 1999.
84
[11] A. Boularias, J. Kober, and J. Peters. Relative Entropy Inverse Reinforcement
Learning. In ICAPS, volume 15, pages 20{27, 2011.
[12] J. Buchli, M. Kalakrishnan, M. Mistry, P. Pastor, and S. Schaal. Compliant
quadruped locomotion over rough terrain. In IEEE/RSJ International Conference
on Intelligent Robots and Systems, 2009.
[13] Jonas Buchli, E. Evangelos Theodorou, F. Stulp, and S. Schaal. Variable impedance
control - a reinforcement learning approach. In Robotics Science and Systems, 2010.
[14] Katie Byl, Alec Shkolnik, Sam Prentice, Nick Roy, and Russ Tedrake. Reliable
dynamic motions for a sti quadruped. In Experimental Robotics, pages 319{328.
2009.
[15] G.C. Cawley, N.L.C. Talbot, and M. Girolami. Sparse multinomial logistic re-
gression via bayesian l1 regularisation. Advances in Neural Information Processing
Systems, 19:209, 2007.
[16] J.G. Cham, J.K. Karpick, and M.R. Cutkosky. Stride period adaptation of a
biomimetic running hexapod. The International Journal of Robotics Research,
23(2):141, 2004.
[17] William W Cohen, Robert E Schapire, and Yoram Singer. Learning to order things.
Journal of Articial Intelligence Research, 10:243|270, 1999.
[18] Ioan A. S ucan, Mrinal Kalakrishnan, and Sachin Chitta. Combining planning tech-
niques for manipulation using realtime perception. In IEEE International Confer-
ence on Robotics and Automation, Anchorage, Alaska, May 2010.
[19] Peter Dayan and Georey E Hinton. Using em for reinforcement learning. Neural
Computation, 9, 1997.
[20] R. Diankov, N. Ratli, D. Ferguson, S. Srinivasa, and J. J. Kuner. Bispace
planning: Concurrent multi-space exploration. In Robotics: Science and Systems,
Zurich, Switzerland 2008.
[21] E.W. Dijkstra. A note on two problems in connexion with graphs. Numerische
mathematik, 1(1):269{271, 1959.
[22] A.D. Dragan, N.D. Ratli, and S.S. Srinivasa. Manipulation Planning with Goal
Sets using Constrained Trajectory Optimization. In IEEE Intl Conf. on Robotics
and Automation, pages 4582{4588, 2011.
[23] K. Dvijotham and E. Todorov. Inverse Optimal Control with Linearly-Solvable
MDPs. In Proceedings of the Intl Conf. on Machine Learning, volume 10, pages
335{342, 2010.
85
[24] R. Featherstone. Rigid body dynamics algorithms. Springer-Verlag New York Inc,
2008.
[25] E. Frazzoli and S. Karaman. Incremental sampling-based algorithms for optimal
motion planning. Int. Journal of Robotics Research, 2010.
[26] Y. Fukuoka, H. Kimura, and A.H. Cohen. Adaptive dynamic walking of a quadruped
robot on irregular terrain based on biological concepts. The International Journal
of Robotics Research, 22(3-4):187, 2003.
[27] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control:
theory and practicea survey. Automatica, 25(3):335{348, 1989.
[28] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolu-
tion strategies. Evolutionary Computation, 9(2):159{195.
[29] K. Hauser and V. Ng-Thow-Hing. Fast Smoothing of Manipulator Trajectories
using Optimal Bounded-Acceleration Shortcuts. In IEEE International Conference
on Robotics and Automation, 2010.
[30] S. Hirose, T. Masui, H. Kikuchi, Y. Fukuda, and Y. Umetani. Titan III: a quadruped
walking vehicle. In Robotics Research-The Second International Symposium, page
325331, 1985.
[31] J.K. Hodgins and MN Raibert. Adjusting step length for rough terrain locomotion.
IEEE Transactions on Robotics and Automation, 7(3):289{298, 1991.
[32] A.J. Ijspeert. Central pattern generators for locomotion control in animals and
robots: a review. Neural Networks, 21(4):642{653, 2008.
[33] Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Learning attractor land-
scapes for learning motor primitives. In Advances in Neural Information Processing
Systems 15, pages 1547{1554. MIT Press, 2002.
[34] David H Jacobson. New second-order and rst-order algorithms for determining
optimal control: A dierential dynamic programming approach. Journal of Opti-
mization Theory and Applications, 2(6):411{440, 1968.
[35] N. Jetchev and M. Toussaint. Trajectory Prediction in Cluttered Voxel Environ-
ments. In IEEE International Conference on Robotics and Automation, 2010.
[36] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings
of the eighth ACM SIGKDD international conference on Knowledge discovery and
data mining, 2002.
86
[37] S. Kajita, F. Kanehiro, K. Kaneko, K. Fujiwara, K. Harada, K. Yokoi, and
H. Hirukawa. Biped walking pattern generation by using preview control of zero-
moment point. In IEEE International Conference on Robotics and Automation,
volume 2, pages 1620{1626, 2003.
[38] M. Kalakrishnan, J. Buchli, P. Pastor, M. Mistry, and S. Schaal. Learning locomo-
tion with littledog. http://www.youtube.com/watch?v=cYo9Whssla8.
[39] M. Kalakrishnan, J. Buchli, P. Pastor, M. Mistry, and S. Schaal. Fast, robust
quadruped locomotion over challenging terrain. In IEEE International Conference
on Robotics and Automation, 2010.
[40] M. Kalakrishnan, J. Buchli, P. Pastor, and S. Schaal. Learning Locomotion over
Rough Terrain using Terrain Templates. In IEEE/RSJ Intl Conf. on Intelligent
Robots and Systems, pages 167{172, 2009.
[41] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal. STOMP:
Stochastic Trajectory Optimization for Motion Planning. In IEEE Intl Conf. on
Robotics and Automation, pages 4569{4574, 2011.
[42] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal. Learning force control
policies for compliant manipulation. http://www.youtube.com/watch?v=LkwQJ9_
i6vQ.
[43] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal. Learning force control poli-
cies for compliant manipulation. In ieee/rsj international conference on intelligent
robots and systems (iros 2011), 2011.
[44] M. Kalakrishnan, E. Theodorou, and S. Schaal. Inverse Reinforcement Learning
with PI2. In The Snowbird Workshop, 2010.
[45] Rudolf Emil Kalman. Contributions to the theory of optimal control. Bol. Soc.
Mat. Mexicana, 5(2):102{119, 1960.
[46] S. Karaman and E. Frazzoli. Incremental Sampling-based Algorithms for Optimal
Motion Planning. In Robotics: Science and Systems, 2010.
[47] J. Kober and J. Peters. Learning motor primitives for robotics. In IEEE Intl Conf.
on Robotics and Automation, pages 2112{2118, 2009.
[48] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine
Learning, pages 1{33, 2009.
[49] M. Kobilarov. Cross-entropy motion planning. The International Journal of
Robotics Research, 31(7):855{871, 2012.
87
[50] Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd, and Yi Lin. An interior-point
method for large-scale l1-regularized logistic regression. Journal of Machine Learn-
ing Research, 2007, 2007.
[51] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal
locomotion. In IEEE International Conference on Robotics and Automation, vol-
ume 3, pages 2619{2624. Citeseer, 2004.
[52] J. Z. Kolter, P. Abbeel, and A. Y. Ng. Hierarchical apprenticeship learning, with
application to quadruped locomotion. In Neural Information Processing Systems,
volume 20, 2007.
[53] J. Z. Kolter, M. P. Rodgers, and A. Y. Ng. A control architecture for quadruped
locomotion over rough terrain. In Proceedings of the IEEE International Conference
on Robotics and Automation, pages 811{818, 2008.
[54] J.Z. Kolter, P. Abbeel, and A.Y. Ng. Hierarchical Apprenticeship Learning with
Application to Quadruped Locomotion. Advances in Neural Information Processing
Systems, 20:769{776, 2008.
[55] P. Kormushev, S. Calinon, and D. G. Caldwell. Robot motor skill coordination with
em-based reinforcement learning. In International Conference on Intelligent Robots
and Systems, pages 3232{3237, October 2010.
[56] P. Kormushev, S. Calinon, and D. G. Caldwell. Imitation learning of positional
and force skills demonstrated via kinesthetic teaching and haptic input. Advanced
Robotics, 25(5):581{603, 2011.
[57] J.J. Kuner, K. Nishiwaki, S. Kagami, M. Inaba, and H. Inoue. Motion planning
for humanoid robots. In In Proc. 20th Int'l Symp. Robotics Research (ISRR'03),
2003.
[58] Bo Li, Thomas Bengtsson, and Peter Bickel. Curse-of-dimensionality revisited:
Collapse of importance sampling in very large scale systems. Technical report,
Department of Statistics, University of California-Berkeley, 2005.
[59] Maxim Likhachev, Geo Gordon, and Sebastian Thrun. ARA*: anytime a* with
provable bounds on suboptimality. Advances In Neural Information Processing Sys-
tems, 2003.
[60] R. Malouf et al. A Comparison of Algorithms for Maximum Entropy Parameter
Estimation. In Proceedings of the Conf. on Natural Language Learning, pages 49{
55, 2002.
[61] R. B. McGhee and A. A. Frank. On the stability properties of quadruped creeping
gaits. Mathematical Biosciences, 3(1-2):331{351, August 1968.
88
[62] Robert B. McGhee. Finite state control of quadruped locomotion. SIMULATION,
9(3):135{140, 1967.
[63] M. Mistry. The Representation, Learning, and Control of Dexterous Motor Skills
in Humans and Humanoid Robots. PhD thesis, University of Southern California,
2009.
[64] M. Mistry, J. Buchli, and S. Schaal. Inverse dynamics control of
oating base
systems using orthogonal decomposition. In Proceedings of the 2010 IEEE Int.
Conference on Robotics and Automation, 2010.
[65] Michael Mistry, Jun Nakanishi, Gordon Cheng, and Stefan Schaal. Inverse kine-
matics with
oating base and constraints for full body humanoid robot control. In
Humanoid Robots, 2008. Humanoids 2008. 8th IEEE-RAS International Conference
on, pages 22{27, 2008.
[66] H. Murao, H. Tamaki, and S. Kitamura. Walking pattern acquisition for quadruped
robot by using modular reinforcement learning. In PROC IEEE INT CONF SYST
MAN CYBERN, volume 3, pages 1402{1405, 2001.
[67] A. Y. Ng and S. Russell. Algorithms for Inverse Reinforcement Learning. In Pro-
ceedings of the Intl Conf. on Machine Learning, pages 663{670. Morgan Kaumann,
2000.
[68] Andrew Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance.
In Proceedings of the twenty-rst international conference on Machine learning,
page 78, Ban, Alberta, Canada, 2004. ACM.
[69] A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In
Intl Conf. on Machine Learning, page 78. ACM, 2004.
[70] P. Pastor, M. Kalakrishnan, S. Chitta, E. Theodorou, and S. Schaal. Skill learning
and task outcome prediction for manipulation. In IEEE Intl Conf. on Robotics and
Automation, pages 3828{3834, 2011.
[71] P. Pastor, M. Kalakrishnan, L. Righetti, and S. Schaal. Towards associative skill
memories. In IEEE-RAS Intl Conf. on Humanoid Robots, 2012.
[72] Peter Pastor, Mrinal Kalakrishnan, Sachin Chitta, Evangelos Theodorou, and Ste-
fan Schaal. Skill learning and task outcome prediction for manipulation. In IEEE
International Conference on Robotics and Automation, 2011.
[73] Lev Semenovich Pontryagin. The mathematical theory of optimal processes, vol-
ume 4. CRC Press, 1962.
[74] M. Raibert, K. Blankespoor, G. Nelson, and R. Playter. Bigdog, the rough-terrain
quaduped robot. Proceedings of the 17th International Federation of Automation
Control.(April 2008), pages 10822{10825, 2008.
89
[75] M.H. Raibert. Legged robots that balance. MIT press Cambridge, MA, 1986.
[76] N. Ratli, D. Silver, and J. A. Bagnell. Learning to Search: Functional Gradient
Techniques for Imitation Learning. Autonomous Robots, 27(1):25{53, 2009.
[77] N. Ratli, M. Zucker, J.A. Bagnell, and S. Srinivasa. CHOMP: Gradient optimiza-
tion techniques for ecient motion planning. In IEEE International Conference on
Robotics and Automation, pages 12{17, 2009.
[78] Nathan D. Ratli. Learning to search: structured prediction techniques for imitation
learning. PhD thesis, Carnegie Mellon University, 2009.
[79] J. R. Rebula, P. D. Neuhaus, B. V. Bonnlander, M. J. Johnson, and J. E. Pratt. A
controller for the LittleDog quadruped walking on rough terrain. In Proceedings of
the IEEE International Conference on Robotics and Automation, 2007.
[80] L. Righetti, J. Buchli, M. Mistry, M. Kalakrishnan, and S. Schaal. Optimal distri-
bution of contact forces with inverse dynamics control. (3):280{298, 2013.
[81] L. Righetti, J. Buchli, M. Mistry, and S. Schaal. Control of legged robots with opti-
mal distribution of contact forces. In 2011 11th IEEE-RAS International Conference
on Humanoid Robots, pages 318 { 324, 2011.
[82] R. Rubinstein. The cross-entropy method for combinatorial and continuous optimi-
zation. Methodology and computing in applied probability, 1(2):127{190, 1999.
[83] Radu Bogdan Rusu, Ioan A. S ucan, Brian Gerkey, Sachin Chitta, Michael Beetz,
and Lydia E. Kavraki. Real-time perception guided motion planning for a personal
robot. In International Conference on Intelligent Robots and Systems, St. Louis,
USA, October 2009.
[84] U. Saranli, M. Buehler, and D.E. Koditschek. Rhex: A simple and highly mobile
hexapod robot. The International Journal of Robotics Research, 20(7):616, 2001.
[85] Bernhard Scholkopf and Alexander Smola. A short introduction to learning with
kernels. In Advanced Lectures on Machine Learning, pages 41{64. 2003.
[86] B. Siciliano, L. Sciavicco, and L. Villani. Robotics: modelling, planning and control.
Springer Verlag, 2009.
[87] J.A. Smith and I. Poulakakis. Rotary gallop in the untethered quadrupedal robot
Scout II. In 2004 IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2004.(IROS 2004). Proceedings, volume 3.
[88] M. Sniedovich. Dijkstra's algorithm revisited: the dynamic programming connexion.
Control and cybernetics, 35(3):599, 2006.
90
[89] F. Stulp, J. Buchli, E. Theodorou, and S. Schaal. reinforcement learning of full-
body humanoid motor skills. In humanoid robots (humanoids), 2010 10th ieee-ras
international conference on, pages 405{410, 2010.
[90] F. Stulp and O. Sigaud. Path integral policy improvement with covariance matrix
adaptation. In International Conference on Machine Learning, 2012.
[91] Freek Stulp, Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. Learning
to grasp under uncertainty. In IEEE International Conference on Robotics and
Automation, 2011.
[92] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in
high dimensions: a path integral approach. In IEEE International Conference on
Robotics and Automation, 2010.
[93] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral
control approach to reinforcement learning. Journal of Machine Learning Research,
pages 3137{3181, 2010.
[94] J. A Ting, A. DSouza, and S. Schaal. Automatic outlier detection: A bayesian
approach. In IEEE International Conference on Robotics and Automation, 2007.
[95] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-
optimal feedback control of constrained nonlinear stochastic systems. In American
Control Conference, 2005. Proceedings of the 2005, pages 300{306. IEEE, 2005.
[96] M. Vukobratovic and B. Borovac. Zero-moment point - thirty ve years of its life.
International Journal of Humanoid Robotics, 1(1):157{173, 2004.
[97] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Fitness expectation maxi-
mization. Parallel Problem Solving from Nature{PPSN X, pages 337{346, 2008.
[98] Q.Z. Ye. The signed Euclidean distance transform and its applications. In Interna-
tional Conference on Pattern Recognition, pages 495{499, 1988.
[99] L. Zhang, X. Huang, Y.J. Kim, and D. Manocha. D-plan: Ecient collision-free
path computation for part removal and disassembly. Journal of Computer-Aided
Design and Applications, 5(6):774{786, 2008.
[100] Brian D. Ziebart, Andrew Maas, J. Andrew (Drew) Bagnell, and Anind Dey. Max-
imum Entropy Inverse Reinforcement Learning. In Proc. of AAAI, July 2008.
[101] M. Zucker, N. Ratli, M. Stolle, J. Chestnutt, J.A. Bagnell, C.G. Atkeson, and
J. Kuner. Optimization and Learning for Rough Terrain Legged Locomotion. The
International Journal of Robotics Research, 30(2):175{191, 2011.
91
[102] Matt Zucker and J Andrew Bagnell. Reinforcement planning: Rl for optimal plan-
ners. In Robotics and Automation (ICRA), 2012 IEEE International Conference
on, pages 1850{1855. IEEE, 2012.
[103] Matthew Zucker, J. Andrew (Drew) Bagnell, Chris Atkeson, and James Kuner.
An optimization approach to rough terrain locomotion. In IEEE Conference on
Robotics and Automation, May 2010.
92
Abstract (if available)
Abstract
Planning and optimization methods have been widely applied to the problem of trajectory generation for autonomous robotics. The performance of such methods, however, is critically dependent on the choice of objective function being optimized, and is non‐trivial to design. On the other end of the spectrum, efforts on learning autonomous behavior from user‐provided demonstrations have largely been focused on reproducing behavior similar in appearance to the provided demonstrations. An alternative approach, known as Inverse Reinforcement Learning (IRL), is to learn an objective function that the demonstrations are assumed to be optimal under. With the help of a planner or trajectory optimizer, such an approach allows the system to synthesize novel behavior in situations that were not experienced in the demonstrations. ❧ We present novel algorithms for IRL that have successfully been applied in two real‐world, competitive robotics settings: (1) In the domain of rough terrain quadruped locomotion, we present an algorithm that learns an objective function for foothold selection based on ”terrain templates”. The learner automatically generates and selects the appropriate features which form the objective function, which reduces the need for feature engineering while attaining a high level of generalization. (2) For the domain of autonomous manipulation, we present a probabilistic model of optimal trajectories, which results in new algorithms for inverse reinforcement learning and trajectory optimization in high‐dimensional settings. We apply this method to two problems in robotic manipulation: redundancy resolution in inverse kinematics, and trajectory optimization for grasping and manipulation. Both methods have proven themselves as part of larger integrated systems in competitive settings against other teams, where testing was conducted by an independent test team in situations that were not seen during training.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Machine learning of motor skills for robotics
PDF
Learning from planners to enable new robot capabilities
PDF
Data-driven autonomous manipulation
PDF
Learning affordances through interactive perception and manipulation
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Accelerating robot manipulation using demonstrations
PDF
Discrete geometric motion control of autonomous vehicles
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Bayesian methods for autonomous learning systems
PDF
Algorithms and systems for continual robot learning
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Information theoretical action selection
PDF
Informative path planning for environmental monitoring
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Trajectory planning for manipulators performing complex tasks
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
Asset Metadata
Creator
Kalakrishnan, Mrinal
(author)
Core Title
Learning objective functions for autonomous motion generation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/12/2014
Defense Date
11/26/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
inverse reinforcement learning,locomotion,machine learning,manipulation,motion planning,OAI-PMH Harvest,reinforcement learning,robotics,trajectory optimization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Schaal, Stefan (
committee chair
), Sukhatme, Gaurav S. (
committee member
), Valero-Cuevas, Francisco (
committee member
)
Creator Email
mail@mrinal.net
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-369146
Unique identifier
UC11297044
Identifier
etd-Kalakrishn-2293.pdf (filename),usctheses-c3-369146 (legacy record id)
Legacy Identifier
etd-Kalakrishn-2293.pdf
Dmrecord
369146
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kalakrishnan, Mrinal
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
inverse reinforcement learning
locomotion
machine learning
manipulation
motion planning
reinforcement learning
robotics
trajectory optimization