Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning of motor skills for robotics
(USC Thesis Other)
Machine learning of motor skills for robotics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MACHINELEARNINGOFMOTORSKILLSFORROBOTICS
by
JanReinhardPeters
ADissertationPresentedtothe
FACULTYOFTHEGRADUATESCHOOL
UNIVERSITYOFSOUTHERNCALIFORNIA
InPartialFulfillmentofthe
RequirementsfortheDegree
DOCTOROFPHILOSOPHY
(COMPUTERSCIENCE)
May2007
Copyright 2007 JanReinhardPeters
Dedication
F¨ urmeinenOpa,dessenVorbildDisziplinundGeradlinigkeitlehrte,
f¨ urmeinenVater,dermichzumDenkenundalsWissenschaftleraufzog,
f¨ urmeineMutter,diemirKraft,LiebeundKreativit¨ ataufdenWeggab,
f¨ urmeineFreundin,diemichjedenTagwiedergl¨ ucklichmacht,
f¨ urdiebesteSchwester,dieichmirvorstellenk¨ onnte,
undf¨ urmeinenvielzunettenwissenschaftlichen“grossenBruder”
zumDankf¨ urdieletztenf¨ unfJahre.
ii
Acknowledgments
Firstofall,IhavetothankthewonderfulpeopleattheUniversityofSouthernCalifornia.
Thisthesiswouldhaveneverbeenpossiblewithoutinitiation,continuingencouragement,
coordination,supervisionandunderstandinghelpofStefanSchaal. Heisagreat‘sensei’
andhasenduredmyresearchroller-coasterridefrommymastersthesistotoday. Another
thanksgoestoFirdausUdwadiaandSethuVijayakumarfromwhombothIlearnedalot
onanalyticaldynamicsandmachinelearning,respectively. Iamalsoverygratefultothe
committee members, i.e., Stefan Schaal, Gaurav Sukhatme, Firdaus Udwadia and Chris
Atkeson for reading my thesis and participating in the defense. I am indebted to Mitsuo
Kawato and Gordon Cheng who made two stays in 2000 and 2003 at the Advanced
Telecommunication Research Center ATR in Kyoto, Japan, possible. Without the first
visit in 2000, I would have never met Stefan Schaal and my life would have taken a
very different turn. A special thanks goes to Evangelos Theodorou and Jun Nakanishi
who have helped me a lot by reading this thesis. Finally, I have to thank all my friends
and family for leading me to a career in science and continuing support. My girl-friend
Ays ¸enurAlting¨ ulforherunderstandingandlove. Anotherthanksgoestomyfriendsfrom
USC learning labs i.e., Aaron d’Souza, Aude Billard, Auke Ijspeert, Dimitris Pongas,
iii
Michael Mistry, Nerses Ohanyan, Peyman Mohajerian, Rick Cory, Sethu Viyahakumar,
Srideep Musuvathy, Vidhya Navalpakkam and many more, for all the advice, support,
help,and...thegoodtimesandgreatparties!
iv
TableofContents
Dedication ii
Acknowledgments iii
ListOfTables viii
ListOfFigures ix
Abbreviations xii
Abstract xv
Chapter1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 ObjectiveandApproach . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 EssentialComponents . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 ResultingApproach . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 MajorContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 General,UnifyingFramework . . . . . . . . . . . . . . . . . . 9
1.3.2 NovelLearningAlgorithms . . . . . . . . . . . . . . . . . . . 10
1.3.3 RobotApplication . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter2: AUnifyingFrameworkforRobotControl
withRedundantDegreesofFreedom 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 AUnifyingMethodologyforRobotControl . . . . . . . . . . . . . . . 18
2.2.1 FormulatingRobotControlProblems . . . . . . . . . . . . . . 19
2.2.2 Point-wiseOptimalControlFramework . . . . . . . . . . . . . 20
2.2.3 NecessaryConditionsforStability . . . . . . . . . . . . . . . . 22
2.2.4 HierachicalExtension . . . . . . . . . . . . . . . . . . . . . . 29
2.3 RobotControlLaws . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Joint-SpaceTrajectoryControl . . . . . . . . . . . . . . . . . 32
v
2.3.2 End-effectorTrajectoryControl . . . . . . . . . . . . . . . . . 35
2.3.3 ControllingConstrainedManipulators: Impedance&HybridCon-
trol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 Conclusion&Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 ContributionsoftheChapter . . . . . . . . . . . . . . . . . . . 50
2.5.2 ExtensiontoInfiniteHorizonOptimality . . . . . . . . . . . . 51
2.5.3 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter3: LearningTrackingControlinOperationalSpace 60
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1 NotationandRemarks . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 OperationalSpaceControlasanOptimalControlProblem . . . 64
3.1.3 WhyshouldwelearnOperationalSpaceControl? . . . . . . . . 68
3.2 LearningMethodsforOperationalSpaceControl . . . . . . . . . . . . 70
3.2.1 CanOperationalSpaceControlbelearned? . . . . . . . . . . . 70
3.2.2 Combining the Local Controllers and Ensuring Consistent Res-
olutionofRedundancy . . . . . . . . . . . . . . . . . . . . . . 74
3.3 ReformulationasReinforcementLearningProblem . . . . . . . . . . . 77
3.3.1 RewardTransformation . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 EMReinforcementLearningwithRewardTransformation . . . 81
3.3.3 ReinforcementLearningbyReward-WeightedRegression . . . 84
3.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.5 Conclusion&Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.1 ContributionsofthisChapter . . . . . . . . . . . . . . . . . . . 89
3.5.2 FutureWork: UsingIntertia-basedMetricsinLearning . . . . . 90
Chapter4: PolicyGradientMethodsforMotorPrimitivesandRobotics 93
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.1.1 GeneralAssumptionsandProblemStatement . . . . . . . . . . 98
4.1.2 MotorPrimitivePolicies . . . . . . . . . . . . . . . . . . . . . 101
4.2 PolicyGradientApproachesforParameterizedMotorPrimitives . . . . 103
4.2.1 Finite-differenceMethods . . . . . . . . . . . . . . . . . . . . 105
4.2.2 LikelihoodRatioMethodsandREINFORCE . . . . . . . . . . 107
4.3 ‘Vanilla’PolicyGradientApproaches . . . . . . . . . . . . . . . . . . 110
4.3.1 PolicygradienttheoremandG(PO)MDP . . . . . . . . . . . . 111
4.3.2 OptimalBaselines . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.3 CompatibleFunctionApproximation . . . . . . . . . . . . . . 116
4.4 NaturalActor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.2 ConnectiontotheCompatibleFunctionApproximation . . . . . 123
4.4.3 NaturalActorCriticAlgorithms . . . . . . . . . . . . . . . . . 125
4.5 PropertiesofNaturalActor-Critic . . . . . . . . . . . . . . . . . . . . 132
vi
4.5.1 OntheCovarianceofNaturalPolicyGradients . . . . . . . . . 133
4.5.2 NAC’sRelationtopreviousAlgorithms . . . . . . . . . . . . . 134
4.6 Experiments&Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.6.1 ComparingPolicyGradientMethodsonMotorPrimitives . . . 138
4.6.2 RobotApplication: MotorPrimitiveLearningforBaseball . . . 139
4.7 Conclusion&Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.7.1 ContributionsofthisChapter . . . . . . . . . . . . . . . . . . . 140
4.7.2 RelationtoEM-basedApproaches . . . . . . . . . . . . . . . 142
4.7.3 FutureWork: MotorPrimitiveIteration . . . . . . . . . . . . . 143
Chapter5: Conclusion 145
5.1 SummaryoftheContributions . . . . . . . . . . . . . . . . . . . . . . 146
5.2 Discussion: TheNextStepsforSkillLearning . . . . . . . . . . . . . . 150
Bibliography 152
References 153
AppendixA
AdditionalDerivationsandDiscussion . . . . . . . . . . . . . . . . . . . . . 164
A.1 SteepestDescentwithRespecttoaMetricN . . . . . . . . . . . . . . 164
A.2 ProofoftheCovarianceTheorem . . . . . . . . . . . . . . . . . . . . . 165
A.3 ADiscussionofKimura&Kobayashi’sAlgorithm . . . . . . . . . . . 165
A.4 DerivationsoftheTwoFormsoftheeNAC . . . . . . . . . . . . . . . 167
A.4.1 DerivationoftheeNAC1 . . . . . . . . . . . . . . . . . . . . 168
A.4.2 DerivationoftheeNACn . . . . . . . . . . . . . . . . . . . . . 169
A.5 MotorPrimitiveEquations . . . . . . . . . . . . . . . . . . . . . . . . 170
vii
ListOfTables
2.1 This table shows the root mean squared error results of the tracking
achievedbythedifferentcontrollaws. . . . . . . . . . . . . . . . . . . 48
3.1 This table shows the complete learning algorithm for Operational Space
Control. Seetextofdetailedexplanations. . . . . . . . . . . . . . . . . 78
4.1 Generalsetupforpolicygradientreinforcementlearning. . . . . . . . . 103
4.2 Finitedifferencegradientestimator. . . . . . . . . . . . . . . . . . . . 106
4.3 Generallikelihoodratiopolicygradientestimator“EpisodicREINFORCE”
withanoptimalbaseline. . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Specializedlikelihoodratiopolicygradientestimator“G(PO)MDP”/Policy
Gradientwithanoptimalbaseline. . . . . . . . . . . . . . . . . . . . . 114
4.5 NaturalActor-CriticAlgorithmwithLSTD-Q(λ) . . . . . . . . . . . . 126
4.6 EpisodicNaturalActorCritic . . . . . . . . . . . . . . . . . . . . . . . 129
4.7 EpisodicNaturalActorCriticwithaTime-VariantBaseline . . . . . . . 132
viii
ListOfFigures
1.1 This figure illustrates our general approach to motor skill learning by
dividing it into motor primitive and a motor control component. For the
task execution, fast policy learning methods based on observable error
need to be employed while the task learning is based on slower episodic
learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Thisfigureillustratestheoutlineandrelationofthethesischapters. . . 13
2.1 Inthepresenceofdisturbancesornon-zeroinitialconditions,stabletask
dynamicswillnotresultinjoint-spacestability. . . . . . . . . . . . . . 25
2.2 SarcosMasterArmrobot,asusedfortheevaluationsonourexperiments. 46
2.3 This figure shows the three end-effector trajectory controllers tracking a
“figure eight (8)” pattern at 8 seconds per cycle. On the left is the x-z
planewiththey-zplaneontheright. Allunitsareinmeters. . . . . . . 47
2.4 The same three controllers tracking the same “figure eight (8)” pattern
at a faster pace of 4 seconds per cycle. The labels and units remain the
sameasinFigure2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5 Jointspacetrajectoriesforthefourmajordegreesoffreedom,i.e.,shoul-
derflexion-extension(SFE),shoulderadduction-abduction(SAA),humeral
rotation(HR)andelbowflexion-extension(EBFE),areshownhere. Joint
angle units are in radians. The labels are identical to the ones in Figure
2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ix
3.1 WhenappliedtotheprismaticrobotfromExample3.1shownin(a),the
three control laws for the metricsN = I (dashed-dot red lines), N =
M
−1
(solid green),N =M
−2
(dashed blue) result in (b) the same task-
spacetrackingbut(c,d)verydifferentjoint-spacebehavior. SeeExample
3.1formoreinformation. . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 This figure illustrates how (a) different data sets result in different solu-
tions if each data point is treated with equal importance (the blue dash-
dot line corresponds to the blue diamonds and the red dashed line to the
red circles). If these data points are (b) weighted down using the Gaus-
siancostfunction(hereindicatedwiththemetricN =M
−1
assolidthin
blacklines)thesolutionsofdifferentdatasetswillconsistentlyapproxi-
mateoptimalsolutionsshowninthesolidcyanline. Whileforthelinear
prismatic robot one could live with any solution in (a), different local
solutionshavetocreateaconsistentglobalsolutionfornonlinearrobots.
Thehorizontalfaintlydottedlinesin(a)and(b)indicatecontourlinesof
equaltask-spaceacceleration. . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Acomparisonoffixedandadaptiverewardtransformationforlearninga
linearpolicyπ(u|s) =N(u|θ
1
s+θ
2
,σ
2
)underthetransformedreward
u(r(s,u)) = exp(−τ (q
1
u
2
+q
2
us+sq
2
3
)). The transformed reward is
indicated by the dotted blue ellipses, the variance of the action distribu-
tionisindicatedbytheredthickellipse,andthemeanofthelinearpolicy
isshownbytheredthickline. Withτ beingadaptive,significantlyfaster
learning of the optimal policy is achieved. Step 0 shows the initial pol-
icy and initial transformed reward, Step 1 shows the initial policy with
adaptedtransformedreward. . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 (a) screen shot of the 3 DOF arm simulator, (b) near ideal tracking per-
formanceforaplanarfigure-8patternforthe3DOFarm,and(c)acom-
parison between the analytically obtained optimal control commands to
the learned ones for one figure-8 cycle of the 3DOF arm exhibits that a
near-optimalpolicyisobtained. . . . . . . . . . . . . . . . . . . . . . . 87
3.5 (a)Anthropomorphic SarcosMasterArm, usedassimulated systemand
in progress of actual robot evaluations. (b) Tracking performance for a
planarfigure-8patternforthesimulatedSarcosMasterarm. . . . . . . 88
x
4.1 The classical example of LQR can be used to illustrate why ‘vanilla’
policy gradients reduce the exploration to zero while natural policy gra-
dients go for the optimal solution. The blue circle in (a) indicate equal
distance with the standard Eucledian metric, while the one in (b) shows
equal distance with respect to the Fisher information metric. The natu-
ral policy gradient corresponds to searching the steepest descent on an
infinitesimally small Fisher ellipse (b), while the vanilla policy gradient
correspondstosteepestdescentonaunitcircle. . . . . . . . . . . . . . 121
4.2 This figure shows different experiments with motor task learning. In
(a,b),weseehowthelearningsystemcreatesminimummotorcommand
goal-achievingplansusingboth(a)splinesand(b)motorprimitives. For
this problem, the natural actor-critic methods beat all other methods by
severalordersofmagnitude. In(c,d),theplanhastoachieveaninterme-
diarygoal. Whilethenaturalactor-criticmethodsstilloutperformprevi-
ousmethods,thegapislowerasthelearningproblemiseasier. Notethat
thesearedoublelogarithmicplots. . . . . . . . . . . . . . . . . . . . . 137
4.3 This figure shows (a) the performance of a baseball swing task when
using the motor primitives for learning. In (b), the learning system is
initializedbyimitationlearning,in(c)itisinitiallyfailingatreproducing
the motor behavior, and (d) after several hundred episodes exhibiting a
nicelylearnedbatting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
xi
Abbreviations
Inthisthesisweusethefollowingmathematicalnotationthroughoutthisthesis:
Notation Description
{x
1
,x
2
,...,x
n
} setwithelementsx
1
,x
2
,...,x
n
R realnumbers
x = [x
1
,x
2
,...,x
n
] avector
x
i
thei-thcomponentofthevectorx
A = [a
1
,a
2
,...,a
m
] amatrix
a
i
thei-thvectorofamatrixA
a
ij
thei,j-thcomponentofthematrixA
A
−1
matrixinverse
A
+
matrixpseudo-inverse
A
1/2
matrixroot
∇
θ
i
f derivativewithrespecttoparameterθ
i
∇
θ
f derivativewithrespecttoparametersθ
i
∂f
∂q
partialderivative
p(x) probabilitydensityofx
E{x} expectationofx
x =hxi sampleaverageofx
xii
Assymbolsinthisthesis,thefollowingsymbolswillbeusedinseveralchapters.
Symbol Description
t time(continuous)
x taskspaceposition,stateofthetask
˙ x,¨ x taskspacevelocity,acceleration
¨ x
ref
referenceacceleration
q,˙ q,¨ q jointposition,velocity,acceleration
q
d
,˙ q
d
,¨ q
d
desiredjointposition,velocity,acceleration
u motorcommand,action
x
1:H
seriesofstatesx
i
withi∈{1,2,...,H}
u
1:H
seriesofactionsu
i
withi∈{1,2,...,H}
τ = [x
1:H
,u
1:H
] rollout,trajectory,sampepath,history
M(q) inertiamatrices
F(q,˙ q),F(q,˙ q,t) internalforces
G(q) gravity
C(q,˙ q) Coriolisandcentripetalforces
ε(q,˙ q) unmodelednonlinearities
h(q,˙ q,t) = 0 constrainttaskdescription
A(x,˙ x)¨ x =b(x,˙ x),A¨ x =b differentialconstrainttaskdescription
N arbitrarymetric
K
P
= [κ
ij
],K
D
= [δ
ij
] gainsofaPDcontrollaw
xiii
Symbol Description
π(u|x) controlpolicy
r(x,u) reward
U (r) utilitytransformationofareward
J(θ) expectedreturn
Q
π
(x,u),V
π
(x) valuefunctionsofpolicyπ
d
π
(x) statedistributionunderpolicyπ
F
θ
Fisherinformationmatrix
d
KL
(θ
′
,θ) =d
KL
(p
θ
′ (τ)||p
θ
(τ)) Kullback-Leiblerdivergence
∇
θ
J ‘vanilla’policygradient
e
∇
θ
J naturalpolicygradient
xiv
Abstract
Autonomous robots that can assist humans in situations of daily life have been a long
standingvisionofrobotics, artificialintelligence, andcognitive sciences. Afirststepto-
wardsthisgoalistocreaterobotsthatcanaccomplishamultitudeofdifferenttasks,trig-
geredbyenvironmentalcontextorhigherlevelinstruction. Earlyapproachestothisgoal
during the heydays of artificial intelligence research in the late 1980s, however, made
it clear that an approach purely based on reasoning and human insights would not be
able to model all the perceptuomotor tasks that a robot should fulfill. Instead, new hope
was put in the growing wake of machine learning that promised fully adaptive control
algorithms which learn both by observation and trial-and-error. However, to date, learn-
ing techniques have yet to fulfill this promise as only few methods manage to scale into
the high-dimensional domains of manipulator robotics, or even the new upcoming trend
of humanoid robotics, and usually scaling was only achieved in precisely pre-structured
domains. In this thesis, we investigate the ingredients for a general approach to motor
skilllearninginordertogetonestepclosertowardshuman-likeperformance. Fordoing
so, we study two major components for such an approach, i.e., firstly, a theoretically
well-founded general approach to representing the required control structures for task
xv
representation and execution and, secondly, appropriate learning algorithms which can
beappliedinthissetting.
As a theoretical foundation, we first study a general framework to generate control
lawsforrealrobotswithaparticularfocusonskillsrepresentedasdynamicalsystemsin
differential constraint form. We present a point-wise optimal control framework result-
ing from a generalization of Gauss’ principle and show how various well-known robot
controllawscanbederivedbymodifyingthemetricoftheemployedcostfunction. The
framework has been successfully applied to task space tracking control for holonomic
systemsforseveraldifferentmetricsontheanthropomorphicSARCOSMasterArm.
Inordertoovercomethelimitingrequirementofaccuraterobotmodels,wefirstem-
ploylearningmethodstofindlearningcontrollersfortaskspacecontrol. However,when
learning to execute a redundant control problem, we face the general problem of the
non-convexity of the solution space which can force the robot to steer into physically
impossible configurations if supervised learning methods are employed without further
consideration. This problem can be resolved using two major insights, i.e., the learning
problem can be treated as locally convex and the cost function of the analytical frame-
workcanbeusedtoensureglobalconsistency. Thus,wederiveanimmediatereinforce-
mentlearningalgorithmfromtheexpectation-maximizationpointofviewwhichleadsto
a reward-weighted regression technique. This method can be used both for operational
spacecontrolaswellasgeneralimmediaterewardreinforcementlearningproblems. We
xvi
demonstratethefeasibilityoftheresultingframeworkontheproblemofredundantend-
effector tracking for both a simulated 3 degrees of freedom robot arm as well as for a
simulatedanthropomorphicSARCOSMasterArm.
While learning to execute tasks in task space is an essential component to a general
frameworktomotorskilllearning, learningtheactualtaskisofevenhigherimportance,
particularlyasthisissueismorefrequentlybeyondtheabilitiesofanalyticalapproaches
than execution. We focus on the learning of elemental tasks which can serve as the
“building blocks of movement generation”, called motor primitives. Motor primitives
are parameterized task representations based on splines or nonlinear differential equa-
tions with desired attractor properties. While imitation learning of parameterized motor
primitives is a relatively well-understood problem, the self-improvement by interaction
ofthesystemwiththeenvironmentremainsachallengingproblem,tackledinthefourth
chapterofthisthesis. Forpursuingthisgoal,wehighlightthedifficultieswithcurrentre-
inforcementlearningmethods,andoutlinebothestablishedandnovelalgorithmsforthe
gradient-basedimprovementofparameterizedpolicies. Wecomparethesealgorithmsin
the context of motor primitive learning, and show that our most modern algorithm, the
Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of
magnitude. We demonstrate the efficiency of this reinforcement learning method in the
applicationoflearningtohitabaseballwithananthropomorphicrobotarm.
xvii
Inconclusion,inthisthesis,wehavecontributedageneralframeworkforanalytically
computingrobotcontrollawswhichcanbeusedforderivingvariouspreviouscontrolap-
proachesandservesasfoundationaswellasinspirationforourlearningalgorithms. We
have introduced two classes of novel reinforcement learning methods, i.e., the Natural
Actor-Critic and the Reward-Weighted Regression algorithm. These algorithms have
been used in order to replace the analytical components of the theoretical framework by
learned representations. Evaluations have been performed on both simulated and real
robotarms.
xviii
Chapter1
Introduction
Knowledgeand scientificinsightarethejoy
and theentitlementforthe existenceofallhumanity.
AlexandervonHumboldt (Prussiannaturalist, 1769-1859)
1.1 Motivation
Despite an increasing number of motor skills exhibited by manipulator and humanoid
robots,thegeneralapproachtothegenerationofsuchmotorbehaviorshaschangedlittle
over the last decades (Tsai, 1999; Craig, 2005; Sciavicco & Siciliano, 2007; De Wit,
Siciliano, & Bastin, 1996). The roboticist models the task as accurately as possible and
useshumanunderstandingoftherequiredmotorskillsinordertocreatethedesiredrobot
behavioraswellastoeliminatealluncertaintiesoftheenvironment. Inmostcases,sucha
processboilsdowntorecordingadesiredtrajectoryinapre-structuredenvironmentwith
1
precisely placed objects. If inaccuracies remain, the engineer creates exceptions using
human understanding of the task. In order to be controllable, the robot tracking the tra-
jectoryisusuallyaheavystructurewithnon-backdrivable,noncompliantjointsresulting
in high-payload to weight ratio, low energy-efficiency and dangers for its environment
(Hirzinger et al., 2002; Sciavicco & Siciliano, 2007). While such highly engineered ap-
proachesarefeasibleinwell-structuredindustrialorresearchenvironments,itisobvious
that if robots should ever leave factory floors and research environments, we will need
to reduce or eliminate the complete reliance on hand-crafted models of the environment
and the robots exhibited to date. Instead, we need a general approach which allows us
to use compliant robots designed for interaction with less structured and uncertain en-
vironments in order to reach domains outside industry. Such an approach cannot solely
relyonhumanunderstandingofthetaskbutinsteadhastobeacquiredandadaptedfrom
datageneratedbothbyhumandemonstrationsoftheskillaswellastrialanderrorofthe
robot.
The tremendous progress in machine learning over the last decades offers us the
promise of less human-driven approaches to motor skill acquisition. However, despite
offering the most general way of thinking about data-driven acquisition of motor skills,
generic machine learning techniques, which do not rely on an understanding of motor
systems, often do not scale into the domain of manipulator or humanoid robotics due
to the high domain dimensionality. Therefore, instead of attempting an unstructured,
2
monolithic machine learning approach to motor skill aquisition, we need to develop ap-
proaches suitable for this particular domain with the inherent problems of task repre-
sentation, learning and execution addressed separately in a coherent framework. Such
a general architecture needs to be developed from the perspective of analytical motor
control and employing a combination of imitation, reinforcement and model learning in
order to cope with the complexities involved in motor skill learning. The advantage of
suchaconcertedapproachisthatitallowstheseparationofthemainproblemsofmotor
skillacquisition,refinementandcontrol. Insteadofeitherhavinganunstructured,mono-
lithic machine learningapproachorcreatinghand-craftedapproacheswithpre-specified
trajectories,wearecapableofaquiringskillsfromdemonstrationsandrefinethemusing
trial and error. The creation and improvement of such skills takes place through a com-
bination of imitation and reinforcement learning. The acquired skills are represented as
policiesandcanprincipallyincludespecificationsrangingfrompositions,velocitiesand
acceleration to applied forces, stiffnesses, etc. When using learning-based approaches
forcontrolinsteadofanalyticalmodelsoftherobotsdynamicsandkinematics,weoften
can achieve accurate control without needing to model the complete system by hand.
Furthermore, robots no longer need to be built with the focus on being straightforward
to model but can be chosen to fulfill the tasks requirements in terms of compliance with
theenvironment,energyefficiency,andotherfactors.
3
1.2 ObjectiveandApproach
The principal objective of this thesis is to move closer towards a general learning archi-
tectureformotorskills,i.e.,totacklethequestion
“How can we find a general framework for representing, learning and
executingmotorskillsforrobotics?”
Ascanbeobservedfromthisquestion,themajorgoalofthisthesisrequiresthreebuild-
ing blocks, i.e., (i) appropriate representations for movements, (ii) learning algorithms
whichcanbeappliedtotheserepresentationsand(iii)atransformationwhichallowsthe
execution of the kinematic plans in the respective task space on robots. These essential
components will be discussed in detail in Section 1.2.1 while the resulting idea for a
generalapproachisdiscussedinSection1.2.2.
1.2.1 EssentialComponents
In order to make a step towards the objective of finding a general framework for motor
skills, we have to address our basic philosophy towards the three essential components,
i.e., representation, learning and execution. In this section, we briefly outline on which
fundamental concepts we will be building on in this thesis and how the different topics
relatetoeachother.
Representation. For the representation of motor skills, we can rely on the insight that
humans, while being capable of performing a large variety of complicated movements,
4
restrictthemselvestoasmalleramountofprimitivemotions(Schaal,Ijspeert,&Billard,
2004). As suggested by Ijspeert et al. (2002, 2003), such primitive movements can be
represented by nonlinear dynamic systems. We can represent these in the differential
constraintformgivenby
A
θ
i
(x
i
,˙ x
i
,t)¨ x =b
θ
i
(x
i
,˙ x
i
,t), (1.1)
wherei∈N is the index of the motor primitive in a library of movements,θ
i
∈R
L
de-
notetheparametersoftheprimitivei,tdenotestimeandx
i
,˙ x
i
,¨ x
i
∈R
n
denotepositions,
velocitiesandaccelerationsofthedynamicsystem,respectively.
Learning. Learning basic motor skills
1
is achieved by adapting the parametersθ
i
of
motorprimitivei. Thehighdimensionalityofourdomainprohibitstheexplorationofthe
complete space of all admissible motor behaviors, rendering the application of machine
learning techniques which require exhaustive exploration impossible. Instead, we have
to rely on a combination of supervised and reinforcement learning in order to aquire
motor skills where the supervised learning is used in order to obtain the initialization of
the motor skill while reinforcement learning is used in order to improve it. Therefore,
the aquisition of a novel motor task consists out of two phases,i.e., the ‘learning robot’
attempts to reproduce the skill acquired through supervised learning and improve the
skillfromexperiencebytrial-and-error,i.e.,throughreinforcementlearning.
1
Learningbysequencingandparallelizationofthemotorprimitiveswillbetreatedinfuturework.
5
Execution. The executionof motor skills adds anotherlevelof complexity. It requires
thatamechanicalsystem
u =M(q,˙ q,t)¨ q+F(q,˙ q,t), (1.2)
withamappingx
i
=f
i
(q,˙ q,t)canbeforcedtoexecuteeachmotorprimitiveA
i
¨ x
i
=b
i
inordertofulfilltheskill. Themotorprimitivecanbeviewedasamechanicalconstraint
acting upon the system, enforced through accurate computation of the required forces
based on analytical models. However, in most cases it is very difficult to obtain accu-
rate models of the mechanical system. Therefore it can be more suitable to find a policy
learning approach which replaces the control law based on the hand-crafted rigid body
model. Inthisthesis,wewillfollowthisapproachwhichformsthebasisforunderstand-
ingmotorskilllearning.
1.2.2 ResultingApproach
Aswehaveoutlinedduringthediscussionofourobjectiveanditsessentialcomponents,
we require an appropriate general motor skill framework which allows us to separate
the desired task-space movement generation (represented by the motor primitives) from
movement control in the respective actuator space. We need to be able to understand
this transformation from an analytical point of view and need to be able to relate it to
6
Motor Primitive
Representation
Control Task
Execution
Robot
System
Real-time Learning from
Immediate Rewards
Episodic Learning from
Long-Term Rewards
Perceptual
Triggering
Figure1.1: Thisfigureillustratesourgeneralapproachtomotorskilllearningbydividing
itintomotorprimitiveandamotorcontrolcomponent. Forthetaskexecution,fastpolicy
learningmethodsbasedonobservableerrorneedtobeemployedwhilethetasklearning
isbasedonslowerepisodiclearning.
generaltechniquesinrobotics. Theresultingframeworkhastoyieldtheexecutioncom-
ponent discussed before as well as a basic understanding for resulting learning frame-
works. When turning the analytical approach into a learning framework, we have to
consider two components, i.e., we need to determine how to learn the desired behavior
represented by the motor primitives as well as the execution represented by the trans-
formation of the motor primitives into motor commands. We need to develop scalable
learning algorithms which are both appropriate and efficient when used with the chosen
general motor skill learning architecture. Furthermore, we require algorithms for fast
immediate policy learning for movement control based on instantly observable rewards
in order to enable the system to cope with real-time improvement during the execution.
The learning of the task itself on the other hand requires the learning of policies which
define the long-term evolution of the task, i.e., motor primitives, which are learned on
a trial-by-trial basis with episodic improvement using a teacher for demonstration and
7
reinforcement learning for self-improvement. The resulting general concept underlying
thisthesisisillustratedinFigure1.1.
1.3 MajorContributions
In this thesis, we have made progress towards a general framework for representing,
learning and executing motor skills while demonstrating the application of this work to
physicalandsimulatedrobots. Thisprogressincludescontributionstothreedifferentbut
relatedlinesofresearch,i.e.,
• MachineLearning,
• MotorSkillRepresentationandControl,
• RobotApplication,
hencethenameofthisthesisis“MachineLearning forMotorSkillsin Robotics”.
The problem of motor skills learning for robotics is obviously not solved before we
havehumanoidrobotsaretosuccessfullyperformmostcommonmotortasksinpeople’s
homes. Nevertheless,wehavemadesignificantprogressandseveralmajorcontributions
which we will outline in the next paragraphs. Each of these novel results can be seen as
contribution to all three disciplines and together they can be seen as in a coherent and
generalbasisforlearningmotorskillsforrobotics.
8
1.3.1 General,UnifyingFramework
Asmentionedbefore,ourpushfornewmotorskillslargelyreliesuponageneralunifying
framework for a class of point-wise optimal control approaches. This framework allows
us to derive previous and novel control approaches, e.g., for operational space control,
anditservesasthebasisforourlearningframework.
Point-Wise Optimal Control Approach. For obtaining our generic framework, we
start with the general insight by Udwadia (1996, 2003) that Nature enforces constraints
onmechanicalorstructuralsystemsinthesimilarwaytoanonlinearcontrollerandthus
we can obtain control laws in a similar fashion. We formalize this basic insight for
robotics with a particular focus on redundant degrees-of-freedom (DOFs) systems and
rely on general understanding of the control framework as a special class of point-wise
optimal controllers derived from differential constraints. We discuss resulting require-
ments for stability as well as an extension for hierarchical task prioritization. In a more
limitedcase,itcanbejustifiedfromtheviewpointfromaninfinitehorizonpointofview.
Unificationofpreviousapproaches. The suggested approach offers a promising uni-
ficationandsimplificationofnonlinearcontrollawdesignforrobotsobeyingrigidbody
dynamics equations, both with or without external constraints. We take this line of rea-
soningonestepfurtheranddemonstratethatseveralwell-knownandalsonovelnonlinear
robotcontrollawscanbederivedfromthisgenericmethodology. Weshowexperimental
verificationsonaSarcosMasterArmrobotforsomeofthethederivedcontrollers.
9
1.3.2 NovelLearningAlgorithms
Asoutlinedbefore,weneedtwodifferentstylesofpolicylearningalgorithms,i.e.,meth-
ods for long-term reward optimization and methods for immediate improvement. Thus,
we have developed two different classes of algorithms, i.e., the Natural Actor-Critic and
theReward-WeightedRegression.
Natural Actor-Critic. The Natural Actor-Critic algorithms (Peters, Vijayakumar, &
Schaal, 2003a, 2005) are the fastest policy gradient methods to date and “the current
method of choice” (Aberdeen, 2006). They rely on the insight that we need to maxi-
mize the reward while keeping the loss of experience constant, i.e., we need to measure
the distance between our current path distribution and the new path distribution created
by the policy. This distance can be measured by the Kullback-Leibler divergence and
approximated using the Fisher information metric resulting in a natural policy gradi-
ent approach. This natural policy gradient has a connection to the recently introduced
compatiblefunctionapproximation,whichallowstoobtaintheNaturalActor-Critic. In-
terestingly, earlier Actor-Critic approaches can be derived from this new approach. In
applicationtomotorprimitivelearning,wecandemonstratethattheNaturalActor-Critic
outperforms both finite-difference gradients as well as ‘vanilla’ policy gradient methods
withoptimalbaselines.
10
Reward-WeightedRegression. IncontrasttoNaturalActor-Criticalgorithms,theReward-
Weighted Regression algorithm (Peters & Schaal, 2006a, 2007a, 2007b) focuses on im-
mediaterewardimprovementandemploysanadaptationoftheexpectationmaximization
(EM)algorithmforreinforcementlearninginsteadofagradientbasedapproach. Thekey
difference here is that when using immediate rewards, we can learn from our actions di-
rectly,i.e.,usethemastrainingexamplessimilartoasupervisedlearningproblemwitha
higherpriorityforsampleswithahigherreward. Thus,thisproblemisareward-weighted
regression problem, i.e., it has a well-defined solution which can be obtained using es-
tablished regression techniques. While we have given a more intuitive explanation of
this algorithm, it corresponds to a properly derived maximization-maximization (MM)
algorithm which maximizes a lower bound on the immediate reward similar to an EM
algorithm. Itcanbeshowntoscaletohighdimensionaldomainsandlearnagoodpolicy
withoutanyimitationofahumanteacher.
1.3.3 RobotApplication
The general setup presented in this thesis can be applied in robotics using analytical
models as well as the presented learning algorithms. The applications presented in this
thesisincludemotorprimitivelearningandoperationalspacecontrol.
Operational Space Control. Operational space control is one of the more general
frameworks for obtaining task-level control laws in robotics. In this thesis, we present
two new contributions to this field. First, we show how both well-established as well as
11
novel task-space tracking algorithms can be obtained by posing the problem as a task-
space constraint and simply varying the metric in the point-wise cost function given by
our general framework. We can show that all of the presented methods result in roughly
the same quality of tracking performance but that the resulting joint-space trajectories
differsignificantlyduetotheshiftedforcedistributiononthejoints.
Oursecondcontributionisalearningframeworkforoperationalspacecontrolwhich
is a result of both the general point-wise optimal control framework and our insight
intoimmediaterewardreinforcementlearning. Whilethegenerallearningofoperational
spacecontrollerswithredundantdegreesoffreedomisnon-convexandthusglobalsuper-
visedlearningtechniquescannotbeappliedstraightforwardly,wecangaintwoinsights,
i.e., that the problem is locally convex and that our point-wise cost function allows us
to ensure global consistency among the local solutions. We show that this can yield the
analytically determined optimal solution for simulated three degrees of freedom arms
wherewecansamplethestate-spacesufficiently. Similarly,wecanshowtheframework
worksforsimulationsoftheanthropomorphicSARCOSMasterArm.
MotorPrimitiveImprovement. The main application of our long-term improvement
framework is the optimization of motor primitives. Here, we follow essentially the pre-
viously outlined idea of acquiring an initial solution by supervised learning and then
usingreinforcementlearningformotorprimitiveimprovement. Forthis,wedemonstrate
both comparisons of motor primitive learning with different policy gradient methods,
12
Chapter 1
Introduction
Chapter 2
A Unifying Framework for Robot
Control with Redundant Degrees of
Freedom
Chapter 3
Learning Tracking Control in
Operational Space
Chapter 4
Policy Gradient Methods for Motor
Primitives and Robotics
Chapter 5
Conclusion
Figure1.2: Thisfigureillustratestheoutlineandrelationofthethesischapters.
i.e., finite difference methods, ‘vanilla’ policy gradient methods and the Natural Actor-
Critic, as well as an application of the most successful method, the Natural Actor-Critic
toT-Balllearning.
1.4 ThesisOutline
In the last section of this introduction, we will briefly outline the structure and connec-
tions between the remaining chapters. The relation between the thesis structure and the
differentchaptersofthisthesisproposalisgiveninFigure1.4.
13
InChapter2, “A Unifying Framework for Robot Control with Redundant Degrees of
Freedom”,wepresentthegeneralarchitecturewhichallowsustobothunderstandmotor
skillrepresentationandexecutionfromananalyticalpointofviewaswellasitservesas
thebasisofourmotorskilllearningendeavours.
InChapter3,“LearningTrackingControlinOperationalSpace”,wefocusonlearn-
inghowtogeneratejoint-spacemotorcommandsinordertoachievetask-spacetracking.
WhilethischaptercanbereadseparatelyfromChapter2,itmakesuseofthemainprinci-
ple,i.e.,thepoint-wisecostfunction. Inthiscontext,wepresentfirstthegeneralEM-like
algorithmforreinforcementlearningand,subsequently,wederivetheReward-Weighted
Regressionalgorithmfromit.
WhileChapters3and2assumegiventasksinadifferentialform,Chapter4,“Policy
Gradient Methods for Motor Primitives and Robotics”, focusses on the learning and/or
refinement of tasks represented by motor primitives. For achieving this, we compare
reinforcement learning methods for parameterized policies with a strong bias towards
high-dimensional systems such as robots and episodic task-learning. The best perform-
ingmethod,theEpisodicNaturalActor-Criticisthenshowntobeefficientintasklearn-
ingforananthropomorphicrobot.
Finally,inChapter5,“Conclusion”,wegiveaconclusionanddiscussfuturework.
14
Chapter2
AUnifyingFrameworkforRobotControl
withRedundantDegreesofFreedom
Theoryattracts practiceasthemagnetattractsiron.
CarlFriedrichGauss(Germanmathematician, 1777-1855)
One of the most important reasons why learning methods are not dominating the
field of robotics yet is the depth of human insight into the structure and control of me-
chanical systems. Pure machine learning approaches which neglect the insights from
analytical robotics often end up with ill-posed problems (e.g., learning inverse kinemat-
icsbysupervisedlearningcanresultininconsistentsolutions),orwithinferiorsolutions
(e.g.,thelinearappearanceofvelocitiesandaccelerationsindifferentialkinematicmod-
els can be incorporated). Thus, this chapter attempts to make use of that knowledge in
ordertoestablishthetheoreticalbasisforamachinelearningframeworkformotorskills
from an analytical dynamics point of view. The results presented in this chapter are a
15
key ingredient for Chapter 3. They serve also as basis for the operational space control
learning framework and they suggest a task learning approach based on motor primitive
representationsasdiscussedinChapter4.
Recently, Udwadia (2003) suggested to derive tracking controllers for mechanical
systems with redundant degrees-of-freedom (DOFs) using a generalization of Gauss’
principle of least constraint. This method allows reformulating control problems as a
special class of point-wise optimal controllers. In this thesis chapter, we take this line
of reasoning one step further and demonstrate that several well-known and also novel
nonlinear robot control laws can be derived from this generic methodology. We show
experimental verifications on a Sarcos Master Arm robot for some of the the derived
controllers. The suggested approach offers a promising unification and simplification
of nonlinear control law design for robots obeying rigid body dynamics equations, both
with or without external constraints. It can be shown to generalize to hierarchical task
prioritization frameworks, and, in a more limited case, it can be justified from the view-
pointofaninfinitehorizonpointofview.
2.1 Introduction
The literature on robot control with redundant degrees-of-freedom (DOFs) has intro-
duced many different approaches of how to resolve kinematic redundancy in complex
robots and how to combine redundancy resolution with appriopriate control methods
16
(e.g., see (Nakanishi, Cory, Mistry, Peters, & Schaal, 2005) for an overview). For in-
stance, methods can be classified to operate out of velocity-based, acceleration-based,
and force-based principles, they can focus on local or global redundancy resolution
strategies (Baillieul & Martin, 1990), and they can have a variety of approaches how
toincludeoptimizationcriteriatomaintaincontrolinthenullspaceofamovementtask.
Whenstudyingthedifferenttechniques,itsometimesappearsthattheywerecreatedfrom
ingenious insights of the orginal researchers, but that there is also a missing thread that
linksdifferenttechniquestoacommonbasicprinciple.
We follow Udwadia’s (2003) suggestion to reinterpretate tracking control in terms
of constrained mechanics, which was inspired by results from analytical dynamics with
constrained motion. The major insight is that tracking control can be reformulated in
terms of constraints, which in turn allows the application of a generalization of Gauss’
principleofleastconstraint
1
(Udwadia&Kalaba,1996;Bruyninckx&Khatib,2000)to
deriveacontrollaw. Thisinsightleadstoaspecializedpoint-wiseoptimalcontrolframe-
work for controlled mechanical systems. While it is not applicable to non-mechanical
control problems with arbitrary cost functions, it yields an important class of optimal
controllers, i.e., the class where the problem requires task achievement under minimal
squared motor commands with respect to a specified metric. In this thesis chapter, we
1
Gauss’principleoftheleastconstraint(Udwadia&Kalaba,1996)isageneralaxiomonthemechanics
ofconstrainedmotions. Itstatesthatifamechanicalsystemisconstrainedbyanothermechanicalstructure
theresultingacceleration¨ xofthesystemwillbesuchthatitminimizes(¨ x−M
−1
F)
T
M
−1
(¨ x−M
−1
F)
whilefulfillingtheconstraint. HereMdenotestheinertiamatrixandFdenotestheinternalforces.
17
develop this line of thinking one step further and show that it can be used as a gen-
eral way of deriving robot controllers for systems with redundant DOFs, which offers
a useful unification of several approaches in the literature. We discuss the necessary
conditions for the stability of the controller in task space if the system can be modeled
with sufficient precision and the chosen metric is appropriate. For assuring stability in
configuration space further considerations may apply. To explore the feasibility of our
framework and to demonstrate the effects of the weighting metric, we evaluate some of
the derived controllers on an end-effector tracking task with an anthropomorphic robot
arm.
Thisthesischapterisorganizedasfollows:firstly,aoptimalcontrolframeworkbased
on (Udwadia, 2003) is presented and analyzed. Secondly, we discuss different robot
control problems in this framework including joint and task space tracking, force and
hybrid control. We show how both established and novel controllers can be derived in
a unified way. Finally, we evaluate some of these controllers on a Sarcos Master Arm
robot.
2.2 AUnifyingMethodologyforRobotControl
A variety of robot control problems can be motivated by the desire to achieve a task
accurately while minimizing the squared motor commands, e.g., we intend to track a
trajectory with minimum generated torques. Such problems can be formalized as a type
of minimum effort control. In this section, we will show how the robot dynamics and
18
the control problem can be brought into a general form which, subsequently, will allow
us to compute the optimal control commands with respect to a desired metric. We will
augment this framework such that we can ensure the necessary conditions for stability
bothinthejointspaceoftherobotaswellasinthetaskspaceoftheproblem.
2.2.1 FormulatingRobotControlProblems
We assume the well-known rigid-body dynamics model of manipulator robotics withn
degreesoffreedomgivenbytheequation
u =M(q)¨ q+C(q,˙ q)+G(q), (2.1)
whereu∈ R
n
is the vector of motor commands (i.e., torques or forces),q,˙ q,¨ q∈ R
n
are the vectors of joint position, velocities and acceleration, respectively,M(q)∈R
n×n
is the mass or inertia matrix,C(q,˙ q)∈R
n
denotes centripetal and Coriolis forces, and
G(q)∈ R
n
denotes forces due to gravity (Yoshikawa, 1990; De Wit et al., 1996). At
manypointswewillwritethedynamicsequationsasM(q)¨ q =u(q,˙ q)+F(q,˙ q)where
F(q,˙ q) =−C(q,˙ q)−G(q) for notational convenience. We assume that a sufficiently
accuratemodelofourrobotsystemisavailable.
A task for the robot is assumed to be described in the form of a constraint, i.e., it is
givenbyafunction
h(q,˙ q,t) = 0. (2.2)
19
whereh∈R
k
with an arbitrary dimensionalityk. For example, if the robot is supposed
to follow a desired trajectoryq
des
(t)∈ R
n
, we could formulate it byh(q,˙ q,t) = q−
q
des
(t) = 0; this case is analyzed in detail in Section 2.3.1. The function h can be
considered a task function in the sense of the framework proposed in (Samson, Borgne,
&Espiau,1991).
Weconsideronlytaskswhereequation(2.2)canbereformulatedas
A(q,˙ q,t)¨ q=b(q,˙ q,t), (2.3)
which can be achieved for most tasks by differentiation of equation (2.2) with respect
to time, assuming that h is sufficiently smooth. For example, our previous task, upon
differentiation, becomes¨ q = ¨ q
des
(t) so thatA = I andb =¨ q
des
(t). An advantage of
thistaskformulationisthatnon-holomonicconstraintscanbetreatedinthesamegeneral
way.
In Section 2.3, we will give task descriptions first in the general form of Equation
(2.2), and then derive the resulting controller, which is the linear in accelerations, as
showninEquation(2.3).
2.2.2 Point-wiseOptimalControlFramework
Letusassumethatwearegivenarobotmodelandaconstraintformulationofthetaskas
described in the previous section. In this case, we can characterize the desired prop-
erties of the framework as follows: first, the task has to be achieved perfectly, i.e.,
20
h(q,˙ q,t) = 0, or equivalently, A(q,˙ q,t)¨ q=b(q,˙ q,t), holds at all times. Second,
we intend to minimize the control command with respect to some given metric, i.e.,
J(t) =u
T
N(t)u, at each instant of time, with positive semi-definite matrixN(t). The
solution to this point-wise optimal control problem (Spong, Thorp, & Kleinwaks, 1984,
1986) can be derived from a generalization of Gauss’ principle, as originally suggested
in(Udwadia,2003). Itisalsoageneralizationofthepropositionsin(Udwadia&Kalaba,
1996;Bruyninckx&Khatib,2000). Weformalizethisideainthefollowingproposition.
Proposition2.1 Theclassofcontrollerswhichminimizes
J(t) =u
T
N(t)u, (2.4)
foramechanicalsystemM(q)¨ q =u(q,˙ q)+F(q,˙ q)whilefulfillingthetaskconstraint
A(q,˙ q,t)¨ q=b(q,˙ q,t), (2.5)
isgivenby
u =N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
F), (2.6)
where D
+
denotes the pseudo-inverse for a general matrix D, and D
1/2
denotes the
symmetric,positivedefinite matrixfor whichD
1/2
D
1/2
=D.
21
Proof. By definingz = N
1/2
u = N
1/2
(M¨ q−F), we obtain the accelerations ¨ q =
M
−1
N
−1/2
(z+N
1/2
F). SincethetaskconstraintA¨ q=bhastobefulfilled,wehave
AM
−1
N
−1/2
z =b−AM
−1
F. (2.7)
The vectorz which minimizes J(t) = z
T
z while fulfilling Equation (2.7), is given by
z = (AM
−1
N
−1/2
)
+
(b−AM
−1
F),andasthemotorcommandisgivenbyu =N
−1/2
z,
thepropositionholds.
The choice of the metric N plays a central role as it determines how the control
effort is distributed over the joints. Often, we require a solution which has a kinematic
interpretation; such a solution is usually given by a metric like N = (M·M)
−1
=
M
−2
. Inothercases,thecontrolforceumayberequiredtocomplywiththeprincipleof
virtualdisplacementsbyd’AlembertforwhichthemetricN =M
−1
ismoreappropriate
(Udwadia & Kalaba, 1996; Bruyninckx & Khatib, 2000). In practical cases, one would
want to distribute the forces such that joints with stronger motors get a higher workload
which can also be achieved by a metric such asN = diag(ˆ τ
−2
1
,ˆ τ
−2
2
,...,ˆ τ
−2
n
) where the
nominal torques ˆ τ
i
are used for the appropriate distribution of the motor commands. In
Section2.3,wewillseehowthechoiceofNresultsinseveraldifferentcontrollers.
2.2.3 NecessaryConditionsforStability
Up to this point, this framework has been introduced in an idealized fashion neglect-
ing the possibility of imperfect initial conditions and measurement noise. Therefore,
22
we modify our approach slightly for ensuring stability. However, the stability of this
framework as well as most related approaches derivable form this framework cannot be
shown conclusively but only in special cases (Hsu, Hauser, & Sastry, 1989; Arimoto,
1996). Therefore, we can only outline the necessary conditions for stability, i.e., (i) the
achievementofthetaskwhichwillbeachievedthroughamodificationoftheframework
in Section 2.2.3.1 and (ii) the prevention of undesired side-effects in joint-space. The
later are a result of under-constrained tasks, i.e., tasks where some degrees of freedom
of the robot are redundant for task achievements, can cause undesired postures or even
instabilityinjoint-space. ThisproblemwillbetreatedinSection2.2.3.2.
2.2.3.1 TaskAchievement
Up to this point, we have assumed that we always have perfect initial conditions, i.e.,
that the robot fulfills the constraint in Equation (2.3) at startup, and that we know the
robot model perfectly. Here, we treat deviations to these assumptions as disturbances
andaddmeansofdisturbancerejectiontoourframework. Thisdisturbancerejectioncan
be achieved by requiring that the desired task is an attractor, e.g., it could be prescribed
asadynamicalsystemintheform
˙
h(q,˙ q,t) =f
h
(h,t), (2.8)
whereh =0 is a globally asymptotically stable equilibrium point – or a locally asymp-
totically stable equilibrium point with a sufficiently large region of attraction. Note that
23
h can be a function of robot variables (as in end-effector trajectory control in Section
2.3.2) but often it suffices to choose it as a function of the state vector (for example for
joint-space trajectory control as in Section 2.3.1). In the case of holonomic tasks (such
astracking controlfor arobot arm), i.e. h
i
(q,t) = 0,i = 1, 2,...,k we canmakeuse of
aparticularlysimpleformandturnthistaskintoanattractor
¨
h
i
+δ
i
˙
h
i
+κ
i
h = 0, (2.9)
whereδ
i
andκ
i
arechosenappropriately. Wewillmakeuseofthis‘trick’inordertode-
rive several algorithms. Obviously, different attractors with more desirable convergence
properties (and/or larger basins of attraction) can be obtained by choosingf
h
appropri-
ately.
If we have such task-space stabilization, we can assure that the control law will
achieve the task at least in a region near to the desired trajectory. We demonstrate this
issueinthefollowingproposition.
Proposition2.2 If we can assure the attractor property of the taskh(q,˙ q,t) = 0, or
equivalently, A(q,˙ q,t)¨ q=b(q,˙ q,t), and if our robot model is accurate, the optimal
controller ofEquation (2.6) willachievethe taskasymptotically.
Proof. When combining the robot dynamics equation with the controller, and after re-
orderingtheterms,weobtain
AM
−1
(M¨ q−F) = (AM
−1
N
−1/2
)
+
(b−AM
−1
F). (2.10)
24
Disturbance or
initial velocity
Compensation
in Task Space
Figure 2.1: In the presence of disturbances or non-zero initial conditions, stable task
dynamicswillnotresultinjoint-spacestability.
If we now premultiply the equation withD =AM
−1
N
−1/2
, and noting thatDD
+
D =
D, we obtainA¨ q =DD
+
b =b. The equality follows because the original trajectory
defined byA¨ q = b yields a consistent set of equations. If this equation describes an
attractor,wewillhaveasymptoticallyperfecttaskachievement.
In some cases, such as joint trajectory tracking control discussed in Section 2.3.1,
Proposition 2.2 will suffice for a stability proof in a Lyapunov sense (Yoshikawa, 1990;
De Wit et al., 1996). However, for certain tasks such as end-effector tracking control
discussedinSection2.3.1,thisisnotthecaseandstabilitycanonlybeassuredinspecial
cases(Hsuetal.,1989;Arimoto,1996).
2.2.3.2 PreventionofControlProblemsinJoint-Space
Even if stability in task space can be shown, it is not immediately clear whether the
controllawisstableinjoint-space. Example2.1,illustratesaproblematicsituationwhere
a redundant robot arm achieves an end-effector tracking task and is provably stable in
task-space,butneverthelessalsoprovablyunstableinjoint-space.
25
Example2.1 Let us assume a simple prismatic robot with two horizontal, parallel links
as illustrated in Figure 2.1. The mass matrix of this robot is a constant given byM =
diag(m
1
,0)+m
2
1 where1 denotes a matrix having only ones as entries, and the addi-
tional forces areF = C +G = 0. Let us assume the task is to move the end-effector
x =q
1
+q
2
alongadesiredpositionx
des
,i.e.,thetaskcanbespecifiedbyA = [1,1],and
b = ¨ x
des
+δ(˙ x
des
− ˙ x)+κ(x
des
−x) after double differentiation and task stabilization.
While the task dynamics are obviously stable (which can be verified using the constant
Eigenvaluesofthesystem),theinitialconditionq
1
(t
0
) =x
des
(t
0
)−q
2
(t
0
)wouldresultin
bothq
i
(t)’sdivergingintooppositedirectionsforanynon-zeroinitialvelocitiesorinthe
presence of disturbances for arbitrary initial conditions. The reason for this behavior is
obvious: the effort of stabilizing in joint space is not task relevant and would increase
thecost.
While this example is similar to problems with non-minimum phase nonlinear con-
trol systems (Isidori, 1995)), the problems encountered are not the failure of the task
controller, but rather due to internal dynamics, e.g., hitting of joint limits. From this ex-
ample,weseethatthebasicgeneralframeworkofcontrainedmechanicsdoesnotalways
suffice to derive useful control laws, and that it has to be augmented to incorporate joint
stabilization for the robot without affecting the task achievement. One possibility is to
introduce a joint-space motor commandu
1
as an additional component of the the motor
commandu,i.e.,
u =u
1
+u
2
(u
1
), (2.11)
26
where the first componentu
1
denotes an arbitrary joint-space motor command for sta-
bilization, while the second componentu
2
(u
1
) denotes the task-space motor command
generated with the previously explained equations. The task-space component depends
onthejoint-spacecomponentasithastocompensateforitintherangeofthetaskspace.
We can show that task achievement A¨ q=b by the controller is not affected by the
choiceofthejoint-spacecontrollawu
1
.
Proposition2.3 For any chosen joint stabilizing control lawu
1
= f(q), the resulting
task space control lawu
2
(u
1
) ensures that the joint-space stabilization acts in the null-
spaceofthe taskachievement.
Proof. When determiningu
2
, we consideru
1
to be part of our additional forces in the
rigidbodydynamics,i.e.,wehave
˜
F =F+u
1
. Weobtainu
2
=N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
˜
F) using Proposition 2.1. By reordering the complete control law u =
u
1
+u
2
(u
1
),weobtain
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
(F+u
1
)),
=N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
F)
+
h
I−N
−1/2
(AM
−1
N
−1/2
)
+
AM
−1
i
u
1
,
=N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
F) (2.12)
+N
−1/2
h
I−(AM
−1
N
−1/2
)
+
(AM
−1
N
−1/2
)
i
N
1/2
u
1
,
27
The task space is defined byN
−1/2
(AM
−1
N
−1/2
)
+
, and the projection matrix given by
N
−1/2
[I−(AM
−1
N
−1/2
)
+
(AM
−1
N
−1/2
)]ensuresthatthejoint-spacecontrollawand
thetaskspacecontrollawareN-orthogonal,i.e.,thetaskaccomplishmentisindependent
ofthejoint-spacestabilization.
Despite that the task is still achieved, the optimal control problem is affected by the
restructuring of our control law. While we originally minimizedJ(t) = u
T
N(t)u, we
nowhaveamodifiedcostfunction
˜
J(t) =u
T
2
N(t)u
2
= (u−u
1
)
T
N(t)(u−u
1
), (2.13)
which is equivalent to stating that the complete control lawu should be as close to the
joint-spacecontrollawu
1
aspossibleundertaskachievement.
This reformulation can have significant advantages if used appropriately. For exam-
ple, in a variety of applications – such as using the robot as a haptic interface – a com-
pensation of the robot’s gravitational, coriolis and centripetal forces in joint space can
be useful. Such a compensation can only be derived when making use of the modified
controllaw. Inthiscase,wesetu
1
=−F =C+G,whichallowsustoobtain
u
2
=N
−1/2
(AM
−1
N
−1/2
)
+
b, (2.14)
which does not contain these forces, and we would have a complete control law ofu =
C+G+N
−1/2
(AM
−1
N
−1/2
)
+
b.
28
2.2.4 HierachicalExtension
Incomplexhigh-dimensionalsystems,wecanoftenhavealargenumberoftasksA
1
¨ q=b
1
,
A
2
¨ q=b
2
, ...,A
n
¨ q=b
n
that have to be accomplished in parallel. These tasks often
partially conflict, e.g., when the number of tasks exceeds the number of degrees of free-
dom or some of these tasks cannot be achieved in combination with each other. There-
fore, the combination of these tasks into a single large task is not always practical and,
instead, the tasks need prioritization, e.g., the higher the number of the task, the higher
itspriority. Taskprioritizedcontrolsolutionshavebeendiscussedintheliterature(Naka-
mura, Hanafusa, & Yoshikawa, 1987; Hollerbach & Suh, 1987; Maciejewski & Klein,
1985; Hanafusa, Yoshikawa, & Nakamura, 1981; Yamane & Nakamura, 2003; Sentis &
Khatib, 2004; Siciliano & Slotine, 1991; Khatib, Sentis, Park, & Warren, 2004; Sentis
&Khatib, 2005). Mostpreviousapproacheswerekinematicanddiscussedonlyasmall,
fixednumberoftasks;toourknowledge,(Sentis&Khatib,2004,2005)wereamongthe
first to discuss arbitrary task numbers and dynamical decoupling, i.e., a different metric
from our point of view. The proposed framework allows the generalization for arbitrary
metrics and more general problems as will be shown in Proposition 2.3. The prioritized
motorcommandisgivenby
u =u
1
+u
2
(u
1
)+u
3
(u
1
+u
2
)+...+u
n
(u
1
+...+u
n−1
), (2.15)
29
whereu
n
(u
1
+...+u
n−1
)isthehighest-prioritycontrollawasafunctionofthelower
level u
1
,...,u
n−1
and cancels out all influence u
1
+ ... +u
n−1
which prohibit the
execution of its task. The motor commands for each degree of freedom can be given by
thefollowingProposition:
Proposition2.4 A set of hierachically prioritized constraintsA
i
¨ q=b
i
where the pri-
orityisrepresentedbyi = 1,2,...,n(here,ahighernumberidenotesahigherpriority)
canbecontrolled by
u =u
1
+
n
X
i=2
u
i
X
i−1
k=1
u
k
, (2.16)
whereu
i
(u
Σ
) = N
−1/2
(A
i
M
−1
N
−1/2
)
+
(b−A
i
M
−1
(F +u
Σ
)). For any k < i, the
lower-priority control lawu
k
acts in the null-space of the higher-priority control lawu
i
andanyhigher-prioritycontrollawu
i
cancelsallpartsofthelower-prioritycontrollaw
u
k
which conflictwithits taskachievement.
Proof. Wefirstsimplycreatethecontrollawsu
1
andu
2
(u
1
)asdescribedbeforeandthen
makeuseofProposition2.3,whichprovesthatthisapproachiscorrectforn = 2. Letus
assumenowthatitistrueforn =m. Inthiscase,wecanconsider˜ u
1
=u
1
+u
2
+...+u
m
our joint-space control law and˜ u
2
=u
m+1
the task-space control law. If we now make
use of Proposition 2.3 again, we realize that˜ u
1
=u
1
+u
2
+... +u
m
acts in the null-
spaceof˜ u
2
=u
m+1
andthatallcomponentsofu
1
,u
2
,...,u
m
whichconflictwithu
m+1
will be canceled out. Therefore, the proposition also remains true forn = m + 1. This
provesthepropositionbyinduction.
30
Fromtheviewpointofoptimization,thecontrollawsobtainedinProposition2.4have
astraightforwardinterpretationlikethecombinationofjointandtask-spacecontrollaws:
eachsubsequentcontrollawischosensothatthecontroleffortdeviatesminimallyfrom
theeffortcreatedfromthepreviouscontrollaws.
Example2.2 Robotlocomotionisastraightforwardexampleforsuchanapproach. Tra-
ditionally, all tasks are often meshed into one big tasks (Pratt & Pratt, 1998). However,
the most essential task is the balancing of the robot to prevent accidents; it can, for in-
stance, be achieved by a balancing taskA
3
¨ q=b
3
similar to a spring-damper system
pulling the system to an upright position. Additionally, the center of the torso should
follow a desired trajectory – unless the desired path would make the robot fall. This
gait generating task would be given byA
2
¨ q=b
2
. Additionally, we want to have joint-
space stability as the unconstrained degrees of freedom such as the arms might other-
wise move all the time. The joint-space stabilization can be expressed as a constraint
A
1
¨ q=b
1
pulling the robot towards a rest posture. The combined motor command is
nowgivenbyu =u
1
+u
2
(u
1
)+u
3
(u
1
+u
2
)withthesinglecontrollawsareobtained
byu
i
(u
Σ
) =N
−1/2
(A
i
M
−1
N
−1/2
)
+
(b−A
i
M
−1
(F+u
Σ
))withi = 1,2,3.
Ideas similar to Example 2.2 have been explored in (Yamane & Nakamura, 2003;
Sentis &Khatib, 2004; Khatibetal., 2004; Sentis &Khatib, 2005) andweare currently
workingonapplyingthisframeworktolocomotionsimilartoExample2.2.
31
2.3 RobotControlLaws
The previously described framework offers a variety of applications in robotics – we
will only focus on some the most important ones in this thesis chapter. Most of these
controllers which we will derive are known from the literature, but often from very dif-
ferent, and sometimes convoluted, building principles. In this section, we show how
a large variety of control laws for different situations can be derived in a simple and
straightforward way by using the unifying framework that has been developed hereto.
Wederivecontrollawsforjoint-spacetrajectorycontrolforbothfullyactuatedandover-
actuated “muscle-like” robot systems from our framework. We also discuss task-space
trackingcontrolsystems,andshowthatmostwell-knowninversekinematicscontrollers
are applications of the same principle. Additionally, we will discuss how the control of
constrained manipulators through impedance and hybrid control can be easily handled
withinourframework.
2.3.1 Joint-SpaceTrajectoryControl
The first control problem we address is joint-space trajectory control. We consider two
different situations: (a) We control a fully actuated robot arm in joint-space, and (b) we
control an overactuated arm. The case (b) could, for example, have agonist-antagonist
musclesasactuatorssimilartoahumanarm
2
.
2
Anopentopicofinterestistohandleunderactuatedcontrolsystems. Thiswillbepartoffuturework.
32
2.3.1.1 FullyActuatedRobot
The first case which we consider is the one of a robot arm which is actuated at every
degree of freedom. We have the trajectory as constraint withh(q,t) =q(t)−q
d
(t) =
0. We turn this constraint into an attractor constraint using the idea in Section 2.2.3.1,
yielding
(¨ q−¨ q
d
)+K
D
(˙ q− ˙ q
d
)+K
P
(q−q
d
) = 0, (2.17)
whereK
D
arepositive-definitedampinggains,andK
P
arepositive-definiteproportional
gains. WecanbringthisconstraintintotheformA(q,˙ q)¨ q=b(q,˙ q)with
A =I, (2.18)
b =¨ q
d
+K
D
(˙ q
d
− ˙ q)−K
P
(q
d
−q). (2.19)
Proposition 2.1 can be used to derive the controller. Using (M
−1
N
−1/2
)
+
=N
1/2
M as
bothmatricesareoffullrank,weobtain
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
(F+u
1
)),
=M(¨ q
d
+K
D
(˙ q
d
− ˙ q)+K
P
(q
d
−q))+C+G. (2.20)
Notethat–notsurprisingly–alljoint-spacemotorcommandsorvirtualforcesu
1
always
disappearfromthecontrollawandthatthechosenmetricNisnotrelevant–thederived
33
solutionisuniqueandgeneral. Thisequationisawell-knowntextbookcontrollaw,i.e.,
theInverseDynamicsControlLaw(Yoshikawa,1990;DeWitetal.,1996).
2.3.1.2 OveractuatedRobots
Overactuatedrobots, astheycanbefound inbiological systems, are inherently different
fromthepreviouslydiscussedrobots. Forinstance,thesesystemsareactuatedbyseveral
linearactuators,e.g.,musclesthatoftenactonthesysteminformofopposingpairs. The
interactionsoftheactuatorscanbemodeledusingthedynamicsequationsof
Du =M(q)¨ q+C(q,˙ q)+G(q), (2.21)
whereD depends on the geometric arrangement of the actuators. In the simple model
of a two degree-of-freedom robot with antagonistic muscle-like activation, it would be
givenby
D =
−l +l 0 0
0 0 −l +l
, (2.22)
where size of the entries D
ij
denotes the moment arm length l
i
and the sign of D
ij
whetheritsagonist(D
ij
> 0)orantagonistmuscle(D
ij
< 0). Wecanbringthisequation
into the standard form by multiplying it withD
+
, which results in a modified system
34
where
˜
M(q) =D
+
M(q), and
˜
F(q,˙ q) =−D
+
C(q,˙ q)−D
+
G(q). If we express the
desiredtrajectoryasinthepreviousexamples,weobtainthefollowingcontroller
u =
˜
M
1/2
(A
˜
M
−1/2
)
+
(b−A
˜
M
−1
˜
F), (2.23)
=D
+
M(¨ q
d
+K
D
(˙ q
d
− ˙ q)−K
P
(q
d
−q))+D
+
(C+G). (2.24)
While immediately intuitive, it noteworthy that this particular controller falls out of the
presentedframeworkinannaturalway. ItisstraightforwardtoextendProposition2.1to
show that this is the constrained optimal solution toJ(t) =u
T
DN(t)Du at any instant
oftime.
2.3.2 End-effectorTrajectoryControl
Whilejoint-spacecontrolofatrajectoryq(t)isstraightforwardandthepresentedmethod-
ologyappearstosimplyrepeatearlierresultsfromtheliterature–althoughderivedfrom
a different and unified perspective – the same cannot be said about end-effector control
where the goal is to control the positionx(t) of the end-effector. This problem is gener-
ically more difficult as the choice of the metricN determines the type of the resulting
controller in an important way, and as the joint-space of the robot often has redundant
degrees of freedom resulting in problems as already presented in Example 2.1. In the
following,wewillshowhowtoderivedifferentapproachestoend-effectorcontrolfrom
thepresentedframework,whichwillyieldbothestablishedaswellasnovelcontrollaws.
35
Thetaskdescriptionisgivenbytheend-effectortrajectoryactingasaconstraintwith
h(q,t) = f(q(t))−x
d
(t) = x(t)−x
d
(t) = 0, wherex = f(q) denotes the forward
kinematics. We turn this constraint into an attractor constraint using the idea in Section
2.2.3.1,yielding
(¨ x−¨ x
d
)+K
D
(˙ x− ˙ x
d
)+K
P
(x−x
d
) = 0, (2.25)
whereK
D
arepositive-definitedampinggains,andK
P
arepositive-definiteproportional
gains. Wemakeuseofthedifferentialforwardkinematics,i.e.,
˙ x =J(q)˙ q, (2.26)
¨ x =J(q)¨ q+
˙
J(q)˙ q. (2.27)
Theseequationsallowustoformulatetheprobleminformofconstraints,i.e.,weintend
tofulfill
¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x) =J¨ q+
˙
J˙ q, (2.28)
andwecanbringthisequationintotheformA(q,˙ q)¨ q=b(q,˙ q)with
A(q,˙ q) =J, (2.29)
b(q,˙ q) =¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J˙ q. (2.30)
36
These equations determine our task constraints. As long as the robot is not redundant
J is invertible and similar to joint-space control, we will have one unique control law.
However,whenJisnotuniquelyinvertibletheresultingcontrollerdependsonthechosen
metricandjoint-spacecontrollaw.
2.3.2.1 SeparationofKinematicsandDynamicsControl
ThechoiceofthemetricNdeterminesthenatureofthecontroller. Ametricofparticular
importanceisN =M
−2
asthismetricallowsthedecouplingofkinematicsanddynamics
control as we will see in this section. Using this metric in Proposition 2.1, we obtain a
controllaw
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
(F+u
1
)),
=MJ
+
(¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J˙ q) (2.31)
+M(I−J
+
J)M
−1
u
1
−MJ
+
JM
−1
F.
Ifwechoosethejoint-spacecontrollawu
1
=u
0
−F,weobtainthecontrollaw
u =MJ
+
(¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J˙ q) (2.32)
+M(I−J
+
J)M
−1
u
0
+C+G.
This control law is the combination of a resolved-acceleration kinematic controller
(Yoshikawa, 1990; Hsu et al., 1989) with a model-based controller and an additional
37
null-space term. Often,M
−1
u
0
is replaced by a desired acceleration term for the null-
space stabilization. Similar controllers have been introduced in (Park, Chung, & Youm,
2002, 1995; Chung, Chung, & Y.Youm, 1993; K.C.Suh & Hollerbach, 1987). The null-
spacetermcanbeeliminatedbysettingu
0
=0;however,thiscanresultininstabilitiesif
thereareredundantdegreesoffreedom. ThiscontrollerwillbeevaluatedinSection2.4.
2.3.2.2 DynamicallyConsistentDecoupling
As noted earlier, another important metric isN =M
−1
as it is consistent with the prin-
ciple of d’Alembert, i.e., the resulting control force can be re-interpreted as mechanical
structures (e.g., springs and dampers) attached to the end-effector; it is therefore called
dynamically consistent. Again, we use Proposition 2.1, and by defining
˜
F = F +u
1
obtainthecontrollaw
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
˜
F),
=u
1
+M
1/2
(JM
−1/2
)
T
(JM
−1
J
T
)
−1
(b−JM
−1
˜
F),
=u
1
+J
T
(JM
−1
J
T
)
−1
(b−JM
−1
˜
F),
=J
T
(JM
−1
J
T
)
−1
(¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J(q)˙ q
+JM
−1
(C+G))+M(I−M
−1
J
T
(JM
−1
J
T
)
−1
J)M
−1
u
1
.
It turns out that this is another well-known control law suggest in (Khatib, 1987) with
an additional null-space term. This control-law is especially interesting as it has a clear
38
physical interpretation (Udwadia & Kalaba, 1996; Bruyninckx & Khatib, 2000; Udwa-
dia, 2003): the metric used is consistent with principle of virtual work of d’Alembert.
Similarly as before we can compensate for coriolis, centrifugal and gravitational forces
injoint-space,i.e.,settingu
1
=C+G+u
0
. Thisyieldsacontrollawof
u =J
T
(JM
−1
J
T
)
−1
(¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J(q)˙ q) (2.33)
+C+G+M[I−M
−1
J
T
(JM
−1
J
T
)
−1
J]M
−1
u
0
.
ThecompensationoftheforcesC+Ginjoint-spaceisoftendesirableforthismetricin
order to have full control over the resolution of the redundancy as gravity compensation
purelyintaskspaceoftenresultsinposturesthatconflictwithjointlimitsandotherparts
oftherobot.
2.3.2.3 FurtherMetrics
Using constant parametric matrices as metric, e.g., the identity matrixN =I, allows us
to create alternative, novel approaches. This metric could be of interest as it distributes
the “load” created by the task evenly on the actuators. This metric results in a control
law
u = (JM
−1
)
+
(¨ x
d
+K
D
(˙ x
d
− ˙ x)+K
P
(x
d
−x)−
˙
J(q)˙ q (2.34)
+JM
−1
(C+G))+(I−(JM
−1
)
+
JM
−1
)u
1
.
39
Toourknowledge,thiscontrollerhasnotbeenpresentedintheliterature.
Another practical idea would be to weight the joints depending on the maximal
torquesτ
max,i
ofeachjoint,e.g.,usingN = diag(τ
−1
max,1
,...,τ
−1
max,n
).
These alternative metrics may be particularly interesting for practical application
where the user wants to have more control over the natural appearance of movement,
and worry less about the exact theoretical properties – humanoid robotics, for instance,
is one of such applications. In some cases, it also may not be possible to have accu-
rate access to complex metrics like the inertia matrix, and simplier metrics will be more
suitable.
2.3.3 ControllingConstrainedManipulators: Impedance&Hybrid
Control
Contactwithoutsideobjectsalterstherobot’sdynamics,i.e.,ageneralizedcontactforce
F
C
∈R
6
actingontheend-effectorchangesthedynamicsoftherobotto
u =M(q)¨ q+C(q,˙ q)+G(q)+J
T
F
C
. (2.35)
In this case, the interaction between the robot and the environment has to be controlled.
This kind of control can both be used to make the interaction with the environment safe
(e.g., inamanipulation task)aswellastousetherobottosimulateabehavior(e.g.,ina
haptic display task). We will discuss impedance control and hybrid control as examples
40
of the application of the proposed framework; however, further control ideas such as
parallelcontrolcanbetreatedinthisframework,too.
2.3.3.1 ImpedanceControl
Inimpedancecontrol,wewanttherobottosimulatethebehaviorofamechanicalsystem
suchas
M
d
(¨ x
d
−¨ x)+D
d
(˙ x
d
− ˙ x)+P
d
(x
d
−x) =F
C
, (2.36)
whereM
d
∈ R
6×6
denotes the mass matrix of a desired simulated dynamical system,
D
d
∈ R
6
denotes the desired damping,P
d
∈ R
6
denotes the gains towards the desired
position, and F
C
∈ R
6
the forces that result from this particular dynamical behavior.
Using Equation (2.27) from Section 2.3.2, we see that this approach can be brought in
thestandardformfortasksby
M
d
J¨ q =F
C
−M
d
¨ x
d
−D
d
(˙ x
d
−J˙ q)−P
d
(x
d
−f(q))−M
d
˙
J˙ q. (2.37)
Thus,wecaninferthetaskdescription
A =M
d
J, (2.38)
b =F
C
−M
d
¨ x
d
−D
d
(J˙ q− ˙ x
d
)−P
d
(f(q)−x
d
)−M
d
˙
J˙ q,
andapplyourframeworkforderivingtherobotcontrollawasshownbefore.
41
KinematicSeparationofSimulatedSystemandtheManipulator Asinend-effector
trackingcontrol,N =M
−2
isapracticalmetricwhichbasicallyseparatesthesimulated
dynamicsystemfromthephysicalstructureofthemanipulatoronakinematiclevel. For
simplicity, we make use of the joint-space control law u
1
= C +G+u
0
similar as
before. Thisresultsinthecontrollaw
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
(F+u
1
)),
=M(M
d
J)
+
(F
C
−M
d
¨ x
d
−D
d
(J˙ q− ˙ x
d
)−P
d
(f(q)−x
d
)−M
d
˙
J˙ q)
+C+G+(I−M(M
d
J)
+
M
d
JM
−1
)u
0
. (2.39)
As(M
d
J)
+
=J
T
M
d
(M
d
JJ
T
M
d
)
−1
=J
+
M
−1
d
sinceM
d
isinvertible,wecansimplify
thiscontrollawtobecome
u =MJ
+
M
−1
d
(F
C
−M
d
¨ x
d
−D
d
(J˙ q− ˙ x
d
)−P
d
(f(q)−x
d
))
−MJ
+
˙
J˙ q+C+G+M(I−J
+
J)M
−1
u
0
. (2.40)
Wenotethat¨ x
d
=M
−1
d
(F
C
−M
d
¨ x
d
−D
d
(J˙ q−˙ x
d
)−P
d
(f(q)−x
d
))isadesiredacceler-
ationintask-space. Thisinsightclarifiesthepreviousremarkabouttheseparationofthe
simulated system and the actual physical system: we have one system which describes
the interaction with the environment – and additionally we use a second, inverse-model
typecontrollertoexecutethedesiredaccelerationswithourrobotarm.
42
DynamicallyConsistentCombination Asinend-effectorcontrolN =M
−1
isaprac-
tical metric which combines both the simulated and the physical dynamic systems em-
ploying Gauss’ principle. For simplicity, we make use of the joint-space control law
u
1
=C+G+u
0
asbefore. Thisapproachresultsinthecontrollaw
u =u
1
+N
−1/2
(AM
−1
N
−1/2
)
+
(b−AM
−1
(F+u
1
)),
=u
1
+J
T
(JM
−1
J
T
)
−1
(b−AM
−1
(F+u
1
)),
=M
1/2
(M
d
JM
−1/2
)
+
(F
C
−D
d
(J˙ q− ˙ x
d
)−P
d
(f(q)−x
d
)−M
d
˙
J˙ q)
+C+G+(I−M(M
d
J)
+
M
d
JM
−1
)u
0
. (2.41)
As (M
d
JM
−1/2
)
+
= M
−1/2
J
T
(JM
−1
J
T
)
−1
M
−1
d
sinceM
d
is invertible, we can sim-
plifythiscontrollawinto
u =J
T
(JM
−1
J
T
)
−1
M
−1
d
(F
C
−D
d
(J˙ q− ˙ x
d
)−P
d
(f(q)−x
d
))
−MJ
+
˙
J˙ q+C+G+(I−MJ
+
JM
−1
)u
0
. (2.42)
Wenotethatthemaindifferencebetweenthisandthepreviousimpedancecontrollawis
thelocationofthematrixM.
43
2.3.3.2 HybridControl
In hybrid control, we intend to control the desired position of the end-effectorx
d
and
the desired contact force exerted by the end-effector F
d
. Modern hybrid control ap-
proachesareessentiallysimilartoourintroducedframework(DeWitetal.,1996). Both
are inspired by constrained motion and use this insight in order to achieve the desired
task. In traditional hybrid control, a natural or artificial, idealized holomonic constraint
φ(q,t) = 0 acts on our manipulator, and subsequently the direction of the forces is
determined through the virtual work principle of d’Alembert. We can make significant
contributions here as our framework is a generalization of Gauss’ principle that allows
ustohandleevennon-holomicconstraintsφ(q,˙ q,t) =0aslongastheyaregiveninthe
form
A
φ
(q,˙ q)¨ q =b
φ
(q,˙ q). (2.43)
A
φ
, b
φ
depend on the type of the constraint, e.g., for time-invariant, holomonic con-
straintsφ(q) = 0, we would haveA
φ
(q,˙ q) = J
φ
andb
φ
(q,˙ q) =−
˙
J
φ
˙ q withJ
φ
=
∂φ/∂q as in (De Wit et al., 1996). Additionally, we intend to exert the contact forceF
d
inthetask;thiscanbeachievedifwechoosethejoint-spacecontrollaw
u
1
=C+G+J
T
φ
F
d
. (2.44)
44
Fromthepreviousdiscussion,thisconstraintisachievedbythecontrollaw
u =u
1
+N
−1/2
(A
φ
M
−1
N
−1/2
)
+
(b
φ
−A
φ
M
−1
(F+u
1
)), (2.45)
=C+G+N
−1/2
(A
φ
M
−1
N
−1/2
)
+
b
φ
(2.46)
+N
−1/2
(I−(AM
−1
N
−1/2
)
+
AM
−1
N
−1/2
)N
1/2
J
T
φ
F
d
.
Note that the exerted forces actin the null-space ofthe achievedtracking task; therefore
boththeconstraintandtheforcecanbesetindependently.
2.4 Evaluations
The main contribution of this thesis chapter is the unifying methodology for deriving
robotcontrollers. Eachofthepresentedcontrollersisawellfoundedcontrollawwhich,
from a theoretical point of view, would not need require empirical evaluations, partic-
ularly as most of the control laws are already well-known from the literature and their
stability properties have been explored before. Nevertheless, it is useful to highlight
one component in the suggested framework, i.e., the impact of the metricN on the par-
ticular performance of a controller. For this purpose, we chose to evaluate the three
end-effector controllers from Section 2.3.2: (i) the resolved-acceleration kinematic con-
troller(withmetricN=M
−2
)inEquation(2.32),(ii)Khatib’soperationalspacecontrol
law (N=M
−1
) in Equation (2.33), and (iii) the identity metric control law (N=I) in
Equation(2.34).
45
Figure2.2: SarcosMasterArmrobot,asusedfortheevaluationsonourexperiments.
As an experimental platform, we used the Sarcos Dextrous Master Arm, a hydraulic
manipulator with an anthropomorphic design shown in Figure 2.2 (b). Its seven degrees
of freedom mimic the major degrees of freedom of the human arm, i.e., there are three
DOFs in the shoulder, one in the elbow and three in the wrist. The robot’s end-effector
was supposed to track a planar “figure-eight (8)” pattern in task space at two different
speeds. In order to stabilize the null-space trajectories, we choose a PD control law in
jointspacewhichpullstherobottowardsafixedrestposture,q
rest
,givenby
u
0
=M(K
P
0
(q
rest
−q)−K
D
0
˙ q). (2.47)
Additionally we apply gravity, centrifugal and Coriolis force compensation, such that
u
1
= u
0
+C+G. For consistency, all three controllers are assigned the same gains
bothforthetaskandjointspacestabilization.
46
0.3 0.35 0.4 0.45 0.5 0.55
−0.1
−0.05
0
0.05
0.1
0.15
z axis
x axis
0.34 0.38
−0.1
−0.05
0
0.05
0.1
0.15
y axis
Desired Res. Accel. Khatib Identity
Figure 2.3: This figure shows the three end-effector trajectory controllers tracking a
“figure eight (8)” pattern at 8 secondsper cycle. Onthe left is the x-z plane with the y-z
planeontheright. Allunitsareinmeters.
Figure 2.3 shows the end-point trajectories of the three controllers in a slow pattern
of eight seconds per cycle “figure-eight (8)”. Figure 2.4 shows a faster pace of four
seconds per cycle. All three controllers have similar end-point trajectories and result
in fairly accurate task achievement. Each one has an offset from the desired trajectory
(thin black line), primarily due to the imperfect dynamics model of the robot. The root
mean squared errors (RMS) between the actual and the desired trajectory in task-space
foreachofthecontrollersareshownintheTable2.1.
While the performance of the three controllers is very similar in task space, we did
notice that the resolved-acceleration kinematic controller (N=M
−2
) had a slight ad-
vantage. The reason for this performance difference is most likely due to errors in the
47
0.3 0.35 0.4 0.45 0.5 0.55
−0.1
−0.05
0
0.05
0.1
0.15
z axis
x axis
0.34 0.38
−0.1
−0.05
0
0.05
0.1
0.15
y axis
Desired Res. Accel. Khatib Identity
Figure 2.4: The same three controllers tracking the same “figure eight (8)” pattern at a
faster pace of 4 seconds per cycle. The labels and units remain the same as in Figure
2.3.
dynamics model which affect the three control laws differently, i.e., the effect of these
errorsisamplifiedbytheinversionofthemassmatrixinthecontrollawsgiveninEqua-
tions (2.33, 2.34) while the decoupling of the dynamics and kinematics provided by the
controller in Equation (2.32) can be favorable as the effect of modeling errors is not
increased. More accurate model parameters of the manipulator’s rigid body dynamics
Table 2.1: This table shows the root mean squared error results of the tracking achieved
bythedifferentcontrollaws.
Metric SlowRMSerror[m] FastRMSerror[m]
N =M
−2
0.0122 0.0130
N =M
−1
0.0126 0.0136
N =I 0.0130 0.0140
48
1 2 3 4 5 6 7 8
0.05
0.1
0.15
SFE
1 2 3 4 5 6 7 8
−0.4
−0.2
SAA
1 2 3 4 5 6 7 8
−0.2
0
0.2
0.4
0.6
HR
1 2 3 4 5 6 7 8
1
1.2
1.4
1.6
1.8
EBFE
time (sec.)
Res. Accel. Khatib Identity
Figure 2.5: Joint space trajectories for the four major degrees of freedom, i.e., shoulder
flexion-extension (SFE), shoulder adduction-abduction (SAA), humeral rotation (HR)
and elbow flexion-extension (EBFE), are shown here. Joint angle units are in radians.
ThelabelsareidenticaltotheonesinFigure2.3.
would result in a reduction of the difference between these control laws (observable in
Figures2.3and2.4)aswehaveconfirmedinsimulations.
Figure 2.5 illustrates how the joint space trajectories appear for the fast cycle. Al-
though end-point trajectories were very similar, joint space trajectories differ signifi-
cantly due to the different optimization criteria of each control law, which emphasizes
theimportanceofthechoiceofthemetricN.
49
2.5 Conclusion&Discussion
In order to conclude this thesis chapter, we give an overview on the contributions of the
chapterandoutlinethemostimportantopenproblemsforfutureresearch.
2.5.1 ContributionsoftheChapter
Inthisthesischapter,wehavepresentedanoptimalcontrolframeworkwhichallowsthe
development of a unified approach for deriving a number of different robot control laws
for rigid body dynamics systems. We demonstrated how we can make use of both the
robotmodelandataskdescriptioninordertocreatecontrollawswhichareoptimalwith
respecttothesquaredmotorcommandunderaparticularmetricwhileperfectlyfulfilling
the task at each instant of time. We have discussed how to realize stability both in task
aswellasinjoint-spaceforthisframework.
Buildingonthisfoundation,wedemonstratedhowavarietyofcontrollaws–whichon
firstinspectionappearratherunrelatedtooneanother–canbederivedusingthisstraight-
forwardframework. Thecoveredtypesoftasksincludejoint-spacetrajectorycontrolfor
both fully actuated and overactuated robots, end-effector trajectory control, impedance
andhybridcontrol.
The implemention of three of the end-effector trajectory control laws resulting from
our unified framework on a real-world Sarcos Master Arm robot was carried out as an
empiricalevaluation. Asexpected,thebehaviorintaskspaceisverysimilarforallthree
50
control laws; yet, they result in very different joint-space behaviors due to the different
costfunctionsresultingfromthedifferentmetricsofeachcontrollaw.
Themajorcontributionofthisthesischapteristheunifiedframeworkthatwehavede-
veloped. Itallowsaderivationofavarietyofpreviouslyknowncontrollers,andpromises
easydevelopmentofahostofnovelones,inparticularcontrollawswithadditionalcon-
straints. The particular controllers reported in this thesis chapter were selected primarly
for illustarting the applicability of this framework and demonstrating its strength in uni-
fyingdifferentcontrolalgorithmsusingacommonbuildingprinciple.
The work presented in this chapter up to this point has been presented in (Peters
et al., 2005; Peters, Mistry, Udwadia, & Schaal, 2005) and the resulting journal paper is
currentlyunderreviewatAutonomousRobots.
2.5.2 ExtensiontoInfiniteHorizonOptimality
Itisquiteobviousthatthepoint-wiseminimizationoftheinstantaniousmotorcommands
u
T
u does not neccessarily result in a minimization of
R
∞
0
u
T
udt if there is more than
one solutionu fulfilling the taskA¨ q =b. For example, assume that there are two paths
in joint-space at a point in time which both fulfill A¨ q = b and you have to choose
between the two of them. In this case, we could end up choosing a path which requires
the lower immediate torque at that instant of time but much higher torque for the rest
of the trajectory, resulting into a higher average torque. Unfortunately, such tasks exist
frequently for redundant robots and, thus, it is of high importance to understand what
51
theinfinitehorizonimplicationsofthismethodare. FollowingtheworkbyKazerounian
& Wang (1988), we can compute the infinite horizon optimal controller for a different
measure, i.e., the amount of kinetic energy of a holonomic task
3
plus the value of the
storagefunction(whichallowsadditionalposition-basedpunishments).
Proposition2.5 Theminimization of
J
∞
=
Z
∞
0
˙ q
T
N
∞
(q,t)˙ q+V (q)
dt (2.48)
subjected to a holonomic taskh(q,t) = 0 in form ofA¨ q =b and a mechanical system
u =M¨ q+F, yieldsthecontrollaw
u =u
0
(q)+M(AN
−1/2
∞
)
+
[b+AN
∞
(u
0
(q)−F)], (2.49)
with ajoint-space controllaw
u
0
(q) =
˙
N
∞
˙ q−B˙ q−V
′
(q)+F (2.50)
whereB =∂
˙ q
T
N
∞
/∂q.
Proof. Usinganinstantaneouscostfunction
c(q,˙ q,t) = ˙ q
T
N
−1
∞
(q,t)˙ q+V (q)+μ
T
h(q,t). (2.51)
3
Thesamemetriciscommonlyusedasabaseindifferentialgeometricmethods(Bullo&Lewis,2004).
52
TheresultingEulerequations (∂c/∂q)−d(∂c/∂˙ q)/dt =0canbereformulatedas
2N
∞
¨ q+V
′
(q)+2
˙
N
∞
˙ q−B˙ q−A
T
μ =0, (2.52)
whereB =∂
˙ q
T
N
∞
/∂q. Thisequationsyieldstheacceleration
¨ q = 0.5N
−1
∞
B˙ q+A
T
μ−2
˙
N
∞
˙ q−V
′
(q)
. (2.53)
Substituting into A¨ q = b yieldsμ, resubstituting in Equation (2.53) and using u =
M¨ q+F,weobtainEquation(2.49).
Note that this approach is significantly more limited. It requires the constraints to
be holonomic in nature, the null-space control law cannot include differential terms and
willbedrasticallymoredifficulttocompute. However,fromthisinfinitehorizonoptimal
controller,wecaninferanimportantstatementontheinfinitehorizonoptimalityofsome
controllaws.
Proposition2.6 Thepoint-wise optimal controller withcostfunction
J
0
= (u−u
0
)
T
N
0
(q,t)(u−u
0
) (2.54)
53
isequivalenttotheoneminimizingtheinfinitehorizoncostinEquation(2.48)forN
0
(q,t) =
M
−1
(q)N
∞
(q,t)M
−1
(q)andajoint-space controllaw
u
0
(q) =
˙
N
∞
˙ q−B˙ q−V
′
(q)+F (2.55)
whereB =∂
˙ q
T
N
∞
/∂q.
Proof. WecomparetheoptimalcontrollawsfromEquations(2.49)and(2.6). Theresult
directlyfollows.
For N
0
(q,t) = M
−1
(q) and G(q) = V
′
(q), this implies that no special joint-
spacecontrollawisneededforinfinite-horizonoptimality. Thisartifactisaresultofthe
special nature ofM(q) being the metric found in mechanics and has been established
in a different setting by Doty et al. (1993) and Udwadia (Udwadia & Kalaba, 1996).
ForN
0
(q,t) =M
−2
(q) , and small ˙ q, we can ensureG(q) +C(q,˙ q)≈ V
′
(q) and
thus have a similar equivalence. This explains why these metrics have been preferred
in the theory of robot control. Nevertheless, there are strong arguments against using
these metrics as the required motor commands can be infinite for intertia matrix related
metrics,see(Udwadia&Kalaba,1996).
The infinite horizon point of view allows us to make two very strong points, i.e.,
that adding a null-space control law which includes velocities alters the infinite-horizon
metric for which the controller is optimal, and that the hierachical extension cannot be
optimalfrominfinitehorizonpointofview.
54
2.5.2.1 ExtensionoftheInfiniteHorizonOptimalityusingStorageFunctions
If we intend to have joint-space control laws as the ones suggested in this paper, e.g.,
a spring-damper system in joint-space (or even complex ones), then we need to replace
V (q) byV (q,˙ q), e.g.,V (q,˙ q) = K
D
˙ q
2
+K
P
(q
d
−q)
2
for a spring-damper system
injoint-space. Thiswillalterthemetricofthesystemsignificantly.
Proposition2.7 For velocity dependent V (q,˙ q), the infinite horizon optimal control
law correspondstothe point-wise optimalcontrollaw with thealteredmetric
N
0
=M
−1
(q)
N
∞
(q,t)+
∂
2
V
∂˙ q
2
M
−1
(q), (2.56)
with ajoint-space controllaw
u
0
(q) =
˙
N
∞
˙ q−B˙ q−
∂V
∂q
+
∂
2
V
∂˙ q∂q
˙ q+F. (2.57)
Proof. WereplaceV (q)byV (q,˙ q)inProposition2.5,whichimpliesthatV
′
(q)needs
tobereplacedby∂V/∂q−(∂
2
V/∂˙ q
2
)¨ q−(∂
2
V/∂˙ q∂q)˙ q. Whenfollowingthestepsof
Proposition2.5fromhereon,thestepfollowsnaturally.
Thispointputsastrongdoubtontheideathataninertiametricisalwaysthatuseful:
the main advantage of the inertia metric is that it directly results in an energy optimal
joint-space control law. However, a necessary condition for joint-space stability is that
the joint-space control laws has a velocity component as shown in Example 2.1. Thus,
thisalterationofthemetricwillbenecessaryinallpracticalcases.
55
2.5.2.2 Re-evaluationoftheHierachicalExtension
The hierachical extension opens up a variety of important questions not tackled in the
literature. Already on the two level hierchachy (i.e., a joint-space control law and a task
space control law), the joint-space control law can best be interpreted as an additional
storage function and will always act as a form of friction, thus increasing the amount
of energy consumed by the system. Obviously, a n-level hiercharchy has no physical
interpretationforn> 2and,todate,thereisnowayofshowinganinfinitehorizonopti-
malcontrollaw. Evenworse, whenre-evaluatingthe derivations inProposition (2.5)for
hierachical approaches, it is straightforward to show that a hierachical approach cannot
be developed in a similar fashion. Instead, a hierachical control law which is optimal
fromaninfinitehorizonpointofview,willrequirehigher-orderderivativessuchas
...
q
for
n = 3,
...
q
and
....
q
forn = 4, and
...
q
,...,q
(n)
for arbitraryn. The requirement of these
higherorderderivativesforinfinitehorizonoptimalitymightbecauseserioustroublefor
thehierachicalapproach.
Furthermore, the hierachical extension might not even be desirable as all tasks with
exceptionofthetop-leveltaskcanbedegradedarbitrarilyintrackingperformanceinthe
hierachical approach. Thus, it might be smarter to consider a weighted approach where
thetasksarecombinedinaform
A = [w
1
A
1
,w
2
A
2
,...,w
n
A
n
], (2.58)
b = [w
1
b
1
,w
2
b
2
,...,w
n
b
n
], (2.59)
56
wherew
i
denotes the weight of the taski. In such approaches, a largerw
i
≫ w
j
will
ensure that in the overconstrained case, taskj is largely overruled by taski. However,
it will happen in a soft and not in a hard way; thus, if both can be fulfilled, they will be
fulfilled without contradiction. For the hierachical approach discussed, task fulfillment
cannotalwaysbeguaranteedforlowerleveltasksevenifitispossibleasthehigherlevel
task can simply overrule the lower level task even if an alternative which fulfills both
higherandlowertaskispossible.
2.5.3 FutureWork
Among the other interesting prospective topics resulting from this chapter are the appli-
cationtomorecomplextasksandsystemsandusingthismethodforplanning.
2.5.3.1 ComplexSystems
In this thesis chapter, we have only evaluated the more general systems and intend to
point out the most important systems which should be evaluated with this approach in
thefuture. Oneimportanttopicisthecontrolofsystemsareoveractuatedsystems which
have more actuaters than degrees of freedom, i.e., redundancy on the torque generation
level as it exists in biological systems where hundreds of muscles interact. While we
have suggested a basic solution to this problem (Peters et al., 2005), a proper evaluation
of overactuated control based upon the approach suggested in this paper could result in
interestinginsightsbothforbiologicalmotorcontrolaswellasrobotics.
57
Controllingunderactuatedsystems isprobablythemostimportantrobotcontrolprob-
lem as they occur frequently in nature, e.g., legged locomotion, and even simple toy
systems such as cart-pole balancing fall into this category. From the viewpoint of the
approachdescripedinthischapter,anunderactuatedcontrolproblemcanbetreatedasa
modifiedtaskwherewehave
A
m
T
1
.
.
.
m
T
k
¨ q=
b
F
1
.
.
.
F
k
(2.60)
with the actual control taskA¨ q = b, and, the additional constraint for the unactuated
degrees of freedom i∈{1,...,k}, we always have u
i
= 0 orm
T
i
¨ q=F
i
. Here, m
T
i
denotes thei-th row of the inertia matrix. It becomes clear from this formulation, that
underactuated systems can be treated in the same way as a fully actuated one as long as
the task can be fulfilled. If the task cannot be fulfilled, i.e., if the problem is overcon-
strained, this becomes a pure planning problem where only the task planner can ensure
that any control law can ever perform the task. The same holds true for non-holonomic
systems, where the non-holonomic constraints can result in an overconstrained control
taskwiththesamedifficultiesasforunderactuatedsystems.
58
2.5.3.2 ApplicationsinPlanning
One important aspect might arise out of the application of this method to overactuated
systems, i.e., one might be able to realize a dimensionality reduction method which al-
lowsplanningofcontrollabletrajectoriesinalower-dimensionaldomain. Whilethisidea
is similar to the goal of differential geometric control methods, it might arise naturally
from decompositions of the matrix (AMN
−1/2
)
+
. Such approaches could be similar to
the one suggested in Section 8.4.1 in (Udwadia & Kalaba, 1996) and, the same time,
wouldbecomethefoundationsofnovelmachinelearningalgorithms.
Similarly, the methods proposed in this chapter have another interesting implication
forplanningoftasksastheycanbeusedforfeasibilitytestsoftrajectoriesinsimulation.
Theresultingmechanismisquitesimple: iftheplannedtrajectory(ortask)iscontrollable
at all, the control law generated by this method will be able to control it in an idealized
simulation. Thus, plans can be evaluated for their feasibility. While this might appear
trivial,itcanbeveryusefulinpractice.
59
Chapter3
LearningTrackingControlinOperationalSpace
Anapproximate solution totherightproblem isfarbetter
than anexactanswerto anapproximate problem.
JohnWilderTukey(Americanstatistician, 1915-2000)
One of the control problems which we have discussed in Chapter 2 is tracking con-
trol in operational space. While operational space control is of essential importance for
robotics and well-understood from an analytical point of view, it can be prohibitively
hardtoachieveaccuratecontrolinfaceofmodelingerrors,whichareinevitableincom-
plex robots, e.g., humanoid robots. In such cases, learning control methods can offer an
interesting alternative to analytical control algorithms. However, the resulting learning
problem is ill-defined as it requires to learn an inverse mapping of a usually redundant
system,whichiswellknowntosufferfromthepropertyofnon-convexityofthesolution
space,i.e.,thelearningsystemcouldgeneratemotorcommandsthattrytosteertherobot
intophysicallyimpossibleconfigurations. Afirstimportantinsightforthisthesischapter
60
is that, nevertheless, a physically correct solution to the inverse problem does exit when
learning of the inverse map is performed in a suitable piecewise linear way. The second
crucial component for our work is based on an insight in Section 2.3.2 that many op-
erational space controllers can be understood in terms of a constrained optimal control
problem. Thecostfunctionassociatedwiththisoptimalcontrolproblemallowsustofor-
mulate a learning algorithm that automatically synthesizes a globally consistent desired
resolution of redundancy while learning the operational space controller. From the view
ofmachinelearning,thelearningproblemcorrespondstoareinforcementlearningprob-
lemthatmaximizesanimmediaterewardandthatemploysanexpectation-maximization
policy search algorithm. Evaluations on a three degrees of freedom robot arm are used
illustrate the suggested approach and the application to a physically realistic simulator
oftheanthropomorphicSARCOSMasterarmdemonstratesfeasibilityforcomplexhigh
degree-of-freedomrobots.
3.1 Introduction
Operational space control is one of the most elegant approaches to task control due to
its potential for dynamically consistent control, compliant control, force control, hierar-
chical control, and many other favorable properties, with applications from end-effector
control of manipulators (Khatib, 1987; Hsu et al., 1989) up to balancing and gait exe-
cution for humanoid robots (Sentis & Khatib, 2005). If the robot model is accurately
61
known, operational space control is well-understood yielding a variety of different solu-
tion alternatives, including resolved-motion rate control, resolved-acceleration control,
andforce-basedcontrol(Nakanishietal.,2005). However,particularlyifcompliant(i.e.,
low-gain)controlisdesired,asinmanynewroboticsystemsthataresupposedtooperate
safely in human environments, operational space control becomes increasingly difficult
inthepresenceofunmodelednonlinearities,leadingtoreducedaccuracyorevenunpre-
dictable and unstable null-space behavior in the robot system. As a potential solution
to this problem, learning control methods seem to be promising. But learning methods
donoteasilyprovidethehighlystructuredknowledgerequiredintraditionaloperational
space control laws, i.e., Jacobians, inertia matrices, and Coriolis/centripetal and gravity
forces,asallthesetermsarenotalwaysinstantlyobservableandarethereforenotsuitable
for formulating supervised learning as traditionally used in learning control approaches
(Nakanishi,Farrell,&Schaal,2004).
Inthisthesischapter,wewillsuggestanovelapproachtolearningoperationalspace
controlthatavoidsextractingsuchstructuredknowledge,andratheraimsatlearningthe
operational space control law directly. To develop our approach, we will proceed as fol-
lows: firstly,wewillreviewoperationalspacecontrolanddiscusswherelearningcanbe
beneficial. Secondly, we will pose operational space control as a learning problem and
discuss why standard learning techniques cannot be applied straightforwardly. Using
the alternative understanding of operational space control as an optimal control tech-
nique,wereformulateitasanimmediaterewardreinforcementlearningorpolicysearch
62
problem and suggest novel algorithms for learning some of the most standard types of
operationalspacecontrollaws. Thesenewtechniquesareevaluatedonasimulatedthree
degree-of-freedomrobotarmandasimulatedanthropomorphicsevendegreesoffreedom
SARCOSrobotarm.
3.1.1 NotationandRemarks
Throughoutthisthesischapter,weassumethestandardrigidbodymodelforthedescrip-
tionoftherobot,i.e.,
M(q)¨ q+C(q,˙ q)+G(q)+ε(q,˙ q) =u, (3.1)
where q, ˙ q, ¨ q∈ R
n
denote the joint coordinates, velocities and accelerations of the
robot, respectively. The torques generated by the motors of the robot, also referred to as
motor commands, are given byu∈ R
n
. Furthermore,M(q) denotes the inertia tensor
ormassmatrix,C(q,˙ q)theCoriolisandcentripetalforces,G(q)isgravityandε(q,˙ q)
denotesunmodelednonlinearities.
In operational space control, we intend to execute trajectories or forces
1
given in
the coordinate system of the actual task. A well-studied example is a robot arm where
position and orientation of the end-effector are controlled (Khatib, 1987; Hsu et al.,
1989); however, a variety of further applications exist, such as the control of the center
1
Inthemoregeneralcase,thehybridcreationofforcesintaskspacewhilefollowingadesiredtrajectory
needstobeincluded. Forsimplicity,wewillomitsuchkindoftasksinthisthesischapter.
63
ofgravityforbalancingleggedrobots,whichcanalsobethoughtofasoperationalspace
control (Sentis & Khatib, 2005). Position and orientationx∈R
m
of the controlled ele-
mentoftherobotintask-space,e.g.,theend-effector,isgivenbytheforwardkinematics
x =f
Kinematics
(q). Thederivativesyieldbothvelocityandaccelerationintaskspace,i.e.,
˙ x =J(q)˙ q, ¨ x =J(q)¨ q+
˙
J(q)˙ q, (3.2)
where J(q) = df
Kinematics
(q)/dq denotes the Jacobian. We assume that the robot is
in general redundant, i.e., it has more degrees of freedom than required for the task or,
equivalently,n>m.
3.1.2 OperationalSpaceControlasanOptimalControlProblem
Usingtheframeworkoftrajectorytrackingasanexample,thegeneralprobleminopera-
tionalspacecontrol
1
canbedescribedasfollows: generateacontrollawu =f
Control
(q,˙ q,x
d
,˙ x
d
,¨ x
d
)
which controls the robot along a joint space trajectory q(t), ˙ q(t), and ¨ q(t),such that
the controlled element (e.g., the end-effector) follows a desired trajectory in task space
x
d
(t),˙ x
d
(t), and ¨ x
d
(t). This problem has been thoroughly discussed since the late
1980s (e.g., (Khatib, 1987; Hsu et al., 1989)) and, among others, has resulted in a class
of well-known control laws (Nakanishi et al., 2005). As an important new insight into
operational space control it was recently discovered (Peters et al., 2005), that many of
64
the suggested controllers in the literature can be derived as the solution of a constrained
optimizationproblemgivenby
min
u
C
0
(u) =u
T
Nus.t.J¨ q =¨ x
ref
−
˙
J˙ q, (3.3)
where N denotes a positive definite metric that weights the contribution of the motor
commandstothecostfunction,and¨ x
ref
=¨ x
d
(t)+K
d
(˙ x
d
(t)− ˙ x(t))+K
p
(x
d
(t)−x(t))
denotes a reference attractor in task space with gain matricesK
d
andK
p
. The resulting
controllawsorsolutionofthisoptimizationproblemobeythegeneralform(Petersetal.,
2005)
u =N
−1/2
(JM
−1
N
−1/2
)
+
(¨ x
ref
−
˙
J˙ q+JM
−1
F), (3.4)
withF(q,˙ q)=C(q,˙ q)+G(q)+ε(q,˙ q), and the notationD
+
defining the pseudo
inverseofamatrixsuchthatD
+
D =I,DD
+
=I,andwiththematrixrootD
1/2
defined
asD
1/2
D
1/2
=D.
For example, the resolved-acceleration controller of Hsu et al. (Hsu et al., 1989)
(withoutnullspaceoptimization)istheresultofusingthemetricN =M
−2
,whichyields
u =MJ
T
(¨ x
ref
−
˙
J˙ q)+F, and corresponds to a cascade of an inverse dynamics and an
inverse kinematics control law. Another example is Khatib’s formulation of operational
spacecontrol(Khatib,1987),determinedbythemetricN =M
−1
andgivenby
u =J
T
(JM
−1
J
T
)
−1
(¨ x
ref
−
˙
J˙ q+JM
−1
F). (3.5)
65
Khatib’s solution is special as the metricN =M
−1
is the only metric which generated
torquesthatcorrespondtotheonescreatedbyaphysicalcontraintpullingtherobotalong
thetrajectory(Udwadia&Kalaba,1996;Petersetal.,2005),i.e.,itisthemetricusedby
nature according to Gauss’ principle (Bruyninckx & Khatib, 2000; Udwadia & Kalaba,
1996) and it is invariant under change of joint coordinates (Doty et al., 1993). Other
metrics such asN = const can be used to distribute the required forces differently, e.g.,
suchthatstrongermotorsgetahigherportionofthegeneratedforces(Petersetal.,2005).
Even when achieving the task perfectly, the joint-space trajectories can result in un-
favorable postures or even joint-space instability (see Example 3.1). For handling such
cases, additional controls which do not affect the tasks performance but ensure a favor-
ablejoint-spacebehaviorneedtobeincluded. Fromthepointofviewoftheoptimization
framework,wewouldselectanominalcontrollawu
0
(e.g.,aforcewhichpullstherobot
towards a rest postureu
0
=−K
D
˙ q−K
D
(q−q
rest
)), and then solve the constrained
optimizationproblem
min
u
C
1
(u) = (u−u
0
)
T
N(u−u
0
) s.t.J¨ q =¨ x
ref
−
˙
J˙ q, (3.6)
whereu
1
=u−u
0
as the task-space control component. The general solution is given
by
u =N
−1/2
(JM
−1
N
−1/2
)
+
(¨ x
ref
−
˙
J˙ q+JM
−1
F) (3.7)
+N
−1/2
(I−(N
−1/2
M
−1
J)(JM
−1
N
−1/2
)
+
)N
1/2
u
0
,
66
(a) Prismatic2-dofrobot
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Time t
Position x
N=I
N=M
-1
N=M
-2
(b) End-effector position (all trajectories coin-
cide almost perfectly with the refrence trajec-
tory) x
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Position q
1
Time t
N=I
N=M
-1
N=M
-2
(c) Jointposition q
1
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Position q
2
Time t
N=I
N=M
-1
N=M
-2
(d) Jointposition q
2
Figure 3.1: When applied to the prismatic robot from Example 3.1 shown in (a), the
threecontrollawsforthemetricsN =I(dashed-dotredlines),N =M
−1
(solidgreen),
N =M
−2
(dashedblue)resultin(b)thesametask-spacetrackingbut(c,d)verydifferent
joint-spacebehavior. SeeExample3.1formoreinformation.
wherethesecondsummandfulfillthenominalcontrollawu
0
inthenull-spaceofthefirst
term. Whenhavingmorethantwotasks,thesecanbenestedinasimilarfashionleading
toageneralframeworkofhierarchicaltaskcontrol(Petersetal.,2005;Sentis&Khatib,
2005).
Example3.1 An illustrative example of operational space control is tracking the end-
effector position x = q
1
+ q
2
of a prismatic robot with two parallel links with joint
positions q
1
, q
2
, see Figure 3.1. The mass matrix will by M = diag(m
1
,0) +m
2
1
67
with massesm
1
= m
2
= 1 and1 denoting a matrix with all coefficients equal to one.
The internal forces areF = 0, the Jacobian isJ = [1,1]
T
and its derivative
˙
J = 0.
If no joint-space control law is selected, i.e., u
0
= 0, the control law in the form of
Equation(3.4)forexecutingthetask ¨ x
ref
= ¨ x
d
+K
d
(˙ x
d
− ˙ x)+K
p
(x
d
−x)wouldresult
in unstable behavior for most metricsN. When adding au
0
=−K
D
˙ q−K
D
q pulling
the robot towardsq
rest
= 0, we obtain stable tracking with very different properties as
canbeobservedinFigure3.1: (i)metricN =Iwillresultinthesecondlinktrackingthe
end-effectorandthenull-spacecomponentstabilizingthefirstlink,(ii)metricN =M
−1
willdistributethetaskonbothlinksevenlyandhavethenull-spacecomponentdecouple
thetwo links,while (iii)metricN =M
−2
simplyminimizesthe squaredacceleration.
Wewillusethissimplerobotexample(Example3.1)toillustratevariousotherissues
belowasitallowseasyanalyticalunderstandingandgraphicalvisualizations.
3.1.3 WhyshouldwelearnOperationalSpaceControl?
When an accurate analytical model of the robot is available and its parameters can be
well-estimated, operational space control laws can be highly successful (Khatib, 1987;
Sentis & Khatib, 2005; Nakanishi et al., 2005). However, in many new complex robotic
systems, e.g., humanoid robots, space robots, etc., accurate analytical models of the
robot dynamics are not available due to significant depatures from idealized theoretical
modelssuchasrigidbodydynamics. Forinstance,inourexperiencewithanthropomor-
phic robots, unmodeled nonlinear effects were caused by complex actuator dynamics,
68
hydraulichosesandcablebundlesroutedalongthelightweightstructureoftherobotas
well as complex friction effects. Trying to model such nonlinearities is of little use due
to the lack of generality of such an approach, and the daunting task of deriving useful
modelsfortheunknowneffects.
Example3.2 In the prismatic robot from Example 3.1, already small unmodeled non-
linearities can have a drastic effect. If the estimated mass matrix of the robot
˜
M =
diag(m
1
,0) +m
2
1(where1 is a matrix with only ones as entries) just differs from the
trueMbyM
12
−
˜
M
12
=M
21
−
˜
M
21
= 0.5sin(q
1
+q
2
),e.g.,throughunmodeledprop-
erties of cables, then the resulting control law will result in unstable and unpredictable
null-space behavior despite that accurate task space tracking is theoretically still pos-
sible. On a real physical system, excessive null space behavior saturates the motors of
therobot,suchthatalsotaskspacetrackingdegrades,andtheentirecontrolsystemgoes
unstable.
Example3.2demonstrateshowasmallmodelingerrordecreasestheperformanceof
theoperationalcontrollawandcanresultinjoint-spaceinstabilityevenforsimplerobots.
For light-weight robot arms or full-body humanoid robots, such problems become even
more frequent and difficult to cope with. Traditionally, this problem is be fixed by the
engineerimprovingtheapproximationtheplantbyhand;however,foroperationalspace
controloflow-gaincontrolledlight-weightrobotswhicharehardtomodel,learningisa
promisingnovelalternativeandwillbediscussedinSection3.2.
69
3.2 LearningMethodsforOperationalSpaceControl
Learningoperationalspacecontrolwithredundantmanipulatorsislargelyanunexplored
problem and the literature has only few related examples. Among those, learning ap-
proachestotasklevelcontrolfocussedmostlyonaninversekinematicsend-effectorcon-
trol(Guez&Ahmad,1988;Jordan&Rumelhart,1992;Bullock,Grossberg,&Guenther,
1993; Tevatia & Schaal, 2000; D’Souza, Vijayakumar, & Schaal, 2001), i.e., learning
an inverse kinematics mapping, in order to create appropriate reference trajectories in
joint-space, which were to be executed by a given joint-space control law or were sim-
ply optimizing a certain trajectory (De Luca & Mataloni, 1991). The combination of a
learned inverse kinematics and a learned inverse dynamics controller (Guez & Ahmad,
1988; Tevatia & Schaal, 2000; D’Souza et al., 2001) can only be found occasionally in
the literature. To the best of our knowledge, full operational space control laws with
redundancyhavenotbeenaddressedbygenerallearningapproachestodate.
3.2.1 CanOperationalSpaceControlbelearned?
Learningoperationalspacecontrolisequivalenttoobtainingamapping(q,˙ q,¨ x
ref
)→u
from sampled data using a function approximator. However, as the dimensionality of
thetask-spacereferencetrajectory¨ x
ref
islowerthantheoneofmotorcommandu, there
are infinitely many solutions foru for most joint positionsq, and joint velocities ˙ q. For
theillustrativelinearcaseinExample3.2withoutanull-spacecomponent,thismapping
70
-2 -1 0 1 2
-2
-1
0
1
2
Motor command u
1
Motor command u
2
(a) Unweighteddatasets
-2 -1 0 1 2
-2
-1
0
1
2
Motor command u
1
Motor command u
2
(b) Rewardweighteddatasets
Figure 3.2: This figure illustrates how (a) different data sets result in different solutions
if each data point is treated with equal importance (the blue dash-dot line corresponds
to the blue diamonds and the red dashed line to the red circles). If these data points
are (b) weighted down using the Gaussian cost function (here indicated with the metric
N =M
−1
as solid thin black lines) the solutions of different data sets will consistently
approximateoptimalsolutionsshowninthesolidcyanline. Whileforthelinearprismatic
robot one could live with any solution in (a), different local solutions have to create a
consistent global solution for nonlinear robots. The horizontal faintly dotted lines in (a)
and(b)indicatecontourlinesofequaltask-spaceacceleration.
corresponds to a line in the plane of possible control laws as shown by the two lines in
Figure3.2(a).
A major problem arises in the case of a robot with rotary joints as the motor com-
mands u achieving the same reference acceleration ¨ x
ref
are no longer form a convex
set, a problem first described in the context of learning inverse kinematics (D’Souza
et al., 2001; Jordan & Rumelhart, 1992). Thus, when learning the inverse mapping
71
(q,˙ q,¨ x
ref
)→u, the learning algorithm will average over unconnected sets of the solu-
tions which can result in invalid solutions to the learning problem. Therefore, the learn-
ing problem is ill-conditioned such that directly learning from samples with supervised
learningtechniquesisnotsuitable.
Nevertheless,theconvexityissuescanberesolvedbyemployingaspatiallylocalized
supervisedlearningsystem,which,inourcase,needstospatiallylocalizedbasedonboth
jointspacepositionandvelocity–suchanapproachwasfirstintroducedinthecontextof
inverse kinematics learning (Bullock et al., 1993; D’Souza et al., 2001). The feasibility
ofthisideacanbedemonstratedsimplybyaveragingoverthecombinationofEquations
(3.2)and(3.1)whichyieldsthatbyaveragingovertheexactsamespatialpositionq,and
velocity ˙ q,wehave
¨ x =h¨ xi =
D
JM
−1
(u+F)+
˙
J˙ q
E
(3.8)
=JM
−1
hu+Fi+
˙
J˙ q =JM
−1
(u+F)+
˙
J˙ q,
i.e., in the vicinity of same q,˙ q, a particular ¨ x will always correspond to exactly one
particularu
2
. Therefore,locallylinearcontrollers
u
i
=c
i
β
(q,˙ q,¨ x
ref
) = [¨ x
T
ref
,˙ q
T
,1]β
i
, (3.9)
2
Note,thatthelocalizationinvelocity ˙ qcanbedroppedforapurerigidbodyformulationasitislinear
in the ˙ q
i
˙ q
j
for all degrees of freedom i,j; this, however, is not necessarily desirable as it will add new
inputstothelocalregressionproblemwhichgrowsquadraticallywiththenumberofdegreesoffreedom.
72
can be used if they are only active in a region aroundq,˙ q (note that we added constant
input in Equation (3.9) to account for the intercept of a linear function). From a control
engineering point of view, this argument corresponds to the insight that when we can
linearize the plant in a certain region, we can find a local control law in that region by
treating the plant as linear, and, in general, linear system do not have the problem of
non-convexityofthesolutionspacewhenlearninganinversefunction.
Next we need to address how to find an appropriate piecewise linearization for the
locallylinearcontrollers. Forthispurpose,welearnalocallylinearforwardorpredictor
model
¨ x
i
=p
i
ˆ
β
(q,˙ q,u) = [˙ q
T
,u
T
,1]
ˆ
β
i
, (3.10)
Learning this forward model is a standard supervised learning problem, as the mapping
is guaranteed to be a proper function. A method of learning such a forward model that
automaticallyalsolearnsalocallinearizationisLocallyWeightedProjectionRegression
(LWPR) (Schaal, Atkeson, & Vijayakumar, 2002), a fast online learning method which
scales into high-dimensions, has been used for inverse dynamics control of humanoid
robots, and can automatically determine the number of local models that are needed to
represent the function. The membership to a local model is determined by a weight
generatedfromaGaussiankernel:
w
i
(q,˙ q)=exp
1
2
q
˙ q
−c
i
T
D
i
q
˙ q
−c
i
(3.11)
73
centered atc
i
in (q,˙ q)-space, and shaped by a distance metricD
i
. For a closer descrip-
tionofthisstatisticallearningalgorithmsee(Schaaletal.,2002).
For each local forward model created by LWPR, we automatically create a local
controller. This approach of pair-wise combining predictors and controllers is related
by the MOSAIC architecture (Haruno, Wolpert, & Kawato, 1999) where the quality of
predictingataskisusedforselectingwhichlocalcontrollershouldbeusedforthetask.
3.2.2 CombiningtheLocalControllersandEnsuringConsistentResolution
ofRedundancy
In order to control a robot with these local control laws, they need to be combined into
aconsistentglobalcontrollaw. Thecombinationisgivenbyaweightedaverage(Schaal
etal.,2002):
u =
P
n
i=1
w
i
(q,˙ q)[¨ x
T
ref
,˙ q
T
,1]β
i
P
n
i=1
w
i
(q,˙ q)
, (3.12)
whereeachcontrollawc
i
β
(q,˙ q,¨ x
ref
)isjustvalidinitslocalregioncomputedbyw
i
(q,˙ q),
andβ
i
aretheparametersofeachlocaloperationalspacecontrollaw.
However, while the mappings (q,˙ q,¨ x
ref
)→ u can properly be learned locally in
the neighborhood of some q,˙ q, due to the redundancy in the robotic system, there is
no guarantee that across the local mappings the same type of solution is acquired. This
problemisduetothedependenceoftheinversesolutiononthetrainingdatadistribution
in each local model – i.e., different distributions will pick different solutions for the in-
verse mapping from the infinity of possible inverses. In Figure 3.2 (a), we demonstrate
74
this effect. While this problem is not devastating for the prismatic robot from Example
3.1,itisresultsinsevereproblemsforanynonlinearrobotrequiringmultiple,consistent
linear models. There are two different approaches to tackling such problems: (1) by bi-
asingthesystemtowardsusingapre-processeddatasetsuchthatitcanonlyproduceone
particularinversesolution(D’Souzaetal.,2001),and(2)byincorporatingacost/reward
functioninordertofavoracertainkindofsolution(anexamplewhichwillbediscussed
later and is shown Figure 3.2 (b)). The first approach lacks generality and can bias the
learning system such that the task is not properly accomplished anymore. The major
shortcoming of the second approach is that the choice of the cost/reward function is in
generalnon-trivialanddeterminesthelearningalgorithmaswellasthelearnedsolution.
The crucial component to finding a principled approach to this inconsistency prob-
lem is based on the discussion in Section 3.1.2 and previous work (Peters et al., 2005).
Operationalspacecontrolcanbeseenasaconstrainedoptimizationproblemwithacost
functiongiveninEquation(3.3). Thus,thecostfunctionbasedapproachforthecreation
of a consistent set of local controllers for operational space control can be based on this
insight. The cost function can be turned into a immediate reward r(u) by running it
throughanexponentialfunction:
r(u) =σexp
−0.5σ
2
C
1
(u)
=σexp
−σ
−2
u
T
1
Nu
1
, (3.13)
whereσ is a scaling factor and the task space commandu
1
=u−u
0
can be computed
using a desired null-space behavioru
0
(e.g., pulling towards a rest posture as discussed
75
in Section 3.1.2). The scaling factor σ does not affect the optimality of a solution u
as it acts as a monotonic transformation in this cost function. However, it can increase
the efficiency of the learning algorithm significantly when only sparse data is available
for learning (i.e., as for most interesting robots as the high-dimensional action spaces
of complex robots will hardly ever be filled densely with data)
3
. These local rewards
allowusthereformulationofourlearningproblemasanimmediaterewardreinforcement
learningproblem (Dayan&Hinton,1997),aswillbediscussedinSection3.3.
Wearenowinthepositiontoformulateasupervisedlearningalgorithmforthelocal
operational space controllers. The task constraint in Equation (3.3) as well as the rigid
bodydynamicsinEquation(3.1)areautomaticallyfulfilledbyalldatasampledfromthe
realrobotsimilartoaself-supervisedlearningproblem. Therefore,forlearningthelocal
operational space controllers, we have obtained a local linear regression problem where
we attempt to learn primarily from the observed motor commandsu
k
which also have a
high rewardr(u
k
) within each active local modelc
i
β
(q
k
,˙ q
k
,¨ x
k
ref
). An intuitive solution
istousereward-weightedregression,i.e.,findthesolutionwhichminimizes
N
X
k=1
r
u
k
w
i
q
k
,˙ q
k
u
k
−[¨ x
k,T
ref
,˙ q
k,T
,1]β
i
2
→ min, (3.14)
3
The reward has to be seen in the light of the relationship between the Gaussian distribution and
Gauss’ principle for constrained motion as suggested already by Carl-Friedrich Gauss in his original
work(Udwadia,2005).
76
foreachcontrolleri. Thesolutiontothisproblemisthewell-knownweightedregression
formula:
β =
Φ
T
WΦ
−1
Φ
T
WU, (3.15)
with rows in the matrices Φ and U : Φ
k
= [¨ x
k,T
ref
,˙ q
k,T
,1], U
k
= u
k,T
and W
i
=
r(u
i
)w(q
i
,˙ q
i
). Whenemployingthisreward-weightedregressionsolution,wewillcon-
vergetoagloballyconsistentsolutionacrossalllocalcontrollers. Thelearningalgorithm
isshowninTable3.1togetherwithanadditionalcomponentderivedinSection3.3. Note
that this step was only possible due to the essential cost function in Equation (3.6) from
ourpreviouswork.
3.3 ReformulationasReinforcementLearningProblem
Another way of looking at operational space control is to view it as an immediate re-
ward reinforcement learning problem (Kaebling, Littman, & Moore, 1996) with high-
dimensional, continuous statess = [q,˙ q,¨ x
ref
,u
0
]∈ R
n
and actionsu∈ R
m
. The goal
oflearningistoobtainanoptimalpolicy
u =μ(q,˙ q,¨ x
ref
,u
0
) =μ(s) (3.16)
such that the system follows the reference acceleration ¨ x
ref
while maximizing the im-
mediate reward r(u) =−(u−u
0
)
T
N(u−u
0
) for any given nominal behavior u
0
.
In order to incorporate exploration during learning, we need a stochastic control policy
77
Algorithm:LearningforOperationalSpaceControl
1 foreachnewdatapoint [¨ x
k
ref
,q,˙ q
k
,u
k
]
2 Add (q,˙ q,u)→¨ xtotheforwardmodelregression.
3 Determinethecurrentnumberofmodelsnand
localizationsoftheforwardmodelsw
i
(q,˙ q).
4 Computedesirednull-spacebehavioru
k
0
=f
q
k
,˙ q
k
.
5 ComputecostsC
k
1
=
u
k
1
T
N
q
k
u
k
1
withu
k
1
=u
k
−u
k
0
.
6 Foreachmodeli = 1,2,...,n
Updatemeancost:
7 σ
2
i
=
P
k
h=1
w
k
q
h
,˙ q
h
C
k
1
.
P
N
k=1
w
k
q
h
,˙ q
h
,
Computereward:
8 r(u) =σ
i
exp
−0.5σ
2
i
C
k
1
Adddatapointtoweightedregressionsothat:
9 Φ
i
= [q
i
,˙ q
i
,¨ x
i
ref
]
10 U
i
=u
i
11 W = diag(r(u
1
)w
1
,...,r(u
n
)w
n
)
Performpolicyupdatebyregression
12 β
k+1
=
Φ
T
WΦ
−1
Φ
T
WU,
13 end
14 end
Table3.1: ThistableshowsthecompletelearningalgorithmforOperationalSpaceCon-
trol. Seetextofdetailedexplanations.
78
u = μ
θ
(q,˙ q,¨ x
ref
) +ε, modeled as a probability distributionπ
θ
(u|s) = p(u|s,θ) with
parametervectorθ. Thegoalofthelearningsystemisthustofindthepolicyparameters
θ thatmaximize
J
r
(θ) =
Z
p(s)
Z
π
θ
(u|s)r(s,u)duds. (3.17)
p(s) denotes the distribution of states, which is treated as fixed in immediate reward
reinforcementlearningproblems(Kaeblingetal.,1996).
Originally,wederivedthisalgorithmfromaweightedregressionpointofview. How-
ever, this point of view is not completely satisfying as it still has the open parameter
σ
2
which determines the speed of convergence of the learning controllers. An alter-
native view point, i.e., in the framework of immediate reward reinforcement learning,
allowsderivingthepreviousalgorithmtogetherwithacomputationruleforσ
2
. Previous
workintheliteraturesuggestedavarietyofoptimizingmethodswhichcanbeappliedto
immediate reward reinforcement learning problems, e.g., gradient based methods (e.g.,
REINFORCE,CovariantREINFORCE,finitedifferencegradients,theKiefer-Wolfowitz
procedure,A
RP
algorithms, CRBP, etc.) and random search algorithms (e.g., simulated
annealing or genetic algorithms) (Dayan & Hinton, 1997; Kaebling et al., 1996; Spall,
2003). However,gradient-basedmethodstendtobetooslowfortheonlinelearningthat
we desire in our problem, while randomized search algorithms can create too arbitrary
solutions, often not suitable for execution on a robotic system. For learning operational
space control, we require a method that is computationally sufficiently efficient to deal
with high-dimensional robot systems and large amounts of data, that has a low sample
79
complexity, thatcomeswithconvergenceguarantees,andthatissuitableforsmoothon-
line improvement. For instance, linear regression techniques and/or methods employing
EM-stylealgorithmsarehighlydesirable.
Agoodstartingpointforourworkistheprobabilisticreinforcementlearningframe-
work by Dayan & Hinton (1997). As we will show in the following, a generalization of
this approach allows us to derive an EM-algorithm which essentially reduces the imme-
diaterewardlearningproblemtoareward-weightedregressionproblem.
3.3.1 RewardTransformation
Inordertomaximizetheexpectedreturn(Equation3.17)usingsamples,weapproximate
J
r
(θ)≈
X
n
i=1
π
θ
(u
i
|s
i
)r
i
(3.18)
wherer
i
=r(s
i
,u
i
). For application of the probabilistic reinforcement learning frame-
workofDayan&Hinton(Dayan&Hinton,1997),therewardneedstobestrictlypositive
suchthatitresemblesan(improper)probabilitydistribution. Whilethiscanbeachieved
by a linear rescaling for problems for bounded rewards, for unbounded rewards as dis-
cussed in this paper, a nonlinear transformation of the reward U
τ
(r) is required, with
the constraint that the optimal solution to the underlying problem remains unchanged.
80
Thus,werequirethatU
τ
(r)isstrictlymonotonicwithrespecttor,andadditionallythat
U
τ
(r)≥ 0and
R
∞
0
U
τ
(r)dr = const,resultinginthetransformedoptimizationproblem
J
u
(θ) =
X
n
i=1
π
θ
(u
i
|s
i
)U
τ
(r
i
). (3.19)
The reward transformation plays a more important role than initially meets the eye: as
already pointed out in (Dayan & Hinton, 1997), convergence speed can be greatly af-
fected by this transformation. MakingU
τ
(r) an adaptive part of the learning algorithm
by means of some internal parametersτ can greatly accelerate the learning speed and
help avoid local minima during learning. Figure 3.3 demonstrates this issue with a 1D
continuous state and 1D continuous action example, where the goal is to learn an opti-
mal linear policy. Using the algorithm that we will introduce below, an adaptive reward
transformation accelerated the convergence by a factor of 4, and actually significantly
helpedavoidinglocalminimaduringlearning.
3.3.2 EMReinforcementLearningwithRewardTransformation
To derive our learning algorithm, similar as in (Dayan & Hinton, 1997), we start by
establishingthelowerbound
logJ
u
(θ) = log
X
n
i=1
q(i)
π
θ
(u
i
|s
i
)U
τ
(r
i
)
q(i)
(3.20)
≥
X
n
i=1
q(i)log
π
θ
(u
i
|s
i
)U
τ
(r
i
)
q(i)
(3.21)
81
Step 1
-2 0 2
-1
0
1
Step 2
-2 0 2
-1
0
1
Actionu
State s
Actionu
State s
Step 0
-2 0 2
-1
0
1
Actionu
State s
Step 3
-2 0 2
-1
0
1
Actionu
State s
Step 20
-2 0 2
-1
0
1
Actionu
State s
Step 1
-2 0 2
-1
0
1
Step 2
-2 0 2
-1
0
1
Actionu
Actionu
Step 0
-2 0 2
-1
0
1
Actionu
State s State s State s
Step 3
-2 0 2
-1
0
1
Actionu
State s
Step 6
-2 0 2
-1
0
1
Actionu
State s
(b) Adaptive Reward Transformation
(a) Fixed Reward Transformation
Figure 3.3: A comparison of fixed and adaptive reward transformation for learning a
linear policyπ(u|s) =N(u|θ
1
s +θ
2
,σ
2
) under the transformed rewardu(r(s,u)) =
exp(−τ (q
1
u
2
+q
2
us+sq
2
3
)). The transformed reward is indicated by the dotted blue
ellipses, the variance of the action distribution is indicated by the red thick ellipse, and
the mean of the linear policy is shown by the red thick line. Withτ being adaptive, sig-
nificantlyfasterlearningoftheoptimalpolicyisachieved. Step0showstheinitialpolicy
and initial transformed reward, Step 1 shows the initial policy with adapted transformed
reward.
=
X
n
i=1
q(i)[logπ
θ
(u
i
|s
i
)+logU
τ
(r
i
)−logq(i)] (3.22)
=F (q,θ,τ), (3.23)
duetoJensensinequality. There-weightingdistributionq(i)obeystheconstraint
X
n
i=1
q(i)−1 = 0. (3.24)
TheresultingEMalgorithmisgivenbelow.
82
Algorithm3.1 AnEMalgorithmforoptimizingboththeexpectedrewardaswellasthe
reward-transformation isgiven byan E-Step
q
k+1
(j) =
π
θ
k
(u
j
|s
j
)U
τ
k
(r
j
)
P
n
i=1
π
θ
(u
i
|s
i
)U
τ
k
(r
i
)
, (3.25)
an M-Step forthepolicy parameterupdate given
θ
k+1
= argmax
θ
X
n
i=1
q
k+1
(i)logπ
θ
(u
i
|s
i
), (3.26)
and a M-Stepforthe adaptiverewardtransformation given by
τ
k+1
= argmax
τ
X
n
i=1
q
k+1
(i)logU
τ
(r
i
). (3.27)
Proof. TheE-Stepisgivenby
q
k+1
= argmax
q
F (q,θ,τ) (3.28)
whilefulfillingtheconstraint
0 =
X
n
i=1
q(i)−1. (3.29)
83
Thus,weobtainaconstrainedoptimizationproblemwithLagrangemultiplierλ:
L(λ,q) =
X
n
i=1
q(i)[logπ
θ
(u
i
|s
i
)+logU
τ
(r
i
)−logq(i)+λ]−λ. (3.30)
Optimizing L(λ,q) with respect to q and λ results in Equation (3.25). Optimizing
F (q
k+1
,θ,τ)withrespecttoθ andτ yieldsEquations(3.26,3.27).
3.3.3 ReinforcementLearningbyReward-WeightedRegression
Letusassumethespecificclassofnormallydistributedpolicies:
π
θ
(u|s) =N
u|μ
θ
(s),σ
2
I
(3.31)
with a nominal or mean behavior μ
θ
(s) = φ(s)
T
θ whereφ(s) denotes some fixed
preprocessing of the state by basis functions andσ
2
I determines the exploration
4
. Fur-
thermore,wechoosetherewardtransformation
U
τ
(r) =τ exp(−τr), (3.32)
which, for r > 0 fulfills all our requirements on a reward transformation (cited from
Sec.3.3.1). Algorithm3.1thusbecomes:
4
Notethatσ
2
Icouldbereplacedbyafullvariancematrixwithlittlechangesinthealgorithm. However,
this would result in a quadratic growth of parameters with the dimensionality of the state and is therefore
lessdesirable.
84
Algorithm3.2 Theupdateequations for thepolicyπ
θ
(u|s) =N (u|μ
θ
(s),σ
2
I)are:
θ
k+1
=
Φ
T
WΦ
−1
Φ
T
WY, (3.33)
σ
2
k+1
=
Y−θ
T
k+1
Φ
2
W
, (3.34)
where
W =
n
X
i=1
U
τ
(r
i
)
!
−1
diag(U
τ
(r
1
),U
τ
(r
2
),...,U
τ
(r
n
)), (3.35)
denotesadiagonal matrixwith transformedrewards,
Φ = [φ(s
1
),φ(s
2
),...,φ(s
n
)]
T
, (3.36)
and
Y = [u
1
,u
2
,...,u
n
]
T
(3.37)
themotor commands. TheupdateoftherewardtransformationU
τ
(r) =τ exp(−τr)is
τ
k+1
=
P
n
i=1
U
τ
k
(r
i
)
P
n
i=1
U
τ
k
(r
i
)r
i
. (3.38)
Proof. Whencomputingq
k+1
(j)fromsamplesinEquation(3.25),wehave
q
k+1
(j) =
U
τ
k
(r
j
)
P
n
i=1
U
τ
k
(r
i
)
(3.39)
85
astheprobabilitiesarereplacedbyrelativefrequencies. Weinsertthepolicy
π
θ
(u|s) =
2πσ
2
−
d
2
exp
−
(u−φ(s)
T
θ)
T
(u−φ(s)
T
θ)
2σ
2
!
, (3.40)
intoEquation(3.26). Bydifferentiatingwithrespecttoθ andequatingtheresulttozero,
weobtain
θ =
n
X
i=1
q
k+1
(i)φ(s
i
)φ(s
i
)
T
!
−1
n
X
i=1
q
k+1
(i)φ(s
i
)u
i
!
. (3.41)
In matrix vector form, this corresponds to Equation (3.33). Analogously, the reward
transformationisobtainedfromdifferentiationwithrespecttoτ as
n
X
i=1
q
k+1
(i)
∂
∂τ
logU
τ
(r
i
)=
n
X
i=1
q
k+1
(i)
τ
−1
−r
i
=0. (3.42)
whichresultsinEquation(3.38).
3.4 Evaluations
We evaluated our approach on two different simulated, physically realistic robots: (i) a
threedegree-of-freedom(DOF)planarrobotarmshowninFigure3.4(a)and(ii)aseven
DOF simulated SARCOS master robot arm – an implementation on the real, physical
SARCOSmasterrobotarm(Figure3.5(a))iscurrentlyinprogress.
86
(a) 3 DoF Robot Arm (b) Tracking Performance
0.44 0.48 0.52 0.56
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Hand coordinate x
1
Hand coordinate x
2
desired learned
(c) Optimal vs Learned Motor Command
0 0.5 1 1.5 2
-10
0
10
20
30
40
50
60
Time t
Taskspace motor
commands u 1
u
1
1
u
1
2
u
1
3
optimal learned
Figure 3.4: (a) screen shot of the 3 DOF arm simulator, (b) near ideal tracking perfor-
mance for a planar figure-8 pattern for the 3 DOF arm, and (c) a comparison between
the analytically obtained optimal control commands to the learned ones for one figure-8
cycleofthe3DOFarmexhibitsthatanear-optimalpolicyisobtained.
Both experiments were conducted as follows: first, learning the forward models and
an initial control policy in each local model was obtained from random point-to-point
movements in joint space using a simple PD control law. This “motor babbling” ex-
ploration was necessary in order bootstrap learning with some initial data, as we would
otherwiseexperienceratherslowlearning,astypicallyobservedinsimilardirect-inverse
learning approaches (Jordan & Rumelhart, 1992). The measured end-effector accelera-
tions served as desired acceleration in Equation 3.9, and all other variables for learning
the local controllers were measurable as well. Subsequently, the learning controller was
usedon-policywiththenormallydistributedactuatornoiseservingasexploration.
Bothrobotslearnedtotrackdesiredtrajectorieswithhighaccuracy,asshowninFig-
ures 3.4 (b) and 3.5 (b). For the three DOF arm, we verified the quality of the learned
87
(a) SARCOS Master Robot Arm
(b) Tracking Performance
0.34 0.38
-0.1
-0.05
0
0.05
0.1
y-z plane
y
z
0.25 0.3 0.35 0.4 0.45 0.5 0.55
-0.1
-0.05
0
0.05
0.1
x-z plane
x
z
Figure 3.5: (a) Anthropomorphic Sarcos Master Arm, used as simulated system and
in progress of actual robot evaluations. (b) Tracking performance for a planar figure-8
patternforthesimulatedSarcosMasterarm.
controlcommandsincomparisontotheanalyticalsolution,giveninEquation(3.7): Fig-
ure 3.4 (c) demonstrates that the motor commands of the learned and analytically opti-
malcasearealmostidentical. LearningresultsofthesimulatedsevenDOFSarcosrobot
achieved almost the same end-effector tracking quality and is shown in Figure 3.4 (c).
Itexhibitsonlyslightlyincreasederrors,however,thejointcommandswerenotquiteas
close to the optimal ones as for the 3 DOF arm – the rather high dimensional learning
space of the 7 DOF arm most likely requires more extensive training and more careful
tuning of the LWPR learning algorithm to achieve local linearizations with very high
accuracy and with enough data to find the optimal solution. The 3 DOF required about
2 hours of real-time training, while setup was optimized for the 7 DOF arm where 60
88
minute run of real-time training was sufficient for achieving the quality exhibited on the
testtrajectoryinFigure3.5(b).
3.5 Conclusion&Discussion
ThecontributionsofthischapterareoutlinedinSection3.5.1. Theremainingopenissue
ofnon-constantmetricsisdiscussedinSection3.5.2.
3.5.1 ContributionsofthisChapter
In this thesis chapter, a general learning framework for operational space for redundant
robotshasbeenpresented,whichisprobablythefirstsuccessfulattemptoflearningsuch
control laws to date. We overcome the difficulties of having a non-convex data distribu-
tionbyonlylearninginthevincinityofalocalmodelanchoredbothinjointvelocityand
jointposition. Thelocalregionsareobtainedbylearningforwardmodels,whichpredict
the movement of the end-effector. The global consistency of the redundancy resolution
of the local model controllers is ensured through minimizing the cost function of oper-
ational space control. This cost function, derived in our previous work, is crucial to the
successofthisframeworkanditsabsencehasmostlikelybeenthereasonfortheabsence
oflearningoperationalspacecontrollerstodate. Theresultinglearningalgorithmforthe
localmodelscanbeunderstoodfromtwoperspective,i.e.,asaweightedregressionprob-
lemwhereweintendtomatchtherewardweightedmotorcommands(aftertransforming
89
thecostintoareward)orasareinforcementlearningproblemwhereweattempttomax-
imizeanimmediaterewardcriterion. Throughoutthisthesischapter,wehaveillustrated
theproblemsandadvantagesoflearningoperationalspacecontrolusingaprismatictwo
degrees of freedom robot arm as example. As application, we have shown a task-space
trajectory following on a three degrees of freedom rotary robot arm, where we could
exhibit near-perfect operational space tracking control. As robotics increasingly moves
away from the structured domains of industrial robotics towards complex robotic sys-
tems,whichbothareincreasinglyhigh-dimensionalandincreasinglyhardtomodel,such
as humanoid robots, the techniques and theory developed in this thesis chapter will be
beneficialindevelopingtrulyautonomousandself-tuningroboticsystems.
The work presented in this chapter up to this point has been presented in (Peters &
Schaal, 2006a, 2007a, 2007b) and the resulting journal paper is currently under review
at the International Journal of Robotics Research (IJRR) for the IJRR Special Issue on
RobotLearning.
3.5.2 FutureWork: UsingIntertia-basedMetricsinLearning
In Section 2.5.2, we have seen that the inertia-based metrics have a particular impor-
tance as they do not require special null-space laws in order to be infinite horizon op-
timal. Thus, in order to learn the resulting important control laws known from ana-
lytical robotics, i.e., Khatib-Gauss (Khatib, 1987) and Hsu-IDM Control Laws (Hsu
90
et al., 1989), our learning algorithm needs to be modified in order to be able to com-
putetheappropriaterewards. InSection3.3,wehaveassumedthattherewardr(u,q) =
exp
−u
T
N(q)u
canbecomputedwithoutdifficultywhichisthecase,e.g.,forN(q) =
const. However,thisisnotthecaseformetricsintheformN(q) =M
−n
(q)asthesere-
quire the exact determination of the expensive and error-prone inertia tensor. Therefore,
whentryingtolearnanoperationalspacecontrollerwiththiskindofametric,wewould
run into the same kind of difficulties as analytical approaches with modeling errors, or,
at least, learn a different control law, which does not fully realize the interesting proper-
tiesofthedesiredcontrollaw,e.g.,theKhatib-Gausscontrollaw. Nevertheless,through
a reformulation of the learning problem, we can compute the reward without explicitly
usingtheinertiatensorwhenemployingaforward-inversemodelingapproachsimilarto
(Harunoetal.,1999). Forthisreformulation,werealizefromEquation(3.1)that
M
−1
u
1
=¨ q−M
−1
(F+u
0
) =¨ q−g
β
(q,˙ q,u
0
)≡δ¨ q, (3.43)
where¨ q = g
β
(q,˙ q,u) denotes a learned forward model (or predictor) which predicts
acceleration ¨ q for a given motor command u = u
0
+u
1
at the joint positions q and
velocities ˙ q. Using this motor command induced acceleration differenceδ¨ q, we can
determinetherewardsforKhatib-Gauss andHsu-IDMcontrollawsby
r
K
(u) = exp(−u
T
1
M
−1
u
1
) = exp(−u
T
1
δ¨ q), (3.44)
r
H
(u) = exp(−u
T
1
M
−2
u
1
) = exp(−δ¨ q
T
δ¨ q), (3.45)
91
respectively. This approach has been tested successfully on the prismatic robot, but re-
quiresfurtherevaluationforinterestingrobotsystems.
92
Chapter4
PolicyGradientMethodsforMotorPrimitivesand
Robotics
Eachproblem thatIsolvedbecamearule
whichservedafterwardsto solveotherproblems.
ReneDescartes(Frenchphilosopher and scientist,1596-1650),
One of the major challenges in both action generation for robotics and in the under-
standing of human motor control is to learn the “building blocks of movement genera-
tion”,calledmotorprimitives. Motorprimitives,asusedinthisthesischapter,areparam-
eterized control policies such as splines or nonlinear differential equations with desired
attractorpropertiesandcorrespondtothedesiredbehavioursintheformA¨ x =bwhich
we have been using in the two preceeding chapters. While a lot of progress has been
93
made in teaching parameterized motor primitives using supervised or imitation learn-
ing, the self-improvement by interaction of the system with the environment remains a
challengingproblem.
Inthisthesischapter,weevaluatedifferentreinforcementlearningapproachesforim-
proving the performance of parameterized motor primitives. For pursuing this goal, we
highlight the difficulties with current reinforcement learning methods, and outline both
established and novel algorithms for the gradient-based improvement of parameterized
policies. We compare these algorithms in the context of motor primitive learning, and
show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms
previous algorithms by at least an order of magnitude. We demonstrate the efficiency of
this reinforcement learning method in the application of learning to hit a baseball with
ananthropomorphicrobotarm.
4.1 Introduction
In order to ever leave the well-structured environments of factory floors and research
labs, future robots will require the ability to aquire novel behaviors and motor skills as
wellastoimproveexistingonesbasedonrewardsandcosts. Similarly,theunderstanding
of human motor control would benefit significantly if we can synthesize simulated hu-
man behavior and its underlying cost functions based on insight from machine learning
and biological inspirations. Reinforcement learning is probably the most general frame-
work in which such learning problems of computational motor control can be phrased.
94
However, in order to bring reinforcement learning into the domain of human movement
learning, two deciding components need to be added to the standard framework of re-
inforcement learning: first, we need a domain-specific policy representation for motor
skills, and, second, we need reinforcement learning algorithms which work efficiently
with this representation while scaling into the domain of high-dimensional mechanical
systemssuchashumanoidrobots.
Traditional representations of motor behaviors in robotics are mostly based on de-
sired trajectories generated from spline interpolations between points, i.e., spline nodes,
which are part of a longer sequence of intermediate target points on the way to a final
movement goal. While such a representation is easy to understand, the resulting control
policies, generated from a tracking controller of the spline trajectories, have a variety
of significant disadvantages, including that they are time-indexed and thus not robust
towards unforeseen disturbances, that they do not easily generalize to new bahavioral
sitations without complete recomputing of the spline, and that they cannot easily be co-
ordinated with other events in the environment, e.g., synchronized with other sensory
variableslikevisualperceptionduringcatchingaball. Intheliterature,avarietyofother
approaches for parameterizing motor primitives have been suggested to overcome these
problems,see(Ijspeertetal.,2002,2003)formoreinformation. Oneoftheseapproaches
proposed to use parameterized nonlinear dynamical systems as motor primitives, where
theattractorpropertiesofthesedynamicalsystemsdefinedthedesiredbehavior(Ijspeert
et al., 2002, 2003). The resulting framework was particularly well suited for supervised
95
imitationlearninginrobotics,exemplifiedbyexamplesfromhumanoidroboticswherea
full-body humanoid learned tennis swings or complex polyrhythmic drumming pattern.
One goal of this thesis chapter is the application of reinforcement learning to both tra-
ditional spline-based representations as well as the more novel dynamic system based
approach.
However, despite that reinforcement learning is the most general framework for dis-
cussing the learning of motor primitives for robotics, most of the methods proposed in
the reinforcement learning community are not applicable to high-dimensional systems
such as humanoid robots as these methods do not scale beyond systems with more than
three or four degrees of freedom and/or cannot deal with parameterized policies. Policy
gradient methods are a notable exception to this statement. Starting with the pioneering
work
1
of Gullapali and colleagues (Benbrahim & Franklin, 1997; Gullapalli, Franklin,
& Benbrahim, 1994) in the early 1990s, these methods have been applied to a variety of
robot learning problems ranging from simple control tasks (e.g., balancing a ball-on a
beam(Benbrahim,Doleac,Franklin,&Selfridge,1992),andpole-balancing(Kimura&
Kobayashi,1998))tocomplexlearningtasksinvolvingmanydegreesoffreedomsuchas
learningofcomplexmotorskills(Petersetal.,2005;Peters&Schaal,2006b;Miyamoto
etal.,1995,1996;Gullapallietal.,1994;Mitsunaga,Smith,Kanda,Ishiguro,&Hagita,
2005) and locomotion (Kimura & Kobayashi, 1997; Sato, Nakamura, & Ishii, 2002;
1
Note that there has been earlier work in the control community,see e.g., (Jacobson & Mayne, 1970;
Dyer&McReynolds,1970;Hasdorff,1976).,whichisbasedonexactanalyticalmodels. Extensionsbased
on learned, approximate models originated in the public policy literature, see (Werbos, 1979), and have
also been applied in control (Atkeson, 1994; Morimoto & Atkeson, 2003). In this thesis chapter, we limit
ourselvestomodel-freeapproachesinordertoavoidhavingtotreatmodel-relatedapproaches.
96
Kohl & Stone, 2004; Endo, Morimoto, Matsubara, Nakanishi, & Cheng, 2005; Tedrake,
Zhang, & Seung, 2005; Mori, Nakamura, Sato, & Ishii, 2004; Nakamura, Mori, & Ishii,
2004).
The advantages of policy gradient methods for parameterized motor primitives are
numerous. Among the most important ones are that the policy representation can be
chosen such that it is meaningful for the task, i.e., we can use a suitable motor primi-
tive representation, and that domain knowledge can be incorporated, which often leads
to fewer parameters in the learning process in comparison to traditional value-function
basedapproaches. Moreover,thereexistsavarietyofdifferentalgorithmsforpolicygra-
dient estimation in the literature, which have a rather strong theoretical underpinning.
Additionally, policy gradient methods can be used model-free and therefore also be ap-
pliedtoproblemswithoutanalyticallyknowntaskandrewardmodels.
Nevertheless,manyrecentpublicationsonapplicationsofpolicygradientmethodsin
robotics overlooked the newest developments in policy gradient theory and its original
roots in the literature. Thus, a large number of heuristic applications of policy gradients
can be found, where the success of the projects mainly relied on ingenious initializa-
tions and manual parameter tuning of algorithms. A closer inspection often reveals that
the chosen methods might be highly biased, or even generate infeasible policies under
less fortunate parameter settings, which could lead to unsafe operation of a robot. The
main goal of this thesis chapter is to review which policy gradient methods are applica-
bletoroboticsandwhichissuesmatter,whilealsointroducingsomenewpolicygradient
97
learning algorithms that seem to have superior performance over previously suggested
methods. The remainder of this thesis chapter will proceed as follows: firstly, we will
introduce the general assumptions of reinforcement learning, discuss motor primitives
in this framework and pose the problem statement of this thesis chapter. Secondly, we
will discuss the different approaches to policy gradient estimation and discuss their ap-
plicability to reinforcement learning of motor primitives. We focus on the most useful
methodsanddiscussseveralalgorithmsin-depth. Thepresentedalgorithmsinthisthesis
chapter are highly optimized versions of both novel and previous policy gradient algo-
rithms. Thirdly, we show how these methods can be applied to motor skill learning in
robotics and show learning results with a seven degrees of freedom, anthropomorphic
SARCOSMasterArm.
4.1.1 GeneralAssumptionsandProblemStatement
Mostroboticsdomainsrequirethestatespaceandtheactionspacestobecontinuousand
high-dimensional such that learning methods based on discretizations are not applicable
forhigherdimensionalsystems. However,asthepolicyisusuallyimplementedonadig-
italcomputer,weassumethatwecanmodelthecontrolsysteminadiscrete-timemanner
and we will denote the current time step byk. In order to take possible stochasticity of
theplantintoaccount,wedenoteitusingaprobabilitydistribution
x
k+1
∼p(x
k+1
|x
k
,u
k
) (4.1)
98
as model whereu
k
∈ R
M
denotes the current action, andx
k
, x
k+1
∈ R
N
denote the
current and next state, respectively. We furthermore assume that actions are generated
byapolicy
u
k
∼π
θ
(u
k
|x
k
) (4.2)
which is modeled as a probability distribution in order to incorporate exploratory ac-
tions; for some special problems, the optimal solution to a control problem is actually a
stochasticcontroller,seee.g.,(Sutton,McAllester,Singh,&Mansour,2000). Thepolicy
is parameterized by some policy parametersθ∈ R
K
and assumed to be continuously
differentiable with respect to its parametersθ. The sequence of states and actions forms
atrajectory(alsocalledhistoryorroll-out)denotedbyτ = [x
0:H
,u
0:H
]whereH denotes
thehorizonwhichcanbeinfinite. Ateachinstantoftime,thelearningsystemreceivesa
rewarddenotedbyr(x
k
,u
k
)∈R.
The general goal of policy optimization in reinforcement learning is to optimize the
policyparametersθ∈R
K
sothattheexpectedreturn
J(θ) =
1
a
Σ
E
n
X
H
k=0
a
k
r
k
o
(4.3)
is optimized where a
k
denote time-step dependent weighting factors and a
Σ
is a nor-
malization factor in order to make sure the weights sum up to one. We require that the
weightingfactorsfulfilla
i+j
=a
i
a
j
inordertobeabletoconnecttothepreviouspolicy
99
gradientliterature;examplesaretheweightsa
k
=γ
k
fordiscountedreinforcementlearn-
ing (whereγ is in [0,1]) wherea
Σ
= 1/(1−γ); alternatively, they are set toa
k
= 1 for
the average reward case wherea
Σ
=H. In these cases, we can rewrite the a normalized
expectedreturn intheform
J(θ) =
Z
X
d
π
(x)
Z
U
π(u|x)r(x,u)dxdu (4.4)
used by Sutton et al. (2000), where d
π
(x) = a
−1
Σ
P
∞
k=0
a
k
p(x
k
= x) is the weighted
statedistribution
2
.
In general, we assume that for each considered policy π
θ
, a state value function
V
π
(x,k),andthestate-actionvaluefunctionQ
π
(x,u,k)exist
3
andaregivenby
V
π
(x,k) =E
n
P
H
i=k
a
i
r
i
x
k
=x
o
, (4.5)
Q
π
(x,u,k) =E
n
P
H
i=k
a
i
r
i
x
k
=x,u
k
=u
o
. (4.6)
In the infinite horizon case, i.e., for H →∞, we write V
π
(x) and Q
π
(x,u) as these
functions have become time-invariant. Note, that we can define the expected return in
termsofthestatevaluefunctionby
J(θ) =
Z
X
p(x
0
)V
π
(x
0
,0)dx
0
, (4.7)
2
In most cases, e.g., for a
k
= γ
k
, this distribution is a multi-modal mixture distribution even if the
distributionp(x
k
= x)isaunimodal. Onlyfora
k
= 1,thestateweighteddistributiond
π
(x)willconverge
tothestationarydistribution.
3
Notethatlearningincaseswheresuchfunctionsdonotexistisusuallyprohibitivelydifficult.
100
wherep(x
0
) is the probability ofx
0
being the start-state. Whenever we make practical
use of the value function, we assume that we are given good basis functionsφ(x) so
that the state-value function can be approximated with linear function approximation
V
π
(x) =φ(x)
T
v.
4.1.2 MotorPrimitivePolicies
In this section, we first discuss how motor plans can be represented and then how we
can bring these into the standard reinforcement learning framework. For this purpose,
we consider two forms of motor plans, i.e., (1) spline-based trajectory plans and (2)
nonlinear dynamic motor primitives introduced in (Ijspeert et al., 2002). Spline-based
trajectory planning is well-known in the robotics literature, see e.g., (Sciavicco & Si-
ciliano, 2007; Miyamoto et al., 1996). A desired trajectory is represented by piecewise
connectedpolynomials,e.g.,wehave
y
i
(t) =θ
0i
+θ
1i
t+θ
2i
t
2
+θ
3i
t
3
(4.8)
int∈ [t
i
,t
i+1
]undertheconstraintsthatboth
y
i
(t
i+1
) =y
i+1
(t
i+1
) and ˙ y
i
(t
i+1
) = ˙ y
i+1
(t
i+1
). (4.9)
A given tracking controller, e.g., a PD control law or an inverse dynamics controller,
ensures that the trajectory is tracked well. For nonlinear dynamic motor primitives, we
101
use the approach developed in (Ijspeert et al., 2002) where movement plans (q
d
, ˙ q
d
) for
eachdegreeoffreedom(DOF)oftherobotarerepresentedintermsofthetimeevolution
ofthenonlineardynamicalsystems
¨ q
d,k
=h(q
d,k
,z
k
,g
k
,τ,θ
k
) (4.10)
where (q
d,k
, ˙ q
d,k
) denote the desired position and velocity of a joint,z
k
the internal state
ofthedynamicsystem,g
k
thegoal(orpointattractor)stateofeachDOF,τ themovement
durationsharedbyallDOFs,andθ
k
theopenparametersofthefunctionh. Theequations
used in order to create Equation (4.10) are given in Appendix A.5. The original work
in (Ijspeert et al., 2002) demonstrated how the parametersθ
k
can be learned to match a
templatetrajectorybymeansofsupervisedlearning–thisscenariois,forinstance,useful
as the first step of an imitation learning system. Here, we will add the ability of self-
improvement of the movement primitives in Equation (4.10) by means of reinforcement
learning, which is the crucial second step in imitation learning. The system in Equation
(4.10)isapoint-to-pointmovement,i.e.,thistaskisratherwellsuitedfortheintroduced
episodicreinforcementlearningmethods.
Inordertomakethereinforcementframeworkfeasibleforlearningmotorprimitives,
weneedtoaddexplorationtotherespectivemotorprimitiveframework,i.e.,weneedto
102
add a small perturbationǫ
d,k
∼N (0,σ
2
) so that the nominal target output ¨ q
d,k
becomes
theperturbedtargetoutput
¨
ˆ q
d,k
= ¨ q
d,k
+ǫ
d,k
. Bydoingso,weobtainastochasticpolicy
π(
¨
ˆ q
d,k
|q
d,k
,z
k
,g
k
,τ,θ
k
) =
1
√
2πσ
2
exp
−
(
¨
ˆ q
d,k
− ¨ q
d,k
)
2
2σ
2
!
. (4.11)
This policy will be used throughout the thesis chapter. It is particularly practical as the
explorationcanbeeasilycontrolledthroughonlyonevariableσ.
4.2 PolicyGradientApproachesforParameterizedMotor
Primitives
Thegeneralgoalofpolicyoptimizationinreinforcementlearningistooptimizethepol-
icyparametersθ∈R
K
sothattheexpectedreturnJ(θ)ismaximal. Formotorprimitive
learning in robotics, we require that any change to the policy parameterization has to be
smooth as drastic changes can be hazardous for the robot and its environnment. Also, it
wouldrenderinitializationsofthepolicybasedondomainknowledgeorimitationlearn-
ing useless, as these would otherwise vanish after a single update step. Additionally, we
Table4.1: Generalsetupforpolicygradientreinforcementlearning.
input:initialpolicyparameterizationθ
0
.
1 repeat
2 obtainpolicygradientg fromestimator(seeTables4.2-4.7)
3 updatepolicyθ
h+1
=θ
h
+α
h
g.
4 untilpolicyparameterizationθ
h
≈θ
h+1
converges
return: optimalpolicyparametersθ
∗
=θ
h+1
.
103
needtoguaranteethatthepolicyisimprovedintheupdatestepsatleastonaveragewhich
rulesoutgreedyvaluefunctionbasedmethods. Forthesereasons,policygradientmeth-
ods which follow the steepest descent on the expected return are the method of choice.
Thesemethodsupdatethepolicyparameterizationaccordingtothegradientupdaterule
θ
h+1
=θ
h
+α
h
∇
θ
J|
θ=θ
h
, (4.12)
whereα
h
∈R
+
denotesalearningrate. Ifthegradientestimateisunbiasedandlearning
ratesfulfill
∞
X
h=0
α
h
> 0and
∞
X
h=0
α
2
h
= 0, (4.13)
thelearningprocessisguaranteedtoconvergetoatleastalocalminimum.
The main problem in policy gradient methods is to obtain a good estimator of the
policy gradient∇
θ
J|
θ=θ
h
. Traditionally, people have used deterministic model-based
methods for obtaining the gradient (Jacobson & Mayne, 1970; Dyer & McReynolds,
1970; Hasdorff, 1976). However, in order to become autonomous we cannot expect to
be able to model every detail of the robot and environment appropriately. Therefore, we
need to estimate the policy gradient only from data generated during the execution of a
task,i.e.,withouttheneedforamodel. Inthissection,wewillstudydifferentapproaches
anddiscusswhichoftheseareusefulinrobotics.
Theliteratureonpolicygradientmethodshasyieldedavarietyofestimationmethods
overthelastyears. Themostprominentapproaches,whichhavebeenappliedtorobotics
104
are finite-difference and likelihood ratio methods, more well-known as REINFORCE
methodsinreinforcementlearning.
4.2.1 Finite-differenceMethods
Finite-difference methods are among the oldest policy gradient approaches dating back
to the 1950s; they originated from the stochastic simulation community and are quite
straightforwardtounderstand. Thepolicyparameterizationisvariedbysmallincrements
Δθ
i
and for each policy parameter variationθ
h
+Δθ
i
roll-outs are performed which
generateestimatesΔ
ˆ
J
j
≈J(θ
h
+Δθ
i
)−J
ref
oftheexpectedreturn. Therearedifferent
waysofchoosingthereferencevalueJ
ref
,e.g. forward-difference estimatorswithJ
ref
=
J(θ
h
) and central-difference estimators with J
ref
= J(θ
h
−Δθ
i
). The most general
way is to formulate the determination of the reference valueJ
ref
and the policy gradient
estimateg
FD
≈∇
θ
J|
θ=θ
h
asaregressionproblemwhichcanbesolvedby
g
T
FD
,J
ref
T
=
ΔΘ
T
ΔΘ
−1
ΔΘ
T
ˆ
J, (4.14)
where
ΔΘ =
Δθ
1
, ..., Δθ
I
1, ..., 1
T
, and (4.15)
ˆ
J = [
ˆ
J
1
,...,
ˆ
J
I
]
T
, (4.16)
105
Table4.2: Finitedifferencegradientestimator.
input:policyparameterizationθ.
1 repeat
2 generatepolicyvariationΔθ
i
.
3 estimateJ(θ +Δθ
i
)≈
ˆ
J
j
=
P
H
k=0
a
k
r
k
fromrollouts.
4 computegradient
g
T
FD
,J
ref
T
=
ΔΘ
T
ΔΘ
−1
ΔΘ
T
ˆ
J.
withΔΘ
T
=
Δθ
1
, ..., Δθ
I
1, ..., 1
,
and
ˆ
J
T
= [
ˆ
J
1
,...,
ˆ
J
I
].
5 untilgradientestimateg
FD
converged.
return: gradientestimateg
FD
.
denote the I samples. If single parameters are perturbed, this method is known as the
Kiefer-Wolfowitz procedure and if multiple parameters are perturbed simultaneously,
it is known as Simultaneuous Perturbation Stochastic gradient Approximation (SPSA),
see (Sadegh & Spall, 1997; Spall, 2003) for in-depth treatment. This approach can been
highlyefficientinsimulationoptimizationofdeterministicsystems(Spall,2003)orwhen
acommonhistoryofrandomnumbers(Glynn,1987;Kleinman,Spall,&Naiman,1999)
is being used (the later trick is known as the PEGASUS method in reinforcement learn-
ing, see (Ng & Jordan, 2000)), and can get close to a convergence rate of O(I
−1/2
)
(Glynn,1987). However,whenusedonarealsystem,theuncertainitiesdegradetheper-
formanceresultinginconvergenceratesrangingbetweenO(I
−1/4
)toO(I
−2/5
)depend-
ingonthechosenreferencevalue(Glynn,1987). Animplementationofthisalgorithmis
showninTable4.2.
Due to the simplicity of this approach, such methods have been successfully ap-
plied to robot motor skill learning in numerous applications (Miyamoto et al., 1995,
1996; Tedrake et al., 2005; Kohl & Stone, 2004; Mitsunaga et al., 2005). However, the
106
Table 4.3: General likelihood ratio policy gradient estimator “Episodic REINFORCE”
withanoptimalbaseline.
input:policyparameterizationθ.
1 repeat
2 performatrialandobtainx
0:H
,u
0:H
,r
0:H
3 foreachgradientelementg
h
4 estimateoptimalbaseline
b
h
=
D
(
P
H
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
))
2P
H
l=0
a
l
r
l
E
D
(
P
H
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
))
2
E
5 estimatethegradientelement
g
h
=
D
P
H
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
)
P
H
l=0
a
l
r
l
−b
h
E
.
6 endfor.
7 untilgradientestimateg
RF
converged.
return: gradientestimateg
RF
.
straightforward application is not without peril as the generation of the Δθ
j
requires
properknowledgeonthesystem,asbadlychosenΔθ
j
candestabilizethepolicysothat
thesystembecomesinstableandthegradientestimationprocessispronetofail. Evenin
thefieldofsimulationoptimizationwherethedestabilizationofthesystemisnotsucha
dangerousissue,thecarefulgenerationoftheparameterperturbationisatopicofdebate
with strong requirements on the generating process (Sadegh & Spall, 1997). Practical
problemsoftenrequirethateachelementofthevectorΔθ
j
hasadifferentorderofmag-
nitude, making the generation particularly difficult. Therefore, this approach can only
appliedunderstricthumansupervision.
4.2.2 LikelihoodRatioMethodsandREINFORCE
Likelihood ratio methods are driven by an important different insight. Assume that tra-
jectoriesτ are generated from a system by roll-outs, i.e.,τ ∼ p
θ
(τ) = p(τ|θ) with
107
rewards r(τ) =
P
H
k=0
a
k
r
k
. In this case, the policy gradient can be estimated using
the likelihood ratio trick, see e.g. (Aleksandrov, Sysoyev, & Shemeneva, 1968; Glynn,
1987),orREINFORCEtrick(Williams,1992),i.e.,wecanrewritethegradientby
∇
θ
J (θ) =
Z
T
∇
θ
p
θ
(τ)r(τ)dτ =
Z
T
p
θ
(τ)∇
θ
logp
θ
(τ)r(τ)dτ, (4.17)
=E{∇
θ
logp
θ
(τ)r(τ)}.
Importantly, the derivative∇
θ
logp
θ
(τ) can be computed without knowleged of the
generatingdistributionp
θ
(τ)as
p
θ
(τ) =p(x
0
)
Y
H
k=0
p(x
k+1
|x
k
,u
k
)π
θ
(u
k
|x
k
) (4.18)
impliesthat
∇
θ
logp
θ
(τ) =
X
H
k=0
∇
θ
logπ
θ
(u
k
|x
k
), (4.19)
i.e.,thederivativesthroughthecontrolsystemdonothavetobecomputed
4
. As
Z
T
p
θ
(τ)∇
θ
logp
θ
(τ)dτ =
Z
T
∇
θ
p
θ
(τ)dτ =∇
θ
1 = 0, (4.20)
4
Thisresultmakesanimportantdifference: instochasticsystemoptimization,finitedifferenceestima-
tors are often prefered as the derivative through system is required but not known. In policy search, we
always know the derivative of the policy with respect to its parameters and therefore we can make use of
thetheoreticaladvantagesoflikelihoodratiogradientestimators.
108
aconstantbaselinecanbeinsertedresultingintothegradientestimator
∇
θ
J (θ) =E{∇
θ
logp
θ
(τ)(r(τ)−b)}, (4.21)
whereb∈Rcanbechosenarbitrarily(Williams,1992)butusuallywiththegoaltomin-
imize the variance of the gradient estimator. Note that the baseline was most likely first
suggestedbyWilliams(1992)andisuniquetoreinforcementlearningasitrequiresasep-
arationofthepolicyfromthestate-transitionprobabilitydensities. Therefore,thegeneral
path likelihood ratio estimator or episodic REINFORCE gradient estimator (Williams,
1992)isgivenby
g
RF
=
D
X
H
k=0
∇
θ
logπ
θ
(u
k
|x
k
)
X
H
l=0
a
l
r
l
−b
E
, (4.22)
wherehf (τ)i =
R
T
f(τ)dτ denotes the average over trajectories. This type of method
is guaranteed to converge to the true gradient at the fastest theoretically possible pace
ofO(I
−1/2
) whereI denotes the number of roll-outs (Glynn, 1987) even if the data is
generated from a highly stochastic system. An implementation of this algorithm will be
showninTable4.3togetherwiththeestimatorfortheoptimalbaseline.
Besides the theoretically faster convergence rate, likelihood ratio gradient methods
have a variety of advantages in comparison to finite difference methods. As the gen-
eration of policy parameter variations is no longer needed, the complicated control of
these variables can no longer endanger the gradient estimation process. Furthermore, in
109
practice, already a single roll-out can suffice for an unbiased gradient estimate (Baxter
& Bartlett, 2001; Spall, 2003) viable for a good policy update step, thus reducing the
amountofroll-outsneeded. Finally,thisapproachhasyieldedthemostreal-worldrobot
motorlearningresults(Nakamuraetal.,2004;Morietal.,2004;Endoetal.,2005;Ben-
brahim & Franklin, 1997; Kimura & Kobayashi, 1997; Peters et al., 2005; Gullapalli
et al., 1994). In the subsequent two sections, we will strive to explain and improve this
typeofgradientestimator.
4.3 ‘Vanilla’PolicyGradientApproaches
Despite the fast asymptotic convergence speed of the gradient estimate, the variance of
thelikelihood-ratiogradientestimatorcanbeproblematicintheoryaswellasinpractice.
Thiscanbeillustratedstraightforwardlywithanexample.
Example4.1 When using a REINFORCE estimator with a baselineb = 0 in a scenario
wherethereisonlyasinglerewardofalwaysthesamemagnitude,e.g.,r(x,u) =c∈R
for allx,u, then the variance of the gradient estimate will grow at least cubically with
thelength oftheplanning horizonH as
Var{g
RF
} =H
2
c
2
X
H
k=0
Var{∇
θ
logπ
θ
(u
k
|x
k
)}, (4.23)
if Var{∇
θ
logπ
θ
(u
k
|x
k
)} > 0 for all k. Furthemore, it will also grow quadratically
with themagnitude oftherewardc.
110
Forthisreason,weneedtoaddressthisissueandwewilldiscussseveraladvancesin
likelihoodratiopolicygradientoptimization,i.e.,thepolicygradienttheorem/GPOMDP,
optimalbaselinesandthecompatiblefunctionapproximation.
4.3.1 PolicygradienttheoremandG(PO)MDP
The trivial observation that future actions do not depend on past rewards (unless policy
changestakeplacecontinuouslyduringthetrajectory)canresultinasignificantreduction
ofthevarianceofthepolicygradientestimate. Thisinsightcanbeformalizedas
E{∇
θ
logπ
θ
(u
l
|x
l
)r
k
} =E{E{∇
θ
logπ
θ
(u
l
|x
l
)|x
k
}
| {z }
=0
r
k
} = 0 (4.24)
forl > k, which is straightforward to verify. This allows two variations of the previous
algorithmwhichareknownasthepolicygradienttheorem(Suttonetal.,2000)
g
PGT
=
D
X
H
k=0
a
k
∇
θ
logπ
θ
(u
k
|x
k
)
X
H
l=k
a
l−k
r
l
−b
k
E
, (4.25)
orG(PO)MD(Baxter&Bartlett,2001)
g
GMDP
=
D
X
H
l=0
X
l
k=0
∇
θ
logπ
θ
(u
k
|x
k
)
(a
l
r
l
−b
l
)
E
, (4.26)
111
While these algorithms look different, they are exactly equivalent in their gradient esti-
mate
5
,i.e.,
g
PGT
=g
GMPD
, (4.27)
which is a direct result from the summation theorem (Vachenauer, Rade, & Westergren,
2000) and from the factthat theycanboth derived from REINFORCE.The G(PO)MDP
formulationhaspreviouslybeenderivedinthesimulationoptimizationcommunity(Glynn,
1990). Animplementationofthisalgorithmisshowntogetherwiththeoptimalbaseline
inTable4.4.
These two forms originally puzzled the community as they were derived from two
separate points of view (Sutton et al., 2000; Baxter & Bartlett, 2001) and seemed to be
different on first inspection. While their equality is natural when taking the path-based
perspective, we will obtain the forms proposed in the original sources in a few steps.
First,letusnotethatinEquation(4.25)theterm
P
∞
l=k
a
l
r
l
inthepolicygradienttheorem
isequivalenttoamonte-carloestimateofthevaluefunctionQ
π
(x,u). Thus,weobtain
g
PGT
=
Z
X
d
π
(x)
Z
U
∇
θ
π
θ
(u|x)(Q
π
(x,u)−b(x))dudx, (4.28)
for normalized weightings with infinite horizons (e.g., using the discounted or the aver-
age reward case) and while employing the value function instead of the sum. This form
5
Note that (Baxter & Bartlett, 2001) additionally add an eligibility trick for reweighting trajectory
pieces. Thistrickcanbehighlydangerousinrobotics;itcanbedemonstratedthateveninlinear-quadratic
regulation, this trick can result in convergence to the worst possible policies for small planning horizons
(i.e.,smalleligibilityrates).
112
hasasignificantadvantageoverREINFORCE-likeexpressions,i.e.,itisobviousthatthe
variancedoesnotgrowwiththeplanninghorizonifagoodestimateofQ
π
(x,u)isgiven,
e.g., using traditional value function methods. Thus, the counterexample from Example
4.1doesnolongerapply. Similarly,theterm
P
l
k=0
∇
θ
logπ
θ
(u
k
|x
k
)becomesthelog-
derivativeofthedistributionofstatesμ
π
k
(x) =p(x =x
k
)atstepk inexpectation,i.e.,
∇
θ
logd
π
(x) =
X
H
l=0
a
l
∇
θ
logμ
π
k
(x) =
X
H
l=0
a
l
X
l
k=0
∇
θ
logπ
θ
(u
k
|x
k
),
(4.29)
whichthencanbeusedtorewritetheG(PO)MDPestimatorintostate-spaceform,i.e.,
g
GMDP
=
Z
X
Z
U
(π
θ
(u|x)∇
θ
d
π
(x)+d
π
(x)∇
θ
π
θ
(u|x))(r(x,u)−b)dudx.
(4.30)
Note that this form only allows a baseline which is independent of the state unlike the
policy gradient theorem. When either of the state-action value function or the state dis-
tribution derivative can be easily obtained by derivation or estimation, the variance of
the gradient can be reduced significantly.Without a formal derivation of it, the policy
gradient theorem has been applied in (Gullapalli, 1991; Kimura & Kobayashi, 1997)
using estimated value functionsQ
π
(x,u) instead of the term
P
H
l=k
a
l
r
l
and a baseline
b
k
= V
π
(x
k
,k). Note that the version introduced in (Kimura & Kobayashi, 1997) is
biased
6
anddoesnotcorrespondtothecorrectgradientunlike(Gullapalli,1991).
6
SeeAppendixA.3formoreinformation.
113
Table 4.4: Specialized likelihood ratio policy gradient estimator “G(PO)MDP”/Policy
Gradientwithanoptimalbaseline.
input:policyparameterizationθ.
1 repeat
2 performtrialsandobtainx
0:H
,u
0:H
,r
0:H
3 foreachgradientelementg
h
4 foreachtimestepk
estimatebaselinefortimestepk by
b
h
k
=
D
(
P
k
κ=0
∇
θ
h
logπ
θ
(uκ|xκ))
2
a
k
r
k
E
D
(
P
k
κ=0
∇
θ
h
logπ
θ
(uκ|xκ))
2
E
5 endfor.
6 estimatethegradientelement
g
h
=
D
P
H
l=0
P
l
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
)
a
l
r
l
−b
h
l
E
.
7 endfor.
8 untilgradientestimateg
GMDP
converged.
return: gradientestimateg
GMDP
.
Note that the formulation over paths can be used in a more general fashion than
the state-action form, e.g., it allows derivations for non-stationary policies, rewards and
systems, than the state-action formulation in the paragraph above. However, for some
results, it is more convenient to use the state-action based formulation and there we will
makeuseofit.
4.3.2 OptimalBaselines
Above,wehavealreadyintroducedtheconceptofabaselinewhichcandecreasethevari-
anceofapolicygradientestimatebyordersofmagnitude. Thus,anoptimalselectionof
such a baseline is essential. An optimal baseline minimizes the varianceσ
2
h
= Var{g
h
}
ofeachelementg
h
ofthegradientg without biasingthegradientestimate,i.e.,violating
114
E{g} =∇
θ
J. This can be phrased as having a seperate baselineb
h
for every element
ofthegradient
7
,i.e.,wehave
min
b
h
σ
2
h
= Var{g
h
}, (4.31)
s.t.E{g
h
} =∇
θ
h
J. (4.32)
Duetotherequirementofunbiasednessofthegradientestimate,wehaveσ
2
h
=E{g
2
h
}−
(∇
θ
h
J)
2
,anddueto
min
b
h
σ
2
h
≥E
min
b
h
g
2
h
−(∇
θ
h
J)
2
, (4.33)
theoptimalbaselineforeachgradientelementg
h
canalwaysbegivenby
b
h
=
P
H
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
)
2
P
H
l=0
a
l
r
l
P
H
k=0
∇
θ
h
logπ
θ
(u
k
|x
k
)
2
(4.34)
for the general likelihood ratio gradient estimator, i.e., Episodic REINFORCE. The al-
gorithmic form of the optimal baseline is shown in Table 4.3 in line 4. If the sums in
the baselines are modified appropriately, we can obtain the optimal baseline for the pol-
icygradient theorem or G(PO)MPD.We onlyshowG(PO)MDP in this thesis chapter in
Table4.4asthepolicygradienttheoremisnumericallyequivalent.
7
A single baseline for all parameters can also be obtained and is more common in the reinforcement
learning literature (Weaver & Tao, 2001a; Greensmith, Bartlett, & Baxter, 2004; Weaver & Tao, 2001b;
Williams, 1992; Lawrence, Cowan, & Russell, 2003; Greensmith, Bartlett, & Baxter, 2001). However,
suchabaselineisofcoursesuboptimal.
115
TheoptimalbaselinewhichdoesnotbiasthegradientinEpisodicREINFORCEcan
only be a single number for all trajectories and in G(PO)MPD it can also depend on the
time-step (Peters, 2005). However, in the policy gradient theorem it can depend on the
currentstateand,therefore,ifagoodparameterizationforthebaselineisknown,e.g.,in
ageneralizedlinearformb(x
k
) =φ(x
k
)
T
ω,thiscansignificantlyimprovethegradient
estimation process. However, the selection of the basis functionsφ(x
k
) can be difficult
and often impractical in practice. See (Weaver & Tao, 2001a; Greensmith et al., 2004;
Weaver & Tao, 2001b; Williams, 1992; Lawrence et al., 2003; Greensmith et al., 2001)
formoreinformationonthistopic.
4.3.3 CompatibleFunctionApproximation
Aswepreviouslydiscussed,thelargestsourceofvarianceintheformulationofEquation
(4.28) is the state-action value functionQ
π
(x,u), especially if the functionQ
π
(x,u) is
approximatedbyrolloutsasinthiscontext. Thenaturalalternativeofusingapproximate
value functions is problematic as these introduce bias in presence of imperfect basis
function. However, as demonstrated in (Sutton et al., 2000) and (Konda & Tsitsiklis,
2000)thetermQ
π
(x,u)−b
π
(x)canbereplacedbyacompatiblefunctionapproximation
f
π
w
(x,u) = (∇
θ
logπ(u|x))
T
w≡Q
π
(x,u)−b
π
(x), (4.35)
parameterized by the vector w, without affecting the unbiasedness of the gradient es-
timate and irrespective of the choice of the baseline b
π
(x). However, as mentioned in
116
(Sutton et al., 2000), the baseline may still be useful in order to reduce the variance of
the gradient estimate when Equation (4.28) is approximated from samples. Based on
Equations(4.28,4.35),wederiveanestimateofthepolicygradientas
∇
θ
J(θ) =
Z
X
d
π
(x)
Z
U
π(u|x)∇
θ
logπ(u|x)∇
θ
logπ(u|x)
T
dudxw =F
θ
w. (4.36)
as∇
θ
π(u|x) = π(u|x)∇
θ
logπ(u|x). Since π(u|x) is chosen by the user, even in
sampled data, the integralG(x) =
R
U
π(u|x)∇
θ
logπ(u|x)∇
θ
logπ(u|x)
T
du can be
evaluated analytically or empirically without actually executing all actions. It is also
noteworthy that the baseline does not appear in Equation (4.36) as it integrates out, thus
eliminating the need to find an optimal selection of this open parameter. Nevertheless,
theestimationofG =
R
X
d
π
(x)G(x)dxisstillexpensivesinced
π
(x)istnotknown.
An important observation is that the compatible function approximationf
π
w
(x,u) is
mean-zerow.r.t. theactiondistribution,i.e.,
Z
U
π(u|x)f
π
w
(x,u)du =w
T
Z
U
∇
θ
π(u|x)du = 0, (4.37)
sincefrom
R
U
π(u|x)du = 1,differentionw.r.t. toθresultsin
R
U
∇
θ
π(u|x)du =0. Thus,
f
π
w
(x,u)representsanadvantagefunctionA
π
(x,u) =Q
π
(x,u)−V
π
(x)ingeneral. The
advantage function cannot be learnedwith TD-like bootstrapping without knowledge of
the value function as the essence of TD is to compare the value V
π
(x) of the two ad-
jacent states – but this value has been subtracted out in A
π
(x,u). Hence, a TD-like
117
bootstrappingusingexclusivelythecompatiblefunctionapproximatorisimpossible. As
an alternative, (Sutton et al., 2000; Konda & Tsitsiklis, 2000) suggested to approximate
f
π
w
(x,u) from unbiased estimates
ˆ
Q
π
(x,u) of the action value function, e.g., obtained
fromroll-outsandusingleast-squaresminimizationbetweenf
w
and
ˆ
Q
π
. Whilepossible
intheory,oneneedstorealizethatthisapproachimpliesafunctionapproximationprob-
lemwheretheparameterizationofthefunctionapproximatoronlyspansamuchsmaller
subspace of the training data – e.g., imagine approximating a quadratic function with a
line. In practice, the results of such an approximation depends crucially on the training
data distribution and has thus unacceptably high variance – e.g., fit a line to only data
from the right branch of a parabula, the left branch, or data from both branches. In the
next section, we will see that there are smarter ways to estimate the compatible func-
tion approximation (Section 4.4.1) and that this compatible function approximation has
aspecialmeaning(Section 4.4.2).
4.4 NaturalActor-Critic
Despitealltheadvancesinthevariancereductionofpolicygradientmethods,thesemeth-
ods tend to perform surprisingly poorly. Even when applied to simple examples with
rather few states, where the gradient can be determined very accurately, they turn out
to be quite inefficient – thus, the underlying reason cannot be the variance in the gradi-
ent estimate but rather be caused by the large plateaus in the expected return landscape
118
where the gradients are small and often do not point directly towards the optimal solu-
tion. A simple example that demonstrates this behavior is given in Fig. 4.1. Similar to
supervisedlearning,steepestascentwithrespecttotheFisherinformationmetric(Amari,
1998),calledthe‘natural’policygradient,turnsouttobesignificantlymoreefficientthan
normalgradients. Suchanapproachwasfirstsuggestedforreinforcementlearningasthe
‘average natural policy gradient’ in (Kakade, 2002), and subsequently shown to be the
true natural policy gradient (Peters et al., 2003a; Bagnell & Schneider, 2003). In this
thesis chapter, we take this line of reasoning one step further by introducing the Natural
Actor-Criticwhichinheritstheconvergenceguaranteesfromgradientmethods.
Severalpropertiesofthenaturalpolicygradientareworthhighlighting:
• Convergencetoalocalminimumisguaranteed,see(Amari,1998).
• By choosing a more direct path to the optimal solution in parameter space, the
natural gradient has, from empirical observations, faster convergence and avoids
prematureconvergenceof‘vanillagradients’(seeFigure4.1).
• The natural policy gradient can be shown to becovariant, i.e., independent of the
coordinateframechosenforexpressingthepolicyparameters(seeSection4.5.1).
• As the natural gradient analytically averages out the influence of the stochastic
policy(includingthebaselineofthefunctionapproximator),itrequiresfewerdata
pointsforagoodgradientestimatethan‘vanillagradients’.
119
4.4.1 Motivation
One of the main reasons for using policy gradient methods is that we intend to do just a
smallchangeΔθ tothepolicyπ
θ
whileimprovingthepolicy. However,themeaningof
smallisambiguous. WhenusingtheEuclidianmetricof
√
Δθ
T
Δθ,thenthegradientis
differentforeveryparameterizationθofthepolicyπ
θ
eveniftheseparameterizationsare
relatedtoeachotherbyalineartransformation(Kakade,2002),oftenresultinginunnat-
urally slow learning even when higher order gradient methods were employed (Baxter,
Bartlett,&Weaver,2001;Berny,2000;Kakade,2001). Thisproblemposesthequestion
whetherwecanachieveacovariantgradientdescent,i.e.,gradientdescentwithrespectto
an invariant measure of the closeness between the current policy and the updated policy
baseduponthedistributionofthepathsgeneratedbyeachofthese. Instatistics,avariety
of distance measures for the closeness of two distributions (e.g.,p
θ
(τ) andp
θ+Δθ
(τ))
have been suggested, e.g., the Kullback-Leibler divergence
8
d
KL
(p
θ
(τ)||p
θ+Δθ
(τ)),
theHellingerdistanced
HD
andothers(Su&Gibbs,2002). Manyofthesedistances(e.g.,
the previously mentioned ones) can be approximated by the same second order Taylor
expansion,i.e.,by
d
KL
(p
θ
(τ)||p
θ+Δθ
(τ))≈
1
2
Δθ
T
F
θ
Δθ, (4.38)
8
Whilebeing‘thenaturalwaytothinkaboutclosenessinprobabilitydistributions’(Balasubramanian,
1997),thismeasureistechnicallynotametricasitisnotcommutative.
120
Controller variance θ
2
=σ
Controller gain θ
1
=k
(a) Vanilla Policy Gradient
-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
0.2
0.4
0.6
0.8
1
Controller variance θ
2
=σ
Controller gain θ
1
=k
(b) Natural Policy Gradient
-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.1: The classical example of LQR can be used to illustrate why ‘vanilla’ policy
gradients reduce the exploration to zero while natural policy gradients go for the opti-
mal solution. The blue circle in (a) indicate equal distance with the standard Eucledian
metric, while the one in (b) shows equal distance with respect to the Fisher information
metric. The natural policy gradient corresponds to searching the steepest descent on an
infinitesimally small Fisher ellipse (b), while the vanilla policy gradient corresponds to
steepestdescentonaunitcircle.
where
F
θ
=
Z
T
p
θ
(τ)∇logp
θ
(τ)∇logp
θ
(τ)
T
dτ =
D
∇logp
θ
(τ)∇logp
θ
(τ)
T
E
(4.39)
is known as the Fisher-information matrix. Let us assume that we fix the amount of
change in our policy using the step-sizeε. We then have a restricted step-size gradient
descentproblem(Fletcher&Fletcher,2000). Thus,wehaveanoptimizationproblem
max
Δθ
J (θ +Δθ)≈J (θ)+Δθ
T
∇
θ
J, (4.40)
s.t.d
KL
(p
θ
(τ)||p
θ+Δθ
(τ))≈
1
2
Δθ
T
F
θ
Δθ =ε, (4.41)
121
whichisillustratedinFigure4.1andhasthesolution
Δθ =α
n
F
−1
θ
∇
θ
J (4.42)
withα
n
= [ε(∇J(θ)
T
F
−1
θ
∇J(θ))
−1
]
1/2
,seeSectionA.1forderivations
9
. Thedirection
Δθ is called the natural gradient
e
∇
θ
J(θ) = Δθ/α
n
as introduced in (Amari, 1998).
Thelearningrateα
n
isnotnecessarilyadesirableoneandcanbereplacedbyaconstant
learningratewithoutchangingthegradientdirection.
ThistypeofapproachisknownasNaturalPolicyGradientsandhasitsseparateorigin
in supervised learning (Amari, 1998). It was first suggested in the context of reinforce-
mentlearningbyKakade(2002)andhasbeenexploredingreaterdepthin(Petersetal.,
2003a; Bagnell & Schneider, 2003; Peters, 2005; Peters et al., 2005). The strongest
theoretical advantage of this approach is that its performance no longer depends on the
parameterization of the policy and it is therefore safe to use for arbitrary policies
10
. In
practice,thelearningprocessconvergessignificantlyfasterinmostpracticalcases.
9
The value d
KL
(p
θ
,p
θ+Δθ
) can also be seen as the loss of information resulting of a policy change
Δθ. Thus,wecouldalternativelyformulatetheproblemas
max
Δθ
J(θ+Δθ)−αd
KL
(p
θ
,p
θ+Δθ
)≈ J(θ)+Δθ
T
∇
θ
J−α
1
2
Δθ
T
F
θ
Δθ, (4.43)
whichobviouslyhasthesamesolutionexcepttothefreelyselectabletrade-offfactororforgettingrate α.
10
Thereareavarietyofinterestingpropertiestothenaturalpolicygradientmethodswhichareexplored
in(Petersetal.,2005).
122
4.4.2 ConnectiontotheCompatibleFunctionApproximation
Up to this point, we have left open the deciding question how to determine the Fisher
information matrix. In the first work on natural policy gradients (Kakade, 2002), it ap-
pearedthatthisquestioncouldnotbeansweredstraightforwardly;howeverthisquestion
was largely answered in subsequent work simultaneously by Peters & Schaal, and Bag-
nell & Schneider, see (Peters et al., 2003a; Bagnell & Schneider, 2003; Peters, 2005;
Peters et al., 2005). We repeat our results from (Peters et al., 2003a) and outline the
derivation of Fisher information of paths here. In (Moon & Stirling, 2000), we can find
the well-known lemma that by differentiating
R
T
p(τ)dτ = 1 twice with respect to the
parametersθ,wecanobtain
Z
T
p(τ)∇
2
θ
logp(τ)dτ =−
Z
T
p(τ)∇
θ
logp(τ)∇
θ
logp(τ)
T
dτ (4.44)
foranyprobabilitydensityfunctionp(τ). UsingEquations(4.18,4.19),wecanobtainby
differentiation
∇
2
θ
logp(τ
0:H
) =
H
X
k=1
∇
2
θ
logπ(u
k
|x
k
). (4.45)
123
UsingEquations(4.44,4.45),andthedefinitionoftheFisherinformationmatrix(Amari,
1998),wecandetermineFisherinformationmatrixfortheaveragerewardcaseinsample
notation,i.e,
F
θ
=
∇
θ
logp(τ
0:H
)∇
θ
logp(τ
0:H
)
T
=−
∇
2
θ
logp(τ
0:H
)
,
=−
*
H
X
k=0
∇
2
θ
logπ(u
H
|x
H
)
+
,
=−
Z
X
d
π
H
(x)
Z
U
π(u|x)∇
2
θ
logπ(u|x)dudx,
=
Z
X
d
π
H
(x)
Z
U
π(u|x)∇
θ
logπ(u|x)∇
θ
logπ(u|x)
T
dudx =G
θ
, (4.46)
whered
π
H
(x) =
P
H
k=0
p(x
k
=x) denotes the distribution of states along the trajectory.
Similarly,ifwereplacep(τ
0:H
)byaweightedpathdistributionp
γ
(τ
0:n
) =p(τ
0:n
)
P
H
l=0
a
l
I
x
i
,u
i
),
we see that∇
2
θ
logp(τ
0:n
) = ∇
2
θ
logp
γ
(τ
0:n
). Thus, the proof above generalizes to
reweighted path distributions, i.e., we haved
π
H
(x) =
P
H
k=0
a
k
p(x
k
=x). Thus, we can
estimatetheFisherinformationmatrixwith
F
θ
=
D
X
H
l=0
a
l
∇
θ
logπ(u
l
|x
l
)∇
θ
logπ(u
l
|x
l
)
T
E
=G
θ
, (4.47)
as we have shown in (Peters et al., 2003a). These results imply the equality of the all-
actionmatrixG
θ
andtheFisherinformationF
θ
ofpaths,i.e.,wehave
F
θ
=G
θ
. (4.48)
124
Therefore, we have demonstrated thatF
θ
is indeed a true Fisher information matrix and
doesnothavetobeinterpretedasthe‘average’ofthepointFisherinformationmatrices.
Eqs.(4.46)and(4.42)combinedimplythatthenaturalgradientcanbecomputedas
e
∇
θ
J(θ) =G
−1
θ
F
θ
w =w, (4.49)
sinceF
θ
= G
θ
. Therefore we only need estimatew and not G
θ
. The resulting policy
improvementstepisthusθ
i+1
=θ
i
+αw whereαdenotesalearningrate.
4.4.3 NaturalActorCriticAlgorithms
The critic evaluates the current policy π in order to provide the basis for an actor im-
provement, i.e., the change Δθ of the policy parameters. As we are interested in natural
policy gradient updates Δθ =αw, we wish to employ the compatible function approx-
imation f
π
w
(x,u) from Section 4.3.3 in this context. In Section 4.3.3, we had realized
thatthisfunctionwashardtolearnasitcouldonlyrepresentanimpoverishedversionof
thestate-actionvaluefunction. Wewillremedythissituationwewillderivemoreuseful
estimators from two different point of views, i.e., the state-action based point of view
and the episodic roll-out based point of view. Both rely on the assumption of additional
basisfunctions.
125
Table4.5: NaturalActor-CriticAlgorithmwithLSTD-Q(λ)
input: policyparameterizationθ
1 iffirstgradientestimate
2 resetsufficientstatisticsA
0
=0,b
0
=z
0
=0,k = 0.
3 else
4 forgetstatistics:z
k+1
← 0,A
k+1
←βA
k+1
,b
k+1
←βb
k+1
.
5 end.
6 repeat
7 Drawinitialstatex
0
∼p(x
0
).
8 fort = 0,1,2,...,H do
Executeastep:
9 Drawactionu
t
∼π(u
t
|x
t
).
10 Observenextstatex
t+1
∼p(x
t+1
|x
t
,u
t
).
11 Observerewardr
t
=r(x
t
,u
t
).
Updatebasisfunctions:
12
e
φ
t
= [φ(x
t+1
)
T
,0
T
]
T
,
13
b
φ
t
= [φ(x
t
)
T
,∇
θ
logπ(u
t
|x
t
)
T
]
T
,
Updatesufficientstatistics:
14 z
k+1
=λz
k
+
b
φ
t
,
15 A
k+1
=A
k
+z
k+1
(
b
φ
t
−γ
e
φ
t
)
T
,
16 b
k+1
=b
k
+z
k+1
r
t
,
17 Updatecriticparameters: [w
T
k+1
,v
T
k+1
]
T
=A
−1
k+1
b
k+1
.
18 Updatetime-stepk←k +1.
19 endfor.
20 untilgradientestimateg
NAC
=w
k
converged.
return: gradientestimateg
NAC
=w
k
126
4.4.3.1 NaturalActorCriticwithLSTD-Q(λ)
WeobservethatwecanwritetheBellmanequations(e.g.,see(Baird,1993))intermsof
theadvantagefunctionandthestate-valuefunction
Q
π
(x,u) =A
π
(x,u)+V
π
(x) =r(x,u)+γ
Z
X
p(x
′
|x,u)V
π
(x
′
)dx
′
. (4.50)
InsertingA
π
(x,u) = f
π
w
(x,u) and an appropriate basis functions representation of the
value function as V
π
(x) = φ(x)
T
v, we can rewrite the Bellman Equation, Equation ,
(4.50),asasetoflinearequations
∇
θ
logπ(u
t
|x
t
)
T
w+φ(x
t
)
T
v =r(x
t
,u
t
)+γφ(x
t+1
)
T
v+ǫ(x
t
,u
t
,x
t+1
) (4.51)
where ǫ(x
t
,u
t
,x
t+1
) denotes an error term which mean-zero as can be observed from
Equation(4.50). UsingEquation(4.51),asolutiontoEquation(4.50)canbeobtainedby
adapting the LSTD(λ) policy evaluation algorithm (Boyan, 1999). For this purpose, we
define
b
φ
t
= [φ(x
t
)
T
,∇
θ
logπ(u
t
|x
t
)
T
]
T
, (4.52)
e
φ
t
= [φ(x
t+1
)
T
,0
T
]
T
,
as new basis functions, where0 is the zero vector. This definition of basis function re-
duces bias and variance of the learning process in comparison to SARSA and previous
127
LSTD(λ) algorithms for state-action value functions (Boyan, 1999) as the basis func-
tions
e
φ
t
do not depend on stochastic future actionsu
t+1
, i.e., the input variables to the
LSTDregressionarenotnoisyduetou
t+1
(e.g.,asin(Bradtke,Ydstie,&Barto,1994))
– such input noise would violate the standard regression model that only takes noise in
theregressiontargetsintoaccount. LSTD(λ)withthebasisfunctionsinEquation(4.52),
called LSTD-Q(λ) from now on, is thus currently the theoretically cleanest way of ap-
plying LSTD to state-value function estimation. It is exact for deterministic or weekly
noisy state transitions and arbitrary stochastic policies. As all previous LSTD sugges-
tions, it loses accuracy with increasing noise in the state transitions since
e
φ
t
becomes a
random variable. The complete LSTD-Q(λ) algorithm is given in the Critic Evaluation
(lines12-17)ofTable4.5.
OnceLSTD-Q(λ)convergestoanapproximationofA
π
(x
t
,u
t
)+V
π
(x
t
),weobtain
two results: the value function parametersv, and the natural gradientw. The natural
gradientw serves in updating the policy parameters Δθ
t
= αw
t
. After this update, the
critic has to forget at least parts of its accumulated sufficient statistics using a forgetting
factor β∈ [0,1] (cf. Table 4.5). For β = 0, i.e., complete resetting, and appropriate
basis functions φ(x), convergence to the true natural gradient can be guaranteed. The
completeNaturalActorCritic(NAC)algorithmisshowninTable4.5.
However,itbecomesfairlyobviousthatthebasisfunctionscanhaveaninfluenceon
our gradient estimate. When using the counterexample in (Bartlett, 2002) with a typical
Gibbspolicy,wewillrealizethatthegradientisaffectedforλ< 1;forλ = 0thegradient
128
Table4.6: EpisodicNaturalActorCritic
input:policyparameterizationθ.
1 repeat
2 performM trialsandobtainx
0:H
,u
0:H
,r
0:H
foreachtrial.
Obtainthesufficientstatistics
3 Policyderivativesψ
k
=∇
θ
logπ
θ
(u
k
|x
k
).
4 FishermatrixF
θ
=
P
H
k=0
ψ
k
P
H
l=0
ψ
l
T
.
Vanillagradientg =
D
P
H
k=0
ψ
k
P
H
l=0
a
l
r
l
E
.
5 Eligbilityφ =
D
P
H
k=0
ψ
k
E
.
6 Averagereward ¯ r =
D
P
H
l=0
a
l
r
l
E
.
Obtainnaturalgradientbycomputing
7 Baselineb =Q
¯ r−φ
T
F
−1
θ
g
withQ =M
−1
1+φ
T
MF
θ
−φφ
T
−1
φ
8 Naturalgradientg
eNAC1
=F
−1
θ
(g−φb).
9 untilgradientestimateg
eNAC1
converged.
return: gradientestimateg
eNAC1
.
isflippedandwouldalwaysworsenthepolicy. Thus,itwouldresultinabiasedgradient
updateforλ< 1,however,unlikein(Bartlett,2002),weatleastcouldguaranteethatthe
gradientisunbiasedforλ = 1.
4.4.3.2 EpisodicNaturalActor-Critic
Given the problem that the additional basis functionsφ(x) determine the quality of the
gradient, we need methods which guarantee the unbiasedness of the natural gradient es-
timate. Suchmethod canbe determined bysumming up Equation (4.51) along a sample
path,weobtain
H
X
t=0
a
t
A
π
(x
t
,u
t
) =a
H+1
V
π
(x
H+1
)+
H
X
t=0
a
t
r(x
t
,u
t
)−V
π
(x
0
) (4.53)
129
It is fairly obvious that the last term disappears for H →∞ or episodic tasks (where
r(x
H
,u
H
) is the final reward); therefore each roll-out would yield one equation. If we
furthermore assume a single start-state, an additional scalar value function ofφ(x) = 1
suffices. Wethereforegetastraightforwardregressionproblem:
H
X
t=0
a
t
∇logπ(u
t
,x
t
)
T
w+J =
H
X
t=0
a
t
r(x
t
,u
t
) (4.54)
withexactlydimθ+1unknowns. Thismeansthatfornon-stochastictaskswecanobtain
anaturalgradientafter dimθ +1rolloutsusingleastsquaresregression
w
J
=
Ψ
T
Ψ
−1
Ψ
T
R, (4.55)
with
Ψ
i
=
"
H
X
t=0
a
t
∇logπ(u
t
,x
t
)
T
,1
#
, (4.56)
R
i
=
H
X
t=0
a
t
r(x
t
,u
t
). (4.57)
This regression problem, can be transformed into the form shown shown in Table 4.6
usingthematrixinversionlemmata,seeSectionA.4.1forthederivation.
This algorithm, originally derived in (Peters, 2005; Peters et al., 2003a, 2005), can
be considered the ‘natural’ version of reinforce with a baseline optimal for this gradient
estimator. However,forsteepestdescentwithrespecttoametric,thebaselinealsoneeds
130
to minimize the variance with respect to the same metric and, thus, the episodic natural
actorcriticcanbederivedsimilarasthenormallikelihoodratiogradientsbefore. Inthis
case, we can minimize the whole covariance matrix of the natural gradient estimateΔ
ˆ
θ
givenby
Σ = Cov
n
Δ
ˆ
θ
o
F
θ
=E
Δ
ˆ
θ−F
−1
θ
g
RF
(b)
T
F
θ
Δ
ˆ
θ−F
−1
θ
g
RF
(b)
,
withg
RF
(b) =h∇logp
θ
(τ)(r(τ)−b)ibeingtheREINFORCEgradientwithbaseline
b. As outlined in (Peters, 2005; Peters et al., 2003a, 2005), it can be shown that the
minimum-varianceunbiasednaturalgradientestimator.
4.4.3.3 EpisodicNaturalActorCriticwithaTime-VariantBaseline
Theepisodicnaturalactorcriticdescribedintheprevioussectionsuffersfromdrawback:
itdoesnotmakeuseofintermediatedatajustlikeREINFORCE.Forpolicygradients,the
way out was G(PO)MDP which left out terms which would average out in expectation.
In the same manner, we can make the argument for a time-dependent baseline which
then allows us to reformulate the Episodic Natural Actor Critic. This argument results
in the algorithm shown in Table 4.7 and the derivation is shown in Section A.4.2. The
advantage of this type of algorithms is two-fold: the variance of the gradient estimate is
oftenloweranditcantaketime-variantrewardssignificantlybetterintoaccount.
131
Table4.7: EpisodicNaturalActorCriticwithaTime-VariantBaseline
input:policyparameterizationθ.
1 repeat
2 performM trialsandobtainx
0:H
,u
0:H
,r
0:H
foreachtrial.
Obtainthesufficientstatistics
3 Policyderivativesψ
k
=∇
θ
logπ
θ
(u
k
|x
k
).
4 FishermatrixF
θ
=
D
P
H
k=0
P
k
l=0
ψ
l
ψ
T
k
E
.
Vanillagradientg =
D
P
H
k=0
P
k
l=0
ψ
l
a
k
r
k
E
,
5 EligbilitymatrixΦ = [φ
1
,φ
2
,...,φ
K
]
withφ
h
=
D
P
h
k=0
ψ
k
E
.
6 Averagerewardvector¯ r = [¯ r
1
,¯ r
2
,...,¯ r
K
]
with ¯ r
h
=ha
h
r
h
i.
Obtainnaturalgradientbycomputing
7 Baselineb =Q
¯ r−Φ
T
F
−1
θ
g
withQ =M
−1
I
K
+Φ
T
MF
θ
−ΦΦ
T
−1
Φ
.
8 Naturalgradientg
NG
=F
−1
θ
(g−Φb).
9 untilgradientestimateg
eNACn
converged.
return: gradientestimateg
eNACn
.
4.5 PropertiesofNaturalActor-Critic
In this section, we will emphasize certain properties of the natural actor-critic. In par-
ticular, we want to give a simple proof of covariance of the natural policy gradient, and
discuss (Kakade, 2002) observation that in his experimental settings the natural policy
gradientwasnon-covariant. Furthermore,wewilldiscussanothersurprisingaspectabout
the Natural Actor-Critic (NAC) which is its relation to previous algorithms. We briefly
demonstrate that established algorithms like the classic Actor-Critic (Sutton & Barto,
1998), and Bradtke’s Q-Learning (Bradtke et al., 1994) can be seen as special cases of
NAC.
132
4.5.1 OntheCovarianceofNaturalPolicyGradients
When (Kakade, 2002) originally suggested natural policy gradients, he came to the dis-
appointing conclusion that they were not covariant. As counterexample, he suggested
that for two different linear Gaussian policies, (one in the normal form, and the other
in the information form) the probability distributions represented by the natural policy
gradient would be affected differently, i.e., the natural policy gradient would be non-
covariant. Weintendtogiveaproofatthispointshowingthatthenaturalpolicygradient
isinfactcovariantundercertainconditions,andclarifywhy(Kakade,2002)experienced
thesedifficulties.
Theorem4.1 Naturalpolicygradientsupdatesarecovariantfortwopoliciesπ
θ
param-
eterizedbyθandπ
h
parameterizedbyhif(i)forallparametersθ
i
thereexistsafunction
θ
i
= f
i
(h
1
,...,h
k
), (ii) the derivative∇
h
θ and its inverse∇
h
θ
−1
, and (iii) there are
noredundantparameters(i.e., theFisherinformation matrix isnon-singular).
For the proof see Appendix A.2. Practical experiments show that the problems oc-
curredforGaussianpoliciesin(Kakade,2002)areinfactduetotheselectionthestepsize
α which determines the length ofΔθ. As the linearizationΔθ =∇
h
θ
T
Δh does not
holdforlargeΔθ,thiscancausedivergencebetweenthealgorithmsevenforanalytically
determined natural policy gradients which can partially explain the difficulties occurred
byKakade(Kakade,2002).
133
4.5.2 NAC’sRelationtopreviousAlgorithms
OriginalActor-Critic. Surprisingly,theoriginalActor-Criticalgorithm(Sutton&Barto,
1998)isaformoftheNaturalActor-Critic. BychoosingaGibbspolicy
π(u
t
|x
t
) =
exp(θ
xu
)
P
b
exp(θ
xb
)
, (4.58)
withallparametersθ
xu
lumpedinthevectorθ,(denotedasθ = [θ
xu
])inadiscretesetup
with tabular representations of transition probabilities and rewards. A linear function
approximationV
π
(x) =φ(x)
T
v withv = [v
x
] and unit basis functionsφ(x) =e
x
was
employed. Suttonetal. onlineupdateruleisgivenby
θ
t+1
xu
=θ
t
xu
+α
1
(r(x,u)+γv
x
′−v
x
),
v
t+1
x
=v
t
x
+α
2
(r(x,u)+γv
x
′−v
x
),
whereα
1
,α
2
denotelearningrates. Theupdateofthecriticparametersv
t
x
equalstheone
of the Natural Actor-Critic in expectation as TD(0) critics converges to the same values
as LSTD(0) and LSTD-Q(0) for discrete problems (Boyan, 1999). Since for the Gibbs
policywehave
∂logπ(b|a)
∂θ
xu
=
1−π(b|a) if a =xandb =u,
−π(b|a) if a =xandb6=u,
0 otherwise,
(4.59)
134
andas
P
b
π(b|x)A(x,b) = 0,wecanevaluatetheadvantagefunctionandderive
A(x,u) =A(x,u)−
X
b
π(b|x)A(x,b) =
X
b
∂logπ(b|x)
∂θ
xu
A(x,b).
Since the compatible function approximation represents the advantage function, i.e.,
f
π
w
(x,u) = A(x,u), we realize that the advantages equal the natural gradient, i.e.,
w = [A(x,u)]. Furthermore, the TD(0) error of a state-action pair (x,u) equals the
advantagefunctioninexpectation,andthereforethenaturalgradientupdate
w
xu
=A(x,u) =E
x
′{r(x,u)+γV(x
′
)−V(x)|x,u}, (4.60)
corresponds to the average online updates of Actor-Critic. As both update rules of the
Actor-CriticcorrespondtotheonesofNAC,wecanseebothalgorithmsasequivalent.
Bradtke’sQ-Learning. Bradtkeetal. (1994)proposedanalgorithmwithpolicyπ(u
t
|x
t
) =
N(u
t
|k
T
i
x
t
,σ
2
i
) and parametersθ
i
= [k
T
i
,σ
i
]
T
(whereσ
i
denotes the exploration, andi
the policy update time step) in a linear control task with linear state transitionsx
t+1
=
Ax
t
+bu
t
,andquadraticrewardsr(x
t
,u
t
) =x
T
t
Hx
t
+Ru
2
t
. TheyevaluatedQ
π
(x
t
,u
t
)
with LSTD(0) using a quadratic polynomial expansion as basis functions, and applied
greedyupdates:
k
Bradtke
i+1
= argmax
k
i+1
Q
π
(x
t
,u
t
=k
T
i+1
x
t
) =−(R+γb
T
P
i
b)
−1
γbP
i
A,
135
whereP
i
denotespolicy-specificvaluefunctionparametersrelatedtothegaink
i
;noup-
datetheexplorationσ
i
wasincluded. Similarly,wecanobtainthenaturalpolicygradient
w = [w
k
,w
σ
]
T
, as yielded by LSTD-Q(λ) analytically using the compatible function
approximation and the same quadratic basis functions. As discussed in detail in (Peters
etal.,2003a),thisgivesus
w
k
= (γA
T
P
i
b+(R+γb
T
P
i
b)k)
T
σ
2
i
,
w
σ
= 0.5(R+γb
T
P
i
b)σ
3
i
.
Similarly, it can be derived that the expected return isJ(θ
i
) =−(R +γb
T
P
i
b)σ
2
i
for
thistypeofproblems,see(Petersetal.,2003a). Foralearningrateα
i
= 1/kJ(θ
i
)k,we
see
k
i+1
=k
i
+α
t
w
k
=k
i
−(k
i
+(R+γb
T
P
i
b)
−1
γA
T
P
i
b) =k
Bradtke
i+1
,
which demonstrates that Bradtke’s Actor Update is a special case of the Natural Actor-
Critic. NAC extends Bradtke’s result as it gives an update rule for the exploration –
whichwasnotpossibleinBradtke’sgreedyframework.
136
Expected Return
Rollouts [log-scale]
10
0
10
1
10
2
10
3
10
4
10
5
-10
2.3
-10
2.4
-10
2.5
-10
2.6
-10
2.8
-10
2.7
(a) Minimummotorcommandwithsplines (b) Minimum motor command with motor
primitives
10
0
10
1
10
2
10
3
10
4
10
5
10
6
-10
3
-10
2
-10
1
Rollouts [log-scale]
Expected Return
(c) Passingthroughapointwithsplines
10
0
10
1
10
2
10
3
10
4
-10
3
-10
2
-10
1
Rollouts [log-scale]
Expected Return
(d) Passingthroughapointwithmotorprimitives
Finite Difference Gradient
Vanilla Policy Gradient with constant baseline
Vanilla Policy Gradient with time-variant baseline
Episodic Natural Actor-Critic with single offset basis functions
Episodic Natural Actor-Critic with time-variant offset basis functions
(e) Legend
Figure 4.2: This figure shows different experiments with motor task learning. In (a,b),
weseehowthelearningsystemcreatesminimummotorcommandgoal-achievingplans
usingboth(a)splinesand(b)motorprimitives. Forthisproblem,thenaturalactor-critic
methods beat all other methods by several orders of magnitude. In (c,d), the plan has
to achieve an intermediary goal. While the natural actor-critic methods still outperform
previous methods, the gap is lower as the learning problem is easier. Note that these are
doublelogarithmicplots.
137
4.6 Experiments&Results
In the previous section, we outlined the five first-order, model-free policy gradient al-
gorithms which are most relevant for robotics (further ones exist but are do not scale
into high-dimensional robot domains). In this section, we will demonstrate how these
different algorithms compare in practice in different areas relevant to robotics. For this
pupose,wewillshowexperimentsonbothsimulatedplantsaswellasonrealrobotsand
we will compare the algorithms for the optimization of control laws and for learning of
motorskills.
4.6.1 ComparingPolicyGradientMethodsonMotorPrimitives
Initially, we compare the different policy gradient methods in motor primitive planning
tasksusingbothspline-basedanddynamicalsystembaseddesiredtrajectories. InFigure
4.2 (a) and (b), we show a comparison of the presented algorithms for a simple, single
DOFtaskwitharewardof
r
k
(x
0:N
,u
0:N
) =
N
X
i=0
c
1
˙ q
2
d,k,i
+c
2
(q
d;k;N
−g
k
)
2
(4.61)
138
wherec
1
= 1,c
2
= 1000 for both splines and dynamic motor primitives. In Figure 4.2
(c) and (d) we show the same with an additional punishment term for going through a
intermediatepointp
F
attimeF,i.e.,
r
k
(x
0:N
,u
0:N
) =
N
X
i=0
˜ c
1
˙ q
2
d,k,i
+˜ c
2
(q
d;k;N
−g
k
)
2
+˜ c
2
(q
d;F;N
−p
F
)
2
. (4.62)
Itisquiteclearfromtheresultsthatthenaturalactor-criticmethodsoutperformboththe
vanillapolicygradientmethodsaswellasthelikelihoodratiomethods. Finitedifference
gradient methods behave differently from the likelihood ratio methods as there is no
stochasticity in the system, resulting in a cleaner gradient but also in local minima not
present for likelihood ratio methods where the exploratory actions are stochastic. From
thiscomparison,wecanconcludethatnaturalactor-criticmethodsarethebestsuitedfor
motorprimitivelearning.
4.6.2 RobotApplication: MotorPrimitiveLearningforBaseball
We also evaluated the same setup in a challenging robot task, i.e., the planning of these
motor primitives for a seven DOF robot task using our SARCOS Master Arm. The task
of the robot is to hit the ball properly so that it flies as far as possible; this game is
also known as T-Ball. The state of the robot is given by its joint angles and velocities
while the action are the joint accelerations. The reward is extracted using color segment
tracking with a NewtonLabs vision system. Initially, we teach a rudimentary stroke by
supervised learning as can be seen in Figure 4.3 (b); however, it fails to reproduce the
139
( b ) T e a c h i n b y I m i t a t i o n ( c ) I n i t i a l r e - p r o d u c e d m o t i o n ( d ) I m p r o v e d r e - p r o d u c e d m o t i o n ( a ) P e r f o r m a n c e o f t h e s y s t e m 0 100 200 300 400
-10
-8
-6
-4
-2
0
x 10
5
Episodes
Performance J(θ)
Figure 4.3: This figure shows (a) the performance of a baseball swing task when using
the motor primitives for learning. In (b), the learning system is initialized by imitation
learning, in (c) it is initially failing at reproducing the motor behavior, and (d) after
severalhundredepisodesexhibitinganicelylearnedbatting.
behavior as shown in (c); subsequently, we improve the performance using the episodic
Natural Actor-Critic which yields the performance shown in (a) and the behavior in (d).
Afterapproximately200-300trials,theballcanbehitproperlybytherobot.
4.7 Conclusion&Discussion
The contributions of this chapter are outlined in Section 4.7.1, and we discuss the rela-
tionsbetweenpolicygradientandextensionsoftheEM-likeapproachesinSection4.7.2.
Furthermore, in Section 4.7.3, we discuss a new approach tailored for motor primitive
learning,i.e.,themotorprimitiveiteration.
4.7.1 ContributionsofthisChapter
We have presented an extensive survey of policy gradient methods. While some de-
velopments needed to be omitted as they are only applicable for very low-dimensional
140
state-spaces, this thesis chapter represents the state of the art in policy gradient methods
andcandeliverasolidbaseforfutureapplicationsofpolicygradientmethodsinrobotics.
All three major ways of estimating first order gradients, i.e., finite-difference gradients,
vanilla policy gradients and natural policy gradients are discussed in this thesis chapter
andpracticalalgorithmsaregiven.
Oneofthepresentedclassesofalgorithms,theNaturalActor-Criticalgorithmswere
developed for this thesis. While developed in the beginning of the Ph.D. of the author,
these algorithms have been widely accepted by now and have been applied in a vari-
ety of settings (Guenter, Hersch, Calinon, & Billard, 2007; Mori, Nakamura, & Ishii,
2005; Mori et al., 2004; Nakamura et al., 2004; Park, Kim, & Kang, 2005; Richter, Ab-
erdeen, & Yu, 2007; Sato et al., 2002; Ueno et al., 2006). The Natural Actor-Critic is
considered“Currentmethodofchoice”(Aberdeen,2006)amongthepolicygradientin
the reinforcement learning community. It also allows the derivation of several previous
algorithmsandhasveryusefulproperties.
The experiments presented here show that the time-variant episodic natural actor
critic is the preferred method among the presented methods when applicable; however,
if a policy cannot be differentiated with respect to its parameters, the finite difference
methods may be the only method applicable. The example of motor primitive learning
forbaseballunderlinestheefficiencyofnaturalgradientmethods.
141
4.7.2 RelationtoEM-basedApproaches
Interestingly, the methods used for immediate reinforcement learning in Section 3.3 are
related to policy gradient approaches and can be extended to trajectories. For this, let
us discuss this topic without considering the adaptive reward transformation, i.e., the
presentedEM-likealgorithmcouldbederivedfrom
θ
k+1
= argmax
θ
′d
KL
(p
θ
k
(τ)r(τ)kp
θ
′ (τ)), (4.63)
which can be shown to maximize the lower bound onJ (θ) =
R
p
θ
(τ)r(τ)dτ in the
same fashion as shown in Section 3.3. While the EM solution yields a parameter free
one-step improvement, we can also look at the gradient of d
KL
(p
θ
k
(τ)r(τ)||p
θ
′ (τ))
whichisusedintheEM-update. Thisanalysisyields
d
dθ
′
d
KL
(p
θ
k
(τ)r(τ)kp
θ
′ (τ)) (4.64)
=
d
dθ
′
Z
T
p
θ
k
(τ)r(τ)log
p
θ
′ (τ)
p
θ
k
(τ)r(τ)
dτ, (4.65)
=
Z
T
p
θ
k
(τ)r(τ)
d
dθ
′
logp
θ
′ (τ)dτ =∇
θ
J (θ), (4.66)
i.e.,itshowsthatthevanillapolicygradientisusedinEM-likeapproaches. Inpratice,the
EM-likealgorithmsusuallychooseveryconservativeupdatesandsimpleevaluationson
toy problems for episodic reinforcement learning show that the EM-like RL algorithms
are very sensitive to the data distribution. Thus, when applied to trajectories, the natural
142
actor-critic outperforms vanilla EM-like methods significantly. However, it is obvious
that extensions with appropriate constraints can be of significant advantage here, espe-
ciallyifthereductionontoregressionisbeingusedasinSection3.3. Thus,thisremains
animportanttopicforfutureresearch.
4.7.3 FutureWork: MotorPrimitiveIteration
In this chapter, we have focussed on the application of model free, general policy gra-
dient approaches. Alternatives arise for learning motor primitives for the framework
suggested in (Ijspeert et al., 2002, 2003). In this framework, we have two second-order
linearsystemscoupledthroughalocallyweightedlinearfunctionapproximator. Forthis
reason,onecouldlookatthenonlinearsystem
˙ x =A
x
x+e
2
f (y) =Ax+Bu, (4.67)
as a linear system with the coupling u = f (y) to the canonical system y as optimal
controls. As this coupling is implemented through a locally linear weighted function
approximatorsuchas
f (y) =
P
n
i=1
φ
i
(y)w
T
i
y
P
n
i=1
φ
i
(y)
,
143
itislogical,thattherewardshouldalsobeapproximatedlocallyinasimilarfashion,e.g.,
throughapproximationsoftherewardusingthesamelocalweightingssuchas
r(y,x,u) =
P
n
i=1
φ
i
(y)
R
i
(u−u
i
)
2
+Q
i
(x−x
i
)
2
P
n
i=1
φ
i
(y)
. (4.68)
In this case, there would be two natural steps, i.e., (i) a motor primitive policy eval-
uations step where several rollouts would allow the learning of the local parameters,
e.g.,R
i
,Q
i
,u
i
,x
i
,and(ii)apolicyimprovementstepwhichwouldcomputetheoptimal
controls based on these parameters. The optimal controls would then serve as super-
visedlearningtargetsinordertodeterminethemotorprimitiveparameters.Ifnecessary,
Gallerkin solution methods could allow an extensing to cases beyond the capabilities of
traditional optimal control methods (Beard & McLain, 1998). This method could serve
asapromisingnewmethodspecificallyformotorprimitivesbutwouldofcoursealways
belimitedtomotorprimitivelearning.
144
Chapter5
Conclusion
Reasoningdrawsaconclusion,
but doesnotmaketheconclusion certain,
unlessthe minddiscoversitbythepath ofexperience.
RogerBacon(Englishempiricist, 1214 -1294)
This thesis has been about the greater goal of creating a general machine learning
framework for acquiring and improving motor skills in robotics. Several important con-
tributionsintheareasoflearning,motorskillsandroboticshavebeenmadewhichbring
us astepclosertothatultimate aim. Inthis chapter, wewill firstsummarize theinsights
and results of this thesis. Subsequently, we discuss the different approaches as well the
importantnextstepsneededinordertosolvethegeneralproblemofmotorskilllearning.
145
5.1 SummaryoftheContributions
In this thesis, we have presented several different contributions to three different but
interrelated topics, i.e., machine learning (particularly reinforcement learning), motor
skill representation and control as well as robot applications. The main results of this
thesis have been grouped into three chapters which are written in such a way that they
can be read independently of each other. However, following the sequence of these
chapterswillhelptheunderstandingofthesinglechapterstremendously.
In Chapter 1, we have given an introduction of the thesis. This chapter started with
the general motivation of the chosen topic, i.e., Machine Learning for Motor Skills in
Robotics. It included a summary of the most important novel developments which we
havepresentedinthefollowingchaptersandanoutlineoftheremainderofthethesis.
In Chapter 2, we started with the seminal work of Udwadia (1996, 2003) on the
relation between analytical dynamics and control. The underlying key insight is that
Nature enforces constraints just like a control engineer creates control laws and that
these controls can be derived from the a point-wise optimal control problem. Using
this understanding of motor control, we have developed a coherent framework for the
derivation of control laws for a variety of different control problems ranging from task-
space control to hierarchical control. We have shown that this framework can be used
in order to unify a variety of previous control approaches. We discussed the necessary
conditions for both task-space and joint-space stability in this framework. Evaluations
146
for an anthropomorphic robot arm of the framework in the area of operational space
controlarepresented.
In Chapter 3, we have presented a framework for learning operational space control
based upon the insights presented in Chapter 2. We first showed how small modeling
errors as they always occur in practice can affect our ability to apply analytically de-
rived operational space laws and, thus, that it is necessary to learn operational control
laws. Our previous insight (i.e., that we have a point-wise optimal control problem) al-
lowedustoreformulatetheresultinglearningproblemasanimmediaterewardreinforce-
ment learning problem. For this reason, we derived an immediate reward reinforcement
learning algorithm based upon the expectation-maximization algorithm. The usage of
normally distributed exploration allowed the derivation of a reward-weighted regression
algorithm. The function approximation for representing the nominal behaviour of the
policy was determined using a multiple paired inverse-forward models approach. This
algorithm is applied to both a simulated three degrees of freedom robot arm and a sim-
ulated anthropomorphic SARCOS Master Arm. For the three degrees of freedom robot
arm, we can show convergence to the optimal solution which can be determined analyt-
ically and for the simulated anthropomorphic SARCOS Master Arm, we can show very
goodtrackingperformancewithhighrewards.
In Chapter 4, we have presented the most extensive review of policy gradient meth-
ods to date as well as novel approaches. We started out by introducing the two general
approaches for policy gradients in the machine learning literature, i.e., finite-difference
147
gradients and likelihood ratio policy gradients, i.e., ‘vanilla’ policy gradients. We have
derived both G(PO)MDP as well as the policy gradient theorem from the episodic point
of view on likelihood ratio policy gradients. We have shown how to derive the optimal
baselines for such estimators. Subsequently, we introduce the compatible function ap-
proximation which allows variance reduction but does not introduce bias into into the
gradient approximation. We have clarified the discussion of the compatible function ap-
proximation and show that in general, it can only present the advantage function. The
path-basedderivationallowsusfindthenaturalpolicygradient,i.e.,thesteepestdescent
for the path-based Fisher information matrix. Using this path-based derivation, we have
provenKakade’sintuitiveassumptionthattheparametersofthecompatiblefunctionap-
proximation are in fact the natural policy gradient. Using additional basis functions, we
made use of both the compatible function approximation and the natural policy gradient
in order to present a new reinforcement learning architecture, the Natural Actor-Critic.
Weshowthreeresultingalgorithms,i.e.,theNaturalActor-Criticwithgeneralbasisfunc-
tionswhichemploysLSTD-Q(λ)forgradientestimationandtheepisodicversionswhich
donotrequirecomplexadditionalfunctionapproximationbutonlyaconstantparameter
or time-variant parameter. We have shown that the natural policy gradient used in this
thesis is in fact covariant. We have derived previous reinforcement learning methods
such as the original Actor-Critic and the Bradke’s LQR Q-Learning from the Natural
Actor-Critic. Most presented algorithms are compared in the setting of motor primitive
148
optimizationwheretheepisodicNaturalActor-Criticalgorithmsoutperformpreviousap-
proaches by orders of magnitude. The applicability to robotics is presented in a T-Ball
hitting example where the SARCOS Master Arm has to hit a ball on a stand in a proper
fashion.
InSummary,inthisthesis,wehavepresented
• Ageneralframeworkforcontrolwhichservesastheoreticalfoundationandallows
theunificationofmanypreviouscontrolapproaches(Chapter2).
• Thefirstlearningapproachtooperationalspacecontroltodate(Chapter3).
• The reward-weighted regression approach for immediate reinforcement learning
(Chapter3).
• Aunifiedtreatmentofpreviouspolicygradientreinforcementlearningapproaches
fromthepath-basedperspectiveandtheirapplicationtomotorprimitives(Chapter
4).
• With the Natural Actor-Critic, we have introduced a novel reinforcement learning
methodwhichiscurrentlyconsideredthe“methodofchoice”(Aberdeen,2006)in
policygradientmethodsandworkswellinthecontextofmotorskillimprovement
inaT-Ballsetup(Chapter4).
Thesecontributionstogetherformthebasisformotorskilllearninginrobotics.
149
5.2 Discussion: TheNextStepsforSkillLearning
While this thesis has made important contributions to the domain of motor skill learn-
ing, it mainly delivers the building blocks required in order to create truly autonomous
system, such (i) a general framework as foundation, (ii) a learning framework for mo-
tor command generation for execution in task space and (iii) a task learning framework
basedonpolicygradientsandmotorprimitives. Atthispoint,weunderstandthesethree
areas sufficiently well in order to discuss the required next steps for motor skill learning
which are the collection of skills into libraries, the learning of skill selection as well as
thesequencingandparallelizationofmotorprimitives.
Skill Libraries. In Chapter 4, we have discussed how single behaviours in a certain
task-space can be learned using parameterized motor primitives and in Chapter 3 we
have discussed how to learn the execution of a motor task in its appropriate operational
space. Itisquiteclear,thathumansoperateinmultipletaskspacesdependingontheac-
complishedmotorskill,e.g.,inbody-centricorretinalcoordinates(Boussaoud&Brem-
mer, 1999). However, it appears that the motor primitives used by a human being for
hand-writing or hand-zig-zagging are exactly the same as used for performing the task
withatoe(Wing,2000). Thus,themotorprimitiveprogramsseemtobekinematicplans
in task-space, largely independent from the motor command generation. Thus, future
skilllibrarieswillcontainbothmotorprimitivesaswellastask-spacetomotorcommand
transformations in order to make many combinations of the two possible. However, the
150
learning of such skills will be largely using the methods presented in this thesis. It will
use the observed movements in order to learn “coordinate system to motor command
transformations”, even if the performed task was in a different coordinate system. Sep-
arately from the task primitive to motor command transformation, we will learn motor
primitives. For this, observed tasks are compared to existing primitives. If the observed
task is equivalent to an existing one, it will be used for refining the primitive while oth-
erwiseitwillitwillbeaddedtotheskilllibrary. Subsequently,theskilllibrarymanager
needs to decide whether to practice the skill using the reinforcement learning methods
presentedinthisthesis.
LearningtoSelectSkills. Theselectionofskillsshiftsthefocusawayfrompuremotor
controltowardsaperceptuo-motorperspective. Inthiscase,ageneraltaskisgiven,e.g.,
graspaspecifiedobjectandpickitup, move throughthe roomalongaglobaltrajectory,
or hit a ball with a tennis racket. Here, perceptual variables allow us to choose the
right motor primitives, e.g., whether to select a power grasp vs a precision pinch for a
particular object, which foot trajectories to use for moving from one foothold to another
in the presence of obstacles, or whether to select a tennis fore- vs backhand. Similarly,
theyneedtobeusedinordertosetthemotorprimitivegoalparameters,e.g.,thecontact
points where we intend to hold the object, the selected next foothold, or where to hit
the ball at what time. Each of these tasks is associated with the appropriate effector.
However, it is quite obvious that some of the tasks do transfer between end-effectors,
151
e.g.,wecouldusetwofingersortwohandsforgeneratingaprecisionpinchforgrasping
andliftingaparticularobject.
Clearly, the next higher system above the skill selection system needs some form of
higher-level intelligence which determines the general task. This layer could close the
gapbetweenartificialintelligencesystemsandrobotics.
Motor Primitive Sequencing and Parallelization. Another key issue for future re-
search is the parallelization and sequencing of motor primitives. Such issues will auto-
matically arise in tasks of higher complexity, e.g., assembling a modular system such
as an IKEA shelf. For such tasks, we require a sequence of tasks such as first several a
peg-in-the-hole tasks (Gullapalli et al., 1994) and subsequently dropping a shelf on top
of the four pegs. It will also require holding two sides of the shelf in parallel so that
they do not fall before assembled together. In order to learn such tasks, we require a
hybrid control architecture consisting out of the lower level components such as motor
primitives and task execution as well as a higher, discrete layer. The state of this dis-
crete layer are the active primitives which together form a macro state or option. Such
approacheswillrequireafusionofpreviousapproachestohybridcontrolapproaches,hi-
erarchical reinforcement learning and imitation learning similar as discussed in (Peters,
2005). Working towards such complex task compositions is of essential importance for
thefutureofmotorskillsinrobotics.
152
References
Aberdeen, D. (2006). POMDPs and policy gradients. In Proceedings of the Machine
LearningSummerSchool(MLSS). Canberra,Australia.
Aleksandrov, V., Sysoyev, V., & Shemeneva, V. (1968). Stochastic optimization. Engi-
neeringCybernetics,5,11–16.
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation,
10(251).
An, C. H., Atkeson, C. G., & Hollerbach, J. M. (1988). Model-based control of a robot
manipulator. Cambridge,MA:MITPress.
Arimoto,S. (1996). Control theory of nonlinear mechanical systems: A passivity-based
andcircuit-theoretic approach. Oxford,UK:OxfordUniversityPress.
Atkeson, C. G. (1994). Using local trajectory optimizers to speed up global optimiza-
tion in dynamic programming. In J. E. H. S. J. Moody & R. P. Lippmann (Eds.),
Advances in Neural Information Processing Systems 6 (pp. 503–521). Morgan
Kaufmann.
Bagnell,J.,&Schneider,J. (2003). Covariantpolicysearch. InProceedingsoftheInter-
nationalJointConferenceonArtificialIntelligence (IJCAI). Acapulco,Mexico.
Baillieul,J.,&Martin,D.P.(1990).Resolutionofkinematicredundancy.InProceedings
of Symposia in Applied Mathematics (Vol. 41, pp. 49–89). San Diego, May 1990:
Providence,RI:AmericanMathematicalSociety.
Baird,L. (1993). Advantageupdating(TechnicalReportNo.WL-TR-93-1146). Wright-
PattersonAirForceBase,OH:WrightLaboratory.
Balasubramanian,V. (1997). Statisticalinference,occam’srazor,andstatisticalmechan-
icsonthespaceofprobabilitydistributions. NeuralComputation, 9(2),349-368.
Bartlett, P. (2002). An introduction to reinforcement learning theory: Value function
methods. In Proceedings of the Machine Learning Summer School (MLSS) (p.
184-202). Canberra,Australia.
153
Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of
ArtificialIntelligence Research,15,319-350.
Baxter, J., Bartlett, P., & Weaver, L. (2001). Experiments with infinite-horizon, policy-
gradientestimation. JournalofArtificialIntelligence Research,15,351-381.
Beard, R., & McLain, T. (1998). Successive Galerkin approximation algorithms for
nonlinear optimal and robust control. International Journal of Control: Special
IssueonBreakthroughsinthe ControlofNonlinear Systems,71(5),717–743.
Benbrahim, H., Doleac, J., Franklin, J., & Selfridge, O. (1992). Real-time learning: A
ball on a beam. In Proceedings of the International Joint Conference on Neural
Networks(IJCNN). Baltimore,MD.
Benbrahim, H., & Franklin, J. (1997). Biped dynamic walking using reinforcement
learning. Roboticsand Autonomous Systems,22,283–302.
Berny, A. (2000). Statistical machine learning and combinatorial optimization. In
L.Kallel,B.Naudts,&A.Rogers(Eds.),Theoreticalaspectsofevolutionarycom-
puting(Vol.1). Heidelberg,Germany: Springer-Verlag.
Boussaoud, D., & Bremmer, F. (1999). Gaze effects in the cerebral cortex: Reference
framesforspacecodingandaction. ExpBrainRes,128,170–180.
Boyan, J. (1999). Least-squares temporal difference learning. In Proceedings of the In-
ternationalConferenceonMachineLearning(ICML)(pp.49–56). Bled,Slovenia.
Bradtke,S.,Ydstie,E.,&Barto,A. (1994). Adaptivelinearquadraticcontrolusingpol-
icy iteration (Technical report No. UM-CS-1994-049). Amherst, MA: University
ofMassachusetts.
Bruyninckx, H., & Khatib, O. (2000). Gauss’ principle and the dynamics of redundant
and constrained manipulators. In Proceedings of IEEE International Conference
onRoboticsand Automation (ICRA)(pp.2563–2569). SanFrancisco,CA.
Bullo, F., & Lewis, A. D. (2004). Geometric control of mechanical systems modeling,
analysis,anddesignforsimplemechanicalcontrolsystems. Heidelberg,Germany:
Springer-Verlag.
Bullock,D.,Grossberg,S.,&Guenther,F.H. (1993). Aself-organizingneuralmodelof
motor equivalent reaching and tool use by a multijoint arm. Journal of Cognitive
Neuroscience,5(4),408–435.
Chung, W., Chung, W., & Y.Youm. (1993). Null torque based dynamic control for
kinematically redundant manipulators. Journal of Robotic Systems, 10(6), 811–
834.
154
Craig,J. (2005). IntroductiontoRobotics: MechanicsandControl. UpperSaddleRiver,
NJ:PearsonPrenticeHall.
Dayan, P., & Hinton, G. E. (1997). Using expectation-maximization for reinforcement
learning. NeuralComputation, 9(2),271-278.
De Wit, C., Siciliano, B., & Bastin, G. (1996). Theory of robot control. Heidelberg,
Germany: Springer-Verlag.
De Luca, A., & Mataloni, F. (1991). Learning control for redundant manipulators.
In Proceedings of IEEE International Conference on Robotics and Automation
(ICRA). Sacramento,CA.
Doty, K., Melchiorri, C., & Bonivento, C. (1993). A theory of generalized inverses
appliedtorobotics. International JournalofRoboticsResearch,12,1–19.
D’Souza, A., Vijayakumar, S., & Schaal, S. (2001). Learning inverse kinematics. In
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and
Systems(IROS). Hawaii,USA.
Dyer, P., & McReynolds, S. R. (1970). The computation and theory of optimal control.
NewYork,NY:AcademicPress.
Endo, G., Morimoto, J., Matsubara, T., Nakanishi, J., & Cheng, G. (2005). Learning
cpg sensory feedback with policy gradient for biped locomotion for a full-body
humanoid. In Proceedings of the National Conference on Artificial Intelligence
(AAAI). Pittsburgh,PA.
Fletcher, R., & Fletcher, R. (2000). Practical methods of optimization. New York, NY:
JohnWiley&Sons.
Glynn, P. (1987). Likelihood ratio gradient estimation: an overview. In Proceedings of
theWinterSimulation Conference(WSC)(p.366-375). Atlanta,GA.
Glynn, P. (1990). Likelihood ratio gradient estimation for stochastic systems. Commu-
nications ofthe ACM,33(10),75–84.
Greensmith, E., Bartlett, P., & Baxter, J. (2001). Variance reduction techniques for gra-
dient estimates in reinforcement learning. Advances in Neural Information Pro-
cessingSystems,14(34).
Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques
for gradient estimates in reinforcement learning. Journal of Machine Learning
Research,5,1471–1530.
Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learning for
imitatingconstrainedreachingmovements. AdvancedRobotics, InPress.
155
Guez, A., & Ahmad, Z. (1988). Solution to the inverse kinematics problem in robotics
by neural networks. In Proceedings of IEEE International Conference on Neural
Networks(pp.102–108). SanDiego,CA.
Gullapalli,V. (1991). Associativereinforcementlearningofreal-valuefunctions. InPro-
ceedingsoftheIEEEInternationalConferenceonSystems,ManandCybernetics.
Charlottesville,VA.
Gullapalli, V., Franklin, J., & Benbrahim, H. (1994). Aquiring robot skills via rein-
forcement learning. IEEE Control Systems Journal, Special Issue on Robotics:
Capturing NaturalMotion, 4(1),13-24.
Hanafusa,H.,Yoshikawa,T.,&Nakamura,Y.(1981).Analysisandcontrolofarticulated
robot with redundancy. In Proceedings of IFAC Symposium on Robot Control
(Vol.4,pp.1927–1932). Gaithersburg,MD.
Haruno, M., Wolpert, D. M., & Kawato, M. (1999). Multiple paired forward-inverse
modelsforhumanmotorlearningandcontrol. InAdvancesinNeuralInformation
ProcessingSystems. Cambridge,MA:MITPress.
Harville, D. A. (2000). Matrix algebra from a statistician’s perspective. Heidelberg,
Germany: SpringerVerlag.
Hasdorff,L. (1976). Gradientoptimizationandnonlinearcontrol. NewYork,NY:John
Wiley&Sons.
Hirzinger, G., Sporer, N., Albu-Sch¨ affer, A., H¨ ahnle, M., Krenn, R., Pascucci, A., &
Schedl,M. (2002). DLR’storque-controlledlightweightrobotIII-arewereach-
ing the technological limits now? In Proceedings of IEEE International Confer-
enceon Roboticsand Automation (ICRA)(p.1710-1716). WashingtonDC.
Hollerbach,J.M.,&Suh,K.C. (1987). Redundancyresolutionofmanipulatorsthrough
torqueoptimization. InternationalJournalofRoboticsandAutomation,3(4),308–
316.
Hsu, P., Hauser, J., & Sastry, S. (1989). Dynamic control of redundant manipulators.
JournalofRoboticSystems,6(2),133–148.
Ijspeert,A.,Nakanishi,J.,&Schaal,S.(2003).Learningattractorlandscapesforlearning
motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in
Neural Information Processing Systems (Vol. 15, pp. 1547–1554). Cambridge,
MA:MITPress.
Ijspeert, J. A., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear
dynamical systems in humanoid robots. In Proceedings of IEEE International
Conferenceon Roboticsand Automation (ICRA). Washinton,DC.
156
Isidori,A. (1995). Nonlinearcontrolsystems. Heidelberg,Germany: Springer-Verlag.
Jacobson,D.H.,&Mayne,D.Q. (1970). Differentialdynamicprogramming. NewYork,
NY:AmericanElsevierPublishingCompany,Inc.
Jordan,I.M.,&Rumelhart. (1992). Supervisedlearningwithadistalteacher. Cognitive
Science,16,307–354.
Kaebling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A
survey. JournalofArtificialIntelligence Research,4,237–285.
Kakade, S. (2001). Optimizing average reward using discounted rewards. In Proceed-
ings of the Conference on Computational Learning Theory (COLT). Amsterdam,
Netherlands.
Kakade, S. A. (2002). Natural policy gradient. In Advances in Neural Information
ProcessingSystems(Vol.14). Vancouver,CA.
Kazerounian, K., & Wang, Z. (1988). Global versus local optimization in redundancy
resolution of robotic manipulators. International Journal of Robotics Research,
7(5),3-12.
K.C.Suh,&Hollerbach,J.M. (1987). Localversusglobaltorqueoptimizationofredun-
dant manipulators. In Proceedings of the International Conference on Robotics
andAutomation (ICRA)(pp.619–624). Raleigh,NC.
Khatib, O. (1987). A unified approach for motion and force control of robot manipula-
tors: The operational space formulation. IEEE Journal of Robotics and Automa-
tion, 3(1),43–53.
Khatib,O.,Sentis,L.,Park,J.,&Warren,J. (2004). Wholebodydynamicbehaviorand
control of human-like robots. International Journal of Humanoid Robotics, 1(1),
29–43.
Kimura, H., & Kobayashi, S. (1997). Reinforcement learning for locomotion of a two-
linked robot arm. In Proceedings of the Europian Workshop on Learning Robots
(EWLR)(pp.144–153). Brighton,UK.
Kimura, H., & Kobayashi, S. (1998). Reinforcement learning for continuous action
usingstochasticgradientascent.InProceedingsoftheInternationalConferenceon
IntelligentAutonomousSystems(IAS)(Vol.5,pp.288–295). Madison,Wisconsin.
Kleinman, N., Spall, J., & Naiman, D. (1999). Simulation-based optimization with
stochasticapproximationusingcommonrandomnumbers,”. ManagementScience,
45,1570–1578.
157
Kohl,N.,&Stone,P. (2004). Policygradientreinforcementlearningforfastquadrupedal
locomotion. InProceedingsoftheIEEEInternationalConferenceonRoboticsand
Automation (ICRA). NewOrleans,LA.
Konda,V.,&Tsitsiklis,J. (2000). Actor-criticalgorithms. AdvancesinNeuralInforma-
tion ProcessingSystems12.
Lawrence, G., Cowan, N., & Russell, S. (2003). Efficient gradient estimation for motor
control learning. In Proceedings of the International Conference on Uncertainty
inArtificialIntelligence (UAI). Acapulco,Mexico.
Maciejewski, A., & Klein, C. (1985). Obstacle avoidance for kinematically redun-
dant manipulators in dynamically varying environments. International Journal of
RoboticsResearch,4(3),109–117.
Minamide,N.,&Nakamura,K.(1969). Minimumerrorcontrolprobleminbanachspace
(Research Report of Automatic Control Lab No. 16). Nagoya, Japan: Nagoya
University.
Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H., & Hagita, N. (2005). Robot behav-
ioradaptationforhuman-robotinteractionbasedonpolicygradientreinforcement
learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent
Robotsand Systems(IROS)(p.1594-1601). Edmonton,Canada.
Miyamoto, H., Gandolfo, F., Gomi, H., Schaal, S., Koike, Y., Osu, R., Nakano, E., &
Kawato, M. (1995). A kendama learning robot based on a dynamic optimization
theory. In Proceedings of the IEEE International Workshop on Robot and Human
Communication (ROMAN)(pp.327–332). Tokyo,Japan.
Miyamoto, H., Gandolfo, F., Gomi, H., Schaal, S., Koike, Y., Rieka, O., Nakano, E.,
Wada, Y., & Kawato, M. (1996). A kendama learning robot based on a dynamic
optimization principle. In Proceedings of the International Conference on Neural
Information Processing(ICONIP)(pp.938–942). HongKong.
Moon, T., & Stirling, W. (2000). Mathematical methods and algorithms for signal
processing. UpperSaddleRiver,NJ:PrenticeHall.
Mori, T., Nakamura, Y., & Ishii, S. (2005). Efficient sample reuse by off-policy natural
actor-criticlearning. InAdvancesinNeuralInformationProcessingSystems(NIPS
’05Workshop). Vancouver,Canada.
Mori, T., Nakamura, Y., Sato, M. aki, & Ishii, S. (2004). Reinforcement learning for
cpg-driven biped robot. In Proceedings of the National Conference on Artificial
Intelligence (AAAI)(p.623-630). SanJose,CA.
158
Morimoto, J., & Atkeson, C. A. (2003). Minimax differential dynamic programming:
an application to robust biped walking. In S. Becker, S. Thrun, & K. Obermayer
(Eds.), Advances in Neural Information Processing Systems 15. Cambridge, MA:
MITPress.
Nakamura, Y. (1991). Advanced robotics: Redundancy and optimization. Boston, MA:
Addison-Wesley.
Nakamura, Y., Hanafusa, H., & Yoshikawa, T. (1987). Task-priority based control of
robotmanipulators. International JournalofRoboticsResearch,6(2),3–15.
Nakamura,Y.,Mori,T.,&Ishii,S. (2004). Naturalpolicygradientreinforcementlearn-
ing for a cpg control of a biped robot. In Proceedings of the International Con-
ference on Parallel Problem Solving from Nature (PPSN) (p. 972-981). Kyoto,
Japan.
Nakanishi, J., Cory, R., Mistry, M., Peters, J., & Schaal, S. (2005). Comparative ex-
periments on task space control with redundancy resolution. In Proceedings of
theIEEE/RSJInternationalConferenceonIntelligentRobotsandSystems(IROS).
Edmonton,Canada.
Nakanishi, J., Farrell, J. A., & Schaal, S. (2004). Learning composite adaptive control
for a class of nonlinear systems. In Proceedings of the International Conference
onRoboticsand Automation (ICRA)(pp.2647–2652). NewOrleans,LA.
Ng, A. Y., & Jordan, M. (2000). PEGASUS: A policy search method for large MDPs
and POMDPs. In Proceedings of the International Conference on Uncertainty in
ArtificialIntelligence (UAI). PaloAlto,CA.
Park,J.,Chung,W.-K.,&Youm,Y. (1995). Specificationandcontrolofmotionforkine-
matically redundant manipulators. In Proceedings of the IEEE/RSJ International
Conferenceon IntelligentRobots andSystems(IROS). LasVegas,USA.
Park, J., Chung, W.-K., & Youm, Y. (2002). Characterization of instability of dynamic
control for kinematically redundant manipulators. In Proceedings of the Interna-
tionalConferenceonRoboticsand Automation (ICRA). Washington,DC.
Park, J., Kim, J., & Kang, D. (2005). An RLS-Based Natural Actor-Critic Algorithm
forLocomotionofaTwo-LinkedRobotArm. InY.Hao,J.Liu,Y.Wang,Y.ming
Cheung, H. Yin, L. Jiao, J. Ma, & Y.-C. Jiao (Eds.), Proceedings of the Interna-
tional Conference on Computational Intelligence and Security (CIS) (Vol. 3801,
pp.65–72). Xi’an,China: Springer.
Peters, J. (2005). Machine learning of motor skills for robotics (Technical Report No.
CS-05-867). LosAngeles,CA:UniversityofSouthernCalifornia.
159
Peters, J., Mistry, M., Udwadia, F., R.Cory, Nakanishi, J., & Schaal, S. (2005). A
unifying methodology for the control of robotic systems. In Proceedings of the
IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems(IROS). Ed-
monton,Canada.
Peters, J., Mistry, M., Udwadia, F. E., & Schaal, S. (2005). A new methodology for
robot control design. In ASME International Conference on Multibody Systems,
Nonlinear Dynamics,andControl(MSNDC 2005). LongBeach,CA.
Peters, J., & Schaal, S. (2006a). Learning operational space control. In Proceedings of
Robotics: ScienceandSystems(RSS). Philadelphia,PA.
Peters,J.,&Schaal,S. (2006b). Policygradientmethodsforrobotics. InProceedingsof
theIEEE/RSJInternationalConferenceonIntelligentRobotsandSystems(IROS).
Beijing,China.
Peters, J., & Schaal, S. (2007a). Reinforcement learning for operational space. In
ProceedingsoftheInternationalConferenceonRoboticsandAutomation(ICRA).
Rome,Italy.
Peters, J., & Schaal, S. (2007b). Using reward-weighted regression for reinforcement
learning of task space control. In Proceedings of the IEEE International Sympo-
sium on Approximate Dynamic Programming and Reinforcement Learning (AD-
PRL). Honolulu,HI.
Peters,J.,Vijayakumar,S.,&Schaal,S. (2003a). Reinforcementlearningforhumanoid
robotics. InProceedingsoftheIEEE-RASInternationalConferenceonHumanoid
Robots(HUMANOIDS). Karlsruhe,Germany.
Peters, J., Vijayakumar, S., & Schaal, S. (2003b). Scaling reinforcement learning
paradigms for motor learning. In Proceedings of the 10th Joint Symposium on
NeuralComputation (JSNC). Irvine,CA.
Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In Proceedings of
theEuropeanMachineLearningConference(ECML). Porto,Portugal.
Pratt, J., & Pratt, G. (1998). Intuitive control of a planar bipedal walking robot. In
Proceedings of the International Conference on Robotics and Automation (ICRA)
(pp.1024–2021). Leuven,Belgium.
Richter, S., Aberdeen, D., & Yu, J. (2007). Natural actor-critic for road traffic opti-
misation. In B. Schoelkopf, J. Platt, & T. Hofmann (Eds.), Advances in Neural
Information ProcessingSystems(Vol.19). Cambridge,MA:MITPress.
Sadegh, P., & Spall, J. (1997). Optimal random perturbations for stochastic approxima-
tion using a simultaneous perturbation gradient approximation. In Proceedings of
theAmericanControlConference(ACC)(p.3582-3586). Albuquerque,NM.
160
Samson, C., Borgne, M. L., & Espiau, B. (1991). Robot Control: The Task Function
Approach. Oxford,UK:OxfordUniversityPress.
Sato, M., Nakamura, Y., & Ishii, S. (2002). Reinforcement learning for biped locomo-
tion. InProceedingsoftheInternationalConferenceonArtificialNeuralNetworks
(ICANN)(p.777-782). Springer-Verlag.
Schaal,S.,Atkeson,C.G.,&Vijayakumar,S. (2002). Scalabletechniquesfromnonpa-
rametericstatisticsforreal-timerobotlearning. AppliedIntelligence,17(1),49–60.
Schaal,S.,Ijspeert,A.,&Billard,A. (2004). Computationalapproachestomotorlearn-
ing by imitation. In C. D. Frith & D. Wolpert (Eds.), The neuroscience of social
interaction(pp.199–218). Oxford,UK:OxfordUniversityPress.
Sciavicco, L., & Siciliano, B. (2007). Modeling and control of robot manipulators.
Heidelberg,Germany: MacGraw-Hill.
Sentis, L., & Khatib, O. (2004). A prioritized multi-objective dynamic controller for
robots in human environments. In Proceedings of the IEEE-RAS International
Conferenceon Humanoid Robots(HUMANOIDS). LosAngeles,USA.
Sentis, L., & Khatib, O. (2005). Control of free-floating humanoid robots through task
prioritization. In Proceedings of the International Conference on Robotics and
Automation (ICRA). Barcelona,Spain.
Siciliano, B., & Slotine, J. (1991). A general framework for managing multiple tasks in
highlyredundantroboticsystems. InProceedingsoftheInternationalConference
onRoboticsand Automation (ICRA)(pp.1211–1216). Pisa,Italy.
Spall, J. C. (2003). Introduction to stochastic search and optimization: Estimation,
simulation, and control. Hoboken,NJ:Wiley.
Spong, M., Thorp, J., & Kleinwaks, J. (1984). On pointwise optimal control strategies
for robot manipulators. In Proceedings of the Annual Conference on Information
Sciencesand Systems. Princeton,NJ.
Spong, M., Thorp, J., & Kleinwaks, J. (1986). The control of robot manipulators with
boundedinput. IEEETransactionson Automatic Control, 31(6),483-490.
Su,F.,&Gibbs,A. (2002). Onchoosingandboundingprobabilitymetrics. International
StatisticalReview,70(3),419-435.
Sutton,R.,&Barto,A. (1998). Reinforcementlearning. Boston,MA:MITPress.
Sutton,R.S.,McAllester,D.,Singh,S.,&Mansour,Y. (2000). Policygradientmethods
for reinforcement learning with function approximation. In S. A. Solla, T. K.
Leen,&K.-R.Mueller(Eds.),AdvancesinNeuralInformationProcessingSystems
(NIPS). Denver,CO:MITPress.
161
Tedrake, R., Zhang, T. W., & Seung, H. S. (2005). Learning to walk in 20 minutes. In
ProceedingsoftheYaleWorkshoponAdaptiveandLearningSystems. NewHaven,
CT:YaleUniversity,NewHaven.
Tevatia, G., & Schaal, S. (2000). Inverse kinematics for humanoid robots. In Proceed-
ings of the International Conference on Robotics and Automation (ICRA). San
Fransisco,CA.
Tsai,L.-W. (1999). Robotanalysis. NewYork,NY:Wiley.
Udwadia, F. E. (2003). A new perspective on tracking control of nonlinear structural
and mechanical systems. Proceedings of the Royal Society of London, Series A,
2003(439),1783–1800.
Udwadia,F.E. (2005). DiscussionsonC.F.Gauss,Gauss’principle,anditsapplication
tocontrol. (Personalcommunication)
Udwadia, F. E., & Kalaba, R. E. (1996). Analytical dynamics: A new approach. Cam-
bridge,UK:CambridgeUniversityPress.
Ueno,T.,Nakamura,Y.,Takuma,T.,Shibata,T.,Hosoda,K.,&Ishii,S. (2006). Fastand
stablelearningofquasi-passivedynamicwalkingbyanunstablebipedrobotbased
on off-policy natural actor-critic. In Proceedings of the IEEE/RSJ International
Conferenceon IntelligentRobots andSystems(IROS). Beijing,China.
Vachenauer, P., Rade, L., & Westergren, B. (2000). Springers Mathematische Formeln:
Taschenbuch f¨ ur Ingenieure, Naturwissenschaftler, Informatiker, Wirtschaftswis-
senschaftler. Heidelberg,Germany: Springer-Verlag.
Wahba,G.,&Nashed,M.Z. (1973). Theapproximatesolutionofaclassofconstrained
controlproblems. InProceedingsoftheSixthHawaiiInternationalConferenceon
SystemsSciences. Hawaii,HI.
Weaver,L.,&Tao,N.(2001a).Theoptimalrewardbaselineforgradient-basedreinforce-
ment learning. In Proceedings of the International Conference on Uncertainty in
ArtificialIntelligence (UAI)(Vol.17). Seattle,Washington.
Weaver, L., & Tao, N. (2001b). The variance minimizing constant reward baseline
for gradient-based reinforcement learning (Technical Report No. 30). Australian
NationalUniversity(ANU).
Werbos, P. (1979). Changes in global policy analysis procedures suggested by new
methodsofoptimization. PolicyAnalysisand Information Systems,3(1).
Williams,R.J. (1992). Simplestatisticalgradient-followingalgorithmsforconnectionist
reinforcementlearning. MachineLearning,8(23).
162
Wing, A. M. (2000). Motor control: Mechanisms of motor equivalence in handwriting.
CurrentBiology, 10(6),245–248.
Yamane,K.,&Nakamura,Y.(2003).Naturalmotionanimationthroughconstrainingand
deconstraining at will constraining and deconstraining at will. IEEE Transaction
onVisualization and ComputerGraphics, 9(3).
Yoshikawa, T. (1990). Foundations of Robotics: Analysis and Control. Boston, MA:
MITPress.
163
AppendixA
AdditionalDerivationsandDiscussion
A.1 SteepestDescentwithRespecttoaMetricN
In this section, we show how to determine the steepest decent with respect to a metric
andprovetheresultsfromSection4.4.1. Wehavetheoptimizationproblem
maxJ (θ +δθ) =J (θ)+∇J (θ)
T
δθ,
s.t.d(θ +δθ,θ) =
1
2
δθ
T
Nδθ =ε.
Thus,wehavetheLangrangian
L(δθ,λ) =J (θ)+∇J (θ)
T
δθ +λ
ε−
1
2
δθ
T
Nδθ
which can be maximized with respect toδθ which yieldsδθ = λ
−1
N
−1
∇J(θ). Then
wehavethedualfunction
g(λ) =J (θ)+
1
2
λ
−1
∇J (θ)
T
N
−1
∇J (θ)+λε
whichgivesustheLangrangianmultiplier
λ =
s
∇J (θ)
T
N
−1
∇J (θ)
ε
.
Therefore, we have the steepest gradient descent ofθ
n+1
=θ
n
+α
n
N
−1
∇J(θ) with a
learningrate
α
n
=λ
−1
=
r
ε
∇J (θ)
T
N
−1
∇J (θ)
.
Thelearningrateα
n
isnotnecessarilyadesirableoneandcanbereplacedbyaconstant
learningratewithoutchangingthegradientsdirection.
164
A.2 ProofoftheCovarianceTheorem
For small parameter changesΔh andΔθ, we haveΔθ =∇
h
θ
T
Δh. If the natural
policygradientisacovariantupdaterule,achangeΔhalongthegradient∇
h
J(h)would
resultinthesamechangeΔθalongthegradient∇
θ
J(θ)forthesamescalarstep-sizeα.
Bydifferentiation,wecanobtain∇
h
J(h) =∇
h
θ∇
θ
J(θ).Itisstraightforwardtoshow
thattheFisherinformationmatrixincludestheJacobian∇
h
θ twiceasfactor,
F(h) =
Z
X
d
π
(x)
Z
U
π(u|x)∇
h
logπ(u|x)∇
h
logπ(u|x)
T
dudx,
=∇
h
θ
Z
X
d
π
(x)
Z
U
π(u|x)∇
θ
logπ(u|x)∇
θ
logπ(u|x)
T
dudx∇
h
θ
T
,
=∇
h
θF(θ)∇
h
θ
T
.
Thisshowsthatnaturalgradientinthehparameterizationisgivenby
e
∇
h
J(h) =F
−1
(h)∇
h
J(h) =
∇
h
θF(θ)∇
h
θ
T
−1
∇
h
θ∇
θ
J(θ).
This has a surprising implication as it makes it straightforward to see that the natural
policyiscovariantsince
Δθ =α∇
h
θ
T
Δh =α∇
h
θ
T
e
∇
h
J(h),
=α∇
h
θ
T
∇
h
θF(θ)∇
h
θ
T
−1
∇
h
θ∇
θ
J(θ),
=αF
−1
(θ)∇
θ
J(θ) =α
e
∇
θ
J(θ),
assumingthat∇
h
θ isinvertible. Thisconcludesthatthenaturalpolicygradientisinfact
acovariantgradientupdaterule.
A.3 ADiscussionofKimura&Kobayashi’sAlgorithm
Kimura&Kobayashi’s(1998)introducedanon-episodicalgorithmbasedonvaluefunc-
tion approximation and policy gradients. Here, we show that this algorithm is in fact
at best a biased version of the policy gradient. For this, we first show what value func-
tionbasedgradientestimatorscanbederivedfromboththepolicygradienttheoremand
GPOMDP.Asdiscussed4,thetruegradientofthepolicygradienttheoremis
∇J =
Z
d
π
(x)
Z
π(u|x)∇logπ(u|x)[Q
π
(x,u)−b(x)]dudx. (A.1)
165
Inthisequation,wecanmakeuseoftherelationQ
π
(x,u)−b(x) =E{r(x,u)+V (x
′
)−
V (x)},andderivearesultingalgorithm
△θ =αh∇logπ(u|x)[Q
π
(x,u)−b(x)]i, (A.2)
≈αh∇logπ(u|x)[r(x,u)+V (x
′
)−V (x)]i. (A.3)
ThisalgorithmiscalledSRV,see(Gullapalli,1991),andrepresentsatruegradient. Sim-
ilarly,wecouldtrytoderiveKimura’salgorithmfromtheGPOMDPformulation,where
eachactiongetscreditforallfutureactions. There,wehave
∇J =
Z
d
π
(x)
Z
π(u|x)(∇logd
π
(x)+∇logπ(u|x))[r(x,u)−b(x)]dudx,
(A.4)
andusingd
π
(x
′
) =
RR
d
π
(x)π(u|x)dudx,werealizethat
d
π
(x
′
)∇logd
π
(x
′
) =h∇logπ(u|x)+∇logd
π
(x)i. (A.5)
Asanalgorithm,thisyields
D(x
′
) =∇logπ(u|x)+D(x), (A.6)
△θ =αh[∇logπ(u|x)+D(x)][r(x,u)−b(x)]i. (A.7)
Again, this Episodic REINFORCE and represents a true gradient but not Kimura’s al-
gorithm! If we use a forgetting factor inD(x
′
) =∇logπ(u|x)+βD(x) as suggested
in (Baxter & Bartlett, 2001), it will introduce a large bias, i.e., the optimal solution is
altered! In LQR problems withQ = 0.1 andA =B =R = 1, aβ≈ 0.98 will result in
instableLQRsolutionsevenifcomputedanalytically.
Kimura & Kobayashi (1998) propose an algorithm which basically fuses the two
correctalgorithmsintoasingle,biasedpolicygradientestimator
D(x
′
) =∇logπ(u|x)+βD(x), (A.8)
△θ =αh[∇logπ(u|x)+D(x)][r(x,u)+V (x
′
)−V (x)]i. (A.9)
Asyounotice,thisdoesnotgiveyouatruegradientas
∇J6=
Z
d
π
(x)
Z
π(u|x)(∇logd
π
(x)+∇logπ(u|x))[Q
π
(x,u)−b(x)]dudx.
(A.10)
Thus, we have a biased gradient. It will work for some cases but it is nothing but a
heuristicwhichworksaccidentally.
166
A.4 DerivationsoftheTwoFormsoftheeNAC
In this section, we derive the two different forms of the Episodic Natural Actor-Critic
in two different forms. For this, we first discuss a theorem needed for the derivation of
estimatorswhichsimplifiestheestimationofthecompatiblefunctionapproximationand
subsequentlyapplyitontheeNAC.
TheoremA.1 Aregressionproblem ofthe form
β
∗
=
β
1
β
2
T
= argmin
β
(Y−Xβ)
T
(Y−Xβ), (A.11)
withbasisfunctionX =
X
1
X
2
hastheuniquesolution
β
1
=
X
T
1
X
1
−1
X
T
1
(Y−X
2
b), (A.12)
β
2
=Q
−1
X
T
2
Y−X
1
X
T
1
X
1
−1
X
T
1
Y
, (A.13)
with
Q
−1
=
X
T
2
X
2
−1
(A.14)
+
X
T
2
X
2
−1
X
T
2
X
1
X
T
1
X
1
−X
T
1
X
2
X
T
2
X
2
−1
X
T
2
X
1
−1
X
T
1
X
2
X
T
2
X
2
−1
.
Proof. ThesolutionoftheregressionprobleminEquation(A.11)isgivenby
β
∗
=
X
T
X
−1
X
T
Y, (A.15)
see (Harville, 2000). By definingT = X
T
1
X
1
,U = X
T
1
X
2
,W = X
T
2
X
2
, and subse-
quentlyapplyingtheMatrixInversionTheorem(see(Harville,2000),pages98–101),we
obtain
X
T
X
−1
=
T U
U
T
W
−1
=
T
−1
+T
−1
UQ
−1
U
T
T
−1
−T
−1
UQ
−1
−Q
−1
U
T
T
−1
Q
−1
, (A.16)
withQ =W−U
T
T
−1
U. WecansimplifyQ
−1
usingtheSherman-MorrisonTheorem
(see(Moon&Stirling,2000),pages258–259)whichyields
Q
−1
=W
−1
+W
−1
U
T
T−UW
−1
U
T
−1
UW
−1
. (A.17)
Whenmultiplying
X
T
X
−1
byX
T
Y =
X
T
1
Y X
T
2
Y
T
,weobtain
β
1
=
T
−1
+T
−1
UQ
−1
U
T
T
−1
X
T
1
Y−T
−1
UQ
−1
X
T
2
Y, (A.18)
=T
−1
X
T
1
Y−Uβ
2
, (A.19)
167
β
2
=Q
−1
X
T
2
Y−U
T
T
−1
X
T
1
Y
. (A.20)
AfterinsertingthedefinitionsforT,U,andW,weobtainEquations(A.12,A.13,A.12).
A.4.1 DerivationoftheeNAC1
As we have seen in Section 4.4.3.2 for the estimator g
1
, the baseline b can only be a
constant,thecompatiblefunctionapproximationisobviouslygivenby
f
w
(ξ
j
) =
N
X
τ=0
∇
θ
logπ(u
j
τ
|x
j
0:τ
)
!
T
w, (A.21)
for historyξ
j
and the targets are the accumulated rewards along the trajectory r(ξ
j
).
Whenwebringthisintothestandardregressionform,thebasisfunctionsaregivenby
X
T
=
φ
1
1:n
, φ
2
1:n
, ..., φ
m
1:n
1, 1, ..., 1
, (A.22)
whereφ
j
1:n
=
P
N
τ=0
∇
θ
logπ(u
j
τ
|x
j
0:τ
) denotes the log-probability of the j-th roll-out,
andthetargetsaregivenby
Y
T
=
r
1
1:n
, r
2
1:n
, ..., r
m
1:n
, (A.23)
where r
j
1:n
= r(ξ
j
) denotes the sum of the rewards of the j-the roll-out. The solution
β
∗
= [w
T
,b]
T
canthenbederivedasshowninTheoremA.2.
TheoremA.2 Thesolutionβ
∗
= [w
T
1
,b
1
]
T
to theregressionproblem canbegivenby
w
1
=F
−1
1
g
1
, (A.24)
b
1
=m
−1
1+
¯
φ
T
mF
1
−
¯
φ
¯
φ
T
−1
¯
φ
¯ r−
¯
φ
T
F
−1
1
g
(A.25)
withFisherinformationF
1
,averageeligibility
¯
φ,averagereward ¯ r,policygradientwith
baselineg
1
and withoutbaselineg givenby
F
1
=
m
X
j=1
φ
j
1:n
(φ
j
1:n
)
T
,
¯
φ =
m
X
j=1
φ
j
1:n
, ¯ r =
m
X
j=1
r
j
1:n
, (A.26)
g
1
=
m
X
j=1
φ
j
1:n
r
j
1:n
−b
1
,g =
m
X
j=1
φ
j
1:n
r
j
1:n
. (A.27)
168
Proof. WemakeuseofTheoremA.1intheAppendix. Wefirstobtain
X
T
1
X
1
=F
1
,X
T
1
X
2
=
¯
φ,X
T
2
X
2
=m,X
T
1
Y =g =g
1
+
¯
φb,X
T
2
Y = ¯ r. (A.28)
WeinserttheseintoEquations(A.12,A.13,A.14)fromTheoremA.1,andobtain
w =β
1
=F
−1
1
g−
¯
φb
=F
−1
1
g
1
, (A.29)
b =β
2
=Q
−1
¯ r−
¯
φ
T
F
−1
1
g
, (A.30)
withQ
−1
=m
−1
(1+
¯
φ
T
(mF
1
−
¯
φ
¯
φ
T
)
−1
¯
φ).
Note that this gradient estimator is in fact using exactly the REINFORCE gradient
and just one, constant baseline. This can alternatively be derived using Suttons form
by adding up the advantages along a path and is also known as episodic natural actor-
critic (Peters et al., 2003a; Peters, Vijayakumar, & Schaal, 2003b). As this point, it
might appear that we have thrown the child out together with the water as we need the
reward sequence to approximate the gradient and at the same time need the gradient to
approximatetherewardsequence.
A.4.2 DerivationoftheeNACn
As we have seen in Section 4.4.3.3 for the estimatorg
2
, the baselineb depends only on
time(andtheinitialstate),thecompatiblefunctionapproximationisobviouslygivenby
f
w
(ξ
j
0:k
) =
k
X
τ=0
∇
θ
logπ(u
j
τ
|x
j
τ
)
!
T
w, (A.31)
and the targets are the actual rewards along the trajectory, i.e., r(ξ
0:t
) = r
t
. When we
bringthisintothestandardregressionform,thebasisfunctionsaregivenby
X =
φ
1
1:1
, φ
1
1:2
, ..., φ
1
1:n
, φ
2
1:1
, ..., φ
m
1:n
u
1
, u
2
, ..., u
n
, u
1
, ..., u
n
, (A.32)
whereφ
j
1:n
=
P
n
t=1
∇
θ
logπ(u
j
t
|x
j
t
) denotes the log-probability of thej-th roll-out, and
u
i
denotesthei-thunitvectorbasisfunctionoflengthn. Thetargetsaregivenby
Y =
r
1
1
, r
1
2
, ..., r
1
n
, r
2
1
, ..., r
m
n
, (A.33)
wherer
j
t
denotes the rewards at timet of thej-the roll-out. The solutionβ
∗
= [w
T
,b]
T
canthenbederivedasshowninTheoremA.3.
TheoremA.3 Thesolutionβ
∗
2
= [w
T
2
,b
2
]
T
to theregressionproblem canbegivenby
w
2
=F
−1
2
g
2
, (A.34)
169
b
2
=m
−1
I
n
+
¯
Φ
T
mF
2
−
¯
Φ
¯
Φ
T
−1
¯
Φ
¯ r−
¯
Φ
T
F
−1
1
g
(A.35)
withFisherinformationF
1
,averageeligibility
¯
φ,averagereward¯ r,policygradientwith
baselineg
1
and withoutbaselineg givenby
F
2
=
m
X
j=1
n
X
i=1
φ
j
1:i
φ
j
1:i
T
,
¯
Φ =
m
X
j=1
n
X
i=1
φ
j
1:i
e
T
i
,¯ r =
m
X
j=1
n
X
i=1
r
j
i
e
i
, (A.36)
g
2
=
m
X
j=1
n
X
i=1
φ
j
1:i
r
j
i
−b
i
,g =
m
X
j=1
n
X
i=1
φ
j
1:i
r
j
i
. (A.37)
Proof. WemakeuseofTheoremA.1intheAppendix. Wefirstobtain
X
T
1
X
1
=F
2
,X
T
1
X
2
=
¯
Φ,X
T
2
X
2
=mI
n
,X
T
1
Y =g =g
2
+
¯
Φb
2
,X
T
2
Y =¯ r. (A.38)
WeinserttheseintoEquations(A.12,A.13,A.14)fromTheoremA.1,andobtain
w =β
1
=F
−1
2
g−
¯
Φb
=F
−1
2
g
2
, (A.39)
b =β
2
=Q
−1
¯ r−
¯
Φ
T
F
−1
2
g
, (A.40)
withQ
−1
=m
−1
(I
n
+
¯
Φ
T
(mF
2
−
¯
Φ
¯
Φ
T
)
−1
¯
Φ).
NotethatthisgradientcompatiblerewardestimatorisinfactusingexactlytheGPOMDP
gradient with a time-variant scalar baselineb
k
. However, just like in the last section, we
nottheapparantproblem.
A.5 MotorPrimitiveEquations
Themotorprimitivesfrom(Ijspeertetal.,2002,2003)intheirmostrecentreformualtion
aregivenbyacanonicalsystem
τ
−1
˙ v =α
v
(β
v
(g−x)−v), (A.41)
τ
−1
˙ x =v, (A.42)
which represents the phase of the motor process. It has a goalg, a time constantτ and
some parametersα
v
,β
v
which are chosen so that the system is stable. Additionally, we
haveatransformedsystem
τ
−1
˙ z =α
z
(β
z
(s−x)−v)+f (x,v,g), (A.43)
τ
−1
˙ y =z, (A.44)
τ
−1
˙ s =α
s
(g−s), (A.45)
170
whichhasthesametime-constantτ asthecanonicalsystem,appropriatelysetparameters
α
z
,β
z
,α
s
, and a transformation function f (x,v,g). The transformation function trans-
forms the output of the canonical system so that the transformed system can represent
complexnonlinearpatternsandisgivenby
f (x,v,g) =
P
N
i=1
ψ
i
(x)θ
i
v
P
N
i=1
ψ
i
(x)
, (A.46)
whereθ
i
areadjustableparametersandithaslocalizationweightsdefinedby
ψ
i
(x) = exp
−h
i
x−x
0
g−x
0
−c
i
2
!
(A.47)
withoffsetx
0
,centersc
i
andwidthh
i
.
171
Abstract (if available)
Abstract
Autonomous robots have been a long standing vision of robotics, artificial intelligence, and cognitive sciences. Early approaches in the 1980s showed that approaches purely based on reasoning and human insights would not be applicable to all perceptuomotor tasks of future robots. Instead, new hope was put in the growing wake of machine learning. However, to date, learning techniques have yet to fulfill this promise as only few task-independent methods scale into manipulator or humanoid robotics. In order to overcome these difficulties, we investigate steps towards a structured, general approach to motor skill learning based on a theoretically well-founded novel framework for task control and appropriate learning algorithms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning objective functions for autonomous motion generation
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Scaling robot learning with skills
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Algorithms and systems for continual robot learning
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Data-driven autonomous manipulation
PDF
Design of adaptive automated robotic task presentation system for stroke rehabilitation
PDF
Learning from planners to enable new robot capabilities
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Learning affordances through interactive perception and manipulation
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Data-driven robotic sampling for marine ecosystem monitoring
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Estimation-based control for humanoid robots
PDF
Bayesian methods for autonomous learning systems
Asset Metadata
Creator
Peters, Jan Reinhard
(author)
Core Title
Machine learning of motor skills for robotics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/16/2007
Defense Date
03/21/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,motor skills,natural policy gradients,OAI-PMH Harvest,policy gradients,reinforcement learning,robotics
Language
English
,
German
Advisor
Schaal, Stefan (
committee chair
), Atkeson, Chris (
committee member
), Sukhatme, Gaurav S. (
committee member
), Udwadia, Firdaus (
committee member
)
Creator Email
mail@jan-peters.net
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m397
Unique identifier
UC159030
Identifier
etd-Peters-20070416 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-11877 (legacy record id),usctheses-m397 (legacy record id)
Legacy Identifier
etd-Peters-20070416.pdf
Dmrecord
11877
Document Type
Dissertation
Rights
Peters, Jan Reinhard
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
machine learning
motor skills
natural policy gradients
policy gradients
reinforcement learning
robotics