Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Iterative path integral stochastic optimal control: theory and applications to motor control
(USC Thesis Other)
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ITERATIVE PATH INTEGRAL STOCHASTIC OPTIMAL CONTROL: THEORY AND APPLICATIONS TO MOTOR CONTROL by Evangelos A. Theodorou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2011 Copyright 2011 Evangelos A. Theodorou Epigraph VIRTUE, then, being of two kinds, intellectual and moral, intellectual virtue in the main owes both its birth and its growth to teaching (for which reason it requires experience and time) while the moral virtue comes about as a result of a habit, whence also its name ηθιk´ η is one that is formed by the slight variation from of the word ´ $θos (habit). From this it is also plain that none of the moral virtues arises in us by nature; for nothing that exist by nature can form a habit contrary to its nature. For instance the stone which by nature moves downwards cannot be habituated to move upwards, not even if one tries to train it by throwing it up ten thousand times; nor can fire be habituated to move downwards, nor can anything else that by nature behaves in one way be trained to behave in another. Neither by nature nor contrary to nature do the virtues arise in us; rather we are adapted by nature to receive them, and are made perfect by habit. The Nicomachean Ethics, Aristotle, 384-323 B.C 1 1 Text taken from the book ’Aristotle:The Nicomachean Ethics’ translated by David Ross(Ross 2009) ii Dedication To my first teacher in physics my mother Anastasia. To my brother Zacharias. To my guard angel Choanna. iii Acknowledgements InthisjourneytowardsmyPh.D.therehavebeenthreeveryimportantpeople,Anastasia, Zacharias and Choanna, who deeply understood my intellectual goals and encourage me in every difficult moment. My mother Anastasia was my first teacher in physics, my first intellectual mentor who taught me fairness and morality. She has been giving me the love, the courage and the strength to make my visions real. My brother Zacharias was always there to remind me that I had to stand up and that life is like a marathon. My guard angel Choanna has been always on my side, by teaching me how to enjoy every moment of life, giving me love and positive energy and inspiring me intellectually and mentally. Without the support and the unconditional love of these people I would not have been able to create, fight for and reach my dream. IwouldliketothankmycolleaguesintheComputationalLearningandMotorControl lab and in the Brain Body Dynamics lab. Special thanks go to Mike Mistry who besides my colleague, he was my roommate for the first two years in LA and a very good friend. Special thanks go also to Heiko Hoffman. I thank him for his kindness, generosity and friendship all these years. During the last year of my Ph.D I met Daniel Braun as a roommate and a colleague. I am thankful to Daniel for all of our analytical conversations regarding philosophy, epistemology and life. iv Inspiration for the work in this thesis comes from the work on path integrals and stochastic optimal control by Prof. Bert Kappen. I can not forget my enthusiasm when I first read his papers. I would like to thank him for his work and the interactions that we had. I am deeply grateful to Prof. Stefan Schaal, my main advisor, for trusting me and giving me the opportunity to study at USC. Stefan gave me the support and the freedom to work on a topic of my choice. I also thank Prof. Francisco J. Valero Cuevas for giving me the opportunity to work on applications of control theory to biomechanics. I am thankful to Prof. Emo Todorov for accepting my request to work in his lab as a visiting student for a summer and for his feedback. Finally I would like to thank Prof. Gaurav Sukhatme and Prof. Nicholas Schweighofer for being members of my committee and for their feedback. v Table of Contents Epigraph ii Dedication iii Acknowledgements iv List Of Tables ix List Of Figures x Abstract xiii Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Stochastic optimal control theory . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Reinforcementlearning: Themachinelearningviewofoptimalcontroltheory 5 1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Optimal Control Theory 10 2.1 Dynamic programming and the Bellman principle of optimality: The con- tinuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Pontryagin maximum principle . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Iterative optimal control algorithms . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Stochastic differential dynamic programming . . . . . . . . . . . . 23 2.3.1.1 Value function second order approximation . . . . . . . . 27 2.3.1.2 Optimal controls . . . . . . . . . . . . . . . . . . . . . . . 42 2.3.2 Differential dynamic programming . . . . . . . . . . . . . . . . . . 45 2.4 Risk sensitivity and differential game theory . . . . . . . . . . . . . . . . . 46 2.4.1 Stochastic differential games . . . . . . . . . . . . . . . . . . . . . 47 2.4.2 Risk sensitive optimal control . . . . . . . . . . . . . . . . . . . . . 49 2.5 Information theoretic interpretations of optimal control . . . . . . . . . . 55 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 3: Path Integrals, Feynman Kac Lemmas and their connection to PDEs 60 3.1 Path integrals and quantum mechanics . . . . . . . . . . . . . . . . . . . . 63 vi 3.1.1 The principle of least action in classical mechanics and the quantum mechanical amplitude. . . . . . . . . . . . . . . . . . . . 63 3.1.2 The Schr˝ odinger equation . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Fokker Planck equation and SDEs . . . . . . . . . . . . . . . . . . . . . . 71 3.2.1 Fokker Planck equation in Itˆ o calculus . . . . . . . . . . . . . . . . 71 3.2.2 Fokker Planck equation in Stratonovich calculus . . . . . . . . . . 77 3.3 Path integrals and SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.1 Path integral in Stratonovich calculus . . . . . . . . . . . . . . . . 88 3.3.2 Path integral in Itˆ o calculus . . . . . . . . . . . . . . . . . . . . . . 89 3.4 Path integrals and multi-dimensional SDEs . . . . . . . . . . . . . . . . . 90 3.5 Cauchy problem and the generalized Feynman Kac representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.6 Special cases of the Feynman Kac lemma. . . . . . . . . . . . . . . . . . . 105 3.7 Backward and forward Kolmogorov PDE and their fundamental solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.8 Connection of backward and forward Kolmogorov PDE via the Feynman Kac lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.9 Forward and backward Kolmogorov PDEs in estimation and control . . . 114 3.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.11 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 4: Path Integral Stochastic Optimal Control 118 4.1 Path integral stochastic optimal control . . . . . . . . . . . . . . . . . . . 121 4.2 Generalized path integral formalism . . . . . . . . . . . . . . . . . . . . . 126 4.3 Path integral optimal controls . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.4 Path integral control for special classes of dynamical systems . . . . . . . 134 4.5 Itˆ o versus Stratonovich path integral stochastic optimal control . . . . . . 136 4.6 Iterative path integral stochastic optimal control . . . . . . . . . . . . . . 138 4.6.1 Iterative path integral Control with equal boundary conditions . . 143 4.6.2 Iterative path integral control with not equal boundary conditions 145 4.7 Risk sensitive path integral control . . . . . . . . . . . . . . . . . . . . . . 146 4.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter 5: Policy Gradient Methods 167 5.1 Finite difference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.2 Episodic reinforce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.3 GPOMDP and policy gradient theorem . . . . . . . . . . . . . . . . . . . 174 5.4 Episodic natural actor critic . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Chapter 6: Applications to Robotic Control 180 6.1 Learnable nonlinear attractor systems . . . . . . . . . . . . . . . . . . . . 181 6.1.1 Nonlinear point attractors with adjustable land-scape . . . . . . . 181 6.1.2 Nonlinear limit cycle attractors with adjustable land-scape . . . . 182 6.2 Robotic optimal control and planning with nonlinear attractors . . . . . . 184 6.3 Policy improvements with path integrals: The (PI 2 ) algorithm. . . . . . . 185 vii 6.4 Evaluations of (PI 2 ) for optimal planning . . . . . . . . . . . . . . . . . . 192 6.4.1 Learning Optimal Performance of a 1 DOF Reaching Task . . . . . 194 6.4.2 Learning optimal performance of a 1 DOF via-point task . . . . . 196 6.4.3 Learning optimal performance of a multi-DOF via-point task . . . 197 6.4.4 Application to robot learning . . . . . . . . . . . . . . . . . . . . . 201 6.5 Evaluations of (PI 2 ) on planning and gain scheduling . . . . . . . . . . . 204 6.6 Way-point experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.6.1 Phantom robot, passing through waypoint in joint space . . . . . . 206 6.6.2 Kuka robot, passing through a waypoint in task space . . . . . . . 209 6.7 Manipulation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.7.1 Task 2: Pushing open a door with the CBi humanoid . . . . . . . 213 6.7.2 Task 3: Learning tasks on the PR2 . . . . . . . . . . . . . . . . . . 215 6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.8.1 Simplifications of PI 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.8.2 The assumption λR −1 =Σ ! . . . . . . . . . . . . . . . . . . . . . 219 6.8.3 Model-based, Hybrid, and Model-free Learning . . . . . . . . . . . 220 6.8.4 Rules of cost function design . . . . . . . . . . . . . . . . . . . . . 221 6.8.5 Dealing with hidden state . . . . . . . . . . . . . . . . . . . . . . . 222 6.8.6 Arbitrary states in the cost function . . . . . . . . . . . . . . . . . 223 Chapter 7: Neuromuscular Control 225 7.1 Tendon driven versus torque driven actuation . . . . . . . . . . . . . . . . 226 7.2 Skeletal Mmechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.3 Dimensionality and redundancy . . . . . . . . . . . . . . . . . . . . . . . . 229 7.4 Musculotendon routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Chapter 8: Control of the index finger 236 8.1 Index fingers biomechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 8.2 Iterative stochastic optimal control . . . . . . . . . . . . . . . . . . . . . . 237 8.3 Multi-body dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 8.4 Effect of the moment arm matrices in the control of the index finger . . . 244 8.4.1 Flexing movement . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.4.2 Tapping Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Chapter 9: Conclusions and future work 260 9.1 Path integral control and applications to learning and control in robotics. 260 9.2 Future work on path integral optimal control . . . . . . . . . . . . . . . . 262 9.2.1 Path integral control for systems with control multiplicative noise 262 9.2.2 Path integral control for markov jump diffusions processes. . . . . 263 9.2.3 Path integral control for generalized cost functions . . . . . . . . . 264 9.3 Future work on stochastic dynamic programming . . . . . . . . . . . . . . 265 9.4 Future work on neuromuscular control . . . . . . . . . . . . . . . . . . . . 266 Bibliography 268 viii List Of Tables 2.1 Optimal Control Algorithms according to First Order Expansion (FOE) or Second Order Expansion (SOE) of dynamics and cost function and the existence of Noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Summary of optimal control derived from the path integral formalizm. . . 133 6.1 PseudocodeofthePI 2 algorithmfora1DParameterizedPolicy(Notethat the discrete time step dt was absobed as a constant multiplier in the cost terms). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 8.1 Pseudocode of the iLQG algorithm . . . . . . . . . . . . . . . . . . . . . . 242 ix List Of Figures 6.1 Comparison of reinforcement learning of an optimized movement with mo- tor primitives. a) Position trajectories of the initial trajectory (before learning) and the results of all algorithms after learning – the different algorithms are essentially indistuighishable. b) The same as a), just us- ing the velocity trajectories. c) Average learning curves for the different algorithms with 1 std error bars from averaging 10 runs for each of the algorithms. d) Learning curves for the different algorithms when only two roll-outsareusedperupdate(notethattheeNACcannotworkinthiscase and is omitted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.2 Comparison of reinforcement learning of an optimized movement with mo- tor primitives for passing through an intermediate target G. a) Position trajectories of the initial trajectory (before learning) and the results of all algorithms after learning. b) Average learning curves for the different algorithms with 1 std error bars from averaging 10 runs for each of the algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.3 Comparison of learning multi-DOF movements (2,10, and 50 DOFs) with planar robot arms passing through a via-point G. a,c,e) illustrate the learning curves for different RL algorithms, while b,d,f) illustrate the end- effector movement after learning for all algorithms. Additionally, b,d,f) alsoshowtheinitialend-effectormovement,beforelearningtopassthrough G, and a “stroboscopic” visualization of the arm movement for the final result of PI 2 (the movements proceed in time starting at the very right and ending by (almost) touching the y axis). . . . . . . . . . . . . . . . . 200 6.4 Reinforcement learning of optimizing to jump over a gap with a robot dog. The improvement in cost corresponds to about 15 cm improvement in jump distance, which changed the robot’s behavior from an initial barely successful jump to jump that completely traversed the gap with entire body. This learned behavior allowed the robot to traverse a gap at much higher speed in a competition on learning locomotion. . . . . . . . . . . . 202 x 6.5 Sequenceofimagesfromthesimulatedrobotdogjumpingovera14cmgap. Top: before learning. Bottom: After learning. While the two sequences look quite similar at the first glance, it is apparent that in the 4th frame, the robot’s body is significantly heigher in the air, such that after landing, the body of the dog made about 15cm more forward progress as before. In particular, the entire robot’s body comes to rest on the other side of the gap, which allows for an easy transition to walking. . . . . . . . . . . . . 204 6.6 3-DOF Phantom simulation in SL. . . . . . . . . . . . . . . . . . . . . . . 206 6.7 Learning curves for the phantom robot. . . . . . . . . . . . . . . . . . . . 208 6.8 Initial (red, dashed) and final (blue, solid) joint trajectories and gain scheduling for each of the three joints of the phantom robot. Yellow circles indicate intermediate subgoals. . . . . . . . . . . . . . . . . . . . . . . . . 209 6.9 Learning curves for the Kuka robot. . . . . . . . . . . . . . . . . . . . . . 210 6.10 Initial (red, dotted), intermediate (green, dashed), and final (blue, solid) end-effector trajectories of the Kuka robot. . . . . . . . . . . . . . . . . . 211 6.11 Initial (red, dotted), intermediate (green, dashed), and final (blue, solid) joint gain schedules for each of the six joints of the Kuka robot. . . . . . . 212 6.12 Left: Task scenario. Right: Learning curve for the door task. The costs specific to the gains are plotted separately. . . . . . . . . . . . . . . . . . . 215 6.13 Learned joint angle trajectories (center) and gain schedules (right) of the CBi arm after 0/6/100 updates. . . . . . . . . . . . . . . . . . . . . . . . 216 6.14 Relevant states for learning how to play billiard. . . . . . . . . . . . . . . 217 6.15 Initial and final policies for rolling the box. . . . . . . . . . . . . . . . . . 217 8.1 Flexing Movement: Sequence of postures generated when the first model of moment arm matrix is used and the iLQG is applied . . . . . . . . . . 246 8.2 FlexingMovement: Tendonexcursionsfortherightindexfingerduringthe flexing movement when the first model of moment arm matrix. . . . . . . 246 8.3 Flexing Movement: Tension profiles applied to the right index finger when the first model of moment arm matrix by is used. . . . . . . . . . . . . . . 247 8.4 Flexing Movement: Extensor tension profiles applied to the right index finger when the first model of moment arm matrix is used. . . . . . . . . . 247 xi 8.5 Flexing Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the first model of moment arm matrix is used. . . 248 8.6 FlexingMovement: Sequenceofposturesgeneratedwhenthesecondmodel of moment arm matrix is used and the iLQG is applied. . . . . . . . . . . 248 8.7 FlexingMovement: Tendonexcursionsfortherightindexfingerduringthe flexing movement when the second model of moment arm matrix.. . . . . 249 8.8 Flexing Movement: Tension profiles applied to the right index finger when the second model of moment arm matrix by is used. . . . . . . . . . . . . 249 8.9 Flexing Movement: Extensor tension profiles applied to the right index finger when the second model of moment arm matrix is used. . . . . . . . 250 8.10 FlexingMovement: Flexorstensionprofilesappliedtotherightindexfinger when the second model of moment arm matrix is used. . . . . . . . . . . . 250 8.11 Flexing Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the second model of moment arm matrix is used.. 251 8.12 Tapping Movement: Sequence of postures generated when the first model of moment arm matrix is used and the iLQG is applied. . . . . . . . . . . 253 8.13 Tapping Movement: Tendon excursions for the right index finger during the flexing movement when the first model of moment arm matrix. . . . . 254 8.14 TappingMovement: Tensionprofilesappliedtotherightindexfingerwhen the first model of moment arm matrix by is used. . . . . . . . . . . . . . . 254 8.15 Tapping Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the first model of moment arm matrix is used. . . 255 8.16 Tapping Movement: Sequence of postures generated when the second model of moment arm matrix is used and the iLQG is applied. . . . . . . 255 8.17 Tapping Movement: Tendon excursions for the right index finger during the flexing movement when the second model of moment arm matrix. . . 256 8.18 TappingMovement: Tensionprofilesappliedtotherightindexfingerwhen the second model of moment arm matrix by is used. . . . . . . . . . . . . 256 8.19 Tapping Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the second model of moment arm matrix is used.. 257 xii Abstract Motivated by the limitations of current optimal control and reinforcement learning meth- ods in terms of their efficiency and scalability, this thesis proposes an iterative stochas- tic optimal control approach based on the generalized path integral formalism. More precisely, we suggest the use of the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly groundedinvaluefunctionestimationandoptimalcontrolbasedonthestochasticHamil- ton Jacobi Bellman (HJB) equation, policy improvements can be transformed into an ap- proximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is struc- tured. Thenewalgorithm,PolicyImprovementwithPathIntegrals(PI 2 ), demonstrates interesting similarities with previous RL research in the framework of probability match- ing and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Applications to high dimensional robotic systems are presented for a variety of tasks that require optimal planning and gain scheduling. In addition to the work on generalized path integral stochastic optimal control, in this thesis we extend model based iterative optimal control algorithms to the stochastic xiii setting. More precisely we derive the Differential Dynamic Programming algorithm for stochastic systems with state and control multiplicative noise. Finally, in the last part of this thesis, model based iterative optimal control methods are applied to bio-mechanical models of the index finger with the goal to find the underlying tendon forces applied for the movements of, tapping and flexing. xiv Chapter 1 Introduction 1.1 Motivation Given the technological breakthroughs of the last three decades in the areas of computer science and engineering, the speedup in processing power has reached the point where computationally expensive algorithms are now days implemented and executed in an efficient and fast way. At the same time, advancements in memory technology offered the capability for fast and reliable storage of huge amount of information. All this progress in computer science has benefitted robotics due to the fact that computationally heavy control, estimation and machine learning algorithms can now be executed online and in real time. The breakthroughs of technology in terms of computational speed and increasing memory size created new visions in robotics. In particular, future robots will notonlyperforminindustrialenvironmentsbuttheywillalsosafelyco-existwithhumans in environments less structural and more dynamic and stochastic than the environment of a factory. 1 Despite all this evolution, learning for a robot how to autonomously perform human- like motor control tasks such as object manipulation, walking, running etc, remains an open problem. There is a combination of characteristics in humanoid robots which is unique and it does not often exist in other cases of dynamical systems. These systems are usually high dimensional. Depending on how many degrees of freedom are considered, their dimensionality can easily exceed 100 states. Moreover their dynamical model is usually unknown and hard estimate. In cases where a model is available, it is an ap- proximation of the real dynamics, especially if one considers contact phenomena with the environment as well as the various sources of stochasticity such as sensor and actuation noise. Therefore, there is a level of uncertainty in humanoid robotic systems which is structural and parametric, because it results from the lack of accurate dynamical models, as well as stochastic due to noisy and imperfect sensors. All these characteristics of humanoid robots open the question of how humans resolve these issues due to the fact that they also perform motor control tasks in stochastic environmentsanddealwithcontactphenomenaandsensornoise. Asforthecharacteristic of dimensionality, this is also present in an even more pronounced way in bio-mechanical systems. It suffices to realize that just for the control of the hand there are up to 30 actuated tendons. Motivated by all these issues and difficulties, this thesis proposes a new stochastic op- timalcontrolformalismbasedontheframeworkofpathintegralcontrol, whichextendsto optics of robot learning and reinforcement learning. Path integral control framework and its extensions to iterative optimal control are the central topic sof this thesis. Moreover, inspired by the mystery of bio-mechanical motor control of the index finger, this thesis 2 investigates the underlying control strategies and studies their sensitivity with respect to model changes. Since reinforcement learning and stochastic optimal control are the main frameworks of this thesis, a complete presentation should incorporate views coming from different communities of science and engineering. For this reason, in the next two sections we dis- cusstheoptimalcontrolandreinforcementlearningframeworksfromthecontroltheoretic and machine learning point of view. In the last section of this introductory chapter we provide an outline of this work with a short description of the structure and the contents of each chapter. 1.2 Stochastic optimal control theory Among the areas of control theory, optimal control is one of the most significant, with a plethora of applications from the very early development of aerospace engineering, to robotics, traffic control, biology and computational motor control. With respect to other control theoretic frameworks, optimal control was the first to introduce optimization as a method to find controls. In fact, optimal control can be thought as a constrained optimization problem that has the characteristic that the constraints are not static, in the sense of algebraic equations, but they correspond to dynamical systems and therefore they are represented by differential equations. The addition of differential equations as constraints in the optimization problem leads to the property that in optimal control theory the minimum is not represented by one pointx ∗ in state space but by a trajectory τ ∗ =(x ∗ 1 ,x ∗ 2 ...,x ∗ N ), which is the optimal trajectory. 3 There are two fundamental principles that establish the theoretical basis of optimal control theory in its early developments. These principles are the Pontryagin Maximum principle and the Bellman Principle of Optimality or Dynamic Programming. The maxi- mum principle was introduced by Lev Semenovich Pontryagin a Russian mathematician, in his work The Mathematical Theory of Optimal Processes which was first published in Russian in 1961 and then translated in english in 1962 (Pontryagin, Boltyanskii, Gamkre- lidze & Mishchenko 1962). The Dynamic Programming framework was introduced in 1953, by Richard Ernest Bellman, an applied mathematician at the University of South- ern California. In the history of optimal control theory, there has been criticism due to the fact that most of the design and analysis of optimal control theory takes place in time domain. Therefore there was no definite answer regarding the stability margins of optimal con- trollers and their sensitivity with respect to unknown model parameters and uncertainty. Rudolff Kalman, in his paper ”When is a linear control system optimal?” which was published in 1964 (Kalman 1964), studied the stability margins of optimal controllers for a special class of disturbances. Almost one decade later, early research on robust control theory (Safonov & Athans 1976),(Doyle 1978), investigated the stability margins of stochastic optimal controllers and showed that stochastic optimal controllers have poor stability margins. Most of the analysis and design in robust control theory takes place in frequency domain. As a result, many of the applications of robust control theory dealt with the cases of infinite horizon optimal control problems and time invariant dynamical systems. In these cases the anal- ysis in frequency domain is straight forward since the close loop system is time invariant 4 and the application of Fourrier or Laplace transform does not result in convolution. The risk sensitive optimal control framework and its connection to differential game theory and toH ∞ control provided a method to perform robust control for time varying systems and finite horizon optimal control problems, provided that the disturbances are bounded. 1.3 Reinforcement learning: The machine learning view of optimal control theory In the theory of machine learning (Bishop 2006), there are 3 learning paradigms, super- vised, unsupervised and reinforcement learning. Starting with the domain of supervised learning, thegoalistofindhighlevelmathematicalrepresentationsofthekindy i =f(x i ) between data setsx i ,y i . Thus, in most cases the data setx i ,y i is given and the question is weather or not the function f(x) can be found. Classical applications of supervised learning are problems like function approximation, regression and classification. In unsu- pervised learning the goal is to discover structure in data sets. Applications of unsuper- vised learning techniques are found in the problems of image segmentation, compression and dimensionality reduction. The most recent branch of machine learning is the area of reinforcement learning. In a typical reinforcement learning scenario an agent explores the environment such that it finds the set of optimal actions which will move the agent to a desired state. The desirability to reach a goal state is encoded by the reward function. The reward is state depended, and therefore, it has high values in the states close to the goal. An additional characteristic of reinforcement learning is that the reward is the only feedback provided 5 to the agent. From this feedback the agent has to find a policy u(x,t) such that it maximizes its reward, i.e., the optimal policy. The policy u(x,t) can be a function of state and/or time, depended on how the learning problem is formulated. Essentially the optimal policy provides the actions at given state and/or time that the agent needs to take in order to maximize its reward. Reinforcement learning can be also thought as a generalization of Markov Decision Processes(MDP) (Russell & Norvig 2003), (Sutton & Barto 1998). The essential compo- nents of MDPs are an initial state x 0 , a transition model T (x i+1 ,u i ,x i ) and the reward functionR(x). In MDPs the goal for the agent is to maximize the total expected reward. However, for the case of reinforcement learning the transition model and the reward function may be unknown and subject to be learned by the agent as it explores the environment. When the agent is a dynamical system, reinforcement learning can be thought as an optimal control problem. In both cases, the task is to optimize a cost function or total expected reward subject to constraints imposed by the dynamics of the system under consideration. 1.4 Dissertation outline There are 9 chapters in this thesis including the introductory chapter. Chapter 2 is a review of the theory of stochastic optimal control. More precisely we start chapter 2 with the definition of a stochastic optimal control problem and the Dynamic Programming framework. Our discussion continues with the Pontryagin maximum principle and its 6 connection to dynamic programming. Next the iterative optimal control methods are presented starting with stochastic differential dynamic programming and showing how it is related to Differential Dynamic Programming. Having in mind the criticism of optimal control theory related to robustness, we discuss risk sensitive optimal control framework and its connection to differential game theory. We close chapter 2 with the entropy formulation of stochastic optimal control. Chapter 3 contains important mathematical background on forward and backward partial differential equations(PDEs), stochastic differential equations(SDEs) and path integrals. Essentially the goal in this chapter is to highlight the connection of these three mathematical structures which are commonly used in mathematical physics and control theory. More precisely we start with the history of path integrals and the way how it is introduced in quantum mechanics. We continue by investigating the connection between forward PDEs, SDEs and path integrals. The Feynman-Kac lemma is presented and its roleinbridgingthegapbetweenbackwardsPDEsandSDEsisdiscussed. Afterpresenting the connection between PDEs, SDEs and path integrals we focus our discussion on PDEs and we investigate the relation between backward and forward PDEs in 3 different levels. Chapter4containsthemaintheoryofpathintegralcontrolformalismanditsapplica- tion to stochastic optimal control. In particular the stochastic optimal control problem is transformed to an approximation of a path integral through the application of Feynman- Kac lemma. The presentation continues with the derivation of the path integral optimal control for the case of stochastic dynamical systems with state dependent control transi- tionmatrices. VariationsofthepathintegralformalismbasedontheItˆ oandStratonovich stochastic calculus and the special classes of dynamical systems, are presented. In this 7 chapter we go one step further with the iterative version of path integral control and its risk sensitive version. Chapter5isareviewofmodelfreereinforcementlearningalgorithmswithanemphasis on policy gradient methods. Starting with the vanilla policy gradient method and the REINFORCE algorithm we show the main derivations of the estimated corresponding gradient for each one of these algorithms. Next, the concept of the natural gradient is introduced, and the Episodic Natural Actor Critic is discussed. Chapter 6 is dedicated to applications of path integral stochastic optimal control to learning robotic control problems. More precisely, we start with an introduction to dy- namic movement primitives (DMPs) and their mathematical representation as nonlinear dynamical systems with adjustable attractor landscapes. Next, we explain how DMPs are used for optimal planning and control of robotic systems. We continue with the ap- plication of iterative path integral control to DMPs and the presentation of the resulting algorithm, called Policy Improvement with Path Integrals (PI 2 ). In the remaining of chapter 6, applications of PI 2 to robotic optimal control and planning problems are dis- cussed. These applications include planing and variable stiffness control on simulated as well as real robots. In chapters 7 and 8, the optimal control framework is applied to bio-mechanical mod- els of the index finger with the goal to understand the underlying control strategies and to study their sensitivity. In particular, chapter 7 is a review of current methodologies in modeling bio-mechanical systems based on the characteristic skeletal mechanics, muscle redundancy and the tendon routing. Moreover the differences between tendons driven and torque driven systems are discussed and a review on previous work of optimal control 8 and its application to bio-mechanical and psychophysical systems is given. In chapter 8 the basic physiology of the index finger is presented. Furthermore ,the iterative optimal control algorithm is used on two bio-mechanical models of the index finger. The under- lying control strategies are computed for a flexing and a tapping movement and their sensitivity with respect to model change is discussed. In the last chapter 9, we conclude and discuss future research. 9 Chapter 2 Optimal Control Theory In this chapter we review the theory of optimal control starting from the Pontryagin maximum principle and the Dynamic Programming framework. In particular in section 2.1 we discuss Dynamic Programming and we explain the concept of a value function or cost to go. Moreover we derive the Hamilton Jacobi Bellman (HJB) equation, a fundamental Partial Differential Equation (PDE) in control. Solving the HJB equation resultsinfindingthevaluefunctionanddefiningtheoptimalcontrolpolicy. Insection2.2, we review Pontryagin maximum principle, and we derive the Euler Langrange equations. Furthermore we provide the connection between the Pontryagin Maximum Principle and Hamiltonian approach in mechanics. The application of Dynamic Programming to infinite and/or finite horizon nonlinear optimal control problems in continuous state - action spaces yields a family of iterative algorithms for optimal control. We start our presentation of iterative algorithms in sec- tion 2.3 with the derivation of Stochastic Differential Dynamic Programming(SDDP) for state dependend, control dependend and additive noise and we illustrate how SDDP is a generalization of its deterministic version i.e., Differential Dynamic Programming(DDP). 10 In section 2.4, the connection between the risk sensitive optimal control and the dif- ferential game theory is discussed. In particular, our presentation includes the derivation oftheHJBequationforthecaseofrisksensitivecostfunctionsandthecaseofdifferential gametheoreticoptimalcontrol. Finallyinsection2.5, wepresenttheentropyformulation of stochastic control theory and in the last section we conclude our discussion of optimal control theory 2.1 Dynamic programming and the Bellman principle of optimality: The continuous case The goal in stochastic optimal control is to control a dynamical system while minimizing a performance criterion. Thus, the stochastic optimal control problem is a constrained optimizationproblemwhereconstraintsarenotonlyalgebraicequationorinequalitiesbut differential equations which consist of the model of the dynamical system. In mathemati- calterms,thestochasticoptimalcontrolproblem(Stengel1994),(Basar&Berhard1995), (Fleming & Soner 2006), (Bellman & Kalaba 1964) for a nonlinear dynamical systems is expressed as follows: min u J(u,x) = min u ! φ(x t N )+ " t N t 0 L(x,u)dt # (2.1) withL(x,u)=q(x)+ 1 2 u T Ru subject to the constrains: dx = $ f(x)+G(x)u % dt+B(x)dw (2.2) 11 or in a more compact form: dx =F(x,u)dt+B(x)dw (2.3) with F(x,u)= f(x)+G(x)u and x∈" n as the state and u∈" m as the control vector. The immediate cost L(x,u) includes the state dependent cost q(x) and the controldependentcostu T Ruwhile φ(x t N )istheterminalcost. Themainideainoptimal control is to find the control or policy u =u(x,t) for which the cost function J(u,x) is minimized. The minimum of the cost function, the so called value function or cost to go, V(x) it is defined as V(x) = min u J(x,u). The value function is a function only of state since the optimal control - policyu ∗ =u ∗ (x,t) is a function of the state. Therefore we can write: min u J(x,u)=J(x,u ∗ )=J(x,u ∗ (x,t)) =V(x) (2.4) Fromtheequationsaboveitisclearthatthevaluefunctiondependsonlyonthestate. The concept of the value function is essential for the Bellman principle of optimality and the development of the Dynamic Programming framework. More precisely the Bellman principle (Dorato, Cerone & Abdallah 2000) states that: Bellman Principle of Optimality: If u ∗ (x,τ) is optimal over the interval [t,t N ], starting at state x(t) then u ∗ (x,τ) is necessarily optimal over the subinterval [t,t+Δt] for anyΔt such that T−t≥Δt≥ 0. Proof by contradiction: Let us assume that there exists a policy u ∗∗ (x,t) that yields a smaller value for the cost 12 ! φ(x t N )+ " t N t+Δt L(x,u)dτ # (2.5) thanu ∗ (x,t) over the subinterval [t+Δt,T]. It make sense to create the new control law u(τ)= u ∗ (τ) for t≤ τ ≤t+Δt, u ∗∗ (τ) for t+Δt≤ τ ≤t N (2.6) Then over the interval [t,t N ] we have !" t+Δt t L(x ∗ ,u ∗ )dτ + " t N t+Δt L(x ∗∗ ,u ∗∗ )dτ +φ(x t N ) # = !" t+Δt t L(x ∗ ,u ∗ )dτ # + !" t N t+Δt L(x ∗∗ ,u ∗∗ )dτ +φ(x ∗∗ t N ) # < !" t+Δt t L(x ∗ ,u ∗ )dτ # + !" t N t+Δt L(x ∗ ,u ∗ )dτ +φ(x ∗ t N ) # (2.7) Since u∗ is optimal, by assumption, over the interval [t,t N ] and the inequality above implies thatu results in a smaller value of the cost function than the optimal we reach a contradiction. The principle of optimality for the continuous case is formulated as follows: V(x,t) = min u[x,(t,t+Δt)] * " t+Δt t L(x,u,τ)dτ +V(x,t+Δt) + (2.8) = min u[x,(t,t+Δt)] * − " t t+Δt L(x,u,τ)dτ +V(x,t+Δt) + 13 =− " t t+Δt L(x ∗ ,u ∗ ,τ)dτ +V(x,t+Δt) In the last line, we are assuming for the analysis that the optimal trajectory and controlx ∗ andu ∗ are known and thus the expectation drops. The total derivative of the value function V(x,t) with respect to time is expressed as follows: dV(x,t) dt =−L(x ∗ ,u ∗ ,τ) (2.9) Since the value function V is a function of the state which is a random variable, we can apply the Itˆ o differentiation rule and obtain: dV = $ ∂V ∂t +(∇ x V) T F(x,u,t) % dt+ 1 2 tr $ (∇ xx V)B(x)B(x) T dt % (2.10) By equating the two equation above we arrive at: ∂V ∂t +L(x ∗ ,u ∗ ,τ)+(∇ x V) T F(x ∗ ,u ∗ ,t)+ 1 2 tr $ (∇ xx V)B(x ∗ )B(x ∗ ) T % = 0 (2.11) The equation above can be also written in the form: inf u , ∂V ∂t +L(x ∗ ,u ∗ ,τ)+(∇ x V) T F(x ∗ ,u ∗ ,t)+ 1 2 tr $ (∇ xx V)B(x ∗ )B(x ∗ ) T % - =0 (2.12) Since the term ∂V ∂t does not depend on u the equations can be re-arranged as: 14 − ∂V ∂t = inf u , L(x,u,τ)+(∇ x V) T F(x,u,t)+ 1 2 tr $ (∇ xx V)B(x)B(x) T % - (2.13) The equation above is the so called Hamilton - Jacobi - Bellman PDE derived for the case of stochastic dynamical systems. Since F(x,u)= f(x)+G(x)u andL(x,u)= q(x)+ 1 2 u T Ru the right hand side of the equation above is convex with respect to the controls u and therefore its minimization will results in the following equation: u ∗ (x,t)=−R −1 G(x) T ∇ x V (2.14) The optimal control policyu ∗ (x,t) will move the system towards the direction of the minimum value function since it is proportional to the negative direction of the gradient of the value function, projected on the state spacex by the multiplication withG(x) and weighted with the inverse of the control cost matrix R. Now substitution of the optimal controls in the HJB equation yields the following PDE: − ∂V ∂t =q(x,t)+(∇ x V t ) T f(x,t)− 1 2 (∇ x V t ) T B(x)R −1 B(x) T (∇ x V t ) + 1 2 tr $ (∇ xx V t )B(x)B(x) T % (2.15) with the boundary terminal condition V(x(t N )= φ(x(t N )). The equation above is a backward PDE of second order and nonlinear. Its solution is required in order to find the value function V(x,t) and then the gradient of value function is computed to determine 15 the optimal control policy. There are few special cases of the PDE in (2.15) depending on the cost function. More precisely if the stochastic optimal control problem is infinite horizon with the cost function: min u J(u,x) = min u !" ∞ t 0 L(x,u)dt # (2.16) then the value function V(x) is not a function of time and thus the resulting PDE is expressed as: 0=q(x,t)+(∇ x V t ) T f(x,t)− 1 2 (∇ x V t ) T B(x)R −1 B(x) T (∇ x V t )+ 1 2 tr $ (∇ xx V t )B(x)B(x) T % For the case of the discounted cost function of the form: min u J(u,x) = min u !" ∞ t 0 e −βt L(x,u)dt # (2.17) the partial time derivative of the value function will be equal to ∂V ∂t = βV, and thus the PDE in (2.15) is formulated as follows: −βV =q(x,t)+(∇ x V t ) T f(x,t)− 1 2 (∇ x V t ) T B(x)R −1 B(x) T (∇ x V t ) + 1 2 tr $ (∇ xx V t )B(x)B(x) T % (2.18) 16 In all cases of cost functions above, the solution of the corresponding PDEs especially in high dimensional state spaces is challenging and this is what makes, in general the op- timalcontrolproblemdifficultwhenappliedtohighdimensionalandnonlineardynamical systems. For linear systems, the PDE above collapses to the so called Riccati equations (Stengel 1994),(Dorato et al. 2000), the solution of which provides a linear control policy in the statex of the formu(x,t)=−K(t)x where the matrixK(t)∈" n×m is the control gain. 2.2 Pontryagin maximum principle In this section we discuss the Pontryagin’s Maximum principle (Pontryagin et al. 1962), (Stengel 1994) one of most important principles in the history of the optimal control theory. In our presentation of Pontryagin mimimum principle, we are dealing with deter- ministic systems and therefore, the deterministic optimal control problem: J(u,x)= φ(x t N )+ " t N t 0 L(x,u)dt (2.19) subject to dynamics dx dt =F(x,u)=f(x)+G(x)u. The constraint is pushed into the cost function and with lagrange multiplier (Stengel 1994). More precisely the augmented cost function is expressed by the equation: J A (u,x)= φ(x t N )+ " t N t 0 L(x,u)dt+ " t N t 0 λ T $ dx dt −F(x,u) % dt or 17 J A (u,x)= φ(x t N )+ " t N t 0 $ L(x,u)−λ T $ dx dt −F(x,u) %% dt By defining the Hamiltonian asH(x,u)=L(x,u)+λ T F(x,u) the augmented cost function can be rewritten in the form: J A (u,x)= φ(x t N )+ " t N t 0 $ H(x,u)−λ T dx dt % dt Integration by parts will result in: J A (u,x)= φ(x t N )+ " t N t 0 H(x,u)dt+ $ λ(t 0 ) T x(t 0 )−λ(t N ) T x(t N ) % + " t N t 0 λ T xdt (2.20) Now we will find the variation in the augmented cost δJ A which is expressed by the equation that follows: δJ A =∇ x J T A δx+∇ u J T A δu Thus we will have that: δJ A =∇ x φ T δx | t=t N +∇ x $ λ(t 0 ) T x(t 0 )−λ(t N ) T x(t N ) % δx + " t N t 0 , ∇ x H T δx+∇ x $ ˙ λ T x(t) % δx+∇ u H T δu - dt By rearranging the terms the equation above is formulated as follows: 18 δJ A = $ ∇ x φ T −λ(t N ) % δx | t=t N +λ(t 0 ) T δx+ " t N t 0 , $ ∇ x H T + ˙ λ T % δx+∇ u H T δu - dt or δJ A = δJ A (t 0 )+δJ A (t N )+δJ A (t 0 →t N ) For δJ A = 0 we require that δJ A (t N )= δJ A (t 0 )= δJ A (t 0 →t N ) = 0 and thus we will have that: ∇ u H = 0 (2.21) andλ(t)=∇ x H which sinceH(x,u)=L(x,u)+λ T F(x,u) is formulated as follows: ˙ λ(t)=∇ x L(x,u)+λ T ∇ x F(x,u) (2.22) with the boundary terminal condition: λ(t N )=∇ x φ(x)| t=t N (2.23) The equations (2.21),(2.22) and (2.23) are the so called Euler Lagrange equations. There are few important observations based on the structure of the Euler Lagrange equa- tions. TheLagrangemultiplieroradjointvectorotherwiseλrepresentsthecostsensitivity 19 to dynamic effects. This sensitivity is specified at the final timet N by providing a bound- ary condition for the solution of λ. Another way of interpreting the role of the adjoint vector λ is that it quantifies the sensitivity of the cost as a function to state perturba- tions on the optimal trajectory beginning from the terminal state x(t N ) and backward propagating towards x(t 0 ). Thus the idea in these backward propagation scheme is that in order to decide which way to go it helps to know the effects of the future variation from the resulting path in state space. This knowledge is encoded in the adjoint vector λ. Clearly the optimal strategy is resulted by tracing paths back from the destination and therefore looking in that way into the future outcomes of possible variations. The necessary and sufficient conditions for optimality of x ∗ (t) and u ∗ (t) in the inter- val [t 0 ,t N ] if the dynamic systems under consideration is normal and the optimal path contains no conjugate points afre expressed by the equations: ∇ x H $ x ∗ ,u ∗ ,λ ∗ ,t % = 0 (2.24) ∇ xx H $ x ∗ ,u ∗ ,λ ∗ ,t % ≥ 0 (2.25) ThePontryaginminimumprincipleofoptimalitystatesthatifthevariablesx ∗ (t),λ ∗ (t) are kept fixed then for any admissible neighboring non-optimal control history u(t) in [t 0 ,t N ] we have that: H ∗ =H ∗ $ x ∗ (t),u ∗ (t),λ ∗ (t),t % ≤H ∗ $ x ∗ (t),u(t),λ ∗ (t),t % (2.26) 20 ifH is stationary and convex the minimum principle is satisfied. If it is the one but no the other then the minimum principle is not satisfied . A stronger condition for the minimum principle is formulated as follows: J(u ∗ +δu)−J(u ∗ ) = (2.27) = " t N t 0 . H ∗ $ x ∗ (t),u ∗ (t)+δu,λ ∗ (t),t % −H ∗ $ x ∗ (t),u ∗ (t),λ ∗ (t),t %/ dt≥ 0 (2.28) 2.3 Iterative optimal control algorithms Thereisavarietyofoptimalcontrolalgorithmsdependingon1)theorderoftheexpansion of the dynamics, 2) the order of the expansion of the cost function and 3) the existence of noise. More precisely, if the dynamics under consideration are linear in the state and the controls, deterministic, and the cost function is quadratic with respect to states and controls, we can use one of the most established tools in control theory: the Linear Quadratic Regulator (Stengel 1994). For such type of optimal control problems the dy- namics are formulated as f(x,u)= Ax + Bu, F(x,u) = 0 and the immediate cost l(τ,x(τ),u(τ,x(τ))) = x T Qx + u T Ru. Under the presence of stochastic dynamics F(x,u) )= 0, the resulting algorithm is called the Linear Gaussian Quadratic Regula- tor (LQG). For nonlinear deterministic dynamical systems, expansion of the dynamics is per- formed and the optimal control algorithm is solved in iterative fashion. Under a first 21 LQR LQG iLQR iLQG DDP SDDP Linear Dynamics x x - - - - Quadratic Cost x x - - - FOE of Dynamics - x x - - SOE of Cost - x x x x SOE of Dynamics - - - x x Deterministic x x - x - Stochastic - x - x - x Table 2.1: Optimal Control Algorithms according to First Order Expansion (FOE) or SecondOrderExpansion(SOE)ofdynamicsandcostfunctionandtheexistenceofNoise. order expansions of the dynamics and a second order expansion of the immediate cost function l(τ,x(τ),u(τ,x(τ))) the derived algorithm is called Iterative Linear Quadratic Regulator (iLQR) (Li & Todorov 2004). A better approximation of dynamics up to the second order results in one of the most well know optimal control algorithm especially in the area of Robotics, Differential Dynamic Programming (Jacobson & Mayne 1970). BothiLQRandDDPareiterativealgorithmsthatstartwithaninitialtrajectoryinstates and controls ¯ x and ¯ u and result in an optimal trajectoryx ∗ , an optimal open loop control commandu ∗ , and a sequence of control gainsL which are activated whenever deviations from the optimal trajectory x ∗ are observed. The difference between iLQR and DDP is that DDP provides a better approximation of the dynamics but with an additional computational cost necessary to find the second order derivatives. In cases where noise is present in the dynamics either as multiplicative in the con- trols or state, or both, we have the stochastic version of iLQR and DDP, the Iterative Linear Quadratic Gaussian Regulator (iLQG) (Todorov 2005) and the Stochastic Differ- ential Dynamic Programming (SDDP) (Theodorou 2010). Essentially SDDP contains as special cases all the previous algorithms iLQR, iLQG and DDP since it requires second order expansion of the cost and dynamics and it takes into account control and state 22 dependent noise. This is computationally costly because second order derivatives have to be calculated. An important aspect of stochastic optimal control theory is that, in cases of additive noise, the optimal control u ∗ and the optimal control gains L are both independent of the noise and, therefore, the same with the corresponding deterministic solution. In cases where the noise is control or state dependent, the resulting solutions iLQG and SDDP differ from the solutions of the deterministic versions iLQR and DDP. In the table 2.1 we provide the classification of the optimal control algorithms based on the expansion of dynamics and cost function as well as the existence of noise. 2.3.1 Stochastic differential dynamic programming We consider the class of nonlinear stochastic optimal control problems with cost v π (x,t)= ! h(x(T))+ " T t 0 +(τ,x(τ),π(τ,x(τ)))dτ # (2.29) subject to the stochastic dynamics of the form: dx =f(x,u)dt+F(x,u)dw (2.30) where x∈" n×1 is the state, u∈" m×1 is the control and dw∈" p×1 is brownian noise. The term h(x(T)) in the cost function (2.29), is the terminal cost while the +(τ,x(τ),π(τ,x(τ))) is the instantaneous cost rate which is a function of the state x and control policy π(τ,x(τ)). The cost-to - go v π (x,t) is defined as the expected cost 23 accumulated over the time horizon (t 0 ,...,T) starting from the initial statex t to the final state x(T). To enhance the readability of our derivations we write the dynamics as a function Φ∈" n×1 of the state, control and instantiation of the noise: Φ(x,u,dω)≡f(x,u)dt+F(x,u)dw (2.31) It will sometimes be convenient to write the matrix F(x,u)∈" n×p in terms of its rows or columns: F(x,u)= F 1 r (x,u) . . . F n r (x,u) = 6 F 1 c (x,u),...,F p c (x,u) 7 Every element of the vectorΦ(x,u,dω)∈" n×1 can now be expressed as: Φ j (x,u,dω)=f j (x,u)δt+F j r (x,u)dw Given a nominal trajectory of states and controls (¯ x,¯ u) we expand the dynamics around this trajectory to second order: Φ(¯ x+δx,¯ u+δu,dw)= Φ(¯ x,¯ u,dw)+∇ x Φ·δx+∇ u Φ·δu+O(δx,δu,dw) 24 where O(δx,δu,dw)∈" n×1 contains all the second order terms in the deviations in states, controls and noise 1 . Writing this term element-wise: O(δx,δu,dw)= O (1) (δx,δu,dw) . . . O (n) (δx,δu,dw) , we can express the elements O (j) (δx,δu,dw)∈" as: O (j) (δx,δu,dw)= 1 2 δx δu ∇ xx Φ j ∇ xu Φ j ∇ ux Φ j ∇ uu Φ j δx δu . (2.32) We would now like to express the derivatives of Φ in terms of the given quantities. Beginning with the first-order terms, we find that: ∇ x Φ =∇ x f(x,u)δt+∇ x , m > i=1 F i c dw (i) t - ∇ u Φ =∇ u f(x,u)δt+∇ u , m > i=1 F i c dw (i) t - Next we find the second order derivatives and we have that: ∇ xx Φ (j) =∇ xx f (j) (x,u)δt+∇ xx ? F (j) r (x,u)dw t @ ∇ uu Φ (j) =∇ uu f (j) (x,u)δt+∇ uu ? F (j) r (x,u)dw t @ ∇ ux Φ (j) =∇ ux f (j) (x,u)δt+∇ ux ? F (j) r (x,u)dw t @ 1 Not to be confused with “big-O”. 25 ∇ xu Φ (j) = ? ∇ ux Φ (j) @ T After expanding the dynamics up to the second order we can transition from contin- uous to discrete time. More precisely the discrete-time dynamics are formulated as: δx t+δt = , I n×n +∇ x f(x,u)δt+∇ x , m > i=1 F (i) c ξ (i) t √ δt -- δx t + , ∇ u f(x,u)δt+∇ u , m > i=1 F (i) c ξ (i) t √ δt -- δu t +F(x,u) √ δtξ t +O d (δx,δu,ξ,δt) with δt = t k+1 −t k corresponding to a small discretization interval. Note that the term O d is the equivalent ofO but in discrete time and therefore it is now a function of δt. In fact, since O d contains all the second order expansion terms of the dynamics it contains second order derivativeswrt state and control expressed as follows: ∇ xx Φ (j) =∇ xx f (j) (x,u)δt+∇ xx ? F (j) r (x,u)ξ t @ √ δt ∇ uu Φ (j) =∇ uu f (j) (x,u)δt+∇ uu ? F (j) r (x,u)ξ t @ √ δt ∇ ux Φ (j) =∇ ux f (j) (x,u)δt+∇ ux ? F (j) r (x,u)ξ t @ √ δt ∇ xu Φ (j) = ? ∇ ux Φ (j) @ T 26 Therandomvariableξ∈" p×1 iszeromeanandGaussiandistributedwithcovarianceΣ = σ 2 I m×m The discretized dynamics can be written in a more compact form by grouping the state, control and noise dependent terms, and leaving the second order term separate: δx t+δt =A t δx t +B t δu t +Γ t ξ t +O d (δx,δu,ξ,δt) (2.33) where the matrices A t ∈" n×n ,B t ∈" n×m andΓ t ∈" n×p are defined as A t = I n×n +∇ x f(x,u)δt B t = ∇ u f(x,u)δt Γ t = . Γ (1) Γ (2) ... Γ (m) / withΓ (i) ∈" n×1 definedΓ (i) =∇ u F (i) c δu t +∇ x F (i) c δx t +F (i) c . For the derivation of the optimal control it is useful to expressesΓ t as the summation of terms that depend on variations in state and controls and terms that are independent of such variations. More precisely we will have that: Γ t =Δ t (δx,δu)+F(x,u) (2.34) where each column vector ofΔ t is defined asΔ (i) t (δx,δu)=∇ u F (i) c δu t +∇ x F (i) c δx t . 2.3.1.1 Value function second order approximation AsinclassicalDDP,thederivationofstochasticDDPrequiresthesecondorderexpansion of the cost-to-go function around a nominal trajectory ¯ x: 27 V(¯ x+δx)=V(¯ x)+V T x δx+ 1 2 δx T V xx δx (2.35) Substitution of the discretized dynamics (2.33) in the second order Value function expansion (8.7) results in: V(¯ x t+δt +δx t+δt )=V(¯ x t+δt )+V T x (A t δx t +B t δu t +Γ t ξ+O d ) +(A t δx t +B t δu t +Γ t ξ+O d ) T V xx (A t δx t +B t δu t +Γ t ξ+O d ) (2.36) Next we will compute E(V(¯ x t+δt +δx t+δt )) which requires the calculation of the expec- tation of the all the terms that appear in the equation above. This is what the rest of the analysis is dedicated to. More precisely in the next two sections we will calculate the expectation of the terms: ! V T x δx t+δt # and ! δx T t+δt V xx δx t+δt # (2.37) where the state deviation δx t+δt at time instant t + δt is given by the linearized dynamics: δx t+δt =A t δx t +B t δu t +Γ t ξ+O d (2.38) 28 The analysis that follows in section 2.3.1.1 consist of the computation of the expecta- tion of the four terms which result from the substitution of the linearized dynamics (2.38) into E ! V T x δx t+δt # . In section 2.3.1.1 we compute the expectation of the 16 terms that result from the substitution of (2.38) into ! δx T t+δt V xx δx t+δt # . Expectation of the 1st order term of the value function. The expectation of the first order term results in: ! V T x (A t δx t +B t δu t +Γ t ξ t +O d ) # =V T x $ A t δx t +B t δu t + ! O d #% (2.39) In order to find the expectation ofO d ∈" n×1 we need to find the expectation of each one of the elements of this column vector. Thus we will have that: ! O (j) (δx,δu,ξ t ,δt) # = * 1 2 δx δu T ∇ xx Φ (j) ∇ xu Φ (j) ∇ ux Φ (j) ∇ uu Φ (j) δx δu + = (2.40) = δt 2 δx δu T ∇ xx f (j) ∇ xu f (j) ∇ ux f (j) ∇ uu f (j) δx δu = ˜ O j Therefore we will have that: ! V T x δx t+δt # =V T x ! A t δx t +B t δu t + ˜ O d # (2.41) 29 Where the term ˜ O d is defined as: ˜ O d (δx,δu,δt)= ˜ O (1) (δx,δu,δt) ... ... ˜ O (n) (δx,δu,δt) (2.42) The term∇ x V T ˜ O d is quadratic in variations in the states and controls δx,δu an thus there are the symmetric matricesF∈" n×n ,Z∈" m×m andL∈" m×n such that: V T x ˜ O d = 1 2 δx T Fδx+ 1 2 δu T Zδu+δu T Lδx (2.43) with F = n > j=1 ∇ xx f (j) V x j (2.44) Z = n > j=1 ∇ uu f (j) V x j (2.45) L = n > j=1 ∇ ux f (j) V x j (2.46) From the analysis above we can see that the expectation∇ x V T δx t+δt is a quadratic function with respect to variations in states and controls δx,δu. As we will prove in 30 the next section the expectation of δx T t+δt ∇ xx V T δx t+δt is also a quadratic function of variations in states and controls δx,δu. Expectation of the 2nd order term of the value function. In this section we compute all the terms that appear due to the expectation of the second approximation of the value function ! δx T t+δt ∇ xx Vδx t+δt # . The term δx t+δt is given by the stochastic dynamics in (2.38). Substitution of (2.38) results in 16 terms. To make our analysis clear we classify these 16 terms terms above into five classes. More precisely we will have that: ! δx T t+δt V T xx δx t+δt # =E 1 +E 2 +E 3 +E 4 +E 5 (2.47) where the termsE 1 ,E 2 ,E 3 ,E 4 andE 5 are defined as follows: E 1 = ! δx T t A T t V xx A t δx t # + ! δu T t B T t V xx B t δu t # + ! δx T t A T t V xx B t δu t # + ! δu T t B T t V xx A t δx t # E 2 = ! ξ T t Γ T t V xx A t δx # + ! ξ T t Γ T t V xx B t δu # + ! δx T A T t V xx Γ t ξ t # +E ! δu T B T t V xx Γ t ξ t # + ! ξ T t Γ T t V xx Γ t ξ t # E 3 = ! O T d V xx Γ t ξ t # + ! ξ T t Γ T t V xx O d # 31 E 4 = ! δx T t A T t V xx O d # + ! δu T t B T t V xx O d # + ! O T d V xx B t δu t # + ! O T d V xx A t δx t # E 5 = ! O T d V xx O d # In the first category we have all these terms that depend neither on ξ t and nor on O d (δx,δu,ξ t ,δt). These are the terms that defineE 1 . The second categoryE 2 includes terms that depend on ξ t but not on O d (δx,δu,ξ t ,δt). In the third classE 3 , there are terms that depends both on O d (δx,δu,ξ t ,δt) and ξ t . In the fourth classE 4 , we have terms that depend on O d (δx,δu,ξ t ,δt). Finally in the fifth classE 5 , we have all these terms that depend on O d (δx,δu,ξ t ,δt) quadratically. The expectation operator will cancel all the terms that include noise up the first order. Moreover, the mean operator for terms that depend on the noise quadratically will result in covariance. We compute the expectations of all the terms in theE 1 class. More precisely we will have that: ! δx T t A T t V xx A t δx t # = δx T t A T t V xx A t δx t ! δu T t B T t V xx B t δu t # = δu T t B T t V xx B t δu t ! δx T t A T t V xx B t δu t # = δx T t A T t V xx B t δu t ! δu T t B T t V xx A t δx t # = δu T t B T t V xx A t δx t (2.48) We continue our analysis by calculating all the terms in the classE 2 . More presicely we will have: 32 ! ξ T t Γ T t V xx A t δx # =0 ! ξ T t Γ T t V xx B t δu # =0 ! ξ T t Γ T t V xx A t δx # T =0 ! ξ T t Γ T t V xx B t δu # T =0 (2.49) The terms above are equal to zero since the brownian noise is zero mean. The ex- pectation of the term that does not depend on O d (δx,δu,ξ t ,δt) and it is quadratic with respect to the noise is given as follows: ! ξ T t Γ T t V xx Γ t ξ t # =tr A Γ T t V xx Γ t Σ ω B (2.50) Since matrix Γ depends on variations in states and controls δx,δu we can further massage the expressions above so that it can be expressed as quadratic functions in δx,δu . tr A Γ T t V xx Γ t Σ ω B = σ 2 dω δttr Γ (1)T ... ... Γ (m)T V xx $ Γ (1) ... ... Γ (m) % (2.51) 33 = σ 2 dω δt m > i=1 Γ (i)T V xx Γ (i) (2.52) The last equation is written in the form: tr A Γ T t V xx Γ t Σ ω B = δx T ˜ Fδx+2δx T ˜ Lδu+δu T ˜ Zδu +2δu T ˜ U+2δx T ˜ S +γ (2.53) Where the terms ˜ F∈" n×m , ˜ L∈" n×m , ˜ Z∈" m×m , ˜ U∈" m×1 , ˜ S∈" n×1 and γ∈" are defined as follows: ˜ F = σ 2 δt m > i=1 ∇ x F (i) c T V xx ∇ x F (i) c (2.54) ˜ L = σ 2 δt m > i=1 ∇ x F (i) c T V xx ∇ u F (i) c (2.55) ˜ Z = σ 2 δt m > i=1 ∇ u F (i) c T V xx ∇ u F (i) c (2.56) ˜ U = σ 2 δt m > i=1 ∇ u F (i) c T V xx F (i) c (2.57) ˜ S = σ 2 δt m > i=1 ∇ x F (i) c T V xx F (i) c (2.58) γ = σ 2 δt m > i=1 F (i) c T V xx F (i) c (2.59) For those terms that depend both on O d (δx,δu,ξ t ,δt) and on the noise classE 3 we will have: ! O T d V xx Γ t ξ t # = ! tr A V xx Γ t ξ t O T d B # =tr A V xx Γ t E A ξ t O T d BB (2.60) 34 By writing the term O d (δx,δu,ξ t ,δt) in a matrix form and putting the noise vector insight the this matrix we have: ! O T d V xx Γ t ξ t # =tr $ V xx Γ t E . ξ t O (1) ... ξ t O (n) /% (2.61) Calculation of the expectation above requires to find the terms ! √ δtξ t O (j) # more precisely we will have: ! √ δtξ t O (j) # = 1 2 ! √ δtξ t δx T Φ (i) xx δx # + 1 2 ! √ δtξ t δu T Φ (i) uu δu # + ! √ δtξ t δu T Φ (i) ux δx # (2.62) We first calculate the term: ! √ δtξ t δx T ∇ xx Φ (i) δx # = ! √ δtξ t δx T ? ∇ xx f (i) δt+∇ xx F (i) r ξ t √ δt @ δx # (2.63) = ! √ δtξ t δx T ? ∇ xx f (i) δt @ δx # + ! √ δtξ t δx T ? ∇ xx F (i) r ξ t √ δt @ δx # The term ! √ δtξ t δx T A ∇ xx f (i) δt B δx # = 0 since it depends linearly on the noise and ! ξ t # = 0. The second term ! √ δtξ t δx T ? ∇ xx F (i) r ξ t √ δt @ δx # depends quadratically in the noise and thus the expectation operator will result in the variance on the noise. We follow the analysis: ! √ δtξ t δx T ∇ xx Φ (i) δx # = ! √ δtξ t δx T ∇ xx ? F (i) r ξ t √ δt @ δx # (2.64) Since the ξ t = A ξ (1) ,...,ξ (m) B T and F (i) r = A F (i1) ,....,F (im) B we will have that: 35 ! √ δtξ t δx T ∇ xx Φ (i) δx # = ! δtξ t δx T ∇ xx m > j=1 F (ij) ξ (j) δx # (2.65) = ! δtξ t δx T m > j=1 ∇ xx ? F (ij) ξ (j) @ δx # = ! δtξ t δx T m > j=1 ξ (j) ∇ xx ? F (ij) @ δx # By writing ξ t in vector form we have that: ! √ δtξ t δx T ∇ xx Φ (i) δx # = ! δt ξ (1) ... ... ξ (m) δx T m > j=1 ξ (j) ∇ xx ? F (ij) @ δx # The term δx T ? C m j=1 ξ (j) ∇ xx A F (ij) B @ δx is scalar and it can multiply each one of the elements of the noise vector. δt ! ξ (1) δx T ? C m j=1 ξ (j) ∇ xx A F (ij) B @ δx # ... ... δt ! ξ (m) δx T ? C m j=1 ξ (j) ∇ xx A F (ij) B @ δx # (2.66) 36 Since ! ξ (i) ξ (i) # = σ 2 and ! ξ (i) ξ (j) # = 0 we can show that: ! √ δtξ t δx T ∇ xx Φ (i) δx # = σ 2 δt δx T ∇ xx F (i1) r δx ... ... δx T ∇ xx F (im) r δx (2.67) In a similar way we can show that: ! √ δtξ t δx T ∇ uu Φ (i) δx # = σ 2 δt δu T ∇ uu F (i1) r δu ... ... δu T ∇ uu F (im) r δu (2.68) and ! √ δtξ t δu T ∇ xu Φ (i) δx # = σ 2 δt δu T ∇ ux F (i1) r δx ... ... δu T ∇ ux F (im) r δx (2.69) Since we have calculated all the terms of expression (2.62) we can proceed with the computation of (2.60). According to the analysis above the term ! O T d V xx Γ t ξ t # can be written as follows: ! O T d V xx Γ t ξ t # =tr(V xx Γ t (M+N +G)) (2.70) 37 Where the matricesM∈" m×n ,N∈" m×n andG∈" m×n are defined as follows: M = σ 2 δt δx T ∇ xx F (11) r δx ... δx T ∇ xx F (1n) r δx ... ... ... δx T ∇ xx F (m1) r δx ... δx T ∇ xx F (mn) r δx (2.71) Similarly N = σ 2 δt δx T ∇ xu F (1,1) r δu ... δx T ∇ xu F (1,n) r δu ... ... ... δx T ∇ xu F (m,1) r δu ... δx T ∇ xu F (m,n) r δu (2.72) and G = σ 2 δt δu T ∇ uu F (1,1) r δu ... δu T ∇ uu F (1,n) r δu ... ... ... δu T ∇ uu F (m,1) r δu ... δu T ∇ uu F (m,n) r δu (2.73) Based on (2.34) the term Γ t depends on Δ which is a function of the variations in states and control up to the 1th order. In addition the matricesM,N andG are also functions of the deviations in state and controls up to the 2th order. The product ofΔ with each one of the matricesM,N andG will result into 3th order terms that can be neglected. By neglecting these terms we can show that: ! O T d V xx Γ t ξ t # =tr(V xx (Δ+F)(M+N +G)) (2.74) 38 =tr(V xx F (M+N +G)) Eachelement(i,j)oftheproductC =V xx F canbeexpressedasC (i,j) = C n r=1 V (i,r) xx F (r,j) whereC∈" n×p . Furthermore the element (µ,ν) of the productH =CM is formulated H (µ,ν) = C n k=1 C (µ,k) M (k,ν) withH∈" n×n . Thus, the term tr(V xx FM) can be now expressed as: tr(V xx FM)= n > '=1 H (',') (2.75) = n > '=1 m > k=1 C (',k) M (k,') = n > '=1 m > k=1 , n > r=1 V (k,r) xx F (r,') - M (k,') SinceM (k,') = δtσ 2 dω 1 δx T ∇ xx F (k,') δx the vectors δtσ 2 dω 1 δx T and δx do not depend on k,+,r and they can be taken outside the sum. Thus we can show that: tr(V xx FM)= n > '=1 m > k=1 ,, n > r=1 V (k,r) xx F (r,') - σ 2 δtδx T ∇ xx F (k,') δx - (2.76) = δx T σ 2 δt n > '=1 m > k=1 ,, n > r=1 V (k,r) xx F (r,') - ∇ xx F (k,') - δx = δx T ˜ Mδx 39 where ˜ M is a matrix of dimensionality ˜ M∈" n×n and it is defined as: ˜ M = σ 2 δt n > '=1 m > k=1 ,, n > r=1 V (k,r) xx F (r,') - ∇ xx F (k,') - (2.77) By following the same algebraic steps it can been shown that: tr(V xx FN)= δx T ˜ Nδu (2.78) with ˜ N matrix of dimensionality ˜ N∈" n×m defined as: ˜ N = σ 2 δt n > '=1 m > k=1 ,, n > r=1 V (k,r) xx F (r,') - ∇ xu F (k,') - (2.79) and tr(V xx FG)= δu T ˜ Gδu (2.80) with ˜ G matrix of dimensionality ˜ N∈" m×m defined as: ˜ G = σ 2 δt n > '=1 m > k=1 ,, n > r=1 V (k,r) xx F (r,') - ∇ uu F (k,') - (2.81) Thus the term ! O T d V xx Γ t ξ t # is formulated as: 40 ! O T d V xx Γ t ξ t # = 1 2 δx T ˜ Mδx+ 1 2 δu T ˜ Gδu+δx T ˜ Nδu (2.82) Similarly we can show that: ! ξ T t Γ T t V xx O d # = 1 2 δx T ˜ Mδx+ 1 2 δu T ˜ Gδu+δx T ˜ Nδu (2.83) Next we will find the expectation for all terms that depend onO d (δx,δu,dω,δt) and not on the noise. Consequently, we will have that: ! δx T t A T t V xx O d # = δx T t A T t V xx ˜ O d =0 ! δu T t B T t V xx O d # = δu T t B T t V xx ˜ O d =0 ! O T d V xx A t δx t # = ˜ O d T V xx A t δx t =0 ! O T d V xx B t δu t # = ˜ O d T V xx B t δu t =0 (2.84) where the quantity ˜ O d has been defined in (2.42). All the 4 terms above are equal to zero since they have variations in state and control of the order higher than 2 and therefore they can be neglected. Finally we compute the terms of the 5th class and therefore we have the expression E 5 = ! O T d V xx O d # = ! tr A V xx O d O T d B # =tr $ V xx ! O d O T d #% (2.85) 41 = V xx ! O (1) ... O (n) O (1) ... O (n) T # The product O (i) O (j) is a function of variation in state and control of order 4 since each term O (i) is a function of variation in states and control of order 2. Consequently, the termE 5 =E A O T d V xx O d B is equal to zero. With the computation of the expectation of term that is quadraticwrt O d we have calculated all the terms of the second order expansion of the cost to go function. In the next section we derive the optimal controls and we present the SDDP algorithm. Furthermore we show how SDDP recover the deterministic solution as well as the cases of only control multiplicative, only state multiplicative and only additive noise. 2.3.1.2 Optimal controls In this section we provide the form of the optimal controls and we show how previous results are special cases of our generalized stochastic DDP formulation. Furthermore after we computed all the terms of expansion of the cost to go function V (x t ) at statex t 42 we show that its form remains quadraticwrt variations in state δx t under the constraint of the nonlinear stochastic dynamics in (2.30). More precisely we have that: V(¯ x t+δt +δx t+δt )=V(¯ x t+δt )+∇ x V T A t δx t +∇ x V T B t δu t + 1 2 δx T Fδx+ 1 2 δu T Zδu+δu T Lδx+ 1 2 δx T t A T t V xx A t δx t + 1 2 δu T t B T t V xx B t δu t + 1 2 δx T t A T t V xx B t δu t + 1 2 δu T t B T t V xx A t δx t + 1 2 δx T ˜ Fδx+δx T ˜ Lδu+ 1 2 δu T ˜ Zδu +δu T ˜ U +δx T ˜ S + 1 2 γ + 1 2 δx T ˜ Mδx+ 1 2 δu T ˜ Gδu+δx T ˜ Nδu (2.86) The unmaximized state, action value function is defined as follows: Q(x k ,u k )= +(x k ,u k )+V(x k+1 ) (2.87) Given a trajectory in states and controls ¯ x,¯ u we can approximate the state action value function as follows: Q(¯ x+δx,¯ u+δu)=Q 0 +δu T Q u +δx T Q x + 1 2 . δx T δu T / Q xx Q xu Q ux Q uu δx δu (2.88) 43 By equating the coefficients with similar powers between the state action value func- tion Q(x k ,u k ) and the immediate reward and cost to go +(x k ,u k ) and V(x k+1 ) respec- tively we can show that: Q x = + x +A t V x + ˜ S Q u = + u +A t V x + ˜ U Q xx = + xx +A T t V xx A t +F + ˜ F + ˜ M Q xu = + xu +A T t V xu B t +L+ ˜ L+ ˜ N Q uu = + uu +B T t V uu B t +Z + ˜ Z + ˜ G (2.89) where we have assumed a local quadratic approximation of the immediate reward +(x k ,u k ) according to the equation: +(¯ x+δx,¯ u+δu)= + 0 +δu T + u +δx T + x + 1 2 . δx T δu T / + xx + xu + ux + uu δx δu (2.90) with + x = ∂' ∂x , + u = ∂' ∂u , + xx = ∂ 2 ' ∂x 2 , + uu = ∂ 2 ' ∂u 2 and + ux = ∂ 2 ' ∂u∂x . The local variations in control δu ∗ that maximize the state action value function are expressed by the equation that follows: δu ∗ = argmax u Q(¯ x+δx,¯ u+δu)=−Q −1 uu (Q u +Q ux δx) (2.91) The optimal control variations have the form δu ∗ = l+Lδx where l =−Q −1 uu Q u is the open loop control or feedforward command and L =−Q −1 uu Q ux is the closed loop - feedback gain. All the terms in (2.92) are functions of the gradient of the value function 44 V x and the Hessian V x,x . This quantities are backward propagated from the terminal boundary conditions as follows: Ifthenoiseisonlycontroldependedthen ˜ M, ˜ N, ˜ L, ˜ F, ˜ S willbezerosince∇ xx F(u)= 0,∇ xu F(u) = 0 and∇ x F (i) c (x) = 0 while if it is state dependent then ˜ N, ˜ G, ˜ Z, ˜ L, ˜ U will be zero since∇ xu F(x)=0,∇ uu F(x) = 0 and∇ u F (i) c (x) = 0. In the next two sub-sections we show that differential dynamic programming (DDP) and iterative linear quadratic regulators are special cases of the stochastic differential dynamic programming. 2.3.2 Differential dynamic programming There are two cases in which we can recover the DDP equations. In particular, for the special case where the stochastic dynamics have only additive noiseF(u,x)=F then the terms ˜ M, ˜ N, ˜ G, ˜ F, ˜ L, ˜ Z, ˜ U, ˜ S will be zero since they are functions of ∇ xx F and ∇ xu F and ∇ uu F and it holds that ∇ xx F = 0, ∇ xu F = 0 and ∇ uu F = 0. In systems with additive noire the control does not depend on the statistical characteristics of the noise. In addition, for the case of deterministic systems the terms ˜ M, ˜ N, ˜ G, ˜ F, ˜ L, ˜ Z, ˜ U, ˜ S will be zero because these terms depend on the variance of the noise σ dω i =0, ∀i = 1,...,m. Clearly in both of the cases above the resulting algorithm corresponds to DDP in which the equation are formulated as follows: 45 Q x = + x +A t V x Q u = + u +A t V x Q xx = + xx +A T t V xx A t Q xu = + xu +A T t V xu B t Q uu = + uu +B T t V uu B t (2.92) 2.4 Risk sensitivity and differential game theory The relation of risk sensitivity and differential game theory was first studied for the case on linear dynamics and quadratic cost function in (Jacobson 1973) When the case of im- perfect state measurement isconsidered, therelation between thetwoframeworks was in- vestigatedin(Whittle1991),(Whittle1990),(Basar1991)and(Runolfsson1994). Another researchdirectiononrisksensitivityanddifferentialgametheoryconsidersthecaseofnon- linear stochastic dynamics and markov processes (James, Baras & Elliot 1994),(Fleming & Soner 2006). In addition to the theoretical developments, applications of risk sensi- tivity and differential game theoretic approaches to reinforcement learning showed the robustness of the resulting control policies against disturbances and uncertainty in highly nonlinearsystems(Morimoto&Atkeson2002),(Morimoto&Doya2005). Oneofthemain issueswiththeserisksensitiveRLapproachesistheirpoorscalabilitytohighdimensional dynamical systems. 46 2.4.1 Stochastic differential games The ultimate goal in stochastic optimal control framework (Stengel 1994), (Basar & Berhard 1995), (Fleming & Soner 2006) is to control a dynamical system while minimiz- ing a performance criterion. For the case on nonlinear stochastic systems the stochastic optimal control problem is expressed as the minimization of a cost function. However when disturbances are present then the stochastic optimal control problem can be for- mulated as a differential game with two opponents that is formulated as: min u max v J(x,u) = min u max v ! φ(x t N )+ " t N t 0 L(x,u,v)dt # (2.93) withL(x,u,v)=q(x)+ 1 2 u T Ru−γv T vandunderthestochasticnonlineardynamical constrains: dx=(f(x)+G(x)u)+ D $ γ C(x)L(vdt+dw) (2.94) or dx =F(x,u,v)dt+C(x)Ldw (2.95) with F(x,u,v)= f(x)+G(x)u+ E ( γ C(x)Lv and L is a state independent matrix defined as LL T =Σ ! . Essentially there are two controllers, u∈" m×1 the stabilizing controller and v∈" p×1 the destabilizing one while x∈" n×1 is the state and dw is brownian noise. The parameters $,γ are positive. The stabilizing controller minimizes the cost function while the stabilizing one intents to maximize it. The value function is 47 defined as the optimal of the cost function J(x,u,v) and therefore is a function only of the state: V(x) = min u max v J(x,u,v) = min u max v J(x,u ∗ ,v ∗ ) (2.96) The stochastic Isaacs HJB equation associated with this stochastic optimal control problem is expressed as follows: −∂ t V = min u max v $ L+(∇ x V) T F+ 1 2 tr A (∇ xx V)CΣ ! C T B % (2.97) SincethethelefthandsideoftheHJBisconvexwithrespecttocontroluandconcave with respect to v the min and max are exact lead to the optimal controls: u ∗ (x)=−R −1 G(x) T (∇ x V) (2.98) and optimal destabilizing controller: v ∗ (x)= 1 γ C(x) T (∇ x V) (2.99) Substitution of the optimal stabilizing and destabilizing control to the HJB results into the nonlinear second order PDE: −∂ t V =q+(∇ x V) T f t − 1 2 (∇ x V) T GR −1 G T (∇ x V)− 1 2γ (∇ x V) T BB T (∇ x V) (2.100) + $ 2γ tr A (∇ xx V)CΣ ! C T B 48 The PDE above can be written in the form: −∂ t V =q+(∇ x V) T f− 1 2 (∇ x V) T H(x)(∇ x V)+ $ 2γ tr A (∇ xx V)CΣ ! C T B (2.101) where the introduced termH(x) is defined as H(x)=G(x)R −1 G(x) T − 1 γ C(x)Σ ! C(x) T (2.102) From (2.101) and (2.102) we see that for γ→∞ the Isaacs HJB is reduced to the HJB. In the next section we show under which conditions the nonlinear and second order PDE can be transformed to a linear PDE. Linear PDEs are easier to be solve via the application of the Feynman- Kac lemmas which provides a probabilistic representations of the solution of these PDEs. In the next session after transforming the PDE into a linear, we provide the Feynman Kac lemma. 2.4.2 Risk sensitive optimal control We consider the optimal control problem (Basar & Berhard 1995) where the state dy- namics are described by the Ito stochastic differential differential equation: dx=(f(x)+G(x)u)dt+ D $ γ ˜ C(x)Ldw (2.103) 49 where w(t),t> 0 is an n-dimensional Wiener Precess, γ is a positive parameter and u t ∈ U is the control and Σ ! = LL T . The objective is to find the control law that minimizes the performance criterion: J(x,u)= $log ! exp 1 $ $ φ(t N )+ " t N t 0 L(x,u)dt %# (2.104) For our analysis we need the following conditions: i) Functionsf(x),G(x)andL(x,u)arecontinuouslydifferentiablein(t,x,u)∈ [0,t f ]× " n ×U, φ is twice differentiable in x∈" n and φ andL are nonnegative. ii) C(x) is continuously differentiable in (t,x)∈ [0,t f ]×" n ×U and C(x)C(x) T > 0. iii) F(x,u),∇ x F,L(x,u),L x ,φ(x(t f )),∇ x φ are bounded on [0,t f ]×" n ×U. iv) U is closed and bounded subset of" m Let us assume that: V(x,t) = inf u J(x,u)= $logΦ(x,t) (2.105) whereΦ(t,x) is the value function that corresponds to the cost function: Φ(t,x) = inf u ! exp 1 $ $ φ(t N )+ " t N t 0 L(x,u)dt %# (2.106) or Φ(t,x)= ! exp 1 $ $ φ(t N )+ " t N t 0 L(x ∗ ,u ∗ )dt %# (2.107) 50 where x ∗ ,u ∗ is the optimal state and control trajectory. The total derivative∀t =t 0 is given by: dΦ dt =− 1 $ L(x ∗ ,u ∗ )Φ(t,x) (2.108) and thus we have that: dΦ =− 1 $ L(x ∗ ,u ∗ )Φ(t,x)dt (2.109) By using the Ito differentiation rule we will have that: dΦ = A ∂ t Φ+(∇ x Φ) T F B dt+ $ 2γ tr ? ∇Φ xx ˜ CΣ ! ˜ C T @ (2.110) By equating the two last equation above we will the resulting PDE expressed as follows: −∂ t Φ = (∇ x Φ) T F+ $ 2γ tr ? ∇ xx Φ ˜ CΣ ! ˜ C T @ + 1 $ LΦ (2.111) In this PDE F =F(x ∗ ,u ∗ ) andL =L(x ∗ ,u ∗ ). The PDE above can be also written as follows: 0 = inf u∈U , ∂ t Φ+(∇ x Φ) T F+ $ 2γ tr ? ∇ xx Φ ˜ CΣ ! ˜ C T @ + 1 $ LΨ - or in form: ∂ t Φ t = inf u∈U , (∇ x Φ) T F+ $ 2γ tr ? ∇ xx Φ ˜ CΣ ! ˜ C T @ + 1 $ LΦ - 51 with the boundary condition Φ(x,t) = exp A 1 ( φ(x(t N )) B and F = F(x ∗ ,u) andL = L(x ∗ ,u). This is the Hamilton Jacobi Bellman equation for the case of the risk sensitive stochastic optimal control problem. Since CC T > 0 the PDE above is a uniformly parabolic PDE. Moreover under the conditions 1,2,3,4 the second order PDE has unique boundedpositivesolution. ThevaluefunctionV(x,t)isrelatedtoΦ(x,t)through(2.105) and therefore V(x,t) is smooth and satisfies the uniformly parabolic PDE: ∂ t V t = inf u∈U , (∇ x V) T F+L+ $ 2γ (∇ x V) T ˜ C ˜ C T (∇ x V)+ $ 2γ tr ? ∇ xx Ψ ˜ CΣ ! ˜ C T @ - (2.112) with the boundary condition V(x(t N )= φ(x(t N )). To obtain the equation above we make use of the equalities: 1 $ Φ (∂ t V)= ∂ t Φ (2.113) 1 $ Φ (∇ x V)=∇ x Φ (2.114) ∇ xx Φ = 1 $ (∇ x V)(∇ x V) T + 1 $ (∇ xx V)Φ (2.115) The optimal control law can be found explicitly and thus is given by: u =−R −1 G(x)(∇ x V) (2.116) Substitution of the optimal control back to the parabolic PDE results in Hamilton Jacobi Bellman equation. 52 −∂ t V t =q+(∇ x V) T f− 1 2 (∇ x V) T GR −1 G T (∇ x V) + 1 2γ (∇ x V) T ˜ CΣ ! ˜ C T (∇ x V)+ $ 2γ tr ? ∇ xx Ψ ˜ CΣ ! ˜ C T @ We write the equation above in a more compact form: −∂ t V t =q+(∇ x V) T f− 1 2 (∇ x V) T M(x)(∇ x V)+ $ 2γ tr ? ∇ xx Ψ ˜ CΣ ! ˜ C T @ (2.117) where the termM(x) is defined as: M(x)=G(x)R −1 G(x) T − 1 γ ˜ C(x)Σ ! ˜ C(x) T (2.118) The PDE above is equivalent the stochastic Isaacs HJB in (2.101) if ˜ C(x)Σ ! ˜ C(x) T = C(x)Σ ! C(x) T . Thus the following theorem as stated in (Basar & Berhard 1995) with someslightmodifications 2 establishestheequivalencebetweenstochasticdifferentialgames and Risk sensitivity: Theorem: The stochastic differential game expressed by (2.120) and (2.94) is equivalent under the conditions 1,2,3, and 4 with the risk sensitive stochastic optimal control problem defined by (2.103) and (2.104) in the sense that the former admits a game value function with continuously differentiable in t and twice differentiable in x if 2 In(Basar&Berhard1995)pp183thecorrespondingtheoremisstatedforthecasewhereC(x)= ˜ C(x) while in the present form there is the assumption ˜ C(x)Σ! ˜ C(x) T =C(x)Σ!C(x) T . 53 and only of the later admits an optimum value function with same features. Furthermore the optimal control and value functions are identical and they are specified by: u ∗ (x)=−R −1 G T ∇ x V (2.119) where −∂ t V t =q+(∇ x V) T f− 1 2 (∇ x V) T GR −1 G T (∇ x V) + 1 2γ (∇ x V) ˜ CΣ ! ˜ C T (∇ x V)+ $ 2γ tr ? ∇ xx Ψ ˜ CΣ ! ˜ C T @ with boundary condition V t N = φ t N , iff the following conditions holds ˜ C(x)Σ ! ˜ C(x) T = C(x)Σ ! C(x) T . The parameters γ,λ> 0 and Σ ! defined as Σ ! =LL T . Theuseofthetwoparameters $and γ intheanalysisabovemayseemabitconfusing. As a first observation, $ and γ are tuning parameters in the cost function and therefore it does not make sense to multiply the process noise in the stochastic dynamics since in most control applications the stochastic dynamics are given and their uncertainty is not a matter of manipulation and user tuning. To resolve the confusion we consider the special case where $ = γ which is the most studied in the control literature. In this case the parameters $,γ drop from the stochastic dynamics and they appear only in the cost functions. When $)= γ this is a generalization sincewecannowaskanadditionalquestion: Given the cost functions in risk sensitive and differential game optimal control problems and the difference between the risk parameter $ and the distrurbance weight γ what is the form of the stochastic dynamics for which 54 these two problems are equivalent. Clearly for the dynamics (2.94) and (2.103) the two stochastic optimal control problems are equivalent. Due to this generalization we keep $,γ for path integral stochastic differential games and path integral risk sensitivity. Inthenextsectionwederivetherisksensitivepathintegralcontrolandweshowunder which conditions it is equivalent with the path integral stochastic differential games. 2.5 Informationtheoreticinterpretationsofoptimalcontrol One of the first information theoretic interpretations of stochastic optimal control is the work by (Saridis 1996). In this work, an alternative formulation of stochastic optimal control is proposed which relates the minimization of the performance index in optimal control with the concept of Shannon Differential Entropy. Moreover, the entropy formu- lation is not only applied to provide interpretations of the optimal control problem but it is also used to generated alternative views for the frameworks of stochastic estimation and adaptive control in a unified way. In this section, we are going to restrict our analysis to the case of optimal control problem and its entropy interpretation. Moreprecisely, westartouranalysiswiththe”traditional”formulationoftheoptimal control problem which consists of a cost function under minimization of the form: J(u,x)= φ(x t N )+ " t N t 0 L(x,u)dt (2.120) 55 subject to the stochastic dynamics: dx=(f(x)+G(x)u)dt+B(x)dω. We define the differential entropy: H $ u(x,t),p(u),x(t 0 ) % =− " Ωx 0 " Ωx p $ u,x(t 0 ) % logp $ u,x(t 0 ) % dx dx 0 (2.121) where p $ u,x(t 0 ) % is the probability of selecting u while x 0 is the initial state and Ω x ,Ω x 0 the spaces of the states and the initial conditions. Next, we are looking for the probability distribution which best represents the random variableu. The answer to this requestisgivenbyJayne’smaximumentropyprinciplewhichstatesthatthebestdistribu- tion is the one that maximizes the entropy formulation above. This maximization proce- dureissubjectedtotheconstrainsthatE $ J(u,x) % =K andalso F p $ u,x(t 0 ) % dx 0 = 1. As stated in (Saridis 1996), this problem is more general than the optimal control since the parameter K is fixed and unknown and it depends on the selection of the controls u(x,t). The unconstrained maximization problem is now formulated as follows: Υ =β H $ u(x,t),p(u),x(t 0 ) % −γ $ E $ J(u,x) % −K % −α $" p $ u,x(t 0 ) % dx 0 −1 % ∝− " $ β p $ u,x(t 0 ) % logp $ u,x(t 0 ) % +γ p $ u,x(t 0 ) % J(u,x) % dx −α $" p(u,x(t 0 ))dx 0 −1 % (2.122) The objective function above is concave with respect to the probability distribution since the second derivative ∂Υ ∂p =−β 1 p < 0. Thus to find the maximum we take the first 56 derivative of the objective function with respect to the distribution p(u) and equal to zero. More precisely we have: −βlogp(u)−β−γJ(u,x)−α = 0 (2.123) The worst case distribution and therefore the one which maximizes the differential entropy H $ u(x,t),p(u),x(t 0 ) % is expressed as follows: p(u)= exp $ − γ β J(u,x) % exp $ − β β+α % (2.124) by assuming that 1 λ = γ β and exp $ − β β+α % = F exp $ − 1 λ J(u,x) % dx we will have the final result: p(u)= exp $ − 1 λ J(u(x,t),x) % F exp $ − 1 λ J(u(x,t),x) % dx (2.125) Substitution of the worst distribution results in the maximum differential entropy expressed by the equation: H $ u(x,t),p(u),x(t 0 ) % = ζ + 1 λ E $ J(u(x,t),x) % (2.126) where ζ = β+α β . Giventheformoftheprobabilityp(u)totaltimederivativeexpressed as follows: 57 dp(u) dt = d dt exp $ − 1 λ J(u,x) % exp $ − β β+α % =− 1 λ L(u,x) p(u) (2.127) At the same time we know that dp(u) dt = ∂p(u) ∂t + ∂p ∂x T ˙ x. By equating the two equations we will have: ∂p(u) ∂t +∇ x p T ˙ x+ 1 λ L(u,x) p(u) = 0 (2.128) We now consider the following properties: ∇ x p = 1 λ ∇ x J(x,u) p(u), and ∂p ∂t = 1 λ ∂J ∂t p(u) (2.129) Substitution of the equation above results in the following PDE: $ ∂J(u) ∂t +∇ x J(x,u) T f(x,u,t)+L(u,x) % 1 λ p(u) = 0 (2.130) which is the generalized HJB equation. By assuming that ∀p(u) > 0 the equation above yields: ∂J(u) ∂t +∇ x J(x,u) T f(x,u,t)+L(u,x) = 0 (2.131) The minimization of the equation yields the optimal control that minimizes the dif- ferential entropy H $ u(x,t),p(u),x(t 0 ) % . More precisely we will have that − ∂J(u) ∂t = min u $ ∇ x J(x,u) T f(x,u,t)+L(u,x) % (2.132) 58 2.6 Discussion In this chapter we have presented basic concepts and principles in the theory of optimal control starting from the Bellman principle of optimality and the Pontryagin maximum principle. We discussed a class of model based iterative optimal control approaches by deriving the Stochastic Differential Dynamic programming (SDDP) algorithm for non- linear systems with state and control multiplicative noise. The connection between risk sensitive and differential game theoretic optimal control problems was illustrated. In the previous section we presented information theoretic interpretations of optimal control problem. The next chapter introduces fundamental mathematical concepts in physics and con- troltheory. ThesemathematicalconceptsincludePDEs, andSDEsandthepathintegral. Besides the presentation of each one of these concepts, emphasis is also given their con- nection. 59 Chapter 3 Path Integrals, Feynman Kac Lemmas and their connection to PDEs The goal in this chapter is to introduce important, for the mathematical developments of this work, concepts in the area of PDEs and their connection to path integral formalisms and SDEs. Essentially we can think about the 3 mathematical formalisms of PDEs, Path integrals and SDEs as different mathematical representations of the same underlying physical processes. But, why are there more that one mathematical representation of the same phenomenon? The reason is because these mathematical structures offer represen- tations on a macroscopic or microscopic level. In fact, PDEs provide a macroscopic view while SDEs and path integrals formalisms offer a more microscopic view of an underlying physical process. Among other sciences and engineering fields, the aforementioned mathematical tools are also used in the physics and control theoretic communities. These communities are dealingwithdifferentproblems. Asanexample,whileinphysicsitisimportanttopredict the position of a particle under a magnetic field, in control theory the question is how to construct a magnetic filed such that the particle has a desired behavior. Clearly in both 60 cases one could use PDEs, on the one hand, to predict the outcome of the force field, on the other hand, to find the control policy which when applied meets the desired behavior. So, both communities are using PDEs but for different purposes. This fact results also in different terminology. What is called a ”force field”, for physics, it can be renamed as ”control policy” in control theory. The observations above are not necessarily objective, but they are very much related to our experiences as we were trying to understand and bring together concepts from physics and control theory. With this background in our mind, in this chapter our goal is to bring together concepts from physics and control theory with emphasis on the connection between PDEs, SDEs and Path Integrals. More precisely, section 3.1 is a short journey in the world of quantum mechanics and the work on Path Integrals by one of the most brilliant intellectuals in the history of sciences, Dr. Richard Feynman. By no means, this section is not a review his work. This section just aims to show that the core concepts of this thesis, which is the Path Integral, has its historical origins in the work by Richard Feynman. Insections3.2and3.4wehighlighttheconvectionbetweentheforwardFokkerPlanck PDEs and the underlying SDE for both the Itˆ o and the Stratonovich calculus. With the goal to establish the connection between the path integral formalism and SDEs, in section 3.3, we derive the path integral for the general stochastic integration scheme for 1-dimensional SDE and then we specialize for the cases of Itˆ o and the Stratonovich calculus. In section 3.4 the derivation of the Itˆ o path integral for multi-dimensional SDEs is presented. The last two sections are aiming to show how forwards PDEs are connected to Path Integrals and SDEs. 61 Forward PDEs such as the Fokker Planck equation or the Chapman- Kolmogorov PDE in its forward form, are typically used in estimation problems. In particular in the case where stochasticity is considered only for the state space dynamics, the Fokker Planck PDE is the appropriate mathematical description, while in cases in which there is also measurement uncertainty (partial observability), the forward Chapman-Kolmogorov equation is the corresponding PDE. However, in an optimal control setting, the PDEs are usually backward and since they are related to the concept of the value function and the Bellman principle of optimality. Is there any connection between these backward PDEs, SDEs and the corresponding Path Integrals formalisms? The answer is that this connection exists and it is established via the Feynman- Kac lemma in section (3.5). The Feynman Kac lemma is of great importance because is provides a way to probabilistically representsolutionofPDEs. Insection(3.5)weprovidethefullproofoftheFeynman-Kac lemma, which in its complete form, is rarely found in the literature. In section 3.6 we discuss special cases of the Feynman-Kac lemma. Besides the connection between the forward and backward PDEs, SDE and Path in- tegrals we also discuss how the forward and backward Chapman-Kolmogorov equations are connected in 3 different levels which are: i) through the mathematical concept of fun- damental solutions of PDEs, ii) via a slightly modified version the proof of the Feynman -Kaclemmaandiii)throughtheGeneralizedDualitybetweentheoptimalestimationand control problems. All these issues are addressed in sections 3.7, 3.8 and 3.9. In the last section we conclude and prepare the discussion for the next chapter. 62 3.1 Path integrals and quantum mechanics Since the mathematical construction of the path integral plays a central role in this work, it would have been a severe gap if this work did not include an introduction to path integrals and their use for the mathematical representation of quantum phenomena in physics. Therefore, in the next two subsections, we discuss the concept of least action in classical mechanics and its generalization to quantum mechanics via the use of the path integral. Moreover, we provide the connection between the path integral and the Schr˝ odinger equation, one of the most important equations in quantum physics. The Schr˝ odinger equation was discovered in 1925 by the physicist and theoretical biologist, Erwin Rudolf Josef Alexander Schr˝ odinger (Nobel-Lectures 1965). The initial idea of the path integral goes back Paul Adrien Morice Dirac, a theoretical physicist who together with Schr˝ odinger was awarded the Nobel Prize in Physics in 1933 for their work on discovery of the new productive forms of atomic theory. Richard Phillips Feynman (Nobel-Lectures 1972), also a theoretical physicist and Nobel price winner in 1965 for his work on quantum electrodynamics, completed the theory of path integral in 1948. 3.1.1 The principle of least action in classical mechanics and the quantum mechanical amplitude. Let us assume the case where a dynamical system moves from an initial state x A to a terminal statex B . The principle of least action (Feynman & Hibbs 2005) states that the system will follow the trajectory x ∗ A ,x ∗ 1 ,...,x ∗ N−1 ,x ∗ B that is the extremum of the cost function: 63 S = " t B t A L(x, ˙ x,t)dt (3.1) whereL(x, ˙ x,t) is the Langragian of the system defined asL = E kin −U with E kin being the total kinetic energy and U being the potential energy of the system and S is the so called action. For a particle of mass m moving in a potentialV(x), the Lagrangian isL(x, ˙ x)= 1 2 m˙ x 2 −V(x). By using the Calculus of variations, the optimal path can be determined. We start by taking the Taylor series expansion of S(x+δx) and we have: S(x+δx)= " t B t A L(x+δx, ˙ x+δ˙ x,t)dt = " t B t A $ L(x, ˙ x,t)+δx T ∇ x L+δ˙ x T ∇ ˙ x L % dt =S(x)+ " t B t A $ δx T ∇ x L+δ˙ x T ∇ ˙ x L % dt After integrating by parts we will have ΔS = . δx T ∇ ˙ x L / T t 0 − " T t 0 δx T . d dt $ ∇ ˙ x L % −∇ x L / dt Given the condition δx(t 0 ) = 0 and δx T = 0 we will have: ΔS =− " t B t A δx T . d dt $ ∇ ˙ x L % −∇ x L / dt To find the optimal trajectory we set ΔS = 0. Therefore the condition that the optimal trajectory satisfies is expressed as: 64 d dt $ ∇ ˙ x L % −∇ x L=0 TheequationaboveisthesocalledEuler-Lagrangeequation. Inquantummechanics, for a motion of a particle fromx 0 tox T there is the concept of amplitudeK(x 0 ,x T ) that is associated with it. This amplitude is defined as the sum of contributions φ(x) of all the trajectories that start fromx 0 and end inx T . The contributions φ(x) are defined as: φ(x A →x B )=const×exp $ j h S(x A →x B ) % where S(x) is the action, and const is a normalization factor. Based on the definition of contributions of individual paths (Feynman & Hibbs 2005), the amplitude is defined as: K(x A ,x B )=K(x A →x B )= > φ(x A →x B )= > const×exp $ j h S(x A →x B ) % (3.2) The probability of going from x A to x B is defined as the square of the amplitude K(x A ,x B ) and thus it is expressed as p(x A → x B )= |K(x A ,x B )| 2 (Feynman & Hibbs 2005) . Clearly, the mathematical term of the contribution of each individual path φ(x A → x B ) is represented by a complex number. This is because light can be thought not only as moving particles with tiny mass but also as waves traveling via different paths towards the same destination. Moreover, although the concept of amplitude K(x A → x B ) is 65 associated with the probability p(x A → x B ), it remains somehow an abstract concept. One can further understand it, by looking into how laws of classical mechanics arise from the quantum mechanical law and how the path integral formulation provides a mathematical representation for this relationship. Toinvestigatetherelationbetweenclassicalandquantummechanicallawsitisimpor- tant to realize that the term ¯ h = h 2π =1.055×10 −27 erg·sec whereh is Planck’s constant, is a very small number. In addition, in classical mechanics the action S is much larger than ¯ h due to the scale of masses and time horizons of bodies and motions. Thus the fact that 1 ¯ h is a very large number increases the sensitivity of the phase variable θ = S(x+δx) ¯ h of a path with respect to the changes of the actionS(x) of the corresponding path. Small deviation of the action S(x+δx) create enormous changes in the phase variable θ of the path. As a consequence, neighbored paths of the classical extremum, will have very high phases with opposite signs which will cancel out their corresponding contributions. Only paths in the vicinity of the extremum path will be in-phase and they will contribute and create the extremum path which satisfies the Euler-Langrange equation. Thus, clearly in the classical mechanics there is only one path from x 0 to x T . In the quantum world, the scale of masses and time horizons of bodies and motions are such that the action S(x) is comparable to the term 1 ¯ h . In this case, deviations of the action S(x + δx) do not create enormous changes and thus all paths will interfere by contributing to the total amplitude and the total probability of the motion of the corresponding particle fromx 0 tox T . We realize that the path integral in 3.2 provides a simpleandintuitivewaytounderstandhowclassicalmechanicalandquantummechanical 66 phenomena are related by just changing the scales of body masses and time horizons. Es- sentially the path integral provides a generalization of the concept of action from classical mechanics to the quantum mechanical word. Before we close this subsection, we present some alternative mathematical represen- tations of the path integral in equation 3.2 . More precisely in a compact form, the path integral is written as: K(x A ,x B )= " x B x A exp $ i h S(x) % D(x(t)) (3.3) The path from the statex A tox B can be split into two pieces by incorporating a new state x C . Therefore, the equation above can be written as: K(x A ,x B )= = " x B x A exp $ i h S(x A →x C )+ i h S(x C →x B ) % D(x(t)) = "" x B x A exp $ i h S(x A →x C )+ i h S(x C →x B ) % D(x A ↓x C )D(x C ↓x B )dx C = "" x C x A exp $ i h S(x A →x C ) % D(x A ↓x C ) " x B x C exp $ i h S(x C →x B ) % D(x C ↓x B )dx C Giventhatthepathx A →x B isdefinedasx A →x B = {x A ,x 1 ,...,x C−1 ,x C ,x C+1 ,...,x B } the term D(x A ↓ x C )= dx 1 ×...×dx C−1 and D(x C ↓ x B )= dx C+1 ×...×dx B . The equation above can be written as: K(x A ,x B )= " ∞ −∞ K(x A ,x C )K(x B ,x C )dx C 67 By continuing this process of splitting the paths from x A to x B into subpaths, the path integral takes the form: K(x A ,x B ) = lim dt→0 " ... "" N G i=1 K $ x i+1 ,x i % dx 1 dx 2 ...dx N (3.4) where the kernel K(x i+1 ,x i ) is now defined as: K(x i+1 ,x i )= 1 A exp . i ¯ h δtL $ x i+1 −x i δt , x i+1 +x i 2 , t i+1 +t i 2 %/ (3.5) and A = $ 2πi ¯ hdt m % 1/2 . The equations (3.4) and (3.5) above realize the path integral formulation in discrete time. The path integral formulation is an alternative view of Quantum Mechanics in the next section we discuss the Schr˝ odinger equation and its connection to path integral. 3.1.2 The Schr˝ odinger equation In this section, we show how one of the most central equations in quantum mechanics, the Schr˝ odinger equation, is derived from the mathematical concept of path integrals. The connection between the two descriptions is of critical importance since it provides a more complete view of quantum mechanics (Feynman & Hibbs 2005), but it is also an example of mathematical connection between path integrals and PDEs. The derivation starts with the waive function ψ(x,t) which can be though as an amplitude with the slight difference that the associate probability P(x,t)= |ψ(x,t)| 2 is the probability of being at statex at timet without looking into the past. Since the wave function is an amplitude function it satisfies the integral equation: 68 ψ(x N ,t N )= " ∞ −∞ K(x N ,t N ;x N−1 ,t N−1 )ψ(x N−1 ,t N−1 )dx N−1 (3.6) Substitution of the kernel K(x N ,t N ;x N−1 ,t N−1 ) yields: ψ $ x(t+δt),t+δt % = = " ∞ −∞ exp . i ¯ h δtL $ x(t+δt)−x(t) δt , x(t+δt)+x(t) 2 %/ ψ(x,t)dx(t) For simplifying the notation we make the substitutions, x(t+δt)=x and x(t)=y. ψ $ x,t+δt % = " ∞ −∞ exp . i ¯ h δtL $ x−y δt , x+y 2 %/ ψ(y,t)dy(t) Substitution of the Langragian results in: ψ $ x,t+δt % = = 1 A " ∞ −∞ exp . im 2 ¯ h $ x−y % T $ x−y % δt / exp . − i ¯ h δt V $ x+y 2 %/ ψ(y,t)dy(t) = 1 A " ∞ −∞ exp . imv T v 2 ¯ hδt / exp . − i ¯ h δt V $ x+ 1 2 v %/ ψ(y,t)dy(t) 69 where v =x−y. Next the second exponential function is expanded, while ψ(y,t) is expended around x. More precisely we will have: ψ $ x,t+δt % = = 1 A " ∞ −∞ exp . imv T v 2 ¯ hδt /. 1− i ¯ h δtV(x,t) /. ψ(x,t)−∇ψ T v+ 1 2 v T ∇ xx ψv / dy(t) By using the following equalities 1 A F +∞ −∞ vexp ? im 2 ¯ hδt v T v @ dv = 0 as well as the equa- tion 1 A F +∞ −∞ vv T exp ? jm 2 ¯ hδt v T v @ dv = i ¯ hδt m I n×n the wave function is formulated as: ψ $ x,t+δt % = ψ(x,t)− i ¯ h δtV(x,t)ψ(x,t)+ i ¯ hδt 2m tr $ ∇ xx ψ % The last step is to take the Taylor series expansion of ψ $ x,t+δt % = ψ(x,t)+δt∂ t ψ: ψ(x,t)+δt∂ t ψ = ψ(x,t)− iδt ¯ h V(x,t)ψ(x,t)+ i ¯ hδt 2m tr $ ∇ xx ψ % The final version of the Schr˝ odinger equation takes the form: ∂ t ψ =− i ¯ h . − ¯ h 2 2m tr $ ∇ xx ψ % +V(x,t)ψ / (3.7) By introducing the operator H =− ¯ h 2 2m tr $ ∇ xx % +V(x,t) the Schr˝ odinger equation is formulated as: ∂ t ψ =− i ¯ h Hψ (3.8) 70 With the derivation of the Schr˝ odinger equation, we close our introduction to path integrals in quantum mechanics. 3.2 Fokker Planck equation and SDEs TheFokkerplanckPDEisofgreatimportanceinstatisticalmechanicsasithasbeenused to describe the evolution of the probability of particles as a function of space and time. It can be thought as the equivalent of the Schr˝ odinger equation in quantum mechanics. In the next two sections, we will derive the Fokker Planck PDE starting from the underlying Itˆ o and Stratonovich stochastic differential equation (Chirikjian 2009). 3.2.1 Fokker Planck equation in Itˆ o calculus We start with the following stochastic differential equation: dx =f(x,t)dt+B(x,t)dw (3.9) in which x∈" n×1 is the state, and dw = w(t)−w(t +dt) with w(t)∈" p×1 a Wiener process (or Brownian motion process). The equation above is an Itˆ o stochastic differential equation if its solution: x(t)−x(0) = " t 0 f(x,t)dτ + " t 0 B(x,t)dw(τ) (3.10) can be interpreted in the sense: 71 lim µ→∞ E $ H " t 0 B(x(τ),τ)dτ− µ > k=1 B(x(t k−1 ),t k−1 )[w(t k )−w(t k−1 )] I 2% = 0 (3.11) where t 0 =0<t 1 <t 2 < ... < t N = t. The drift part of the SDE in 3.9 is also interpreted as: lim µ→∞ E $ H " t 0 f(x(τ),τ)dτ− 1 µ µ > k=1 f(x(t k−1 ),t k−1 ) I 2% = 0 (3.12) for the cases where the function f(x,t) is not pathological, then the limit can be pushed inside the expectation. Consequently, the equation above is true due to the fact that lim µ→∞ " t 0 f(x(τ),τ)dτ = 1 µ µ > k=1 f(x(t k−1 ),t k−1 ) (3.13) For the derivation of the corresponding Fokker Planck equation we will make use of the expectation of terms of the form E(Zdx) and E A dxM dx T B whereZ∈" n×1 and M∈" n×n . Thus we will have: E(Zdx)=E $ Z f(x,t)dt % =Z f(x,t)dt (3.14) E A dxMdx T B =E $ dw T B(x,t) T MB(x,t)dw % =tr $ B(x,t) T MB(x,t)dt % (3.15) 72 where we have used the properties of the Wiener process E(dw) = 0, E A dw dw T B = dtI m×m . NowwearereadytoderivetheFokkerPlanckPDEandwestartwiththepartial derivativeoftheprobabilityfunctionp(x(t)|y,t)wherey =x(0). Toavoidanyconfusion, it is important to understand the notation p(x(t)|y,δt). In particular p(x(t)|y,δt) is interpreted as the probability of being at state x at time t 1 given that the state at time t 2 <t 1 is y(t 2 ) and t 1 −t 2 = δt. Consequently, in case where t 1 = t and t 2 = 0, the transitionprobabilityp(x|y,t)isabsolutelymeaningful. Thepartialderivativeofp(x|y,t) with respect to time is expressed as follows: ∂p(x|y,t) ∂t = lim δt→0 p(x|y,t+δt)−p(x|y,t) δt (3.16) theprobabilityp(x|y,t+δt)canbealsowrittenviatheChapmanKolmogorovequation as p(x|y,t+δt)= F p(x|z,t)p(z|y,δt)dz. Therefore we will have that: ∂p(x|y,t) ∂t = lim δt→0 F p(x|z,t)p(z|y,δt)dz−p(x|y,t) δt (3.17) Lets define the function ψ(x,t)∈" that is compactly supported and it isC 2 . We project ∂p(x|y,t) ∂t on ψ(x,t) in Hilbert space and we have that: " ∂p(x|y,t) ∂t ψ(x)dx = lim δt→0 1 δt " $" p(x|z,t)p(z|y,δt)dz−p(x|y,t) % ψ(x,t)dx (3.18) by exchanging the order of integration we will have that: 73 " ∂p(x|y,t) ∂t ψ(x,t)dx = lim δt→0 1 δt $"" p(x|z,t)p(z|y,δt)ψ(x)dxdz− " p(x|y,t)ψ(x)dx % (3.19) The Taylor series expansion of ψ(x)= ψ(z+dx) where dx = x−z is expressed as follows: ψ(x)= ψ(z)+∇ z ψ(z)(x−z)+ 1 2 (x−z) T ∇ zz ψ(z)(x−z) (3.20) We substitute the expanded term ψ(x) in the first term of the left side of (3.19) and we have: "" p(x|z,t)p(z|y,δt) $ ψ(z)+∇ z ψ(z) dx+ 1 2 dx T ∇ zz ψ(z) dx % dxdz = " p(z|y,t)ψ(z)dz+ "" p(x|z,t)p(z|y,δt) $ ∇ z ψ(z) dx+ 1 2 dx T ∇ zz ψ(z) dx % dxdz (3.21) Theterms F p(x|z,δt)∇ z ψ(z) T dxdy and F p(x|z,δt)dx T ∇ zz ψ(z)dxdy canbewrit- ten in the formE A ∇ z ψ(z) T dx B andE A dx T ∇ zz ψ(z) dx B which, according to (3.14) and (3.15)areequalto∇ z ψ(z) T f(z,t)dtandtr $ B(z,t) T ∇ zz ψ(z)B(z,t)dt % . Bysubstituting (3.21) in to (3.19) it is easy to show that: 74 " ∂p(x|y,t) ∂t ψ(x)dx = " p(z|y,t) $ ∇ z ψ(z) T f(z,t)+ 1 2 tr $ ∇ zz ψ(z) B(z,t)B(z,t) T %% dz (3.22) wheretheterms F p(z|y,t)ψ(z)dzin(3.21)and− F p(x|y,t)ψ(x)dxin(3.19)areequal and therefore they have been cancelled out. In the final step we integrate by part the ride side of the equation above and therefore we will have that: " ∂p(x|y,t) ∂t ψ(x)dx (3.23) = " −∇ z ·(f(z,t)p(z|y,t))+ 1 2 tr $ ∇ zz $ B(z,t)B(z,t) T p(z|y,t) %% ψ(z)dz Since x,z∈" n and the integrals are calculated in the entire" n the equation above is written in the following form: " , ∂p(x|y,t) ∂t +∇ x ·(f(x,t)p(x|y,t))− 1 2 tr $ ∇ xx $ B(x,t)B(x,t) T p(x|y,t) %% - ψ(x)dx (3.24) 75 We now apply the fundamental theorem of calculus of variations (Leitmann 1981) according to which: let f(x) ∈C k , if F b a f(x)h(x)dx=0, ∀ h(x) ∈C k and h(a)= h(b)=0 then f(x) = 0. Consequently we will have that: ∂p(x|y,t) ∂t +∇ x ·(f(x,t)p(x|y,t))− 1 2 tr $ ∇ xx $ B(x,t)B(x,t) T p(x|y,t) %% = 0 (3.25) or in the form: ∂p(x|y,t) ∂t =− n > i=0 ∂ ∂x i $ f(x,t)p(x|y,t) % + 1 2 m > k=1 n > i,j=1 $ B i,k (x,t)B k,j (x,t) T p(x|y,t) % (3.26) The PDE above is the so called Fokker Planck Equation which is a forward, second order and linear PDE. From the derivation it is clear that the Fokker Planck equation describes the evolution of the transition probability of a stochastic dynamical system of the form (3.9) over time. In fact, lets consider a number of trajectories as realizations of the same stochastic dynamics then the 2nd term, which corresponds to drift, in (3.37) controlsthedirectionofthetrajectorieswhilethe3rdterm, thatcorrespondstodiffusion, quantifies how much these trajectories spread due to noise in the dynamics (3.9). As we will show in the next section the FKP PDE differs from the forward Kolmogorov PDE only in one term but we will leave this discussion for the future section. 76 3.2.2 Fokker Planck equation in Stratonovich calculus In this section we derive the Fokker Planck PDE in case where the underlying stochastic differential equation is integrated in the Stratonovich sense. We start our analysis with the stochastic differential equation in (3.9) if its solution is interpreted as the integral: x(t)−x(0) = " t 0 f (S) (x,τ)dτ + " t 0 B (S) (x,τ)⊕dw(τ) (3.27) where the superscript S is used to distinguish that the function f(x,t) and B(x,t) are evaluated in the Stratonovich convention and therefore they are different from the corresponding functions in the Itˆ o calculus. More precisely the Stratonovich integration 1 is defined as: " t 0 t f(τ)⊕w(τ) = lim t→∞ n > i=1 f $ t i +t i−1 2 % (w(t i )−w(t i−1 )) (3.28) where the equal sign above is understood in the mean square sense. Clearly, the drift part of solution (3.27) of the stochastic differential equation (3.9) can be interpreted as: " t 0 f (S) (x,τ)dτ = 1 n lim n→∞ n > i=1 f $ x(t i )+x(t i−1 ) 2 , t i −t i−1 2 % (3.29) while the diffusion part is interpreted as: " t 0 B (S) (x,τ)⊕dw(τ)= 1 n lim n→∞ n > i=1 B $ x(t i )+x(t i−1 ) 2 , t i −t i−1 2 % (w(t i )−w(t i−1 )) (3.30) 1 The symbol⊕ is used to represent the Stratonovich integral. 77 The equalities (3.29) and (3.30) are understood in the mean square sense. In order to find the Fokker Planck equation for the case of the Stratonovich integration of (3.9) we will first find the connection between Itˆ o and Stratonovich integrals. Through this connection we will be able to find the Stratonovich Fokker Planck PDE without explicitly deriving it. More precisely, we rewrite (3.9) in scalar form expressed by the equation: dx i =f S i (x,t)dt+ m > j=1 B S i,j (x,t)⊕dw j (3.31) where the terms f S i (x,t) and B S i,j (x,t) are given below: f S i (x,t)dt =f i $ x $ t k +t k−1 2 % ,t % dt, B S i,j (x,t)=B i,j $ x $ t k +t k−1 2 % ,t % We will take the Taylor series expansion of the term above since x ? t k +t k−1 2 @ = x(t k−1 )+ 1 2 dx. More precisely we have that: f i $ x(t k−1 )+ 1 2 dx % dt (3.32) =f i $ x(t k−1 ) % dt+ 1 2 $ ∇ x(t k−1 ) f i (x(t)) % T dx dt =f i $ x(t k−1 ) % dt+ 1 2 $ ∇ x(t k−1 ) f i (x(t)) % T $ f(x,t)dt+B(x,t)dw % dt =f i $ x(t k−1 ) % dt+ 1 2 $ ∇ x(t k−1 ) f i (x(t)) % T f(x,t)dt 2 + 1 2 $ ∇ x(t k−1 ) f i (x(t)) % T B(x,t)dwdt 78 Since, dt 2 → 0 and dwdt→ 0, the 2nd and 3rd term in the equation above drop and this we have the result: f i $ x(t k−1 )+ 1 2 dx % dt =f i $ x(t k−1 ) % dt (3.33) We continue with the term B i,j $ x ? t k +t k−1 2 @ ,t % and we will have that: B i,j $ x(t k−1 )+ 1 2 dx % dw j (3.34) =B i,j $ x(t k−1 ) % dw j + 1 2 ∇ x(t k−1 ) B i,j (x(t)) T dx dw j =B i,j $ x(t k−1 ) % dω j + 1 2 ∇ x(t k−1 ) B i,j (x(t)) T $ f(x t k−1 )dt+B(x t k−1 )dw % dw j =B i,j $ x(t k−1 ) % dw j + 1 2 ∇ x(t k−1 ) B i,j (x(t)) T B(x t k−1 ) dw dw j =B i,j $ x(t k−1 ) % dw j + 1 2 n > l=1 ∂B i,j (x t k−1 ) ∂x l B i,l (x t k−1 ) dt where we have used the fact that dw dw j =dt and dwdt→ 0. By substituting back into (3.31) we will have: dx i =f i (x,t)dt+ m > j=1 B i,j (x)dw j + m > j=1 n > l=1 ∂B i,j (x) ∂x l B i,l (x) dt (3.35) The stochastic differential equation above is expressed in Itˆ o calculus and it is equiv- alent to its Stratonovich version equation (3.31). In other words the Stratonovich inter- pretation of the solution of (3.31) is equivalent to the Itˆ o interpretation of the solution of the equation (3.35). Now that we found the equivalent of the Stratonovich stochastic 79 differential equation in Itˆ o calculus, the Stratonovich Fokker Planck equation is nothing else than the Itˆ o Fokker Planck equation of the stochastic differential equation: dx = $ f(x,t)−C(x) % dt+B(x)dw, 1 2 C i (x)= m > j=1 n > l=1 ∂B i,j (x) ∂x l B i,l (x) (3.36) Thus, the Stratonovich Fokker Planck equation has the form: ∂p(x|y,t) ∂t =−∇ y · $$ f(y)−C(y) % p(x|y,t) % + 1 2 tr $ ∇ yy $ B(y,t)B(y,t) T p(x|y,t) %% (3.37) The difference between the Stratonovich and Itˆ o Fokker Planck PDEs is in the extra term C(x). In the question, which calculus to use, the answer depends on the appli- cation and the goal of the underlying derivation. It is generally accepted (Chirikjian 2009),(Øksendal 2003) that the Itˆ o calculus is used for the cases where expectation op- erations have to be evaluated while the Stratonovich calculus has similar properties with the usual calculus. In this section we have derived the connection between the two cal- culi and therefore, one could take advantage of both by transforming the Stratonovich interpreted solution of a stochastic differential equation in to its Itˆ o version and then apply Itˆ o calculus. Besides these conceptual differences between the two calculi, there are additional characteristics of Itˆ o integration which we do not find in the Stratonovich calculus and vice versa. More detailed discussion on the properties of the Itˆ o integration can been found in (Øksendal 2003), (Karatzas & Shreve 1991), (Chirikjian 2009) and (Gardiner 2004). 80 Both Itˆ o and Stratonovich stochastic integrations are special cases of a more general stochastic integration rule in which functions (3.32) are evaluated∀α∈ [0,1] as follows: f α i (x,t)=f i $ x(αt k +(1−α)t k−1 ),t % , B α i,j (x,t)=B i,j $ x(αt k +(1−α)t k−1 ),t % Similarly as before since x(αt k +(1−α)t k−1 )= x(t k )+ αdx we take the Taylor series expansions of the terms above and therefore we will have that: f i $ x(t k )+αdx % =f i $ x(αt k +(1−α)t k−1 ),t % dt =f i $ x(t k ),t % dt (3.38) and B i,j $ x(t k−1 )+αdx % dω j =B i,j $ x(t k−1 ) % dω j +α n > l=1 ∂B i,j (x t k−1 ) ∂x l B i,l (x t k−1 ) dt (3.39) Consequently, the termC(x)∈" n×1 in (3.36) is now defined as follows: C i (x)= α m > j=1 n > l=1 ∂B i,j (x) ∂x l B i,l (x) (3.40) Clearly, for α = 0 we have the Itˆ o calculus while for α = 1 2 we have the Stratonovich. 81 3.3 Path integrals and SDEs The path integral formalism can been thought as an alternative, to Fokker Planck and Langevin equations, mathematical description of nonlinear stochastic dynamics (Lau & Lubensky 2007). In this section we will start with the derivation of the path integral formalism for the one dimensional stochastic differential equation: dx =f(x,t)dt+B(x,t)dw (3.41) Westartwiththe1dimensionalcasesbecauseitiseasiertounderstandthederivation andtherationalbehindthepathintegralconcept. Afterthe1dimensionalcaseextensions to multidimensional cases are strait-forward for the case of Itˆ o integration while the Stratonovich involves additional analysis. We will discretize the 1-dimensional stochastic differential equation as follows: x(t+δt)−x(t)= " t+δt t f $ βx(t)+(1−β)x(t+δt) % dτ + " t+δt t B $ αx(t)+(1−α)x(t+δt) % n(τ)dτ where the constants α,β∈ [0,1]. Since the drift and the diffusion term are evaluated at βx(t)+(1− β)x(t+δt) and αx(t)+(1− α)x(t+δt) they can be taken outside the integral and thus the equations above is expressed as: 82 x(t+δt)−x(t)=f $ βx(t)+(1−β)x(t+δt) % δt+B $ αx(t)+(1−α)x(t+δt) %" t+δt t n(τ)dτ (3.42) The path integral derivation is based on the statistics of the state path x(t 0 → t N ). We discretize the state path to the segments x(t i ) with t 0 <t 1 <t 2 < ... < t N and we define δt =t i −t i−1 . The probability of the path now is defined as follows: P $ x N ,t N ;x N−1 ,t N−1 ;...;x 1 ,t 1 |x 0 ,t 0 % = ! δ[x N −φ(t N ;x 0 ,t 0 ]...δ[x 1 −φ(t 1 ;x 0 ,t 0 ] # The function φ(t i ;x i−1 ,t i−1 ) is the solution of the stochastic differential equation (3.41)forx(t i )giventhatx(t i−1 )=x i−1 . Duetothefactthatthenoiseisdeltacorrelated, in different time intervals the noise is uncorrelated and therefore we will have that: P $ x N ,t N ;x N−1 ,t N−1 ;...;x 1 ,t 1 |x 0 ,t 0 % = N G i=1 ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # where the function ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # =P $ x i ,t i |x i−1 ,t i−1 % and thus it corre- spondstoconditionalprobabilitythattherandomvariablex(t)isstatex i attimet i given that att i−1 we have thatx(t i−1 )=x i−1 . We can use the transition probabilities to calcu- late the probability P(x N ,t N |x 0 ,t 0 ) that is the probability of being at state x(t N )=x N given that the initial state is x(t 0 )=x 0 at time t 0 . More precisely we have that: 83 P $ x N ,t N |x 0 ,t 0 % = " dx N−1 ... " dx 1 N G i=1 ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # (3.43) To find the path integral we need to evaluated the function δ[x i −φ(t i ;x i−1 ,t i−1 ] and then substitute to the equation above. The analysis for the evaluation of the function δ[x i −φ(t i ;x i−1 ,t i−1 )]requiresthediscretizedversionofthestochasticdifferentialequation (3.41) expressed in (3.42). We can rewrite discrete version in the form: x i =x i−1 +δt f i +B i " t i t i−1 n(τ)dτ where f i = f (βx i +(1−β)x i−1 ) and B i = B(αx i +(1−α)x i−1 ) and we introduce the function h(x i ,x i−1 ) defined as follows: h(x i ,x i−1 )= x i −x i−1 −f i δt B i − " t i t i−1 n(τ)dτ for which the condition h[φ(t i ;x i−1 ,t i−1 ),x i−1 ] = 0. By using the property of the delta function δ(g(x)) = δ(x−x 0 ) |g " (x 0 | we will have that: δ[h(x i ,x i−1 )] = J J J J ∂h(x i ,x i−1 ) ∂x i J J J J −1 x i =φ(t i ) δ[x i −φ(t i ;x i−1 ,t i−1 )] The transition probability P(x i ,t i |x i−1 ,t i−1 ) is now written as: 84 P(x i ,t i |x i−1 ,t i−1 )= ! δ[x i −φ(t i ;x i−1 ,t i−1 )] # = J J J J ∂h(x i ,x i−1 ) ∂x i J J J J x i =φ(t i ) ! δ[h(x i ,x i−1 )] # The term ∂h(x i ,x i−1 ) ∂x i is expressed by the equation: ∂h(x i ,x i−1 ) ∂x i = $ 1−β δt (∂ x f i ) % B i − $ x i −x i−1 −f i δt % (∂ x B i ) B 2 i = 1 B i . 1−β δt (∂ x f i )−α (∂ x B i ) B i $ x i −x i−1 −f i δt %/ From the property of the inverse Fourier transform of the delta function δ(t)= F +∞ −∞ dωexp(jωt) 1 2π with j 2 =−1 we will have that: δ[h(x i ,x i−1 )] = " dω 2π exp $ jωh(x i ,x i−1 ) % = " +∞ −∞ dω 2π exp $ jω $ x i −x i−1 −f i δt B i − " t i t i−1 n(τ)dτ %% The expectation operator over the noise results in: ! δ[h(x i ,x i−1 )] # = " dω 2π exp $ jω x i −x i−1 −f i δt B i %! exp $ jω " t i t i−1 n(τ)dτ %# 85 By considering the Taylor series expansion of the exponential function and applying the expectation, it can be shown that: ! exp $ jω " t i t i−1 n(τ)dτ %# = = ! 1+ jω F t i t i−1 n(τ)dτ 1! − ω 2 ( F t i t i−1 n(τ)dτ) 2 2! − jω 3 ( F t i t i−1 n(τ)dτ) 3 3! .... # = 1+ ω 2 δt 2 = exp−( 1 2 ω 2 δt) Therefore ! δ[h(x i ,x i−1 )] # = F +∞ −∞ dω 2π exp $ jω x i −x i−1 −f i δt B i − 1 2 ω 2 δt) % . By putting ev- erything together the transition probability P(x i ,t i |x i−1 ,t i−1 ) will take the form: P(x i ,t i |x i−1 ,t i−1 )= " dω 2πB i exp $ jω x i −x i−1 −f i δt B i + 1 2 ω 2 δt % × . 1−β δt (∂ x f i )−α (∂ x B i ) B i $ x i −x i−1 −f i δt %/ From the expression above we work with the term: − α(∂ x B i ) 2πB i " dωexp $ jω x i −x i−1 −f i δt B i + 1 2 ω 2 δt %$ x i −x i−1 −f i δt B i % =− α(∂ x B i ) 2πB i " dω j exp $ − 1 2 ω 2 δt % ∂ ω $ exp $ jω x i −x i−1 −f i δt B i %% =− α(∂ x B i ) 2πB i " dω $ jωδt % exp $ jω x i −x i−1 −f i δt B i − 1 2 ω 2 δt % The transition probability is expressed as follows: 86 P(x i ,t i |x i−1 ,t i−1 )= " dω 2πB i exp $ jω x i −x i−1 −f i δt B i − 1 2 ω 2 δt % × . 1−β δt (∂ x f i )−jωα(∂ x B i )δt / The term . 1−β δt (∂ x f i )−jωα(∂ x B i )δt / 2 exp . 1−β δt (∂ x f i )−jωα(∂ x B i )δt / as δt→ 0. Thus we will have: P(x i ,t i |x i−1 ,t i−1 ) = " dω 2πB i exp $ jω x i −x i−1 −f i δt+α(∂ x B i )δtB i B i − 1 2 ω 2 δt−β δt (∂ x f i ) % = " dω 2πB i exp $. j ω B i $ δx δt −f i +α(∂ x B i )B i % − 1 2 ω 2 δt−β (∂ x f i ) / δt % = exp[−β(∂ x f i )δt] √ 2πδtB i " dω √ 2πδtB i exp $. j ω B i $ δx δt −f i +α(∂ x B i )B i % − 1 2 ω 2 / δt % wedefinethequantity η = j B i $ δx δt −f i +α(∂ x B i )B i % . Wecannowwritethetransition probability as follows: P(x i ,t i |x i−1 ,t i−1 ) = exp[−β(∂ x f i )δt] √ 2πδtB i " dω √ 2πδtB i exp $. ωη− 1 2 ω 2 / δt % = exp[−β(∂ x f i )δt] √ 2πδtB i " dω √ 2πδtB i exp $. ωη− 1 2 ω 2 − 1 2 η 2 / δt % exp $ 1 2 η 2 δt % 87 = exp[−β(∂ x f i )δt] √ 2πδtB i exp $ 1 2 η 2 δt %" dω √ 2πδtB i exp $. ωη− 1 2 ω 2 − 1 2 η 2 / δt % = exp[−β(∂ x f i )δt] √ 2πδtB i exp $ 1 2 η 2 δt % = 1 √ 2πδtB i exp $ − . 1 2B 2 i $ δx δt −f i −α(∂ x B i )B i % 2 +β (∂ x f i ) / δt % The last line is valid up to the first order in δt. Clearly the path integral is given by the product of all the transition probabilitiesalong the pathx 0 ,x 1 ,...,x N . More precisely we will have that: P $ x N ,t N |x 0 ,t 0 % = " dx 1 √ 2πδtB i ... " dx N−1 √ 2πδtB N−1 " dx N √ 2πδtB N ×exp $ − N > i=1 . 1 2B 2 i $ δx δt −f i +α(∂ x B i )B i % 2 +β (∂ x f i ) / δt % = " x N x 0 D(x) e −S(x) whereD(x)= K N i=1 dx i √ 2πδtB i andS(x)= C N i=1 . 1 2B 2 i $ δx δt −f i +α(∂ x B i )B i % 2 +β (∂ x f i ) / δt is the action defined on the path x 1 ,...,x N . 3.3.1 Path integral in Stratonovich calculus The path integral defined above is general since for different values of the parameters α and β it can recover the Stratonovich and the Itˆ o path integral mathematical forms. In particular, for the values α = β = 1 2 the path integral above corresponds to the 88 Stratonovich calculus while for the cases of α = β = 0 the resulting path integral cor- responds to the Itˆ o calculus. Thus the Stratonovich path integral for the 1 dimensional case is expressed as P $ x N ,t N |x 0 ,t 0 % = " dx 1 √ 2πδtB 1 ... " dx N−2 √ 2πδtB N−2 " dx N−1 √ 2πδtB N−1 ×exp $ − N > i=1 . 1 2B 2 i $ δx δt −f i + 1 2 (∂ x B i )B i % 2 − 1 2 (∂ x f i ) / δt % or in a more compact form: P $ x N ,t N |x 0 ,t 0 % = " N−1 G i=1 dx i exp $ − C N i=1 .$ δx δt −f i + 1 2 (∂xB i )B i √ 2B i % 2 − 1 2 (∂ x f i ) / δt % √ 2πδtB i where f i =f (0.5x i +0.5x i−1 ) and B i =B(0.5x i +0.5x i−1 ). 3.3.2 Path integral in Itˆ o calculus Similarly the Itˆ o path integral for the scalar case is expressed as: P $ x N ,t N |x 0 ,t 0 % = " dx 1 √ 2πδtB 1 ... " dx N−2 √ 2πδtB N−2 " dx N−1 √ 2πδtB N−1 ×exp $ N > i=1 . 1 2B 2 i $ δx δt −f i % 2 / δt % 89 or in a more compact form: P $ x N ,t N |x 0 ,t 0 % = " N−1 G i=1 dx i √ 2πδtB i ×exp $ − N > i=1 . 1 2B 2 i $ δx δt −f i % 2 / δt % where f i =f (x i−1 ) and B i =B(x i−1 ). 3.4 Path integrals and multi-dimensional SDEs In this section we derive the path integral for the multidimensional SDE (Schulz 2006). More precisely we consider the multidimensional SDE: dx =f(x,t)dt+B(x,t)dw (3.44) in which x∈" n×1 , f(x):" n×1 →" n×1 and B(x):" n×1 →" n×p . We will consider the Itˆ o representation of the SDE: x(t i )=x(t i−1 )+f(x,t)dτ + " t i t i−1 B(x,t)dw(τ) (3.45) Similarly to the 1D case we define φ(x t i−1 ,t i−1 ) as the solution of the SDE above, as follows: φ(x t i−1 ,t i−1 )=x(t i−1 )+f(x,t)dτ + " t i t i−1 B(x,t)dw(τ) (3.46) Moreover we define the term h(x t i ,x t i−1 ) as: 90 h(x t i ,x t i−1 )=x(t i )−x(t i−1 )−f(x,t)dτ− " t i t i−1 B(x,t)dw(τ) (3.47) The probability of hitting state x N at t N starting from x 0 at t 0 is formulated as follows: P $ x N ,t N |x 0 ,t 0 % = " dx N−1 ... " dx 1 N G i=1 ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # (3.48) To calculate the probability above the delta function ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # has to be found. The Fourier representation of delta function yields: δ[x i −φ(t i ;x i−1 ,t i−1 ]=det $ J xt i h(x t i ,x t i−1 ) % δ[h(x t i ,x t i−1 )] =det $ J xt i h(x t i ,x t i−1 ) %" dw (2π) n exp $ jw T h(x t i ,x t i−1 ) % where J xt i is the Jacobian. For the Itˆ o SDE above the jacobian J xt i = I n×n and therefore det $ J xt i h(x t i ,x t i−1 ) % = 1. Thus we will have that: δ[x i −φ(t i ;x i−1 ,t i−1 ]= δ[h(x t i ,x t i−1 )] ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = !" dw (2π) n exp $ jw T h(x t i ,x t i−1 ) %# 91 Substitution of the term h(x t i ,x t i−1 ) yields the equation: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = " dw (2π) n exp $ jw T $ x(t i )−x(t i−1 )−f(x,t)dτ %% × ! exp $ jw T B(x t i−1 ,t i−1 )dw(t i−1 ) %# We will further analyze the term: ! exp $ jw T B(x t i−1 ,t i−1 )dw(t i−1 ) %# = ! I + 1 1! jw T B(x,t)dw(t) # − ! 1 2! w T B(x,t)dw(t)dw(t) T B(x,t) T w # + ! O(dw i (t)dw j (t)dw k (t)) # Since dw(t) is Wiener noise we have that ! dw(t)dw(t) T # = I n×n dt. In addition the termO(dw i (t)dw j (t)dw k (t)) has terms of order higher than quadratic in dw i . The expectation of this term will result zero. More precisely, for these terms that are of order ν > 2, where ν is an even number the expectation result in terms of orderµ> 1 in dt and therefore all these terms are zero. For the remaining terms, of order orderν > 2, where ν is an odd number, the expectation will result in zero since ! dw(t) # = 0. Thus, since lim dt→0 ! O(dw i (t)dw j (t)dw k (t)) # = 0 we will have: ! exp $ jw T B(x t i−1 ,t i−1 )dw(t i−1 ) %# =I− 1 2! w T B(x,t)B(x,t) T wdt = exp $ − 1 2 w T B(x,t)B(x,t) T wdt % 92 By substituting back we will have: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = " dw (2π) n exp $ jw T $ x(t i )−x(t i−1 )−f(x,t)dτ %% ×exp $ − 1 2 w T B(x,t)B(x,t) T wdt % = " dw (2π) n exp $ jw T A % exp $ − 1 2 w T Bwdt % where A = x(t i )− x(t i−1 )− f(x,t)dτ and B = B(x,t)B(x,t) T . The transition probability therefore is formulated as follows: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = " dw (2π) n exp $ jw T A− 1 2 w T Bwdt % (3.49) This form of the transition probability is very common in the physics community. In engineering fields, the transition probability is derived according to the distribution of the state space noise that is considered to be Gaussian distributed. Therefore it seems that the transition above is different than the one that would have been derived if we were considering the Gaussian distribution of the state space noise. However as we will show in the rest of our analysis, (3.49) takes the form of Gaussian distribution. More precisely we will have that: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = " dw (2π) n exp $ jw T Bdt(Bdt) −1 A− 1 2 w T Bwdt % exp $ − j 2 2 A T (Bdt) −T Bdt(Bdt) −1 A % 93 ×exp $ j 2 2 A T (Bdt) −T Bdt(Bdt) −1 A % = " dw (2π) n exp . − $ jw+(Bdt) −1 A % T (Bdt) $ jω+(Bdt) −1 A % 2 / ×exp $ − 1 2 A T (Bdt) −T Bdt(Bdt) −1 A % = L det(Bdt) L (2π) n × " dwexp .$ jw+(Bdt) −1 A % T (Bdt) $ jω+(Bdt) −1 A %/ × 1 L (2π) n det(Bdt) exp $ −A T (Bdt) −1 A % since for term: L det(Bdt) L (2π) n " dwexp .$ jw+(Bdt) −1 A % T (Bdt) $ jω+(Bdt) −1 A %/ = 1 (3.50) Finally we will have that: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = 1 L (2π) n det(Bdt) exp $ − 1 2 A T (Bdt) −1 A % Clearly the transition probability above has the form of a Gaussian distribution. Substitution of the transition probabilities in (3.48) yields the final result: P $ x N ,t N |x 0 ,t 0 % = " N G i=1 dx i (2πδt) m/2 |BB T | 1/2 ×exp $ − 1 2 N > i=1 M M M M δx δt −f(x) M M M M 2 BB T δt % 94 wheretheterm M M δx δt −f(x) M M 2 BB T = A δx δt −f(x) B T BB T A δx δt −f(x) B . Withthissection we have derived the path integral from the stochastic differential equation and therefore we have completed the presentation of the connection between the three different ways of mathematicallyexpressingnonlinearstochasticdynamics. These3differentmathematical representations are the stochastic differential equations, the corresponding Fokker Planck PDE and the path integral formalism. In the next section we focus on forward and backward PDEs, the so called forward and backward Chapman Kolmogorov PDEs, and we discuss the probabilistic representation of the their solutions. 3.5 Cauchy problem and the generalized Feynman Kac representation TheFeynman-Kaclemmaprovidesaconnectionbetweenstochasticdifferentialequations and PDEs and therefore its use is twofold. On one side it can be used to find probabilistic solutions of PDEs based on forward sampling of diffusions while on the other side it can be used to find solution of SDEs based on deterministic methods that numerically solve PDEs. There are many cases in stochastic optimal control and estimation in which PDEs appear. In fact as we have seen in chapter 2, on the control side there is the so called Hamilton-Jacobi-Bellman PDEs which describes how the value function V(x,t) of a stochastic optimal control problem varies as a function of time t and state x. In this work, we compute the solution of the linear version of the HJB above with the useoftheFeynman-Kaclemma(Øksendal2003),andthus, inthissectionweprovidethe generalized version of the Feynman-Kac Lemma based on the theoretical development in 95 (Karatzas & Shreve 1991) and (Friedman 1975). This lemma is of a great significance for our analysis and application of path integral control and therefore we believe that it is essential to provide the proof of the lemma. Let us assume an arbitrary but fixed timeT> 0 and the constantL> 0 and λ≥ 0. Furthermore we consider the functions Ψ(x,t) : [0,T]×" n×1 →",F(x,t) : [0,T]×∈ " n×1 →" and q(x,t) : [0,T]×" n×1 → [0,∞] to be continuous and satisfying the conditions: (i) |Ψ(x,t)|≤L ? 1+||x|| 2λ @ or (ii) Ψ(x,t)≥ 0; ∀x∈" n×1 (3.51) (iii) |F(x,t)|≤L ? 1+||x|| 2λ @ or (iv) F(x,t)≥ 0; 0≤t≤T, x∈" n×1 (3.52) Feynman - Kac Theorem: Suppose that the coefficients f i (x) and B i,j (x) satisfy the linear growth condition ||f i (x)|| 2 +||B i,j (x)|| 2 ≤K 2 (1+||x|| 2 ) where K is a positive constant. LetΨ(x,t) : [0,T]×" n×1 →" is continuous andΨ(x,t)∈C 1,2 and it satisfies the Cauchy problem: −∂ t Ψ t =− 1 λ q t Ψ t +f T t (∇ x Ψ t )+ 1 2 tr A (∇ xx Ψ t )BB T B +F(x,t) (3.53) in [0,T)×" n×1 with the boundary condition: Ψ(x,T)= ξ(x); x∈"(n×1) (3.54) 96 as well as the polynomial growth condition: max 0≤t≤T |Ψ(x,t)|≤M A 1+||x|| 2µ B ;x∈" n×1 (3.55) For someM> 0,µ≥ 1 thenΨ(x,t) admits the stochastic representation Ψ(x,t)= ! ξ(x T )exp $ − 1 λ " T t q(x s ,s)ds % + " T t F(x θ ,θ)exp $ − 1 λ " T t q(x,s)ds % dθ # (3.56) on [0,T]×" n×1 ; in particular, such a solution is unique. Proof: Let us consider G(x,t 0) ,t) = Ψ(x,t) Z(t 0 ,t) where the term Z(t 0 ,t) is defined as follows: Z(t 0 ,t) = exp $ − 1 λ " t t 0 Q(x)dτ % (3.57) We apply the multidimensional version of the Itˆ o lemma: dG(x,t 0 ,t)=dΨ(x,t)Z(t 0 ,t)+Ψ(x,t) dZ(t 0 ,t)+dΨ(x,t) dZ(t 0 ,t) (3.58) Since dΨ(x,t) dZ(t 0 ,t) = 0 we will have that: dG(x,t 0 ,t)= dΨ(x,t) Z(t,t N )+ Ψ(x,t) dZ(t 0 ,t). We calculate the differentials dΨ(x,t),dZ(t,t N ) according to the Itˆ o differentiation rule. More precisely for the term dZ(t 0 ,t N ) we will have that: 97 dZ(t 0 ,t)=− 1 λ Q(x)Z(t 0 ,t) dt (3.59) while the term dΨ(x,t) is equal to: dΨ(x,t)= ∂ t Ψ dt+(∇ x Ψ) T dx+ 1 2 dx T (∇ xx Ψ) dx = ∂ t Ψ dt+(∇ x Ψ) T , f(x,t)dt+B(x)dw - (3.60) + 1 2 , f(x,t)dt+B(x)dw - T (∇ xx Ψ) , f(x,t)dt+B(x)dw - Since the following properties (Øksendal 2003) hold, dw T dw → 0, dwdt → 0, the equation above is further simplified into: dΨ(x,t)= ∂ t Ψ dt+(∇ x Ψ) T f(x,t) dt+ 1 2 $ B(x)dw % T (∇ xx Ψ) $ B(x)Ldw % +(∇ x Ψ) T B(x)dw By considering dwdw T →Idt we will have the equation that follows: dΨ(x,t)= ∂ t Ψ dt+(∇ x Ψ) T f(x,t)dt+ 1 2 tr $ (∇ xx Ψ)BB T % dt+(∇ x Ψ) T B(x)dw (3.61) 98 Since we have found the total differential inΨ(x,t) we can substitute back to (3.59) and we get the following expression: dG(x,t 0 ,t)=Z(t,t N ) dt , − 1 λ Q(x)Ψ+∂ t Ψ+(∇ x Ψ) T f(x,t)+ 1 2 tr $ (∇ xx Ψ)BB T % - +Z(t 0 ,t)(∇ x Ψ) T B(x)Ldw According to the backward Kolmogorov PDE (3.53) the term inside the parenthesis equalsF(x,t) and therefore the equation above is formulated as: dG(x,t 0 ,t)=Z(t 0 ,t) ? −F(x,t)dt+(∇ x Ψ) T B(x)Ldw @ With the definition of τ p as τ p ! {s≤t;||x||>p} we integrate the equation above in the time interval t∈ [t 0 ,t N ∧τ n ] and we will have then following expression: " t N ∧τp t 0 dG(x,t 0 ,t)=− " t N ∧τp t 0 F(x,t)Z(t 0 ,t)dt+ " t N ∧τp t 0 Z(t 0 ,t)(∇ x Ψ) T C(x)Ldw (3.62) Expectation of the equation above is taken over the sampled paths τ =(x 0 ,...,x t N ) starting from the state x 0 . The resulting equation is expressed as follows: !" t N ∧τp t 0 dG(x,t 0 ,t) # = = ! − " t N ∧τp t 0 F(x,t)Z(t 0 ,t)dt+ " t N ∧τp t 0 Z(t 0 ,t)(∇ x Ψ) T C(x)Ldw # 99 We change the order to the time integration and expectation and due to the fact that E(dw) = 0 the last term of the right side of the equation above drops. Consequently we will have: !" t N ∧τp t 0 dG(x,t 0 ,t) # =− !" t N ∧τp t 0 F(x,t)Z(t,t N )dt # (3.63) The left hand side of the equation above is further written as: ! G(x,t 0 ,t N )1 τp>t N +G(x,t 0 ,τ p )1 τp<t N −G(x,t 0 ,t 0 ) # =− !" t N ∧τp t 0 F(x,t)Z(t 0 ,t)dt # (3.64) or ! G(x,t 0 ,t 0 ) # = !" t N ∧τp t 0 F(x,t)Z(t,t N )dt+G(x,t 0 ,t N )1 τp>t N +G(x,t 0 ,τ p )1 τp<t N # (3.65) SinceG(x,t,t 0 ) = Ψ(x,t)Z(t 0 ,t) andZ(t 0 ,t) = exp ? − 1 λ F t t 0 Q(x)dτ @ all the terms G(x,t 0 ,t N ),G(x,t 0 ,t 0 ) andG(x,τ p ,t N ) are further specified by the equations that follow: G(x,t 0 ,t 0 ) = Ψ(x,t 0 ) G(x,t 0 ,t N ) = Ψ(x,t)exp $ − 1 λ " t N t 0 Q(x)dτ % G(x,t 0 ,τ n ) = Ψ(x,τ p )exp $ − 1 λ " τp t 0 Q(x)dτ % Substituting the equations above to (3.65) results in: 100 Ψ(x,t 0 )=G(x,t 0 ,t 0 )= !" t N ∧τp t 0 F(x,t)exp $ − 1 λ " t N t 0 Q(x)dτ %# + ! Ψ(x,τ p )exp $ − 1 λ " τp t 0 Q(x)dτ % 1 τp≤t N # + ! Ψ(x,t N )exp $ − 1 λ " t N t 0 Q(x)dτ % 1 τp>t N # (3.66) The next step in the derivation is to find the limit of the right hand side of the equationaboveasp→∞. Morepreciselyeitherbyusing(iii)in(3.51)andthedominated convergence theorem or by considering the monotone convergence theorem (see section 3.11) and (iv) in (3.51) the limit of the first term in (3.66) equals: lim p→∞ !" t N ∧τp t 0 F(x,t)exp $ − 1 λ " t N t 0 Q(x)dτ %# = !" t N t 0 F(x,t)exp $ − 1 λ " t N t 0 Q(x)dτ %# The second term in (3.66) is bounded as: ! |Ψ(x,t)|1 τp≤T # ≤M A 1+p 2µ B P (τ p ≤T) where the probability P (τ p ≤T) is expressed as follows: P (τ p ≤T)=P $ max t≤θ≤T ||x θ ||≥p % ≤ p 2m ! max t≤θ≤T ||x θ || 2m # ≤ Cp −2m A 1+||x|| 2m B where the first inequality results from the chebyshev inequality and the second in- equality comes from the property ! max t≤θ≤s ||x θ || 2m # ≤ C A 1+||x|| 2m B e C(T−s) where 101 t≤ s≤ T. Clearly as p→∞ we have ! |Ψ(x,t)|1 τp≤T # ≤ M A 1+p 2µ B P (τ n ≤T)→ 0. Thus: lim p→∞ ! Ψ(x,τ p )exp $ − 1 λ " τp t 0 Q(x)dτ % 1 τp≤t N # =0 Finally the third term converges to ! Ψ(x,t N )exp $ − 1 λ " t N t 0 Q(x)dτ %# The final result of the Feynman Kac lemma is given by the equation that follows: Ψ(x,t)= ! ξ(x T )exp $ − 1 λ " T t q(x s ,s)ds % + " T t F(x θ ,θ)exp $ − 1 λ " T t q(x,s)ds % dθ # withΨ(x,t N )= ξ(x T ). This is the end of the proof of the Feynman-Kac lemma. Since the Feynman-Kac lemma requires the condition of the linear growth of the elements of the drift term f(x,t) and the diffusion matrix B(x,t) in (3.9) one could ask what kind of dynamical systems fulfill these conditions. But before we discuss the generality of the applicability of the Feynman-Kac lemma to a variety of dynamical systems in control and planning application it is critical to identify the conditions under which a solution to the Cauchy problems exist. The conditions that guarantee the existence of the solutions as they are reported in (Karatzas & Shreve 1991) and proven in (Friedman 1975) are given bellow: 102 i) Uniform Ellipticity: There exist as positive constant δ such that: n > i=1 n > j=1 α i,j (x,t)ξ i ξ j ≥ δ||ξ|| 2 (3.67) holds for every ξ∈" n×1 and (t,x)∈ [0,∞)×" n×1 . ii) Boundness: The functions f(x,t),q(x,t),α(x,t) are bounded in [0,T]×" n×1 . iii) Holder Continuity: The functions f(x,t),q(x,t),α(x,t) andF(x,t) are uniformly Holder continuous in [0,T]×" n×1 . iv) Polynomial Growth: the functionsΨ(x(t N )) = ξ(x(t N )) andF(x,t) satisfy the (i) and (iii) in (3.51) Conditions (i),(ii) and (iii) can be relaxed by assuming that they are locally true. Essentially, the Feynman- Kac lemma provides solution of the PDE (3.53) in a prob- abilistic manner, if that solution exists, and it also tells us that this solution is unique. The conditions above are sufficient conditions for the existence of the solution of (3.53). WiththegoaltoapplytheFeynman-Kaclemmatolearningapproximatesolutionfor optimal planning and control problems, it is important to understand how the conditions ofthislemmaarerelatedtopropertiesandcharacteristicsofthedynamicalsystemsunder consideration. • The condition of linear growth, for the case of control, is translated as the require- ment to deal with dynamical systems in which their vector field as a function of state is bounded either by a linear function or by number. Therefore the Feyn- man Kac lemma can be applied either to linear dynamical systems of the form 103 ˙ x = Ax +Bu or nonlinear dynamical systems of the form ˙ x = f(x)+G(x)u in whichf i (x)<M||x||. Examplesofnonlinearfunctionsthatsatisfythelineargrowth condition are functions such as cos(x),sin(x). The dynamical systems under con- sideration can be stable or unstable, for as long as their vector field satisfies the linear growth condition then they ”qualify” for the application of the Feynman-Kac lemma. • But what happens for the case of dynamical systems in which the vector field f(x) cannot be bounded ∀x∈" n such as for example the function f(x)= x 2 ? The answer to this question is related to the locality condition. In particular if we know that the dynamical system under consideration operates in a pre-specified region of the state space then an upper bound for the vector field can almost always be found. Consequentlytheconditionsofboundednessincombinationwiththerelaxed condition of locality are important for the application of Feynman-Kac lemma to a rather general class of systems. • Finally, ourviewinapplyingtheFeynmanKaclemmaandthepathintegralcontrol formalism is for systems in which an initial set of state space trajectories is given or generated after an initial control policy has been applied. Thus these systems are initially controlled and clearly their vector field cannot be unbounded as a function of the state. We will continue this discussion of the application of Feynman - Lemma for optimal controlandplanningforthechapterofpathintegralcontrolformalism. Inthenextsection we will try to identify the most important special case of the Feynman Kac lemma. 104 3.6 Special cases of the Feynman Kac lemma. There are many special cases of the Feynman Kac lemma in the literature (Øksendal 2003),(Friedman1975),(Karatzas&Shreve1991),(Fleming&Soner2006)which,atafirst glance, might look confusing and different. Nevertheless, under the generalized version of the Feynman Kac lemma it is easy to recover and recognize all these special cases. We start with the case where there is no discount cost which is equivalent to q(x) = 0. The backward Kolmogorov PDE, then is formulated as: −∂ t Ψ t =f T t (∇ x Ψ t )+ 1 2 tr $ (∇ xx Ψ t )BB T % +F(x,t); in [0,T)×" n×1 (3.68) with the Feynman - Kac representation: Ψ(x,t)= ! ξ(x T )+ " T t F(x θ ,θ)dθ # (3.69) and Ψ(x,t N )= ξ(x T ). If the forcing term F(x,t)=0 ∀x∈" n×1 ,t ∈ [0,T) then it drops from (3.68) while the Feynman- Kac representation of the resulting PDE is given byΨ(x,t)= E(ξ(x T )). Moreover if the drift term f(x,t) = 0∀x∈" n×1 ,t∈ [0,T) the backward Kolmogorov PDE (3.68) collapses to: −∂ t Ψ t = 1 2 tr $ (∇ xx Ψ t )BB T % ; in [0,T)×" n×1 (3.70) For the case whereB(x,t)=B the difference between the backward Kolmogorov and forward Kolmogorov PDEs, for this special case of (3.70), is only the sign of the partial 105 derivativeofΨwithrespecttotime. Toseethatwejustneedtoapplythetransformation Ψ(x,t) =Φ(x,T−t) =Φ(x,τ) and thus we will have that ∂ t Ψ t =−∂ τ Φ τ . The backward kolmogorov PDE is now transformed to a the forward PDE: ∂ τ Φ τ = 1 2 tr $ (∇ xx Ψ t )BB T % ; in [0,T)×" n×1 (3.71) The PDE above is the forward Kolmogorov PDE which corresponds to SDEs without the drift term and only diffusion term. In the most general cases, the transformation Ψ(x,t) =Φ(x,T −t) =Φ(x,τ) of the backward Kolmogorov PDE results in a forward PDE which does nor always correspond to the forward Kolmogorov PDE. In fact, this is true only in the case F(x,t)=0,q(x) = 0 and f(x,t)=0 ∀x∈" n×1 ,t ∈ [0,T) and constant diffusion matrixB. For the most general case the transformationΨ(x,t)= Φ(x,T−t) =Φ(x,τ) results in the PDEs given by the equation that follows: ∂ τ Φ (i) τ =− 1 λ q(x,T−τ)Φ (i) τ +f (i) τ T (∇ x Φ (i) τ )+ 1 2 tr ? (∇ xx Φ (i) τ )BB T @ +F(x,T−τ) (3.72) withtheinitialconditionΦ(x,0) = exp(− 1 λ φ(t N )). Bysubstituting ˜ q(x,τ)=q(x,T− τ) and ˜ F(x,τ)=F(x,T−τ) the Feynman Kac representation takes the form: Φ(x,τ)= ! ξ(x 0 )exp $ − 1 λ " τ 0 ˜ q(x,s)ds % + " τ 0 ˜ F(x,θ)exp $ − 1 λ " τ t ˜ q(x,s)ds % dθ # 106 The forward PDE in (3.72) and its probabilistic solution above is another form of the Feynman- Kac lemma. In the next section we show how the backward and forward PDEs are related for the most general case. 3.7 Backward and forward Kolmogorov PDE and their fundamental solutions After discussing the solution to the Cauchy problem and presenting special cases of the Feynman-Kac lemma, in this section we investigate the connection between the forward andbackwardKolmogorovPDEs. ThebackwardKolmogorovPDE,aswewillshowinthe nextchapter, appearsundercertainconditionsinageneraloptimalcontrolproblemwhile the forward Kolmogorov PDE is of great importance in nonlinear stochastic estimation. It is therefore of great importance to understand their connection in a mathematical as well as an intuitive level. Towards this goal, we will start our analysis with the definition of the fundamental solution (Karatzas & Shreve 1991) of a second order PDE. Definition: Let consider the nonnegative function D(y,t;x,τ) with 0 < t < τ, x,y∈" n and ξ ∈ C and τ ∈ [0,T]. The function D(y,t;x,τ) is a fundamental solution of the PDE: −∂ t Ψ t =− 1 λ q t Ψ t +f(x t ) T (∇ x Ψ t )+ 1 2 tr $ (∇ xx Ψ t )B(x,t)B(x,t) T % (3.73) 107 if the functionΨ(y,t)= F * n D(y,t;x,τ)ξ(x)dx, satisfies the PDE above and lim t→τ −Ψ(y,t)= ξ(y,τ). Before we proceed with the theorem which establishes the connection between the forwardandbackwardKolmogorovPDEthroughtheconceptoffundamentalsolution,lets understand the ”physical” meaning of the functionD(y,t;x,τ) for 0<t< τ, x,y∈" n . Let us assume for the moment that q(x) = 0 then through the Feynamn Kac lemma of the solution of the PDE is represented as Ψ(x,t)= E(ξ(x T )). Inspection of the last equation andΨ(y,t)= F * n D(y,t;x,τ)ξ(x)dx tell us that any fundamental solution of the backward Kolmogotov PDE can be thought as an transition probability of the stochastic processx which evolves according to the stochastic differential equation (3.9). Consequently, we can write that D(y,t;x,τ)= p(x,τ|y,t). Another property of the function of fundamental solution of a second order PDE comes from the fact that: lim t→τ − Ψ(y,t) = lim t→τ − " D(y,t;x,τ)ξ(x)dx = ξ(y,τ) (3.74) From the equation above, it is easy to see that the fundamental solution has the prop- erty that lim t→τ −D(y,t;x,τ)= δ(y−x) where δ(x) is the Dirac function. Clearly, since there is a probabilistic interpretation ofD(y,t;x,τ) the transition probabilityp(x,τ|y,t) inherits the same property and therefore p(x,τ|y,t)= δ(y−x) for t = τ. Now, we present the theorem (Karatzas & Shreve 1991),(Friedman 1975) that estab- lishes the connection between the forward and backward Kolmogorov PDE, trough the concept of fundamental solution. 108 Theorem: Under the conditions of Uniform Ellipticity of α(x,t), Holder conti- nuity and boundeness of f(x,t),F(x,t),α(x,t) a fundamental solution of 3.73 exist. Furthermore for any fixed τ,x the function ψ(y,t)= D(y,t;x,τ) satisfies the back- ward Kolmogorov PDE. In addition if the function (∂/∂x i )f(x,t), (∂/∂x i )α(x ik ) and (∂ 2 /∂x 2 i )α(x ik )areboundedandHoldercontinuoustheforfixedt,xthefunction ψ(y,τ)= D(y,t;x,τ) satisfies the forward Kolmogorov equation: ∂ τ ψ(y,τ)=− n > i=1 ∂ ∂y i $ f i (y,τ)ψ(y,τ) % + 1 2 n > i,j=1 ∂ ∂y i ∂y j $ α i,j (y,τ)ψ(y,τ) % −q(y,t)ψ(y,τ) (3.75) The proof of the theorem can be found in (Friedman 1975). Clearly, the funda- mental solutionD(y,t;x,τ) establishes a connection between the forward and backward Kolmogorov PDE. Essentially, ifD(y,t;x,τ) is considered as a function of x,t then it satisfies the former, while when it is thought as a function of y,τ then it satisfies the later. To better understand this connection, lets study the example of the diffusion dx = µdt + σdω. From the analysis above we know that the transition probability of this diffusion is a fundamental solution. Lets verify if this statement true. More precisely, the aforementioned diffusion can been written in the form x(t+dt)−x(t)=µdt+σdω. By substituting with x(t + dt)= x(τ) for τ = t + dt and x(t)= y(t), we will have x(τ)−y(t)=µ(τ−t)+σdω. With this new form, the transition probability is expressed by the equation that follows: 109 p(x,τ|y,t)=D(y,t;x,τ)= 1 L 2πσ 2 (t−τ) exp $ − (x−y−µ (τ−t)) 2 2σ 2 (t−τ) % (3.76) ThebackwardandforwardKolmogorovPDEsforthestochasticdiffusionsdx =µdt+ σdω are formulated as follows: − ∂ ∂t p(x,τ|y,t)=µ ∂ ∂y p(x,τ|y,t)+ 1 2 σ 2 ∂ 2 ∂y 2 p(x,τ|y,t) ∂ ∂τ p(x,τ|y,t)=−µ ∂ ∂x p(x,τ|y,t)+ 1 2 σ 2 ∂ 2 ∂x 2 p(x,τ|y,t) The verify the theorem of the fundamental solution we compute the following terms: ∂ ∂y p(x,τ|y,t)= $ x−y−µ (τ−t) σ 2 (τ−t) % p(x,τ|y,t) ∂ 2 ∂y 2 p(x,τ|y,t)= −1 σ 2 (τ−t) $ 1+ x−y−µ (τ−t) σ 2 (τ−t) % p(x,τ|y,t) ∂ ∂x p(x,τ|y,t)=− ∂ ∂y p(x,τ|y,t) ∂ 2 ∂y 2 p(x,τ|y,t)= 1 σ 2 (τ−t) $ −1+ x−y−µ (τ−t) σ 2 (τ−t) % p(x,τ|y,t) In addition to the partial derivative with respect to the state x and y, the time derivatives of the transition probability are formulated as follows: ∂ ∂τ p(x,τ|y,t)= , − 1 τ−t + x−y−µ(τ−t) σ 2 (τ−t) + (x−y−µ(τ−t)) 2 σ 2 (τ−t) 2 - p(x,τ|y,t) 110 ∂ ∂t p(x,τ|y,t)= , 1 τ−t − x−y−µ(τ−t) σ 2 (τ−t) − (x−y−µ(τ−t)) 2 σ 2 (τ−t) 2 - p(x,τ|y,t) BycomputingthetermsintheleftsidesofthePDEsandthetimederivatives ∂ ∂t p(x,τ|y,t) and ∂ ∂τ p(x,τ|y,t) it is easy to show that, indeed, p(x,τ|y,t) satisfies the backward Kol- mogorov in y,τ and the forward Kolmogorov in x,t. 3.8 ConnectionofbackwardandforwardKolmogorovPDE via the Feynman Kac lemma The connection between the backward and the forward Kolmogorov PDEs can be also seen in the derivation of the Feynman Kac lemma. Towards an understanding in depth of the connection between the two PDEs, our goal in this section is to show that in the derivation of the Feynman Kac lemma both PDEs are involved. In particular, we are assuming that the backward Kolmogorov PDE holds, and while we are trying to find its solution,theforwardPDEappearsfromourmathematicalmanipulations. Moreprecisely, we will start our derivation from equation (3.61) in the Feynman Kac lemma but we will assume for simplicity that q(x,t) = 0. More precisely we will have that: dG(x,t 0 ,t)=dt , ∂ t Ψ+(∇ x Ψ) T f(x,t)+ 1 2 tr A ∇ xx ΨBB T B - +(∇ x Ψ) T B(x)Ldw Byintegratingandtakingtheexpectationoftheequationaboveandsince ! dw # = 0: 111 ! dG(x,t 0 ,t) # = !" , ∂ t Ψ+(∇ x Ψ) T f(x,t)+ 1 2 tr A ∇ xx ΨBB T B - dt # The expectation above is taken with respect to the transition probabilityp(x,t|x 0 ,t 0 ) defined based on the Itˆ o diffusion (3.9). Consequently we will have: ! dG(x,t 0 ,t) # = "" p(x,t|x,t 0 ) , ∂ t Ψ+(∇ x Ψ) T f(x,t)+ 1 2 tr A ∇ xx ΨBB T B - dtdx We are skipping few of the steps that we followed in the Feynman- Kac lemma, however it is easy to show that the equation above can be written in the form: ! Ψ(x(t N ) # −Ψ(x(t 0 )) = = "" p(x,t|x 0 ,t 0 ) , ∂ t Ψ+(∇ x Ψ) T f(x,t)+ 1 2 tr A ∇ xx ΨBB T B - dtdx we integrate by parts and therefore: " * n p(x,t N |x 0 ,t 0 )Ψ(x(t N ))dx−Ψ(x(t 0 )) = " * n " Ψ , −∂ t p(x,t|x 0 ,t 0 )+∇ x (f(x,t)p(x,t|x 0 ,t 0 )) - + " * n " Ψ , 1 2 tr A ∇ xx p(x,t|x 0 ,t 0 )BB T B - dtdx + " * n p(x,t|x 0 ,t 0 )Ψ(x,t)dx| t=t N t=t 0 112 The last term F * n p(x,t|x 0 ,t 0 )Ψ(x,t)dx| t=t N t=t 0 is further written as: " * n p(x,t N |x 0 ,t 0 )Ψ(x,t)dx− " * n p(x,t 0 |x 0 ,t 0 )Ψ(x,t)dx From the equation above we conclude the following: Ψ(x(t 0 )) = " * n p(x,t 0 |x 0 ,t 0 )Ψ(x,t 0 )dx (3.77) and also −∂ t p(x,t|x 0 ,t 0 )+∇ x (f(x,t)p(x,t|x 0 ,t 0 ))+ 1 2 tr A ∇ xx p(x,t|x 0 ,t 0 )BB T B = 0 (3.78) The first equation tells us that the transition probabilityp(x,t|x 0 ,t 0 ) acts as a Dirac function since lim t→t + 0 p(x,t|x 0 ,t 0 )= δ(x−x 0 ). We have arrived at the same conclusion, regarding the transition probability, with the result in the previous section where we showed that, in fact this is a general property of the fundamental solutions of PDEs and therefore since the transition lim t→t + 0 p(x,t|x 0 ,t 0 ) is a fundamental solutions, it inherits the same property. Clearly in this section we do not use the theory of fundamental solutions but we find the same result by slightly changing the derivation of the Feynman- Kaclemma. ThesecondequationisnothingelsethantheforwardKolmogorovPDEandit tells us that that the transition probabilityp(x,t|x 0 ,t 0 ) satisfies the forward Kolmogorov PDE in x and t . Essentially the derivation of the Feynman- Kac lemma can be used to 1) find the probabilistic interpretation of the solution of the backward kolmogorov equationandthusprovideasolutiontotheCauchyproblem. 2)Shownthatthetransition 113 probabilityactsasadiracfunctionthusitsharesthesamepropertywiththefundamental solution of the PDEs and 3) Prove that the forward kolmogorov can be thought as an outcome of the Feynman-Kac lemma and thus to offer another perceptual view of the connection between the forward and backward Kolmogorov PDEs. The discussion so far seems a bit abstract. So one could ask why all these? What do really this PDEs represent? Where do we find them in engineering? 3.9 ForwardandbackwardKolmogorovPDEsinestimation and control We will close this chapter on the connection between the forward and backward Kol- mogorov PDEs with the discussion on how these PDEs appear in nonlinear control and estimation problems. We start our analysis with the Zakai equation which is found in nonlinear estimation theory. More precisely we consider the nonlinear filtering problem in which the stochastic dynamics are expressed by the equation: dx =f(x,t)dt+B(x,t)dw while the observations are given by the diffusion: dy =h(x,t)dt+dv 114 The goal is to estimate the state of the stochastic dynamics which is equivalent of finding the probability density p(x,t) of the state x at time t. This probability density satisfies the Zakai equation: ∂p =− n > i=0 ∂ ∂x i $ f(x,t)p % dt+ 1 2 m > k=1 n > i,j=1 $ B i,k (x,t)B k,j (x,t) T p % dt+ph(x,t) T dy The PDE above is linear, second order and stochastic. The stochasticity is incorpo- rated due the the last term which is a function of the observations dy. Substitution of the observation model to the PDE above, results in the following linear stochastic PDE: ∂p =− n > i=0 ∂ ∂x i $ f(x,t)p % dt+ 1 2 m > k=1 n > i,j=1 $ B i,k (x,t)B k,j (x,t) T p % dt+ph(x,t) T h(x,t)dt +ph(x,t) T dv From the equation above we can see that for h(x,t) = 0 the forward Zakai equation collapses to a forward Chapman- Kolmogorov PDE. As it will be shown in the next chapter, the backward chapman Kolmogorov PDE appears in optimal control and it has the form: −∂ t Ψ t =− 1 λ q t Ψ t +f T t (∇ x Ψ t )+ 1 2 tr A (∇ xx Ψ t )BB T B With respect to the forward Zakai PDE, the backward Chapman Kolmogorov is also linear but deterministic. This last difference is one of the main reasons why the duality between optimal linear filtering and linear control was not generalized for the nonlinear 115 case. Recently (Todorov 2008) , the generalized duality was exploited when the backward Zakai equation in nonlinear smoothing is considered. Essentially the backward Zakai equation can be turned into a deterministic PDE and then a direct mapping between the two PDEs can be made in the same way how it is made between the backward and forward Riccati equations in linear control and filtering problems. 3.10 Conclusions In this chapter we investigated the connection between SDEs, linear PDEs and Path Integrals. My goal was to give an introduction to these mathematical concepts and their connections by keeping a balance between a pedagogical and intuitive presentation and a presentation that is characterized by rigor and mathematical precision. In the next chapter, the path integral formalism is applied to stochastic optimal control and reinforcement learning and the generalized path integral control is derived. More precisely, the backward Chapman Kolmogorov is formulated and the Feynman-Kac lemma is applied. Finally the path integral control is derived. Extensions of path integral control to iterative and risk sensitive control are presented. 3.11 Appendix We assume the stochastic differential equation dx = f(x,t)dt + B(x,t)dw . If the drift f(x,t) and diffusion term B(x,t) satisfy the condition:||f(y,t)|| 2 + ||B(y,t)|| 2 < K $ 1+max||y(s)|| 2 % then ! max 0<s<t ||x s || 2m # ≤C $ 1+ ! ||x o || 2m #% e Ct , ∀0≤t≤T. 116 H¨ older Continuity Definition: A function f(x):" n →" is H¨ older continuous is there isα> 0 such that |f(x)−f(x)|≤ |x−y| α . MonotoneConvergenceTheorem: iff n is a sequence of measurable function with 0≤f n ≤f n+1 , ∀n then lim n→∞ F f n dµ = F lim n→∞ f n dµ. DominatedConvergenceTheorem: Letf n the sequence of real value and measur- able functions. If the sequence convergence pointwise to the functionf and it is dominated by some integrable function g then lim n→∞ F f n dµ = F fdµ. A function is dominated by g if |f n (x)|<g(x). 117 Chapter 4 Path Integral Stochastic Optimal Control AfterdiscussingtheconnectionbetweenPDEs, SDEsandthePathIntegralsinthischap- ter we present the application of path integral formalism to stochastic optimal control and reinforcement learning problems. While reinforcement learning (RL) is among the mostgeneralframeworksoflearningcontroltocreatetrulyautonomouslearningsystems, its scalability to high-dimensional continuous state-action system, e.g., humanoid robots, remains problematic. Classical value-function based methods with function approxima- tion offer one possible approach, but function approximation under the non-stationary iterative learning process of the value-function remains difficult when one exceeds about 5-10 dimensions. Alternatively, direct policy learning from trajectory roll-outs has re- cently made significant progress (Peters 2007), but can still become numerically brit- tle and full of open tuning parameters in complex learning problems. In new develop- ments, RL researchers have started to combine the well-developed methods from statis- tical learning and empirical inference with classical RL approaches in order to minimize tuning parameters and numerical problems, such that ultimately more efficient algo- rithms can be developed that scale to significantly more complex learning system (Dayan 118 & Hinton 1997, Kober & Peters 2009, Peters & Schaal 2008a, Toussaint & Storkey 2006) and (Ghavamzadeh & Yaakov 2007, Deisenroth, Rasmussen & Peters 2009, Vlassis, Tou- ssaint, Kontes & S. 2009, Jetchev & Toussaint 2009). In the spirit of these latter ideas, in this chapter we derive the necessary mathe- matical background for the development of a new method of probabilistic reinforcement learning based on the framework of stochastic optimal control and path integrals. We start our analysis motivated by the original work of (Kappen 2007, Broek, Wiegerinck & Kappen. 2008) and we extend the path integral control framework in new directions whichinclude1)stochasticdynamicswithstatedependentcontrolanddiffusionmatrices, 2) the iterative version of the proposed framework and 3) different integration schemes of stochastic calculus which include but they are not limited to Itˆ o and Stratonovich calculus. Thepresentchapterisorganizedasfollows: insection4.1,wegothroughthefirststeps of Path Integral control, starting with the presentation of a general stochastic optimal control problem and the corresponding HJB equation. We continue with the transfor- mation of HJB to a linear and second order PDE, the so called the Backward Chapman Kolmogorov PDE. This transformation allows us to use the Feynman-Kac lemma, from chapter 3, and to represent the solution for the backward Chapman Kolmogorov PDE as the expectation of the exponentiated state depended part of the cost function over all possible trajectories. In section 4.2 we derive the path integral formalism for stochastic dynamic systems in which the state is partitioned into directly actuated and not directly actuated parts. There is a plethora of dynamical systems that have this property such as Rigid body and 119 Multi-body dynamics as well as the Dynamic Movement Primitives. DMPs are nonlinear attractors with adjustable landscapes and they can be used to represent state space trajectories as well as control policies. We will continue the discussion on DMP and their application to robotic optimal control and planning in chapter 6. The derivation of path integral for such type of systems is based on the Itˆ o calculus and it is presented step by step. In section 4.3 the generalized path integral control for the case of systems with state dependentcontroltransitionanddiffusionmatricesisderived. Thederivationispresented in details in the appendix of the present chapter and it consists of 2 lemmas and 1 theorem. All the analysis in sections 4.2 and 4.3 is according to Itˆ o calculus. To complete the presentation on the generalized path integral control we present the derivation of the optimal controls in Stratonovich calculus. Furthermore, we discus the case in which the StratonovichandtheItˆ ocalculusleadtothesameresultsinthetermsofthefinalformula that provides the path integral optimal controls. With the goal to apply the Path Integrals control to high dimensional robotic control and planning problems, in section 4.6 we present the Iterative version of the path integral control and we have a discussion on the convergence analysis of the proposed algorithm (PI 2 ). When the iterative path integral control approach is applied to DMPs then the resulting algorithm is the so calledPolicyImprovement withPathIntegrals (PI 2 ). This algorithm is presented in great detail in chapter 6. Finally, in section 4.7 we discuss the risk sensitive version of path integral control. More precisely we derive the condition under which the path integral control formalism could be applied for the case of stochastic optimal control problems with risk sensitive 120 cost functions. In the last section we discuss the main points of this chapter and we conclude. 4.1 Path integral stochastic optimal control The goal in stochastic optimal control is to control a stochastic dynamical system while minimizingaperformancecriterion. Therefore,inmathematicaltermastochasticoptimal control problem can be formulated as follows: V(x) = min u J(x,u) = min u ! φ(x t N )+ " t N to L(x,u,t)dt # (4.1) subject to the stochastic dynamical constrains: dx=(f(x,t)+G(x,t)u)dt+B(x,t)Ldw (4.2) with x t ∈" n×1 denoting the state of the system, G t = G(x,t)∈" n×p the control matrix, B t = B(x,t)∈" n×p is the diffusions matrix f t = f(x,t)∈" n×1 the passive dynamics, u t ∈" p×1 the control vector and dw∈" p×1 brownian noise. L∈" p×p is a state independent matrix withΣ w =LL T . As immediate reward we consider L t =L(x t ,u t ,t)=q t + 1 2 u T t Ru t (4.3) where q t = q(x t ,t) is an arbitrary state-dependent cost function, and R is the positive definiteweightmatrixofthequadraticcontrolcost. ThestochasticHJBequation(Stengel 121 1994, Fleming & Soner 2006) associated with this stochastic optimal control problem is expressed as follows: −∂ t V t = min u $ L t +(∇ x V t ) T F t + 1 2 tr A (∇ xx V t )B t Σ w B T t B % (4.4) To find the minimum, the reward function (4.3) is inserted into (4.4) and the gradient of the expression inside the parenthesis is taken with respect to controls u and set to zero. The corresponding optimal control is given by the equation: u(x,t)=u t =−R −1 G(x) T (∇ x V(x,t)) (4.5) Substitution of the optimal control into the stochastic HJB (4.4) results in the follow- ing nonlinear and second order PDE: −∂ t V t =q t +(∇ x V t ) T f t − 1 2 (∇ x V t ) T G t R −1 G T t (∇ x V t )+ 1 2 tr A (∇ xx V t )B t Σ w B T t B (4.6) To transform the PDE above into a linear one, we use a exponential transformation of the value function V t =−λlogΨ t . Given this logarithmic transformation, the partial derivatives of the value function with respect to time and state are expressed as follows: ∂ t V t =−λ 1 Ψ t ∂ t Ψ t (4.7) ∇ x V t =−λ 1 Ψ t ∇ x Ψ t (4.8) ∇ xx V t = λ 1 Ψ 2 t ∇ x Ψ t ∇ x Ψ T t −λ 1 Ψ t ∇ xx Ψ t (4.9) 122 Inserting the logarithmic transformation and the derivatives of the value function we obtain: λ Ψ t ∂ t Ψ t =q t − λ Ψ t (∇ x Ψ t ) T f t − λ 2 2Ψ 2 t (∇ x Ψ t ) T G t R −1 G T t (∇ x Ψ t ) (4.10) + 1 2 tr(Γ) (4.11) where the termΓ is expressed as: Γ = $ λ 1 Ψ 2 t ∇ x Ψ t ∇ x Ψ T t −λ 1 Ψ t ∇ xx Ψ t % B t Σ w B T t (4.12) The tr ofΓ is therefore: Γ = λ 1 Ψ 2 tr A ∇ x Ψ T t B t Σ w B t ∇ x Ψ t B −λ 1 Ψ t tr A ∇ xx Ψ t B t Σ w B T t B (4.13) Comparing the underlined terms in (4.11) and (4.55), one can recognize that these terms will cancel under the assumption λG(x)R −1 G(x) T = B(x)Σ w B(x) T =Σ(x t )= Σ t . The resulting PDE is formulated as follows: −∂ t Ψ t =− 1 λ q t Ψ t +f T t (∇ x Ψ t )+ 1 2 tr((∇ xx Ψ t )Σ t ) (4.14) with boundary condition: Ψ t N =Ψ(x,t N ) = exp A − 1 λ φ(x t N ) B . The partial differential equation (PDE) in (4.14) corresponds to the so called Chapman Kolmogorov PDE, which is of second order and linear. Analytical solutions of even linear PDEs are plausible only in very special cases which correspond to systems with trivial low dimensional dynamics. 123 InthisworkwecomputethesolutionofthelinearPDEabovewiththeuseoftheFeynman - Kac lemma (Øksendal 2003). The Feynman- Kac lemma provides a connection between stochastic differential equations and PDEs and therefore its use is twofold. On one side it canbeusedtofindprobabilisticsolutionsofPDEsbasedonforwardsamplingofdiffusions whileontheothersideitcanbeusedfindsolutionofSDEsbasedondeterministicmethods that numerically solve PDEs. The solution of the PDE above can be found by evaluating the expectation: Ψ(x,t i )= ! e − ! t N t i 1 λ q(x)dt Ψ(x t N ) # τ i (4.15) on sample paths τ i =(x i ,...,x t N ) generated with the forward sampling of the diffusion equation dx = f(x,t)dt+B(x,t)Ldw. Under the use of the Feynman Kac lemma the stochastic optimal control problem has been transformed into an approximation problem of a path integral. With a view towards a discrete time approximation, which will be needed for numerical implementations, the solution (4.15) can be formulated as: Ψ(x,t i ) = lim dt→0 " p(τ i |x i )exp − 1 λ φ(x(t N ))+ N−1 > j=i q(x,t j )dt dτ i (4.16) where τ i =(x t i ,.....,x t N ) is a sample path (or trajectory piece) starting at state x t i and the term p(τ i |x i ) is the probability of sample path τ i conditioned on the start statex t i . Since equation (4.16) provides the exponential cost to goΨ t i in statex t i , the integration above is taken with respect to sample paths τ i = A x t i ,x t i+1 ,.....,x t N B . The differential term dτ i is defined as dτ i =(dx t i ,.....,dx t N ). After the exponentiated value function 124 Ψ(x,t) has been approximated, the optimal control are found according to the equation that follows: u(x,t)= λR −1 G(x) T ∇ x Ψ(x,t) Ψ(x,t) (4.17) Clearly optimal controls in the equation above act such that the stochastic dynamical system visits regions of the state space with high exponentiated values function Ψ(x,t) while in the optimal control formulation (4.5) controls will move the system towards part ofthestatespacewithminimumcost-to-goV(x,t). Thisobservationisincompleteagree- ment with the exponentiation of value function Ψ(x,t) = exp A − 1 λ V(x,t) B . Essentially, the resulting value functionΨ(x,t) can be thought as a probability of the state and thus states with high cost to go V(x,t) will be less probable(= smallΨ(x,t)) while state with small cost to go will be most probable. In that sense the stochastic optimal control has been transformed from a minimization to maximization optimization problem. Finally the intuition behind the condition λG(x,t)R −1 G(x,t) T = B(x,t)Σ w B(x,t) T is that, since the weight control matrix R is inverse proportional to the variance of the noise, a high variance control input implies cheap control cost, while small variance control inputs have high control cost. From a control theoretic stand point such a relationship makes sense due to the fact that under a large disturbance (= high variance) significant control authority is required to bring the system back to a desirable state. This control authority can be achieved with corresponding low control cost in R. With the goal to find theΨ(x,t) in equation (4.16), in the next section we derive the distribution p(τ i |x i ) based on the passive dynamics. This is a generalization of results in (Kappen 2007, Broek et al. 2008). 125 4.2 Generalized path integral formalism To develop our algorithms, we will need to consider a more general development of the pathintegralapproachtostochasticoptimalcontrolthanpresentedin(Kappen2007)and (Broek et al. 2008). In particular, we have to address that in many stochastic dynamical systems, the control transition matrixG t is state dependent and its structure depends on the partition of the state in directly and non-directly actuated parts. Since only some of the states are directly controlled, the state vector is partitioned into x=[x (m) T x (c) T ] T with x (m) ∈" k×1 the non-directly actuated part and x (c) ∈" l×1 the directly actuated part. Subsequently, the passive dynamics term and the control transition matrix can be partitioned as f t =[f (m) t T f (c) t T ] T with f m ∈" k×1 , f c ∈" l×1 and G t =[0 k×p G (c) t T ] T with G (c) t ∈" l×p . The discretized state space representation of such systems is given as: x t i+1 =x t i +f t i dt+G t i u t i dt+B t i dw t i , or, in partitioned vector form: x (m) t i+1 x (c) t i+1 = x (m) t i x (c) t i + f (m) t i f (c) t i dt+ 0 k×p G (c) t i u t i dt+ 0 k×p B (c) t i dw t i . (4.18) Essentially the stochastic dynamics are partitioned into controlled equations in which thestatex (c) t i+1 isdirectlyactuatedandtheuncontrolledequationsinwhichthestatex (m) t i+1 126 is not directly actuated. Since stochasticity is only added in the directly actuated terms (c) of (4.18), we can develop p(τ i |x i ) as follows. p(τ i |x t i )= p(τ i+1 |x t i ) = p A x t N ,.....,x t i+1 |x t i B = Π N−1 j=i p A x t j+1 |x t j B , where we exploited the fact that the start state x t i of a trajectory is given and does not contributetoitsprobability. Forsystemswherethecontrolhaslowerdimensionalitythan the state (4.18), the transition probabilities p A x t j+1 |x t j B are factorized as follows: p A x t j+1 |x t j B = p ? x (m) t j+1 |x t j @ p ? x (c) t j+1 |x t j @ = p ? x (m) t j+1 |x (m) t j ,x (c) t j @ p ? x (c) t j+1 |x (m) t j ,x (c) t j @ ∝ p ? x (c) t j+1 |x t j @ , (4.19) where we have used the fact that p ? x (m) t i+1 |x (m) t i ,x (c) t i @ is the Dirac delta function, since x (m) t j+1 can be computed deterministically from x (m) t j ,x (c) t j . For all practical purposes, 1 the transition probability of the stochastic dynamics is reduced to the transition probability of the directly actuated part of the state: p(τ i |x t i )=Π N−1 j=i p A x t j+1 |x t j B ∝Π N−1 j=i p ? x (c) t j+1 |x t j @ . (4.20) 1 The delta functions will all integrate to 1 in the path integral. 127 Sinceweassumethatthenoise!iszeromeanGaussiandistributedwithvarianceΣ w , whereΣ w = LL T ∈" l×l , the transition probability of the directly actuated part of the state is defined as: 2 p ? x (c) t j+1 |x t j @ = 1 A (2π) l ·|Σ t j | B 1/2 exp , − 1 2 M M Mx (c) t j+1 −x (c) t j −f (c) t j dt M M M 2 Σ −1 t j - , (4.21) where the covarianceΣ t j ∈" l×l is expressed asΣ t j =B (c) t j Σ w B (c) t j T dt. Combining (4.21) and (4.20) results in the probability of a path expressed as: p(τ i |x t i )∝ 1 Π N−1 j=i A (2π) l 4Σ t j | B 1/2 exp − 1 2 N−1 > j=1 M M Mx (c) t j+1 −x (c) t j −f (c) t j dt M M M 2 Σ −1 t j . Finally, we incorporate the assumption (4.56) about the relation between the control cost and the variance of the noise, which needs to be adjusted to the controlled space as Σ t j = B (c) t j Σ w B (c) t j T dt = λG (c) t j R −1 G (c) t j T dt = λH t j dt with H t j = G (c) t j R −1 G (c) t j T . Thus, we obtain: p(τ i |x t i )∝ 1 Π N−1 j=i A (2π) l |Σ t j | B 1/2 exp − 1 2λ N−1 > j=i M M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M M 2 H −1 t j dt . 2 For notational simplicity, we write weighted square norms (or Mahalanobis distances) as v T Mv = "v" 2 M . 128 With this formulation of the probability of a trajectory, we can rewrite the the path integral (4.16) as: Ψ t i = = lim dt→0 " exp − 1 λ φ t N + C N−1 j=i q t j dt+ 1 2 C N−1 j=i M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M 2 H −1 t j dt Π N−1 j=i A (2π) l/2 |Σ t j | 1/2 B dτ (c) i (4.22) Or in a more compact form: Ψ t i = lim dt→0 " 1 D(τ i ) exp $ − 1 λ S(τ i ) % dτ (c) i , (4.23) where, we defined S(τ i )= φ t N + N−1 > j=i q t j dt+ 1 2 N−1 > j=i M M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M M 2 H −1 t j dt, and D(τ i )= N−1 G j=i ? (2π) l/2 |Σ t j | 1/2 @ . 129 Note that the integration is over dτ (c) i = ? dx (c) t i ,.....,dx (c) t N @ , as the non-directly ac- tuated states can be integrated out due to the fact that the state transition of the non- directly actuated states is deterministic, and just added Dirac delta functions in the integral (cf. Equation (4.19)). Equation (4.23) is written in a more compact form as: Ψ t i = lim dt→0 " exp $ − 1 λ S(τ i )−logD(τ i ) % dτ (c) i = lim dt→0 " exp $ − 1 λ Z(τ i ) % dτ (c) i , (4.24) where Z(τ i )=S(τ i )+λlogD(τ i ). It can be shown (see appendix) that this term is factorized in path dependent and path independent terms of the form: Z(τ i )= ˜ S(τ i )+ λ(N−i)l 2 log(2πdtλ), where ˜ S(τ i )=S(τ i )+ λ 2 N−1 > j=i log|B t j | (4.25) with B = B (c) t j B (c) t j T . This formula is a required step for the derivation of optimal controls in the next section. The constant term λ(N−i)l 2 log(2πdtλ) can be the source of numerical instabilities especially in cases where fine discretization dt of stochastic dynamics is required. However, in the next section, and in a great detail in Appendix A, lemma 1, we show how this term drops out of the equations. 130 4.3 Path integral optimal controls For every moment of time, the optimal controls are given as u t i = −R −1 G T t i (∇ xt i V t i ). Due to the exponential transformation of the value function, the equation of the optimal controls can be written as u t i = λR −1 G t i ∇ xt i Ψ t i Ψ t i . After substitutingΨ t i with (4.24) and canceling the state independent terms of the cost we have: u t i = lim dt→0 λR −1 G T t i ∇ x (c) t i ? F e − 1 λ ˜ S(τ i ) dτ (c) i @ F e − 1 λ ˜ S(τ i ) dτ (c) i , Furtheranalysisoftheequationaboveleadstoasimplifiedversionfortheoptimalcontrols as u t i dt = F P (τ i )u L (τ i )dτ (c) i , (4.26) with the probability P (τ i ) and local controls u L (τ i ) defined as P (τ i )= e − 1 λ ˜ S(τ i ) ! e − 1 λ ˜ S(τ i ) dτ i (4.27) The local control can now be expressed as: u L (τ i )=R −1 G (c) t i T H −1 t i G (c) t i dw t i , 131 By substituting H t i = G (c) t i R −1 G (c) t i T in the equation above, we get our main result for the local controls of the sampled path for the generalized path integral formulation: u L (τ i )=R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 G (c) t i dw t i . (4.28) Given the local control above the optimal control in (4.26) are now expressed by the equation that follows: u t i dt = F P (τ i )R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 G (c) t i dw t i dτ (c) i , (4.29) The equations in boxes (4.29) and (4.27) the solution for the generalized path integral stochastic optimal control problem. The numerical evaluation of the integral above is expressed by the equation u(τ i )dt = #Paths > k=1 ˜ p (k) (τ i )R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 ? G (c) t i dw (k) t i @ (4.30) The equation above can also be written in the form: u(τ i )dt =R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 #Paths > k=1 ˜ p (k) (τ i ) ? G (c) t i dw (k) t i @ (4.31) 132 • Given: – The system dynamics x t i+1 =x t i +(f t i +G t u t )dt+B t i dw t i (cf. 4.2) – The immediate costL t =q t + 1 2 u T t Ru t (cf. 4.3) – A terminal cost term φ t N – Trajectory starting at t i and ending at t N : τ i =(x t i ,.....,x t N ) – Apartitioningofthesystemdynamicsinto(c)controlledand(m)uncontrolled equations, where n = c+m is the dimensionality of the state x t (cf. Section 4.2) • Optimal Controls: – Optimal controls at every time step t i : u t i dt = F P (τ i )u L (τ i )dτ (c) i – Probability of a trajectory: P (τ i )= e − 1 λ ˜ S(τ i ) ! e − 1 λ ˜ S(τ i ) dτ i – Generalized trajectory cost: ˜ S(τ i )=S(τ i )+ λ 2 C N−1 j=i log|B t j | where ∗ S(τ i )= φ t N + C N−1 j=i q t j dt+ 1 2 C N−1 j=i M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M 2 H −1 t j dt ∗ H t j =G (c) t j R −1 G (c) t j T andB =B (c) t j B (c) t j T – Local Controls: u L (τ i )=R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 ? G (c) t i dw t i @ . Table 4.1: Summary of optimal control derived from the path integral formalizm. Given that this result is of general value and constitutes the foundation to derive our reinforcement learning algorithm in the next section, but also since many other special cases can be derived from it, we summarized all relevant equations in Table 4.1. The Given components of Table 4.1 include a model of the system dynamics, the cost function, knowledge of the system’s noise process, and a mechanism to generate trajectories τ i . It is important to realize that this is a model-based approach, as the computations of the optimal controls requires knowledge of! i . ! i can be obtained in two ways. First, the trajectories τ i can be generated purely in simulation, where the noise 133 is generated from a random number generator. Second, trajectories could be generated by a real system, and the noise ! i would be computed from the difference between the actual and the predicted system behavior, that is,G (c) t i ! i = ˙ x t i − ˆ ˙ x t i = ˙ x t i −(f t i +G t i u t i ). Computing the prediction ˆ ˙ x t i also requires a model of the system dynamics. In the next section we show how our generalized formulation is specialized to different classesofstochasticdynamicalsystemsandweprovidethecorrespondingformulaoflocal controls for each class. 4.4 Path integral control for special classes of dynamical systems The purpose of this section is twofold. First, it demonstrates how to apply the path integral approach to specialized forms of dynamical systems, and how the local controls in (4.28) simplify for these cases. Second, this section prepares the special case which we will need for our reinforcement learning algorithm in presented in the next chapter. The generalized formulation of stochastic optimal control with path integrals in Table 4.1 can be applied to a variety of stochastic dynamical systems with different types of control transition matrices. One case of particular interest is where the dimensionality of the directly actuated part of the state is 1D, while the dimensionality of the control vector is 1D or higher dimensional. As will be seen below, this situation arises when the controls are generated by a linearly parameterized function approximator. The control 134 transition matrix thus becomes a row vector G (c) t i = g (c)T t i ∈" 1×p . According to (4.28), the local controls for such systems are expressed as follows: u L (τ i )= R −1 g (c) t i g (c)T t i R −1 g (c) t i ? g (c)T t i dw t i @ . Since the directly actuated part of the state is 1D, the vector x (c) t i collapses into the scalar x (c) t i which appears in the partial differentiation above. In the case that g (c) t i does not depend on x (c) t i , the differentiation with respect to x (c) t i results to zero and the the local controls simplify to: u L (τ i )= R −1 g (c) t i g (c)T t i g (c)T t i R −1 g (c) t i dw t i The generalized formula of the local controls (4.28) was derived for the case where the control transition matrix is state dependent and its dimensionality is G (c) t ∈" l×p with l < n and p the dimensionality of the control. There are many special cases of stochastic dynamical systems in optimal control and robotic applications that belong into this general class. More precisely, for systems having a state dependent control transition matrix that is square (G (c) t i ∈" l×l with l = p) the local controls based on (4.28) are reformulated as: u L (τ i )=dw t i . (4.32) Interestingly, a rather general class of mechanical systems such as rigid-body and multi-body dynamics falls into this category. When these mechanical systems are ex- pressed in state space formulation, the control transition matrix is equal to rigid body 135 inertia matrix G (c) t i =M(θ t i ) (Sciavicco & Siciliano 2000). Future work will address this special topic of path integral control for multi-body dynamics. Another special case of systems with partially actuated state is when the control transition matrix is state independent and has dimensionality G (c) t = G (c) ∈" l×p . The local controls, according to (4.28), become: u L (τ i )=R −1 G (c) T ? G (c) R −1 G (c) T @ −1 G (c) dw t i . (4.33) If G (c) t i is square and state independent, G (c) t i =G (c) ∈" l×l , we will have: u L (τ i )=dw t i . (4.34) This special case was explored in (Kappen 2005a), (Kappen 2007), (Kappen 2005b) and (Broek et al. 2008). Our generalized formulation allows a broader application of path integral control in areas like robotics and other control systems, where the control transition matrix is typically partitioned into directly and non-directly actuated states, and typically also state dependent. 4.5 Itˆ oversusStratonovichpathintegralstochasticoptimal control The derivation of the Path Integral for the systems with partitioned state into directly and no directly actuated parts was performed based on the Itˆ o stochastic calculus. In 136 this section we derive the path integral control for the case of Stratonovich stochastic calculus. We consider the dynamics: dx =f(x,t)dt+g(x)(udt+dw) (4.35) We follow the same argument required to apply the path integral control framework and we come up with the path integral formulation expressed according to Stratonovich caclulus. For a general integration scheme we have shown that the path integral takes the form: P $ x N ,t N |x 0 ,t 0 % = " N−1 G i=1 $ dx i √ 2πδtB i % ×exp $ − N > i=1 . 1 2B 2 i $ δx δt −f i +α(∂ x B i )B i % 2 +β (∂ x f i ) / δt % For the Stratonovich calculus we can chose α = 1 2 and β = 1 2 and we will have the path integral: P $ x N ,t N |x 0 ,t 0 % = " N−1 G i=1 $ dx i √ 2πδtB i % exp $ − N > i=1 . 1 2B 2 i $ δx δt −f i + 1 2 (∂ x B i )B i % 2 / δt % ×exp $ − 1 2 N > i=1 (∂ x f i )δt % we can write the equation above in the form: 137 P $ x N ,t N |x 0 ,t 0 % = " N−1 G i=1 $ dx i √ 2πδtB i % exp $ − N > i=1 . 1 2B 2 i $ δx δt − ˜ f i % 2 / δt % ×exp $ − 1 2 N > i=1 (∂ x f i )δt % where ˜ f i = f i − 1 2 (∂ x B i )B i . The derivation of the optimal control for the scalar case follows the same steps as in appendix but with the difference of using ˜ f i instead of f i and the additional term C N i=1 (∂ x f i )δt . It can be shown that the optimal control is now formulated as : u(x t i )= " p(τ i )u L dτ i = " p(τ i ) ? ˙ x− ˜ f(x) @ dτ i In the next section we discuss the iterative version of path integral control. 4.6 Iterative path integral stochastic optimal control Inthissection,weshowhowPathIntegralControlistransformedintoaniterativeprocess, which has several advantages for use on a real robot. The analysis that follows holds for any stochastic dynamical systems that belongs to the class of systems expressed by (4.2). WhentheiterativepathintegralcontrolisappliedtoDynamicMovementPrimitivesthen the resulting algorithm is the so called Policy Improvement with Path Integrals (PI 2 ). 138 However, we will leave the discussion for PI 2 for the next chapter and in this section we present the general version of iterative path integral control. In particular, we start by looking into the expectation (4.15) in the Feynman Kac Lemma that is evaluated over the trajectories τ i = A x t i ,x t i+1 ,.....,x t N B sampled with the forward propagation of uncontrolled diffusion dx = f(x,t)dt +B(x,t)Ldw. This samplingapproach is inefficient since it is very likely that partsof thestate space relevant to the optimal control task may not be reached by the sampled trajectories at once. In addition, it has poor scalability properties when applied to high dimensional robotic optimal control problems. Besides the reason of poor sampling, it is very common in robotics applications to have an initial controller-policy which is manually tuned and found based on experience. In such cases, the goal is to improve this initial policy by performing an iterative process. At every iteration (i) the policy δu (i−1) is applied to the dynamical system to generate state space trajectories which are going to be used for improving the current policy. The policy improvement results from the evaluation of the expectation (4.16) of the Feynman - Kac Lemma on the sampled trajectories and the use of the path integral control formalism to find δu (i) . The old policy δu (i−1) is updated according to δu (i−1) + δu (i) and the process repeats again with the generation of the new state space trajectories according to the updated policy. In mathematical terms the iterative version of Path Integral Control is expressed as follows: V (i) (x) = min δu (i) J(x,u) = min δu (i) !" t N to ? q(x,t)+δu (i)T R δu (i) @ dt # (4.36) 139 subject to the stochastic dynamical constrains: dx = ? f (i) (x)+G(x)δu (i) @ dt+B(x)Ldw (4.37) wheref (i) (x t )=f (i−1) (x t )+G(x)δu (i−1) where δu (i−1) is the control correction found in the previous iteration. The linear HJB equation is now formulated as: −∂ t Ψ (i) t =− 1 λ q t Ψ (i) t +f (i) t T (∇ x Ψ (i) t )+ 1 2 tr ? (∇ xx Ψ (i) t )Σ @ (4.38) The solution of PDE above is given by Ψ (i) (x t )= ! e − ! t N t i 1 λ q(x)dt Ψ(x t N ) # τ (i) (4.39) where τ (i) =(x t ,.....,x t N ) are sampled trajectories generated by the diffusion: dx = f(x,t)dt+B(x,t)Ldw. The optimal control at iteration (i) is expressed as: δu (i) = λR −1 G(x) T ∇ x Ψ (i) (x,t) Ψ (i) (x,t) (4.40) and it is applied to the dynamics f (i) (x t ). The application of the new control results in updating the previous control δu (i−1) and creating the new dynamics f (i+1) (x)= f (i) (x)+G(x)δu (i) =f (i−1) (x)+G(x) A δu (i) +δu (i−1) B . At the next iteration (i+1) of 140 the iterative path integral control, the corresponding exponentiated value functionΨ (i+1) is given by the following PDE: −∂ t Ψ (i+1) t =− 1 λ q t Ψ (i+1) t +f (i+1) t T (∇ x Ψ (i+1) t + 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ (4.41) The solution of the PDE is now expressed as: Ψ (i+1) (x t )= ! e − ! t N t i 1 λ q(x)dt Ψ(x t N ) # τ (i+1) (4.42) where τ (i+1) =(x t ,.....,x t N ) are sampled trajectories generated by the diffusion: dx = f (i+1) (x t )dt+B(x)dω. Our ultimate goal for the iterative path integral control is to find the sufficient con- ditions so that at every iteration the value function improves V (i+1) (x,t)<V (i) (x,t) < .... < V (0) (x,t). Since in the path integral control formalism we make use of the trans- formation Ψ(x,t) = exp A − 1 λ V(x,t) B it suffices to show that Ψ (i+1) (x,t) > Ψ (i) (x,t) > .... > Ψ (0) (x,t). If the last condition is true then at every (i) iteration the stochastic dynamical system visits to regions of state space with more and more probable states( = stateswithhighΨ(x,t)). ThesestatescorrespondtosmallvaluefunctionV(x,t). Tofind the condition under which the above is true, we proceed with the analysis that follows. Since we know that f (i+1) (x)= f (i) (x)+G(x)δu (i) we substitute in (4.41) and we will have that: 141 −∂ t Ψ (i+1) t =− 1 λ q t Ψ (i+1) t +f (i) t T (∇ x Ψ (i+1) t )+ 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ (4.43) +δu (i)T G T (∇ x Ψ (i+1) t ) substitution of δu results in: −∂ t Ψ (i+1) t = − 1 λ q t Ψ (i+1) t +f (i) t T (∇ x Ψ (i+1) t )+ 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ (4.44) + λ Ψ (i) t (∇ x Ψ (i) t ) T GRG T (∇ x Ψ (i+1) t ) or in a more compact form: −∂ t Ψ (i+1) t = − 1 λ q t Ψ (i+1) t +f (i) t T (∇ x Ψ (i+1) t )+ 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ +F(x,t) where F(x,t)= λ Ψ (i) (x,t) ∇ x Ψ (i) (x,t) T GRG T ∇ x Ψ (i+1) (x,t) (4.45) correspond to a force term which is the inner product of the gradients of the value functions at iterations (i) and (i + 1) under the metricM = λ Ψ (i) (x,t) GRG T . Clearly M > 0 since the matrix product GRG T > 0 is positive definite andλ > 0,Ψ(x,t) > 0. Comparing the two PDEs at iteration (i) and (i+1) and by using the differential linear operatorA (i) =− 1 λ q t +f (i) t T ∇ x + 1 2 tr(Σ∇ xx ) we have: 142 −∂ t Ψ (i+1) t =A (i) Ψ (i+1) t +F(x,t) −∂ t Ψ (i) t =A (i) Ψ (i) t (4.46) under the terminal condition Ψ (i) t N = exp A − 1 λ φ(x t N ) B and Ψ (i+1) t N = exp A − 1 λ φ(x t N ) B . In the next two subsection we study the two PDEs above, with the goal to find the connection betweenΨ (i) andΨ (i+1) . 4.6.1 Iterative path integral Control with equal boundary conditions In this section we will simplify our analysis and we will assume that over the itera- tions i the boundary conditions of the corresponding PDEs are the same thus Ψ (i) t N = exp A − 1 λ φ(x t N ) B , ∀i. Our analysis is fairly intuitive. We claim that Ψ (i+1) < Ψ (i) if F(x,t)> 0∀x,t. To see this result we rewrite equation (4.60) in the following form: −∂ t Ψ (i+1) t = − 1 λ q t Ψ (i+1) t +f (i) t T (∇ x Ψ (i+1) t )+ 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ + 1 λ δu (i)T R δu (i+1)T Ψ (i+1) t (4.47) where we used the fact that δu (i+1) = λR −1 G(x) T ∇xΨ (i) (x,t) Ψ (i+1) (x,t) or in a more compact form: −∂ t Ψ (i+1) t =− 1 λ ˜ q t Ψ (i+1) t +f (i) t T (∇ x Ψ (i+1) t )+ 1 2 tr ? (∇ xx Ψ (i+1) t )Σ @ (4.48) 143 wheretheterm ˜ q=˜ q(x,t,δu (i) ,δu (i+1) )isdefinedas ˜ q =q(x,t)− 1 λ δu (i)T R δu (i+1)T . To find the relation betweenΨ (i) (x,t) andΨ (i+1) (x,t) we first transform the PDE above into a forward PDE and then we follow some intuitive arguments. More precisely we assume the transformationΨ(x,t) =Φ(x,T−t) =Φ(τ). Thus we will have that: ∂ t Ψ t =−∂ τ Φ τ (4.49) The PDE at iteration (i) takes now the form: ∂ τ Φ (i) τ =− 1 λ q(x,T−τ)Φ (i) τ +f (i) τ T (∇ x Φ (i) τ )+ 1 2 tr ? (∇ xx Φ (i) τ )Σ @ (4.50) with the initial conditionΦ(x,0) = exp(− 1 λ φ(t N )). At iteration (i+1) we will have: ∂ τ Φ (i+1) τ =− 1 λ ˜ q(x,T−τ)Φ (i+1) τ +f (i) τ T (∇ x Φ (i+1) τ )+ 1 2 tr ? (∇ xx Φ (i+1) τ )Σ @ (4.51) under the initial conditionΦ(x,0) = exp(− 1 λ φ(t N )). Clearly there are 3 cases depending on the sign of F(x,t) and therefore the sign of 1 λ δu (i)T R δu (i+1)T . More precisely we will have that: • IfF(x,T−t)> 0⇒ δu (i)T R δu (i+1)T > 0. Bycomparing(4.50)with(4.51)wesee that state cost ˜ q subtracted fromΦ (i+1) is smaller than the state cost q subtracted fromΦ (i) andthereforeΦ (i+1) (x,T−t)>Φ (i) (x,T−t)=⇒Ψ (i+1) (x,t)>Ψ (i) (x,t). 144 • IfF(x,T −t)=0⇒ δu (i)T R δu (i+1)T = 0 the two PDEs (4.50) and (4.51) are identical. Therefore under the same boundary condition Φ (i+1) (x,0) = Φ (i) (x,0) we will have thatΦ (i+1) (x,T−t)=Φ (i) (x,T−t)=⇒Ψ (i+1) (x,t) =Ψ (i) (x,t). • IfF(x,T−t)< 0⇒ δu (i)T R δu (i+1)T > 0. Bycomparing(4.50)with(4.51)wesee that state cost ˜ q subtracted fromΦ (i+1) is smaller than the state cost q subtracted fromΦ (i) andthereforeΦ (i+1) (x,T−t)<Φ (i) (x,T−t)=⇒Ψ (i+1) (x,t)<Ψ (i) (x,t). 4.6.2 Iterativepathintegralcontrolwithnotequalboundaryconditions In this section we deal with the more general case in which the boundary conditions for the PDEs in (4.46) are not necessarily equal. To study the relation betweenΨ (i+1) and Ψ (i) we define the functionΔΨ (i+1,i) =Ψ (i+1) −Ψ (i) . Since the two PDEs in (4.46) are linear we can subtract the PDE inΨ i from the PDE inΨ i+1 and we will have: −∂ t ΔΨ (i+1,i) t =A (i) ΔΨ (i+1,i) +F(x,t) (4.52) Now we apply the generalized version of the Feynman-Kac lemma and we represent the solution of the PDE above in a probabilistic manner. More precisely we will have: ΔΨ (i+1,i) (x,t)= ! ΔΨ (i+1,i) (x,t N )exp $ − 1 λ " T t q(x s ,s)ds %# + !" T t F(x θ ,θ)exp $ − 1 λ " T t q(x,s)ds % dθ # We identify 3 cases: 145 • Clearly in case ΔΨ (i+1,i) (x,t N ) = 0 then, if F(x θ ,θ) > 0 ⇒ ΔΨ (i+1,i) (x,t) > 0⇒Ψ (i+1) (x,t)>Ψ (i) (x,t). This case was discussed in the previous subsection in which we came to the same conclusion thatF(x θ ,θ) > 0 by using more intuitive arguments. • IfΔΨ (i+1,i) (x,t N ) < 0 then the conditions, forΨ (i+1) (x,t) >Ψ (i) (x,t) to be true, are given as follows: − ! ΔΨ (i+1,i) (x,t N )exp $ − 1 λ " T t q(x s ,s)ds %# < !" T t F(x θ ,θ)exp $ − 1 λ " T t q(x,s)ds % dθ # Theconditionaboveresultsin: ! F T t F(x θ ,θ)exp ? − 1 λ F T t q(x,s)ds @ dθ # > 0which is a necessary but not sufficient condition. • IfΔΨ (i+1,i) (x,t N )> 0thenthecondition ! F T t F(x θ ,θ)exp ? − 1 λ F T t q(x,s)ds @ dθ # > 0 becomes the sufficient condition such thatΨ (i+1) (x,t)>Ψ (i) (x,t). 4.7 Risk sensitive path integral control To arrive in the Path integral control formalism for the risk sensitive setting we make use of (2.117) and (2.118) and the transformation V(x,t)= −λlogΨ(x,t). More precisely we will have that: 146 λ Ψ t ∂ t Ψ =q− λ Ψ (∇ x Ψ) T f− λ 2 2Ψ 2 (∇ x Ψ) T M(x)(∇ x Ψ)+ $ 2γ tr ? ˜ Γ @ (4.53) where the term ˜ Γ is expressed as: ˜ Γ(x)= λ 1 Ψ 2 ∇ x Ψ∇ x Ψ T ˜ CΣ ! ˜ C T −λ 1 Ψ ∇ xx Ψ ˜ CΣ ! ˜ C T (4.54) The tr of ˜ Γ is therefore: ˜ Γ(x)= λ 1 Ψ 2 tr ? ∇ x Ψ T ˜ CΣ ! ˜ C∇ x Ψ @ −λ 1 Ψ tr ? ∇ xx Ψ ˜ CΣ ! ˜ C T @ (4.55) Comparing the underlined terms in (4.11) and (4.55), one can recognize that these terms will cancel under the assumption: λM(x)= $ γ ˜ C(x)Σ ! ˜ C(x) T =Σ(x t )=Σ t (4.56) which results in: λ $ G(x)R −1 G(x) T − 1 γ ˜ C(x) ˜ C(x) T % = $ γ ˜ C(x)Σ ! ˜ C(x) T (4.57) Again since ( γ ˜ C(x)Σ ! ˜ C(x) T is positive definite∀$,γ > 0, we will have that: λ $ G(x)R −1 G(x) T − 1 γ ˜ C(x) ˜ C(x) T % > 0 (4.58) 147 The previous equation can be written in the form: λG(x)R −1 G(x) T = λ+$ γ ˜ C(x)Σ ! ˜ C(x) T (4.59) With this simplification, (4.11) reduces to the following form: −∂ t Ψ =− 1 λ qΨ t +f T (∇ x Ψ)+ $ 2γ tr ? (∇ xx Ψ) ˜ CΣ ! ˜ C T @ (4.60) withboundarycondition: Ψ t N = exp A − 1 λ φ t N B . Theanalysissofarresultsinthefollowing theorem. Theorem: The exponentiated value function Ψ(x,t) = exp A − 1 λ V(x,t) B of the risk sensitive stochastic optimal control problem defined by (2.103),(2.104) is given by the linear and second order PDE: −∂ t Ψ =− 1 λ qΨ t +f T (∇ x Ψ)+ $ 2γ tr ? (∇ xx Ψ) ˜ CΣ ! ˜ C T @ with terminal condition Ψ t N = exp A − 1 λ φ t N B iff the following assumption holds λG(x)R −1 G(x) T = λ+$ γ ˜ C(x)Σ ! ˜ C(x) T where the parameters $,γ,λ> 0 and Σ ! =LL T . Clearly, a quick inspection of (4.41) and (4.60) leads to the conclusion that the PDEs are identical if ˜ C(x)Σ ! ˜ C(x) T = C(x)Σ ! C(x) T . Given the last assumption and (4.59) the stochastic differential game formulated by (2.120), (2.94) and stochastic risk sensitive 148 optimal control problem given by (2.103),(2.104) are equivalent. Essentially, condition (4.59) guarantees that the equivalence between differential games and risk sensitivity in optimal control carries over inside the path integral control formalism. The theorem that follows is the synopsis of our analysis and it is central in this work since it establishes the connection between risk sensitive control and differential game theory under the path integral control formalism. More precisely: Theorem: Consider the stochastic differential game expressed by (2.120) and (2.94) and the risk sensitive stochastic optimal control problem defined by (2.103) and (2.104). These optimal control problems are equivalent under the path integral formalism. Their common optimal control solution is expressed by: u ∗ (x)= λR −1 G T ∇ x Ψ Ψ (4.61) where −∂ t Ψ =− 1 λ qΨ t +f T (∇ x Ψ)+ $ 2γ tr ? (∇ xx Ψ) ˜ CΣ ! ˜ C T @ with boundary condition Ψ t N = exp A − 1 λ φ t N B , iff the following conditions hold i) λG(x)R −1 G(x) T = λ+( γ ˜ C(x)Σ ! ˜ C(x) T and ii) ˜ C(x)Σ ! ˜ C(x) T =C(x)Σ ! C(x) T with the parameters γ,λ> 0 and Σ ! defined as Σ ! =LL T . 149 4.8 Appendix This section contains the derivation for the factorization of the cost function Z(τ i ), into pathdependentandpathindependentterms,thelemmasL1andL2andonetheoremT1. The theorem provides the main result of the generalized path integral control formalism expressed by (4.26), (4.27), (4.28). Its proof is based on results proven in the lemmasL1 and L2. Derivation of the factorization of Z(τ i ). We start our derivation from the equation 4.24. Our goal is to factorize the following quantity into path dependent and path independent terms. More precisely we have: Z(τ i )=S(τ i )+λlogD(τ i ) (4.62) D(τ i )= N−1 G j=i ? (2π) l/2 |Σ t j | 1/2 @ . SinceΣ t j =B (c) t j Σ w B (c) t j T dt = λG (c) t j R −1 G (c) t j T dt = λH t j dtwithH t j =G (c) t j R −1 G (c) t j T Z(τ i )=S(τ i )+λlog N−1 G j=i (2π) n/2 |Σ(x t i )| 1/2 =S(τ i )+λ N−1 > j=i log $ (2π) n/2 |B(x,t j )Σ w B(x,t j ) T dt| 1/2 % =S(τ i )+λ N−1 > j=i log $ (2π) n/2 |B(x,t j )Σ w B(x,t j ) T dt| 1/2 % 150 =S(τ i )+λ N−1 > j=i log $ (2π) n/2 |B(x,t j )Σ w B(x,t j ) T dt| 1/2 % =S(τ i )+λ N−1 > j=i log $ |2πB(x,t j )Σ w B(x,t j ) T dt| 1/2 % =S(τ i )+ λ 2 N−1 > j=i trlog $ 2πB(x,t j )Σ w B(x,t j ) T dt % Here we will assume just for simplicity thatΣ w = σ 2 w I n×n . Z(τ i )=S(τ i )+ λ 2 N−1 > j=i tr . log $ 2πσ 2 w I n×n dt % +log $ B(x,t j )B(x,t j ) T %/ =S(τ i )+ λ 2 N−1 > j=i tr . log $ 2πσ 2 w I n×n dt % +log $ B(x,t j )B(x,t j ) T %/ =S(τ i )+ λ 2 N−1 > j=i . nlog A 2πσ 2 w dt B +trlog $ B(x,t j )B(x,t j ) T %/ =S(τ i )+ λNn 2 log A n2πσ 2 w dt B + λ 2 N−1 > j=i trlog $ B(x,t j )B(x,t j ) T % =S(τ i )+ λ 2 N−1 > j=i log|B(x,t j )B(x,t j ) T |+ λ(N−i)n 2 log A 2πσ 2 w dt B Finally the full cost to go is: Z(τ i )= ˜ S(τ i )+ λNn 2 log A 2πσ 2 w dt B where 151 ˜ S(τ i )=S(τ i )+ λ 2 N−1 > i=0 log|B(x,t j )| whereB(x,t j )=B(x,t j )B(x,t j ) T and S(τ i )= φ t N + N−1 > j=i , q t j + 1 2 M M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M M 2 Ht j - dt In cases whereΣ w )= σ 2 w I n×n the results are the same with the equations besides the termB(x,t j ) that is now defined asB(x,t j )=B(x,t j )Σ w B(x,t j ) T . Lemma 1 The optimal control solution to the stochastic optimal control problem ex- pressed by (4.1),(4.2) and (4.3) is formulated as: u t i = lim dt→0 . −R −1 G (c) t i T " ˜ p(τ i )∇ x (c) t i ˜ S(τ i )dτ i / where ˜ p(τ i )= exp(− 1 λ ˜ S(τ i )) ! exp(− 1 λ ˜ S(τ i ))dτ i is a path dependent probability distribution. The term ˜ S(τ i ) is a path function defined as ˜ S(τ i )=S(τ i )+ λ 2 C N−1 j=i log|B(x,t j )| that satisfies the following condition lim dt→0 F exp ? − 1 λ ˜ S(τ i ) @ dτ i ∈C (1) for any sampled trajectory starting from state x t i . Moreover the term H t j is given by H t j =G (c) t j R −1 G (c) t j T while the term S(τ i ) is defined according to S(τ i )= φ t N + N−1 > j=i q t j dt+ 1 2 N−1 > j=i M M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M M 2 Ht j dt. 152 Proof: Theoptimalcontrolsatthestatex t i isexpressedbytheequationu t i =−R −1 G t i ∇ xt i V t i . DuetotheexponentialtransformationofthevaluefunctionΨ t i =−λlogV t i theequation of the optimal controls is written as: u t i = λR −1 G t i ∇ xt i Ψ t i Ψ t i . In discrete time the optimal control is expressed as follows: u t i = lim dt→0 , λR −1 G T t i ∇ xt i Ψ (dt) t i Ψ (dt) t i - . By using equation (4.24) and substitutingΨ (dt) (x t i ,t) we have: u t i = lim dt→0 , λR −1 G T t i ∇ xt i F exp A − 1 λ Z(τ i ) B dτ i F exp A − 1 λ Z(τ i ) B dτ i - . Substitution of the term Z(τ i ) results in the equation: u t i = lim dt→0 λR −1 G T t i ∇ xt i F exp ? − 1 λ ˜ S(τ i )− λ(N−i)l 2 log(2πdtλ) @ dτ i F exp ? − 1 λ ˜ S(τ i )− λ(N−i)l 2 log(2πdtλ) @ dτ i . Next we are using standard properties of the exponential function that lead to: u t i = lim dt→0 λR −1 G T t i ∇ xt i 6 F exp ? − 1 λ ˜ S(τ i ) @ exp ? − λ(N−i)l 2 log(2πdtλ) @ dτ i 7 F exp ? − 1 λ ˜ S(τ i ) @ exp ? − λ(N−i)l 2 log(2πdtλ) @ dτ i . 153 The term exp A − λNl 2 log(2πdtλ) B does not depend on the trajectory τ i , therefore it can be taken outside the integral as well as outside the gradient. Thus we will have that: u t i = lim dt→0 λR −1 G T t i exp ? − λ(N−i)l 2 log(2πdtλ) @ ∇ xt i 6 F exp ? − 1 λ ˜ S(τ i ) @ dτ i 7 exp ? − λ(N−i)l 2 log(2πdtλ) @ F exp ? − 1 λ ˜ S(τ i ) @ dτ i . Theconstanttermdropsfromthenominatoranddenominatorandthuswecanwrite: u t i = lim dt→0 λR −1 G T t i ∇ xt i F exp ? − 1 λ ˜ S(τ i ) @ dτ i F exp ? − 1 λ ˜ S(τ i ) @ dτ i . Under the assumption that term exp ? − 1 λ ˜ S(τ i ) @ dτ i is continuously differentiable in x t i and dt we can change order of the integral with the differentiation operations. In general for ∇ x F f(x,y)dy = F ∇ x f(x,y)dy to be true, f(x,t) should be continuous in y and differentiable in x. Under this assumption, the optimal controls can be further formulated as: u t i = lim dt→0 λR −1 G T t i F ∇ xt i exp ? − 1 λ ˜ S(τ i ) @ dτ i F exp ? − 1 λ ˜ S(τ i ) @ dτ i . Application of the differentiation rule of the exponent results in: u t i = lim dt→0 λR −1 G T t i F exp ? − 1 λ ˜ S(τ i ) @ ∇ xt i ? − 1 λ ˜ S(τ i ) @ dτ i F exp ? − 1 λ ˜ S(τ i ) @ dτ i . Thedenominatorisafunctionofx t i thecurrentstateandthusitcanbepushedinside the integral of the nominator: 154 u t i = lim dt→0 λR −1 G T t i " exp ? − 1 λ ˜ S(τ i ) @ F exp ? − 1 λ ˜ S(τ i ) @ dτ i ∇ xt i $ − 1 λ ˜ S(τ i ) % dτ i . By defining the probability ˜ p(τ i )= exp(− 1 λ ˜ S(τ i )) ! exp(− 1 λ ˜ S(τ i ))dτ i the expression above can be written as: u t i = lim dt→0 . λR −1 G T t i " ˜ p(τ i )∇ xt i $ − 1 λ ˜ S(τ i ) % dτ i / . Further simplification will result in: u t i = lim dt→0 . −R −1 G T t i " ˜ p(τ i )∇ xt i ˜ S(τ i )dτ i / . We know that the control transition matrix has the form G(x t i ) T = [0 T G c (x xt i ) T ]. In addition the partial derivative∇ xt i ˜ S(τ i ) can be written as: ∇ xt i ˜ S(τ i ) T =[∇ x (m) t i ˜ S(τ i ) T ∇ x (c) t i ˜ S(τ i ) T ]. By using these equations we will have that: u t i = lim dt→0 −R −1 [0 T G (c) t i T ] " ˜ p(τ o ) ∇ x (m) t i ˜ S(τ i ) ∇ x (c) t i ˜ S(τ i ) dτ i . The equation above can be written in the form: u t i = lim dt→0 −[0 T R −1 G (c) t i T ] " ˜ p(τ i ) ∇ x (m) t i ˜ S(τ i ) ∇ x (c) t i ˜ S(τ i ) dτ i . 155 or u t i = lim dt→0 −[0 T R −1 G (c) t i T ] F ˜ p(τ i )·∇ x (m) t i ˜ S(τ i )dτ i F ˜ p(τ i )·∇ x (c) t i ˜ S(τ i )dτ i . Therefore we will have the result u t i = lim dt→0 . −R −1 G (c) t i T " ˜ p(τ i )∇ x (c) t i ˜ S(τ i )dτ i / . Lemma 2 Given the stochastic dynamics and the cost in(4.1),(4.2) and(4.3) the gradi- ent of the path function ˜ S(τ i ) in (4.25), with respect to the directly actuated part of the state x (c) t i is formulated as: ∇ x (c) t i ˜ S(τ i )= 1 2dt α T t i $ ∇ x (c) t i H −1 t i % α t i −H −1 t i $ ∇ x (c) t i f (c) t i % α t i − 1 dt H −1 t i α t i + λ 2 ∇ x (c) t i log|B t i | where H t i =G (c) t i R −1 G (c) t i T and α t j = ? x (c) t i+1 −x (c) t i −f (c) t i dt @ . Proof: We are calculating the term∇ x (c) to ˜ S(τ o ) . More precisely we have shown that ˜ S(τ i )= φ t N + N−1 > j=i q t j dt+ 1 2 N−1 > j=i 4 x (c) t j+1 −x (c) t j dt −f (c) t j 4 2 Ht j dt+ λ 2 N−1 > j=i log|B t j |. 156 To limit the length of our derivation we introduce the notation γ t j =α T t j h −1 t j α t j and α t j = ? x (c) t j+1 −x (c) t j −f (c) t j dt @ and it is easy to show that4 x (c) t j+1 −x (c) t j dt −f (c) t j 4 2 Ht j dt = 1 dt γ t j and therefore we will have: ˜ S(τ i )= φ t N + 1 2dt N−1 > j=i γ t j + t N > to Q t j dt+ λ 2 N−1 > j=i log|B t j |. In the analysis that follows we provide the derivative of the 1th, 2th and 4th term of the cost function. We assume that the cost of the state during the time horizon Q t i = 0. Incasesthatthisisnottruethenthederivative∇ x (c) t i C t N t i Q t i dtneedstobefoundaswell. By calculating the term∇ x (c) to ˜ S(τ o ) we can find the local controls u(τ i ). It is important to mention that the derivative of the path cost S(τ i ) is taken only with respect to the current state x to . The first term is: ∇ x (c) t i (φ t N )=0. (4.63) Derivative of the 2th Term ∇ x (c) t i 6 1 2dt C N−1 i=1 γ t i 7 of the cost S(τ i ). The second term can be found as follows: ∇ x (c) t i 1 2dt N−1 > j=i γ t j . The operator∇ x (c) to is linear and it can massaged inside the sum: 1 2dt N−1 > j=i ∇ x (c) t j A γ t j B . 157 Terms that do not depend on x (c) t i drop and thus we will have: 1 2dt ∇ x (c) t i γ t i . Substitution of the parameter γ t i =α T t i H −1 t i α t i will result in: 1 2dt ∇ x (c) t i N α T t i H −1 t i α t i O . By making the substitution β t i =H −1 t i α t i and applying the rule∇ A u(x) T v(x) B = ∇(u(x))v(x)+∇(v(x))u(x) we will have that: 1 2dt . ∇ x (c) t i α t i β t i +∇ x (c) t i β t i α t i / . (4.64) Next we find the derivative of α to : ∇ x (c) t i α t i =∇ x (c) t i 6 x (c) t i+1 −x (c) t i −f c (x t i )dt 7 . and the result is ∇ x (c) t i α t i =−I l×l −∇ x (c) t i f (c) t i dt. We substitute back to (4.64) and we will have: 1 2dt . − $ I l×l +∇ x (c) t i f (c) t i dt % β t i +∇ x (c) t i β t i α t i / . 158 − 1 2dt $ I l×l +∇ x (c) t i f (c) t i dt % β t i + 1 2dt ∇ x (c) t i β t i α t i . After some algebra the result of∇ x (c) t i ? 1 2dt C N−1 i=1 γ t i @ is expressed as: − 1 2dt β t i − 1 2 ∇ x (c) t i f (c) t i β t i + 1 2dt ∇ x (c) t i β t i α t i . We continue with further analysis of each one of the terms in the expression above. More precisely we will have: First Subterm: − 1 2dt β t i $ − 1 2dt β t i % =− $ 1 2dt H −1 t i α t i % =− 1 2 H −1 t i α t i =− 1 2 H −1 t i $ (x (c) t i+1 −x (c) t i ) 1 dt −f (c) t i % . Second Subterm: − 1 2 ∇ x (c) t i f (c) t i β t i $ 1 2 ∇ x (c) t i f (c) t i β t i % =− 1 2 ∇ x (c) t i f c (x t i ) β t i =− 1 2 ∇ x (c) t i f (c) t i A H −1 t i α t i B =− 1 2 ∇ x (c) t i f c (x t i ) H −1 t i α t i Third Subterm: 1 2dt ∇ x (c) t i β t i α t i $ 1 2dt ∇ x (c) t i β t i α t i % =∇ x (c) t i β t i $ 1 2dt α t i % =∇ x (c) t i β t i 1 2 $ (x (c) t i+1 −x (c) t i ) 1 dt −f (c) t i % . 159 We substitute β t i =H −1 t i α t i and write the matrix H −1 t i in row form: =∇ x (c) t i A H −1 t i α t i B 1 2dt α t i = = ∇ x (c) t i H (1) −T t i H (2) −T t i . . . H (l) −T t i α t i 1 2dt α t i =∇ x (c) t i H (1) −T t i α t i H (2) −T t i α t i . . . H (l) −T t i α t i 1 2dt α t i . We can push the operator∇ x (c) t i insight the matrix and apply it to each element. = ∇ T x (c) t i ? H (1) −T t i α t i @ ∇ T x (c) t i ? H (2) −T t i α t i @ . . . ∇ T x (c) t i ? H (l) −T t i α t i @ 1 2dt α t i . We again use the rule∇ A u(x) T v(x) B =∇(u(x))v(x)+∇(v(x))u(x) and thus we will have: 160 = $ ∇ x (c) t i H (1) −T t i α t i +∇ x (c) t i α t i H (1) −T t i % T $ ∇ x (c) t i H (2) −T t i α t i +∇ x (c) t i α t i H (2) −T t i % T . . . $ ∇ x (c) t i H (l) −T t i α t i +∇ x (c) t i α t i H (l) −T t i % T 1 2dt α t i . We can split the matrix above into two terms and then we pull out the termsα t i and ∇ x (c) t i α t i respectively : = α T t i ∇ x (c) t i H (1) −T t i ∇ x (c) t i H (2) −T t i . . . ∇ x (c) t i H (l) −T t i + H (1) −T t i H (2) −T t i . . . H (l) −T t i ∇ x (c) t i α T t i 1 2dt α t i = $ α T t i ∇ x (c) t i H −1 t i +H −1 t i $ ∇ x (c) t i α T t i %% 1 2dt α t i . = 1 2dt $ α T t i ∇ x (c) t i H −1 t i α t i +H −1 t i $ ∇ x (c) t i α T t i % α t i . % 161 Since $ ∇ x (c) t i α T t i % =−I l×l −∇ x (c) t i f (c) t i dt. the final result is expressed as follows 1 2dt ∇ x (c) t i β t i α t i = 1 2dt . α T t i $ ∇ x (c) t i H −1 t i % α t i −H −1 t i $ ∇ x (c) t i f (c) t i % dtα t i −H −1 t i α t i / After we have calculated the 3 sub-terms, the 2th term of the of the derivative of path cost S(τ i ) can be expressed in the following form: ∇ x (c) t i 1 2dt N−1 > j=i γ t j = 1 2dt α T t i $ ∇ x (c) t i H −1 t i % α t i −H −1 t i $ ∇ x (c) t i f (c) t i % α t i − 1 dt H −1 t i α t i (4.65) Next we will find the derivative of the term∇ x (c) t i ? λ 2 C N−1 j=i log|B t j | @ . Derivative of the Fourth Term∇ x (c) t i ? λ 2 C N−1 j=i log|B t j | @ of the cost S(τ i ). The analysis for the 4th term is given below: ∇ x (c) t i λ 2 N−1 > j=i log|B t j | = λ 2 ∇ x (c) t i log|B t i |. (4.66) After having calculated all the derivatives of ˜ S(τ i ) the final result under (4.63),(4.65) and (4.66) takes the form: ∇ x (c) t i ˜ S(τ i )= 1 2dt α T t i $ ∇ x (c) t i H −1 t i % α t i −H −1 t i $ ∇ x (c) t i f (c) t i % α t i − 1 dt H −1 t i α t i + λ 2 ∇ x (c) t i log|B t i |. 162 Theorem The optimal control solution to the stochastic optimal control problem ex- pressed by (4.1),(4.2),(4.3) is formulated by the equation that follows: u t i dt = " ˜ p(τ i ) u L (τ i ) dτ i , where ˜ p(τ i )= exp(− 1 λ ˜ S(τ i )) ! exp(− 1 λ ˜ S(τ i ))dτ i is a path depended probability distribution and the term u(τ i ) defined as u L (τ i )=R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 G (c) t i dw t i , are the local controls of each sampled trajectory starting from state x t i . The term is defined as H t i =G (c) t i R −1 G (c) t i T . To prove the theorem we make use of the lemma L2 and we substitute∇ x (c) t i ˜ S(τ i ) in the main result of lemma L1. More precisely from lemma L1 we have that: u t i dt =−R −1 G (c) t i T dt " ˜ p(τ i ) $ ∇ x (c) t i ˜ S(τ i ) % dτ i . u t i dt =−R −1 G (c) t i T dt " ˜ p(τ i ) $ ∇ x (c) t i ˜ S(τ i ) % dτ i (4.67) =R −1 G (c) t i T * ∇ x (c) t i ˜ S(τ i )dt + ˜ p(τ i ) Now we will find the term * ∇ x (c) t i ˜ S(τ i )dt + ˜ p(τ i ) . More precisely we will have that: 163 * ∇ x (c) t i ˜ S(τ i )dt + ˜ p(τ i ) = * 1 2 α T t i $ ∇ x (c) t i H −1 t i % α t i + ˜ p(τ i ) − * H −1 t i $ ∇ x (c) t i f (c) t i % α t i dt + ˜ p(τ i ) − * H −1 t i α t i + ˜ p(τ i ) + * λ 2 ∇ x (c) t i log|B t i |dt + ˜ p(τ i ) The first term of the expectation above is calculated as follows: * 1 2dt α T t i $ ∇ x (c) t i H −1 t i % α t i + ˜ p(τ i ) = * 1 2 $ ∇ x (c) t i H −1 t i % α t i α T t i + ˜ p(τ i ) = 1 2dt * tr $$ ∇ x (c) t i H −1 t i % α t i α T t i % + ˜ p(τ i ) = 1 2dt tr H $ ∇ x (c) t i H −1 t i % * α t i α T t i + ˜ p(τ i ) I By taking into account the fact that * α t i α T t i + ˜ p(τ i ) =B (c) t i Σ w B (c) t i T dt * 1 2dt α T t i $ ∇ x (c) t i H −1 t i % α t i + ˜ p(τ i ) = 1 2 tr $$ ∇ x (c) t i H −1 t i % B (c) t i Σ w B (c) t i T % = dt 2 tr $$ ∇ x (c) t i H −1 t i % B (c) t i Σ w B (c) t i T % 164 ByusingthefactthatthenoiseandthecontrolsarerelatedviaΣ t j =B (c) t j Σ w B (c) t j T dt = λG (c) t j R −1 G (c) t j T dt = λH t j dt with H t j =G (c) t j R −1 G (c) t j T we will have: * 1 2 α T t i $ ∇ x (c) t i H −1 t i % α t i + ˜ p(τ i ) = 1 2 tr $$ ∇ x (c) t i H −1 t i % B (c) t i Σ w B (c) t i T % = λ 2 tr $$ ∇ x (c) t i B(x t i ) −1 % B(x t i ) % = λ 2 ∇ x (c) t i log|B(x t i )| −1 =− λ 2 ∇ x (c) t i log|B(x t i )| The second term * H −1 t i $ ∇ x (c) t i f (c) t i % α t i dt + ˜ p(τ i ) = 0 since dtα t i = dtG (c) t i dw → 0. We the equation above we will have that: * ∇ x (c) t i ˜ S(τ i )dt + ˜ p(τ i ) =− * H −1 t i $ ∇ x (c) t i f (c) t i % α t i dt + ˜ p(τ i ) − * H −1 t i α t i + ˜ p(τ i ) =− * H −1 t i α t i + ˜ p(τ i ) =− * H −1 t i G (c) t i dw t i + ˜ p(τ i ) Substituting back to the optimal control we will have that: u t i dt = " ˜ p(τ i )R −1 G (c) t i T H −1 t i G (c) t i dw t i dτ i , (4.68) or in a more compact form: u t i dt = " ˜ p(τ i ) u (dt) L (τ i ) dτ i , (4.69) 165 where the local controls u (dt) L (τ i ) are given as follows: u (dt) L (τ i )=R −1 G (c) t i T H −1 t i G (c) t i dw t i The local control above can be written in the form: u L =R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 G c dw t i . Therefore the optimal control can now be expressed in the form: u(τ i )dt =R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 K > k=1 ˜ p (k) (τ i )G (c) t i dw (k) t i (4.70) 166 Chapter 5 Policy Gradient Methods In this chapter we are discussing the Policy Gradient (PG) methods which are classified as part of model free reinforcement learning. Our goal is to provide a quick introduction to PGs and review the main assumptions and mathematical tricks and their derivations. Our discussion starts in section 5.1 with the presentation of one of the most simple and widely used PG methods, the so called finite difference algorithm. We continue in section 5.2 with the derivation of the Episodic Reinforce PG method. Our derivation consist of the computation of the PG, the computation of the optimal baseline necessary for reducing the variance of the estimate gradient. In section 5.3 the policy gradient theorem is presented with the derivation of the corresponding gradient and the time optimal baseline. In section 5.4 the concept of the Natural Gradient is presented and its application to reinforcement learning problems is discussed. The resulting algorithm Natural Actor Critic is derived. In the last section we conclude with observations and comments regarding the performance of PG methods. 167 5.1 Finite difference In the Finite Difference(FD) method the goal is to optimize a cost function w.r.t. a parameter vector θ∈" p×1 . In reinforcement learning scenarios this parameter vector is used to parametrized the policy. The optimization problem is stated as follows: min θ J(θ) As in all policy gradient algorithms, in FD methods the gradient is estimated∇ θ J and the parameter estimates are updated according to the rule θ k+1 = θ k +∇ θ J. To find the gradient, a number of perturbations of the parameters δθ are performed and the Taylor series expansions of the cost function is computed. More precisely we will have: J i (θ)=J(θ)+δθ i )=J(θ)+∇ θ J T δθ i +O(δθ 2 i,j ) ∀ i=1,2,...,M By putting all these equations above for i=1,2,...,M together we will have that: ΔJ 1 (θ) ΔJ M (θ) = δθ T 1 δθ T M ∇ θ J The equation can be solved with respect to∇ θ J. More precisely we will have: ∇ θ J = $ ΔΘ T ΔΘ % −1 ΔΘ T ΔJ 168 where ΔΘ T =(δθ 1 ,...,δθ M )∈" N×M and ΔJ T = (ΔJ 1 (θ),...,ΔJ 1 (θ))∈" 1×M . The estimation of the gradient vector∇ θ J requires that the matrixΔΘ T ΔΘ is full rank and therefore invertible. 5.2 Episodic reinforce (Williams 1992) introduced the episodic REINFORCE algorithm, which is derived from taking the derivative of a cost with respect to the policy parameters. This algorithm has rather slow convergence due to a very noisy estimate of the policy gradient. It is also very sensitive to a reward baseline parameter b k (see below). Recent work derived the optimal baseline for REINFORCE (cf. (Peters & Schaal 2008c)), which improved the performance significantly. We derive of episodic REINFORCE algorithm by mathematically expressing the cost function under optimization as follows: J(x,u)= " p(τ)R(τ)dτ (5.1) wherep(τ) is the probability of the trajectoryτ =(x 0 ,u 0 ...x N−1 ,u N−1 ,x N ) of states and controls with x∈" n×1 and u∈" p×1 and R(τ)= C N t=1 r(x t ,u t ) is the cost accu- mulated over the horizon T =Ndt . Due to the Markov property and the dependence of the policy to the state and parameter x,θ we will have the following expression for the probability of the trajectory p(τ): 169 p(τ)=p(x 0 ) N−1 G i=1 p(x i+1 |x i ,u i )p(u i |x i ;θ) (5.2) The probability of the trajectory is expressed as the product of the transition prob- abilities in p(x i+1 |x i ,u i ) and the parametrized policy p(u i |x i ;θ) where θ∈" q×1 is the parameter under learning. We would like to find the gradient ofJ(x,u) w.r.t the parameter θ. More precisely we will have that: ∇ θ J(x,u)=∇ θ , " p(τ)R(τ)dτ - = " ∇ θ p(τ) R(τ)dτ = " p(τ)∇ θ logp(τ) R(τ)dτ = * ∇ θ logp(τ) R(τ) + ˜ p(τ i ) where the !# p(τ) is the expectation under the probability metric p(τ). The next step is to calculate the term∇ θ logp(τ). ∇ θ logp(τ)=∇ θ , logp(x 0 )+ N−1 > i=1 logp(x i+1 |x i ,u i )+ N−1 > i=1 logp(u i |x i ;θ) - =∇ θ , N−1 > i=1 logp(u i |x i ;θ) - = N−1 > i=1 ∇ θ logp(u i |x i ;θ) 170 Therefore the policy gradient is expressed as: ∇ θ J(x,u)= * R(τ) N−1 > i=1 ∇ θ logp(u i |x i ;θ) + ˜ p(τ i ) (5.3) The equation above provide us with an estimate of the true gradient∇ θ J(x,u). In order to reduce the variance of this estimate we will incorporate the baselineb k such that the following expression is minimized: b k = argmin !$ (R(τ)−b k ) N−1 > i=1 ∂ θ k logp(u i |x i ;θ)−µ k % 2 + where µ k = * R(τ) C N−1 i=1 ∂ θ k logp(u i |x i ;θ) + . More precisely we will have that: ∂ b k $! (R(τ)−b k ) 2 , N−1 > i=1 ∂ θ k logp(u i |x i ;θ) - 2 +µ 2 k −2µ k (R(τ)−b k ) N−1 > i=1 ∂ θ k logp(u i |x i ;θ) #% = ∂ b k $! (R(τ)−b k ) 2 , N−1 > i=1 ∂ θ k logp(u i |x i ;θ) - 2 +µ 2 k −2µ k R(τ) N−1 > i=1 ∂ θ k logp(u i |x i ;θ) #% = ∂ b k $! (R(τ)−b k ) 2 , N−1 > i=1 ∂ θ k logp(u i |x i ;θ) - 2 #% =0 where we have used the fact that: " p(τ)dτ =1⇒∇ θ " p(τ)dτ =0⇒ * ∇ θ logp(τ) + p(τ) =0 171 The optimal baseline is defined as: b k = * $ R(τ) C N−1 i=1 ∂ θ k logp(u i |x i ;θ) % 2 + p(τ) * C N−1 i=1 ∂ θ k logp(u i |x i ;θ) + p(τ) (5.4) The final expression for the gradient is: ∇ θ J(x,u)= * diag(R(τ)−b) N−1 > i=1 ∇ θ logp(u i |x i ;θ) + p(τ) (5.5) where diag(R(τ)−b) is defined as: diag(R(τ)−b)= R(τ)−b 1 ... 0 00 0 ... R(τ)−b n (5.6) Without loss of generality, the policy could be parametrized as follows: u(x,θ)dt =Φ(x)θdt+B(x)dw (5.7) Under this parameterization we will have that: p(u i |x i ;θ)= 1 (2π) m/2 |B(x)B(x) T | exp $ − 1 2 (u−Φθ) T A B(x)B(x) T B −1 (u−Φθ) % By taking the logarithm of the probability above we will have that: 172 logp(u i |x i ;θ)=−log(2π) m/2 |BB T |− $ 1 2 (u−Φ(x)θ) T A BB T B −1 (u−Φ(x)θ) % =−log(2π) m/2 |B(x)B(x) T |− 1 2 θ T Φ T A BB T B −1 Φθ+θ T Φ T A BB T B −1 u − 1 2 u T A BB T B −1 u Thus ∇ θ logp(u i |x i ;θ)= −Φ T A BB T B −1 Φθ +Φ T BB T u =Φ T A BB T B −1 B! i and the policy gradient will take the form: ∇ θ J(x,u)= * diag(R(τ)−b) N−1 > i=1 Φ T A BB T B −1 B! i + p(τ) (5.8) The result above can take different formulations depending on the parameterization of the policy. Therefore, if B =Φ then we will have that: ∇ θ J(x,u)= * diag(R(τ)−b) N−1 > i=1 B T A BB T B −1 B! i + p(τ) (5.9) Before we move to the derivation of the policy gradient theorem it is important to realize that the expectations above are taken with respect to the state space trajectories, These trajectories can be generated by the application of the current policy (policy at every iteration) on the real physical system. In addition one may ask in which cases the expectationsaboveresultinzerogradientvectorandthereforenofurtherupdateof. The expectations compute the correlation of the perturbations of the policy parameters with the observed changes in the cost function. Therefore the gradient estimate will approach zero either when no change in the cost function is observed or there is no correlation 173 between the cost function and the parameter perturbations. In both cases, cost function tuning is of critical importance. 5.3 GPOMDP and policy gradient theorem In their GPOMDP algorithm, (Baxter & Bartlett 2001) introduced several improvements overREINFORCEthatmadethegradientestimatesmoreefficient. GPOMDPcanalsobe derived from the policy gradient theorem (Sutton, McAllester, Singh & Mansour 2000, Peters & Schaal 2008c), and an optimal reward baseline can be added (cf. (Peters & Schaal 2008c)) Under the observation that past rewards do not affect future controls, the reinforce policy gradient can be reformulated as follows: ∇ θ J(x,u)= * N−1 > i=1 $ diag(R i (τ)−b i ) $ ∇ θ logp(u i |x i ;θ) %% + p(τ) (5.10) where R i (τ)= 1 N−i C N j=i r(x j ,u j ). Given the parameterization of the policy the results above takes the form: ∇ θ J(x,u)= * N−1 > i=1 $ diag(R i (τ)−b i ) $ Φ T A BB T B −1 B! i %% + p(τ) (5.11) The term b k,i is the optimal baseline that minimizes the variance of the estimated gradient. 174 diag(R i (τ)−b i )= R(τ)−b 1,i ... 0 00 0 ... R(τ)−b n,i (5.12) The variance of the estimated gradient is expressed as follows: *, N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) % −µ k - 2 + p(τ) = = * $ N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) % 2 # p(τ) − * 2µ k N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) % +µ 2 k + We take the derivative of the expectation above with respect to b k and set it to zero. More precisely we will have that: ∂ b k,m *, N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) % −µ k - 2 + p(τ) =0 ∂ b k,m * $ N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) %% 2 + p(τ) −∂ b k,m * 2µ k N−1 > i=1 $ ∂ θ k logp(u i |x i ;θ)(R i (τ)−b k,i ) % + p(τ) =0 Since * ∇ θ logp(τ) + p(τ) = 0 the expression above takes the form: 175 * b k,m ∂ θ k logp(u m |x m ;θ) + p(τ) = * R m (τ) $ ∂ θ k logp(u m |x m ;θ) % 2 + p(τ) Thus, the optimal baseline is defined as: b k,m = * R m (τ) $ ∂ θ k logp(u m |x m ;θ) % 2 + p(τ) * ∂ θ k logp(u m |x m ;θ) + p(τ) (5.13) 5.4 Episodic natural actor critic Vanilla policy gradients which follow the gradient of the expected cost function J(x,u) very often stuck intolocal minimum. As, it has been demonstrated in supervised learning (Amari 1999) natural gradients are less sensitive in getting trapped to local minima. Methods based on natural gradients do not follow the steepest direction in the parameter space but the steepest direction with respect to Fisher information metric. Oneofthemostefficientpolicygradientalgorithmwasintroducedin(Peters&Schaal 2008b), called the Episodic Natural Actor Critic. In essence, the method uses the Fisher Information Matrix to project the REINFORCE gradient onto a more effective update direction, which is motivated by the theory of natural gradients by (Amari 1999). The gradient for the eNAC algorithm takes the form of: ˜ ∇J =F(θ) −1 ∇ θ J (5.14) 176 whereF(θ) is the Fisher information matrix. To derive the natural actor critic we start from the policy gradient theorem and we will have that: ∇ θ J(x,u)= * N−1 > i=1 $ ∇ θ logp(u i |x i ;θ)diag(R i (τ)−b k,i ) % + p(τ) The equation above can be also written in the form: ∇ θ J(x,u)= " p(x " |x,u) " p(u|x;θ) $ ∇ θ logp(u|x;θ)(R(τ)−b k ) % dudx At this point the term R(τ)−b k is approximated with logp(u|x;θ) T w. Thus sub- stitution of R(τ)−b k = logp(u|x;θ) T w results in: ∇ θ J(x,u)= " p(x " |x,u) " p(u|x;θ) $ ∇ θ logp(u|x;θ)∇ θ logp(u|x;θ) T w % dudx = " p(x " |x,u)F(x,θ)dxw =F(θ)w where F(x,θ)= F p(u|x;θ) $ ∇ θ logp(u|x;θ)∇ θ logp(u|x;θ) T % du. By substituting the result above to the parameter update law we will have that: θ k+1 =θ k +F(θ) −1 ∇ θ J =θ k +w (5.15) 177 Aswecanseetheupdatelawisfurthersimplifiedtojustonlyupdatingtheparameters θ withw. Thusitisimportanttocomputew. Todoso,weconsidertheBellmanequation in terms of an advantage function A(x,u) and state value function V(x). More precisely we will have: Q(x,u)=A(x,u)+V(x)=r(x,u)+ " p(x , |x,u)V(x , )dx By evaluating the equation above on the trajectory ? x (j) 0 ,u (j) 0 ...x (j) N−1 ,u (j) N−1 ,x (j) N @ we will have that: N−1 > i=1 A(x (j) i ,u (j) i )+V(x (j) 0 )= N−1 > i=1 r(x (j) i ,u (j) i )+V(x (j) N ) N−1 > i=1 ∇ θ p ? u (j) i |x (j) i ;θ @ T w+V(x (j) 0 )−V(x (j) N )= N−1 > i=1 r(x (j) i ,u (j) i ) N−1 > i=1 ∇ θ p ? u (j) i |x (j) i ;θ @ T w+ΔV = N−1 > i=1 r(x (j) i ,u (j) i ) By combining the equations above for j=1,2,...,M we will have that: ∇ θ p ? u (1) i |x (1) i ;θ @ T , 1 ... ... ∇ θ p ? u (M) i |x (M) i ;θ @ T , 1 w ΔV = C N−1 i=1 r(x (1) i ,u (1) i ) ... C N−1 i=1 r(x (M) i ,u (M) i ) We regress the equation above and get the final result for w and obtain: 178 w = A X T X B −1 X T Y (5.16) where the matrix X and the vector Y are defined as follows: X = ∇ θ p ? u (1) i |x (1) i ;θ @ T , 1 ... ... ∇ θ p ? u (M) i |x (M) i ;θ @ T , 1 and Y = C N−1 i=1 r(x (1) i ,u (1) i ) ... C N−1 i=1 r(x (M) i ,u (M) i ) To find the parameter vectorw∈" n×1 , there must be M >n number of trajectories rolloutssuchthatthematrixX T Xisfullrankandthereforeinvertible. Withtheepisodic NaturalActorCriticwewillconcludeourpresentationofPGmethods. Inthenextsection we discuss the application and comparison of PGs on a LQR optimal control problem. 5.5 Discussion In this chapter we have reviewed the PG methods with the derivation of estimated corre- sponding gradients. The work on PG methods for reinforcement learning was an impor- tant advancement since it offered an alternative approach to optimal control problems in which either no model is available, or if there is a model, it is a bad approximation. Besides the their advantages, PG methods are in general, not easy to tune since they are very sensitive to exploration noise as well as the cost function design. In the next chapter we compare the PG methods with iterative path integral optimal control in via point tasks. 179 Chapter 6 Applications to Robotic Control In this chapter we present the application of iterative path integral stochastic optimal control for the applications of planning and gain scheduling. We start our presentation in section 6.1 with the discussion on Dynamic Movement Primitives (DMPs) which cor- responds to nonlinear point or limit cycle attractors with adjustable land scape. The DMPs play an essential role in the application of path integral control to learning robotic tasks. We discuss this role in section 6.2 where the ways in which DMPs are used for representing desired trajectories and for gain scheduling are presented. When the itera- tive path integral control framework is applied to DMPs the resulting algorithm is the so called the Policy Improvement with Path Integrals (PI 2 ). In section 6.3 we provide all the main equations of (PI 2 ) and we discuss all the small adjustments required to robotic tasks with the use of the DMPs. In section 6.4 PI 2 is applied for learning optimal state space trajectories. The eval- uations take place on simulated planar manipulators of different DOF and the little dog robot for the task or passing through a target and jumping over a gap respectively. In section 6.5 PI 2 is applied for optimal planning and gain scheduling. The robotic tasks 180 include via point task with manipulators of various DOFs as well as the task of pushing a door to open with the simulated CBi humanoid robot. In the last section 6.8 we discuss the performance of PI 2 in the aforementioned tasl and we conclude. 6.1 Learnable nonlinear attractor systems 6.1.1 Nonlinear point attractors with adjustable land-scape The nonlinear point attractor consists of two sets of differential equations, the canonical and transformation system which are coupled through a nonlinearity (Ijspeert, Nakan- ishi, Pastor, Hoffmann & Schaal submitted),(Ijspeert, Nakanishi & Schaal 2003). The canonical system is formulated as 1 τ ˙ x t =−αx t . That is a first - order linear dynamical system for which, starting from some arbitrarily chosen initial state x 0 , e.g., x 0 = 1, the state x converges monotonically to zero. x can be conceived of as a phase variable, where x = 1 would indicate the start of the time evolution, and x close to zero means that the goal g (see below) has essentially been achieved. The transformation system consist of the following two differential equations: τ ˙ z =α z β z $$ g+ f α z β z % −y % −α z z (6.1) τ ˙ y =z Essentially, these 3 differential equations code a learnable point attractor for a move- ment from y t 0 to the goal g, where θ determines the shape of the attractor. y t , ˙ y t denote 181 the position and velocity of the trajectory, while z t ,x t are internal states. α z ,β z ,τ are time constants. The nonlinear coupling or forcing term f is defined as: f(x)= C N i=1 K(x t ,c i )θ i x t C N i=1 K(x t ,c i ) (g−y 0 )=Φ P (x) T θ (6.2) The basis functions K(x t ,c i ) are defined as: K(x t ,c i )=w i = exp A −0.5h i (x t −c i ) 2 B (6.3) with bandwidth h i and center c i of the Gaussian kernels – for more details see (Ijspeert et al. 2003). The full dynamics have the form of dx =F(x)dt+G(x)udt where the state x is specified as x=(x,y,z) while the controls are specified as u =θ=(θ 1 ,...,θ p ) T .The representation above is advantageous as it guarantees attractor properties towards the goal while remaining linear in the parametersθ of the function approximator. By varying the parameter θ the shape of the trajectory changes while the goal state g and initial state y t 0 remain fixed. These properties facilitate learning (Peters & Schaal 2008c). 6.1.2 Nonlinear limit cycle attractors with adjustable land-scape Thecanonicalsystemforthecaseoflimitcycleattractorsconsistthedifferentialequation τ ˙ φ = 1 where the term φ∈ [0,2π] correspond to the phase angle of the oscillator in polar coordinates. The amplitude of the oscillation is assumed to ber. This oscillator produces a stable limit cycle when projected into Cartesian coordinated with v 1 = rcos(φ) and v 2 =rsin(φ). In fact, it corresponds to form of the (Hopf-like) oscillator equations 182 τ ˙ v 1 =−µ L v 2 1 +v 2 2 −r L v 2 1 +v 2 2 v 1 −v 2 (6.4) τ ˙ v 2 =−µ L v 2 1 +v 2 2 −r L v 2 1 +v 2 2 v 2 +v 1 (6.5) where µ is a positive time constant. The system above evolve to the limit cycle v 1 =rcos(t/τ +c) and v 2 =rsin(t/τ +c) with c a constant, given any initial conditions except [v 1 ,v 2 ] = [0,0] which is an unstable fixed point. Therefore the canonical system provides the amplitude signal (r) and a phase signal (φ) to the forcing term: f(φ,r)= C N i=1 K(φ,c i )θ i C N i=1 K(φ,c i ) r =Φ R (φ) T θ (6.6) where the basis function K(φ,c i ) are defined as K(φ,c i ) = exp(h i (cos(φ−c i )−1)). The forcing term is incorporated into the transformation system which is expressed by the equations (6.1). The full dynamics of the rhythmic movement primitives have the form ofdx =F(x)dt+G(x)udt where the statex is specified asx=(φ,v 1 ,v 2 ,z,y) while the controls are specified as u =θ=(θ 1 ,...,θ p ) T . The term g for the case of limit cycle attractorsisinterpretedasanchorpoint(orsetpoint)fortheoscillatorytrajectory, which can be changed to accommodate any desired baseline of the oscillation The complexity of attractors is restricted only by the abilities of the function approximator used to gen- erate the forcing term, which essentially allows for almost arbitrarily complex (smooth) attractors with modern function approximators 183 6.2 Robotic optimal control and planning with nonlinear attractors In this section we show how the Path integral optimal control formalism in combination with the point and limit cycle attractors can be used for optimal planning (Theodorou, Buchli & Schaal 2010) and control (Buchli, Theodorou, Stulp & Schaal 2010) of robotic systems in high dimensions. As an example, consider a robotic system with rigid body dynamics (RBD) equations (Sciavicco & Siciliano 2000) using a parameterized policy: ¨ q = M(q) −1 (−C(q, ˙ q)−v(q))+M(q) −1 u (6.7) u = K P (q d −q)+K D (˙ q d − ˙ q) (6.8) where M is the RBD inertia matrix, C are Coriolis and centripetal forces, and v de- notes gravity forces. The state of the robot is described by the joint angles q and joint velocities ˙ q. The proportional-Derivative (PD) controller with positive definite gain matrices K P and K D have the form K P = diag ? K (1) p ,K (2) p ,...,K (N) p @ and K D = diag ? K (1) d ,K (2) d ,...,K (N) d @ where K (i) p ,K (i) d are the proportional and derivative gains for every DOF i. These gains converts a desired trajectory q d , ˙ q d into a motor command u. The gains are parameterized as follows: dK (i) p = α K ? Φ (i) P T ? θ (i) dt+dω (i) @ −K (i) p dt @ (6.9) This equation models the time course of the position gains which are are represented by a basis functionΦ (i) P T θ (i) linear with respect to the learning parameterθ (i) , and these 184 parameter can be learned with the (PI 2 ). We will assume that the time constant α K is so large, that for all practical purposes we can assume that K (i) P =Φ (i) p T ? θ (i) +! (i) t @ holds at all time where ! (i) t = dω (i) dt . In our experiments K D gains are specified as K (i) d = ξ E K (i) p where ξ is user determined. Alternatively, for the case of optimal planing we could create another form of control structure in which we add for the RBD system (6.7) the following equation: ¨ q d = G(q d , ˙ q d )(θ+! t ) (6.10) whereG(q d , ˙ q d ) is represented with a point or limit cycle attractor. The control or learning parameter for this case is the parameter θ in (6.10). 6.3 Policy improvements with path integrals: The (PI 2 ) algorithm. After having introduced the nonlinear stable attractors with learnable landscapes which from now on we will call them as Dynamic Movement Primitives(DMPs), in this section we discuss the application of iterative path integral control to DMPs. The resulting algorithm is the so calledPolicyImprovement withPathIntegralsPI 2 . As can be easily recognized, the DMP equations are of the form of our control system (4.2), with only one controlled equation and a one dimensional actuated state. This case has been treated in Section 4.4. The motor commands are replaced with the parametersθ – the issue of time 185 dependent vs. constant parameters will be addressed below. More precisely, the DMP equations can be written as: ˙ x t ˙ z t ˙ y t = −αx t y t α z (β z (g−y t )−z t ) + 0 1×p 0 1×p g (c) t T (θ t +! t ) (6.11) ThestateoftheDMPispartitionedintothecontrolledpartx (c) t =y t anduncontrolled part x (m) t =(x t z t ) T . The control transition matrix depends on the state, however, it depends only on one of the state variables of the uncontrolled part of the state, i.e., x t . The path cost for the stochastic dynamics of the DMPs is given by: ˜ S(τ i )= φ t N + N−1 > j=i q t j dt+ 1 2 N−1 > j=i M M M M M M x (c) t j+1 −x (c) t j dt −f (c) t j M M M M M M 2 H −1 t j dt+ λ 2 N−1 > j=i log|H t j | ∝ φ t N + N−1 > j=i q t j + 1 2 N−1 > j=i M M Mg (c)T t j (θ t j +! t j ) M M M 2 H −1 t j = φ t N + N−1 > j=i q t j + 1 2 N−1 > j=i 1 2 (θ t j +! t j ) T g (c) t j H −1 t j g (c)T t j (θ t j +! t j ) = φ t N + N−1 > j=i q t j + 1 2 N−1 > j=i 1 2 (θ t j +! t j ) T g (c) t j g (c)T t j g (c)T t R −1 g (c) t (θ t j +! t j ) = φ t N + N−1 > j=i q t j + 1 2 N−1 > j=i 1 2 (θ t j +! t j ) T M T t j RM t j (θ t j +! t j ) (6.12) with M t j = R −1 gt j g T t j g T t j R −1 gt j . H t becomes a scalar given by H t =g (c)T t R −1 g (c) t . Interest- ingly, the term λ 2 C N−1 j=i log|H t j | for the case of DMPs depends only on x t , which is a 186 deterministic variable and therefore can be ignored since it is the same for all sampled paths. We also absorbed, without loss of generality, the time step dt in cost terms. Con- sequently, the fundamental result of the path integral stochastic optimal problem for the case of DMPs is expressed as: u t i = F P (τ i )u(τ i )dτ (c) i (6.13) where the probability P (τ i ) and local controls u(τ i ) are defined as P (τ i )= e − 1 λ ˜ S(τ i ) F e − 1 λ ˜ S(τ i ) dτ i , u(τ i )= R −1 g (c) t i g (c)T t i g (c)T t i R −1 g (c) t i ! t i (6.14) and the path cost given as ˜ S(τ i )= φ t N + N−1 > j=i q t j + 1 2 N−1 > j=i ! T t j M T t j RM t j ! t j (6.15) Note that θ = 0 in these equations, i.e., the parameters are initialized to zero. These equations correspond to the case where the stochastic optimal control problem is solved withoneevaluation oftheoptimalcontrols(6.13) usingdensesamplingof thewholestate space under the “passive dynamics” (i.e., θ = 0), which requires a significant amount of exploration noise. Such an approach was pursued in the original work by (Kappen 2007, Broek et al. 2008), where a potentially large number of sample trajectories was needed to achieve good results. Extending this sampling approach to high dimensional spaces, however, is daunting, as with very high probablity, we would sample primarily rather 187 useless trajectories. Thus, biasing sampling towards good initial conditions seems to be mandatory for high dimensional applications. Thus, we consider only local sampling and an iterative update procedure. Given a current guess of θ, we generate sample roll-outs using stochastic parameters θ +! t at every time step. To see how the generalized path integral formulation is modified for the case of iterative updating, we start with the equations of the update of the parameter vector θ, which can be written as: θ (new) t i = " P (τ i ) R −1 g t i g t i T (θ+! t i ) g t i T R −1 g t i dτ i (6.16) = " P (τ i ) R −1 g t i g t i T ! t i g t i T R −1 g t i dτ i + R −1 g t i g t i T θ g t i T R −1 g t i (6.17) = δθ t i + R −1 g t i g t i T tr(R −1 g t i g t i T ) θ (6.18) = δθ t i +M t i θ (6.19) The correction parameter verctor δθ t i is defined as δθ t i = F P (τ i ) R −1 gt i gt i T !t i gt i T R −1 gt i dτ i . It is important to note that θ (new) t i is now time dependent, i.e., for every time step t i ,a different optimal parameter vector is computed. In order to return to one single time independent parameter vector θ (new) , the vectors θ (new) t i need to be averaged over time t i . We start with a first tentative suggestion of averaging over time, and then explain why it is inappropriate, and what the correct way of time averaging has to look like. The tentative and most intuitive time average is: θ (new) = 1 N N−1 > i=0 θ (new) t i = 1 N N−1 > i=0 δθ t i + 1 N N−1 > i=0 M t i θ 188 Thus, we would update θ based on two terms. The first term is the average of δθ t i , which is reasonable as it reflects the knowledge we gained from the exploration noise. However, there would be a second update term due to the average over projected mean parameters θ from every time step – it should be noted that M t i is a projection matrix onto the range space of g t i under the metric R −1 , such that a multiplication with M t i can only shrink the norm of θ. From the viewpoint of having optimal parameters for every time step, this update component is reasonable as it trivially eliminates the part of the parameter vector that lies in the null space of g t i and which contributes to the command cost of a trajectory in a useless way. From the view point of a parameter vector that is constant and time independent and that is updated iteratively, this second update is undesirable, as the multiplication of the parameter vector θ with M t i in (6.19) and the averaging operation over the time horizon reduces the L 2 norm of the parameters at every iteration, potentially in an uncontrolled way 1 . What we rather want is to achieve convergence when the average of δθ t i becomes zero, and we do not want to continue updating due to the second term. The problem is avoided by eliminating the projection matrix in the second term of averaging, such that it become: θ (new) = 1 N N−1 > i=0 δθ t i + 1 N N−1 > i=0 θ = 1 N N−1 > i=0 δθ t i +θ The meaning of this reduced update is simply that we keep a component in θ that is irrelevant and contributes to our trajectory cost in a useless way. However, this irrelevant 1 To be precise, θ would be projected and continue shrinking until it lies in the intersection of all null spaces of thegt i basis function – this null space can easily be of measure zero. 189 component will not prevent us from reaching the optimal effective solution, i.e., the solutionthatliesintherangespaceofg t i . Giventhismodifiedupdate, itis, however, also necessary to derive a compatible cost function. As mentioned before, in the unmodified scenario, the last term of (6.12) is: 1 2 N−1 > j=i (θ+! t j ) T M T t j RM t j (θ+! t j ) (6.20) To avoid a projection of θ, we modify this cost term to be: 1 2 N−1 > j=i (θ+M t j ! t j ) T R(θ+M t j ! t j ) (6.21) With this modified cost term, the path integral formalism results in the desired θ (new) t i without the M t i projection of θ. The main equations of the iterative version of the generalized path integral formula- tion, called Policy Improvement with Path Integrals (PI 2 ), can be summarized as: P (τ i )= e − 1 λ S(τ i ) F e − 1 λ S(τ i ) dτ i (6.22) S(τ i )= φ t N + N−1 > j=i q t j dt+ 1 2 N−1 > j=i (θ+M t j ! t j ) T R(θ+M t j ! t j )dt (6.23) δθ t i = " P (τ i )M t i ! t i dτ i (6.24) [δθ] j = C N−1 i=0 (N−i) w j,t i [δθ t i ] j C N−1 i=0 w j,t i (N−i) (6.25) θ (new) = θ (old) +δθ (6.26) 190 Essentially, (6.22) computes a discrete probability at time t i of each trajectory roll-out with the help of the cost (6.23). For every time step of the trajectory, a parameter update is computed in (6.24) based on a probability weighted average over trajectories. The parameter updates at every time step are finally averaged in (6.25). Note that we chose a weighted average by giving every parameter update a weight 2 according to the time steps left in the trajectory and the activation of the kernel in (6.3). This average can be interpreted as using a function approximator with only a constant (offset) parameter vector to approximate the time dependent parameters. Giving early points in the trajectory a higher weight is useful since their parameters affect a large time horizon and thus higher trajectory costs. Other function approximation (or averaging) schemes could be used to arrive at a final parameter update – we preferred this simple approach as it gave very good learning results. The final parameter update is θ (new) =θ (old) +δθ. The parameter λ regulates the sensitivity of the exponentiated cost and can auto- matically be optimized for every time step i to maximally discriminate between the experienced trajectories. More precisely, a constant term can be subtracted from (6.23) as long as all S(τ i ) remain positive – this constant term 3 cancels in (6.22). Thus, for a given number of roll-outs, we compute the exponential term in (6.22) as exp $ − 1 λ S(τ i ) % = exp $ −h S(τ i )−minS(τ i ) maxS(τ i )−minS(τ i ) % (6.27) 2 The use of the kernel weights in the basis functions (6.3) for the purpose of time averaging has shown better performance with respect to other weighting approaches, across all of our experiments. Therefore this is the weighting that we suggest. Users may develop other weighting schemes as more suitable to their needs. 3 In fact, the term inside the exponent results by adding hminS(τ i ) maxS(τ i )−minS(τ i ) , which cancels in (6.22), to the term− hS(τ i ) maxS(τ i )−minS(τ i ) which is equal to− 1 λ S(τi). 191 with h set to a constant, which we chose to be h = 10 in all our evaluations. The max and min operators are over all sample roll-outs. This procedure eliminates λ and leaves the variance of the exploration noise ! as the only open algorithmic parameter for PI 2 . It should be noted that the equations for PI 2 have no numerical pitfalls: no matrix inversions and no learning rates 4 , rendering PI 2 to be very easy to use in practice. The pseudocode for the finalPI 2 algorithm for a one dimensional control system with function approximation is given in Table 6.1. A tutorial Matlab example of applyingPI 2 can be found at http://www-clmc.usc.edu/software . 6.4 Evaluations of (PI 2 ) for optimal planning WeevaluatedPI 2 inseveralsyntheticexamplesincomparisonwithREINFORCE,GPOMDP, eNAC, and, when possible, PoWER. Except for PoWER, all algorithms are suitable for optimizing immediate reward functions of the kind r t = q t + u t Ru t . As mentioned above, PoWER requires that the immediate reward behaves like an improper probability. This property is incompatible with r t = q t +u t Ru t and requires some special nonlinear transformations, which usually change the nature of the optimization problem, such that PoWER optimizes a different cost function. Thus, only one of the examples below has a compatible a cost function for all algorithms, including PoWER. In all examples below, exploration noise and, when applicable, learning rates, were tuned for every individual al- gorithms to achieve the best possible numerically stable performance. Exploration noise was only added to the maximally activated basis function in a motor primitive, and 4 R is a user design parameter and usually chosen to be diagonal and invertible. 192 Table 6.1: Pseudocode of the PI 2 algorithm for a 1D Parameterized Policy (Note that the discrete time step dt was absobed as a constant multiplier in the cost terms). • Given: – An immediate cost function r t =q t +θ T t Rθ t – A terminal cost term φ t N (cf. 4.25) – A stochastic parameterized policy a t =g T t (θ+! t ) – The basis function g t i from the system dynamics (cf. 4.2) – The varianceΣ ! of the mean-zero noise ! t – The initial parameter vector θ • Repeat until convergence of the trajectory cost R: – Create K roll-outs of the system from the same start state x 0 using stochstic parameters θ+! t at every time step – For k=1...K, compute: ∗ P (τ i,k )= e − 1 λ S(τ i,k ) " K k=1 [e − 1 λ S(τ i,k ) ] ∗ S(τ i,k )= φ t N ,k + C N−1 j=i q t j ,k + 1 2 C N−1 j=i+1 (θ+M t j ,k ! t j ,k ) T R(θ+M t j ,k ! t j ,k ) ∗ M t j ,k = R −1 g t j ,k g T t j ,k g T t j ,k R −1 g t j ,k – For i=1...(N−1), compute: ∗ δθ t i = C K k=1 [P (τ i,k )M t i ,k ! t i ,k ] – Compute [δθ] j = " N−1 i=0 (N−i) w j,t i [δθt i ] j " N−1 i=0 w j,t i (N−i) – Update θ←θ+δθ – Create one noiseless roll-out to check the trajectory cost R = φ t N + C N−1 i=0 r t i . In case the noise cannot be turned off, i.e., a stochastic system, multiple roll- outs need be averaged. 193 the noise was kept constant for the entire time that this basis function had the highest activation – empirically, this trick helped to improve the learning speed of all algorithms. 6.4.1 Learning Optimal Performance of a 1 DOF Reaching Task The first evaluation considers learning optimal parameters for a 1 DOF DMP (cf. Equa- tion 6.11). The immediate cost and terminal cost are, respectively: r t =0.5f 2 t +5000 θ T θ φ t N = 10000(˙ y 2 t N +10(g−y t N ) 2 ) (6.28) with y t 0 = 0 and g = 1 – we use radians as units motivated by our interest in robotics application, but we could also avoid units entirely. The interpretation of this cost is that we would like to reach the goalg with high accuracy while minimizing the acceleration of the movement and while keeping the parameter vector short. Each algorithm was run for 15 trials to compute a parameter update, and a total of 1000 updates were performed. Note that 15 trials per update were chosen as the DMP had 10 basis functions, and the eNAC requires at least 11 trials to perform a numerically stable update due to its matrix inversion. The motor primitives were initialized to approximate a 5-th order polynomial as point-to-point movement (cf. Figure 6.1a,b), called a minimum-jerk trajectory in the motor control literature; the movement duration was 0.5 seconds, which is similar to normal human reaching movements. Gaussian noise of N(0,0.1) was added to the initial parameters of the movement primitives in order to have different initial conditions for every run of the algorithms. The results are given in Figure 6.1. Figure 6.1a,b show the initial (before learning) trajectory generated by the DMP together with the 194 0 500000 1000000 1500000 2000000 2500000 3000000 1 10 100 1000 10000 15000 Cost Number of Roll-Outs -0.2 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Position [rad] Time [s] Initial PI^2 REINFORCE PG NAC -1 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 Velocity [rad/s] Time [s] 0 500000 1000000 1500000 2000000 2500000 3000000 1 10 100 1000 2000 Cost Number of Roll-Outs a) b) c) d) Figure 6.1: Comparison of reinforcement learning of an optimized movement with mo- tor primitives. a) Position trajectories of the initial trajectory (before learning) and the results of all algorithms after learning – the different algorithms are essentially indistu- ighishable. b) The same as a), just using the velocity trajectories. c) Average learning curves for the different algorithms with 1 std error bars from averaging 10 runs for each of the algorithms. d) Learning curves for the different algorithms when only two roll-outs are used per update (note that the eNAC cannot work in this case and is omitted). learning results of the four different algorithms after learning – essentially, all algorithms achieve the same result such that all trajectories lie on top of each other. In Figure 6.1c, however, it can be seen that PI 2 outperforms the gradient algorithms by an order of magnitude. Figure 6.1d illustrates learning curves for the same task as in Figure 6.1c, just that parameter updates are computed already after two roll-outs – the eNAC was excluded from this evaluation as it would be too heuristic to stablize its ill-conditioned matrix inversion that results from such few roll-outs. PI 2 continues to converge much faster than the other algorithms even in this special scenario. However, there are some 195 noticable fluctuation after convergence. This noise around the convergence baseline is caused by using only two noisy roll-outs to continue updating the parameters, which causes continuous parameter fluctuations around the optimal parameters. Annealing the exploration noise, or just adding the optimal trajectory from the previous parameter update as one of the roll-outs for the next parameter update can alleviate this issue – we do not illustrate such little “tricks” in this paper as they really only affect fine tuning of the algorithm. 6.4.2 Learning optimal performance of a 1 DOF via-point task The second evaluation was identical to the first evaluation, just that the cost function now forced the movement to pass through an intermediate via-point at t = 300ms. This evaluation is an abstract approximation of hitting a target, e.g., as in playing tennis, and requires a significant change in how the movement is performed relative to the initial trajectory (Figure 6.2a). The cost function was r 300ms = 100000000(G−y t 300ms ) 2 φ t N = 0 (6.29) with G=0.25. Only this single reward was given. For this cost function, the PoWER algorithm can be applied, too, with cost function ˜ r 300ms = exp(−1/λ r 300ms ) and ˜ r t i = 0 otherwise. This transformed cost function has the same optimum as r 300ms . The resulting learning curves are given in Figure 6.2 and resemble the previous evaluation: PI 2 outperforms the gradient algorithms by roughly an order of magnitude, while all the gradient algorithms have almost identical learning curves. As was expected from the 196 0 5000000 10000000 15000000 20000000 25000000 1 10 100 1000 10000 15000 Cost Number of Roll-Outs -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 Position [rad] Time [s] Initial PI^2 REINF. PG NAC PoWER a) b) G Figure 6.2: Comparison of reinforcement learning of an optimized movement with motor primitives for passing through an intermediate target G. a) Position trajectories of the initial trajectory (before learning) and the results of all algorithms after learning. b) Average learning curves for the different algorithms with 1 std error bars from averaging 10 runs for each of the algorithms. similarity of the update equations, PoWER and PI 2 have in this special case the same performance and are hardly distinguisable in Figure 6.2. Figure 6.2a demonstrates that all algorithms pass through the desired targetG, but that there are remaining differences between the algorithms in how they approach the targetG – these difference have a small numerical effect in the final cost (wherePI 2 and PoWER have the lowest cost), but these difference are hardly task relevant. 6.4.3 Learning optimal performance of a multi-DOF via-point task A third evaluation examined the scalability of our algorithms to a high-dimensional and highly redundant learning problem. Again, the learning task was to pass through an intermediate target G, just that a d=2,10, or 50 dimensional motor primitive was employed. We assume that the multi-DOF systems model planar robot arms, where d links of equal length l=1/d are connected in an open chain with revolute joints. Essentially, these robots look like a multi-segment snake in a plane, where the tail of the 197 snake is fixed at the origin of the 2D coordinate system, and the head of the snake can be moved in the 2D plane by changing the joint angles between all the links. Figure 6.3b,d,f illustrate the movement over time of these robots: the initial position of the robots is when all joint angles are zero and the robot arm completely coincides with the x-axis of the coordinate frame. The goal states of the motor primitives command each DOF to move to a joint angle, such that the entire robot configuration afterwards looks like a semi-circle where the most distal link of the robot (the end-effector) touches the y- axis. The higher priority task, however, is to move the end-effector through a via-point G = (0.5,0.5). To formalize this task as a reinforcement learning problem, we denote the joint angles of the robots as ξ i , with i=1,2,...,d, such that the first line of (6.11) reads now as ¨ ξ i,t = f i,t +g T i,t (θ i +! i,t ) – this small change of notation is to avoid a clash of variables with the (x,y) task space of the robot. The end-effector position is computed as: x t = 1 d d > i=1 cos( i > j=1 ξ j,t ),y t = 1 d d > i=1 sin( i > j=1 ξ j,t ) (6.30) The immediate reward function for this problem is defined as r t = C d i=1 (d+1−i) ? 0.1f 2 i,t +0.5 θ T i θ i @ C d i=1 (d+1−i) (6.31) Δr 300ms = 100000000 A (0.5−x t 300ms ) 2 +(0.5−y t 300ms ) 2 B (6.32) φ t N = 0 (6.33) whereΔr 300ms is added to r t at time t = 300ms, i.e., we would like to pass through the via-point at this time. The individual DOFs of the motor primitive were initialized as in 198 the 1 DOF examples above. The cost term in (6.31) penalizes each DOF for using high accelerationsandlargeparametervectors, whichisacriticalcomponenttoachieveagood resolution of redundancy in the arm. Equation (6.31) also has a weighting term d+1−i that penalizes DOFs proximal to the orgin more than those that are distal to the origin — intuitively, applied to human arm movements, this would mean that wrist movements are cheaper than shoulder movements, which is motivated by the fact that the wrist has much lower mass and inertia and is thus energetically more efficient to move. The results of this experiment are summarized in Figure 6.3. The learning curves in the left column demonstrate again that PI 2 has an order of magnitude faster learn- ing performance than the other algorithms, irrespective of the dimensionality. PI 2 also converges to the lowest cost in all examples: Algorithm 2-DOFs 10-DOFs 50-DOFs PI 2 98000±5000 15700±1300 2800±150 REINFORCE 125000±2000 22000±700 19500±24000 PG 128000±2000 28000±23000 27000±40000 NAC 113000±10000 48000±8000 22000±2000 Figure 6.3 also illustrates the path taken by the end-effector before and after learning. All algorithms manage to pass through the via-point G appropriately, although the path particularly before reaching the via-point can be quite different across the algorithms. Given that PI 2 reached the lowest cost with low variance in all examples, it appears to have found the best solution. We also added a “stroboscopic” sketch of the robot arm for the PI 2 solution, which proceeds from the very right to the left as a function of time. It 199 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000 1 10 100 1000 10000 15000 Cost Number of Roll-Outs 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000 1 10 100 1000 10000 15000 Cost Number of Roll-Outs a) b) c) d) 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000 1 10 100 1000 10000 15000 Cost Number of Roll-Outs e) f) -0.4 0.1 0.6 0 0.5 1 y [m] x [m] Initial PI2 REINFORCE PG NAC G -0.4 0.1 0.6 0 0.5 1 y [m] x [m] G -0.4 0.1 0.6 0 0.5 1 y [m] x [m] G 2 DOF 10 DOF 50 DOF Figure 6.3: Comparison of learning multi-DOF movements (2,10, and 50 DOFs) with planar robot arms passing through a via-point G. a,c,e) illustrate the learning curves for different RL algorithms, while b,d,f) illustrate the end-effector movement after learning forallalgorithms. Additionally, b,d,f)alsoshowtheinitialend-effectormovement, before learning to pass through G, and a “stroboscopic” visualization of the arm movement for the final result of PI 2 (the movements proceed in time starting at the very right and ending by (almost) touching the y axis). 200 should be emphasized that there were absolutely no parameter tuning needed to achieve the PI 2 results, while all gradient algorithms required readjusting of learning rates for every example to achieve best performance. 6.4.4 Application to robot learning Figure 6.4 illustrates our application to a robot learning problem. The robot dog is to jump across as gap. The jump should make forward progress as much as possible, as it is amaneuverinaleggedlocomotioncompetitionwhichscoresthespeedoftherobot–note thatweonlyusedaphysicalsimulatoroftherobotforthisexperiment,astheactualrobot was not available. The robot has three DOFs per leg, and thus a total of d = 12 DOFs. Each DOF was represented as a DMP with 50 basis functions. An initial seed behavior (Figure 6.5-top) was taught by learning from demonstration, which allowed the robot barely to reach the other side of the gap without falling into the gap – the demonstration wasgenerated fromamanual adjustment of spline nodes in aspline-based trajectory plan for each leg. PI 2 learning used primarily the forward progress as a reward, and slightly penalized the squared acceleration of each DOF, and the length of the parameter vector. Addi- tionally, a penalty was incurred if the yaw or the roll exceeded a threshold value – these penalties encouraged the robot to jump straight forward and not to the side, and not to fall over. The exact cost function is: r t = r roll +r yaw + d > i=1 A a 1 f 2 i,t +0.5a 2 θ T i θ B (a 1 =1.e−6,a 2 =1.e−8) (6.34) 201 (a) Real & Simulated Robot Dog 0 100 200 300 400 500 600 1 10 100 Cost Number of Roll-Outs (b) Learning curve for Dog Jump with PI 2 ±1std Figure 6.4: Reinforcement learning of optimizing to jump over a gap with a robot dog. The improvement in cost corresponds to about 15 cm improvement in jump distance, which changed the robot’s behavior from an initial barely successful jump to jump that completely traversed the gap with entire body. This learned behavior allowed the robot to traverse a gap at much higher speed in a competition on learning locomotion. 202 r roll = 100∗(|roll t |−0.3) 2 , if (|roll t |> 0.3) 0, otherwise (6.35) r yaw = 100∗(|yaw t |−0.1) 2 , if (|yaw t |> 0.1) 0, otherwise (6.36) φ t N = 50000(goal−x nose ) 2 (6.37) where roll,yaw are the roll and yaw angles of the robot’s body, and x nose is the position of the front tip (the “nose”) of the robot in the forward direction, which is the direc- tion towards the goal. The multipliers for each reward component were tuned to have a balanced influence of all terms. Ten learning trials were performed initially for the first parameter update. The best 5 trials were kept, and five additional new trials were performed for the second and all subsequent updates. Essentially, this method performs importance sampling, as the rewards for the 5 trials in memory were re-computed with the latest parameter vectors. A total of 100 trials was performed per run, and ten runs were collected for computing mean and standard deviations of learning curves. (i.e., 5 updates), the performance of the robot was converged and significantly im- proved, such that after the jump, almost the entire body was lying on the other side of the gap. Figure 6.4 captures the temporal performance in a sequence of snapshots of the robot. It should be noted that applyingPI 2 was algorithmically very simple, and manual tuning only focused on generated a good cost function, which is a different research topic beyond the scope of this paper. 203 Figure 6.5: Sequence of images from the simulated robot dog jumping over a 14cm gap. Top: beforelearning. Bottom: Afterlearning. Whilethetwosequenceslookquitesimilar at the first glance, it is apparent that in the 4th frame, the robot’s body is significantly heigher in the air, such that after landing, the body of the dog made about 15cm more forward progress as before. In particular, the entire robot’s body comes to rest on the other side of the gap, which allows for an easy transition to walking. 6.5 Evaluations of (PI 2 ) on planning and gain scheduling In the next sections we evaluate the PI 2 on the problems of optimal planning and gain scheduling. In a typical planning scenario the goal is to find or to learn trajectories which minimize some performance criterion. As we have seen in the previous sections at every iteration of the learning algorithm, new trajectories are generated based on which the new planing policy is computed. The new planning policy is used at the next iteration to generated new trajectories which are again used to compute the improved planning policy. The process continues until the convergence criterion is meet. In this learning process main assumption is the existence of a control policy that is adequate to steer the system such that it follows the trajectories generated at every it- eration of the learning procedure. In this section we go one step further and apply PI 2 not only to find the optimal desired trajectories but also to learn control policies that minimize a performance criterion. This performance criterion is a function of kinematic 204 variables of the underlying dynamics and the strength of control gains that are incorpo- rated in the control policy. Essentially, the goal for the robot is to be able to perform the task with as lower gains as possible. 6.6 Way-point experiments We start our evaluations with way -point experiments in two simulated robots, the 3DOF Phantom robot and the 6DOF Kuka robot. For both robots, the immediate reward at time step t is given as: r(t)=w gain > i K i P,t +w acc ||¨ x||+w subgoal C(t) (6.38) Here, C i K i P,t is the sum over the proportional gains over all joints. The reasoning behind penalizing the gains is that low gains lead to several desirable properties of the system such as compliant behavior (safety and/or robustness (Buchli, Kalakrishnan, Mis- try, Pastor & Schaal 2009)), lowered energy consumption, and less wear and tear. The term ||¨ x|| is magnitude of the accelerations of the end-effector. This quantity is penalized to avoid high-jerk end-effector motion. This penalty is low in comparison to the gain penalty. The robot’s primary task is to pass through an intermediate goal, either in joint space or end-effector space – such scenarios occur in tasks like playing tennis or table tennis. The component of the cost function C(t) that represents this primary task will be described individually for each robot in the next sections. Gains and accelerations are penalized at each time step, but C(t) only leads to a cost at specific time steps along 205 the trajectory. Finally for both robots, the cost weights are w int = 2000, w gain =1/N, w acc =1/N. Dividing the weights by the number of time steps N is convenient, as it makes the weights independent of the duration of a movement. 6.6.1 Phantom robot, passing through waypoint in joint space ThePhantomPremium1.5Robotisa3DOF,twolinkarm. Ithastworotationaldegrees of freedom at the base and one in the arm. We use a physically realistic simulation of this robot generated in SL (Schaal 2009), as depicted in Figure 6.6. Figure 6.6: 3-DOF Phantom simulation in SL. The task for this robot is intentionally simple and aimed at demonstrating the ability to tune task relevant gains in joint space with straightforward and easy to interpret data. The duration of the movement is 2.0s, which corresponds to 1000 time steps at 500Hz servo rate. The intermediate goals for this robot are set as follows: 206 C(t)= δ(t−0.4) | q SR (t)+0.2 |+δ(t−0.8) | q SFE (t)−0.4 |+δ(t−1.2) | q EB (t)−1.5 | This penalizes joint SR for not having an angle q SR =−0.2 at time t=0.4s. Joints SFE and EB are also required to go through (different) intermediate angles at times 0.8s and 1.2s respectively. The initial parameters θ i for the reference trajectory are determined by training the DMPswithaminimumjerktrajectory(Zefran, Kumar&Croke1998)injointspacefrom q t=0.0 = [0.00.32.0] T to q t=2.0 =[−0.60.81.4] T . The function approximator for the proportional gains of the 3 joints is initialized to return a constant gain of 6.0Nm/rad. The initial trajectories are depicted as red, dashed plots in Figure 6.8, where the angles and gains of the three joints are plotted against time. Since the task ofPI 2 is to optimize both trajectories and gains with respect to the cost function, this leads to a 6-D RL problem. The robot executes 100 parameter updates, with 4 noisy exploration trials per update. After each update, we perform one noise-less test trial for evaluation purposes. Figure 6.7 depicts the learning curve for the phantom robot (left), which is the overall cost of the noise-less test trial after each parameter update. The joint space trajectory and gain scheduling after 100 updates are depicted as blue, solid lines in Figure 6.8. From these graphs, we draw the following conclusions: • PI 2 has adapted the initial minimum jerk trajectories such that they fulfill the task and pass through the desired joint angles at the specified times. These intermediate goals are represented by the circles on the graphs. 207 Figure 6.7: Learning curves for the phantom robot. • Because the magnitude of gains is penalized in general, they are low when the task allows it. After t=1.6s, all gains drop to the minimum value 5 , because accurate tracking is no longer required to fulfill the goal. Once the task is completed, the robot becomes maximally compliant, as one would wish it to be. • Whentherobotisrequiredtopassthroughtheintermediatetargets, itneedsbetter tracking, and therefore higher gains. Therefore, the peaks of the gains correspond roughly to the times where the joint is required to pass through an intermediate point. • Due to nonlinear effects, e.g., Coriolis and centripedal forces, the gain schedule shows more complex temporal behavior as one would initially assume from specify- ing three different joint space targets at three different times. 5 We bounded the gains between pre-specified maximum and minimum values. Too high gains would generate oscillations and can lead to instabilities of the robot, and too low gains lead to poor tracking such that the robot frequently runs into the joint limits. 208 Figure 6.8: Initial (red, dashed) and final (blue, solid) joint trajectories and gain schedul- ing for each of the three joints of the phantom robot. Yellow circles indicate intermediate subgoals. In summary, we achieved the objective of variable impedance control: the robot is compliant when possible, but has a higher impedance when the task demands it. 6.6.2 Kuka robot, passing through a waypoint in task space Nextweshowasimilartaskona6DOFanthropomorphicarm,aKukaLight-WeightArm. Thisexampleillustratesthatourapproachscaleswelltohigher-dimensionalsystems, and also that appropriate gains schedules are learned when intermediate targets are chosen in end-effector space instead of joint space. The duration of the movement is 1.0s, which corresponds to 500 time steps. This time, the intermediate goal is for the end-effectorx to pass through [ 0.70.30.1] T at time t=0.5s: 209 Figure 6.9: Learning curves for the Kuka robot. C(t)= δ(t−0.5)(x−[0.70.30.1] T ) (6.39) Thesixjointtrajectoriesareagaininitializedasminimumjerktrajectories. Asbefore, the resulting initial trajectory is plotted as red, dashed line in Figure 6.10. The initial gains are set to a constant [60,60,60,60,25,6] T . Given these initial conditions, finding the parameter vectors for DMPs and gains that minimizes the cost function leads to a 12-D RL problem. We again perform 100 parameter updates, with 4 exploration trials per update. 210 The learning curve for this problem is depicted in Figure 6.9. The trajectory of the end-effector after 30 and 100 updates is depicted in Figure 6.10. The intermediate goal at t=0.5 is visualized by circles. Finally, Figure 6.11 shows the gain schedules after 30 and 100 updates for the 6 joints of the Kuka robot. Figure 6.10: Initial (red, dotted), intermediate (green, dashed), and final (blue, solid) end-effector trajectories of the Kuka robot. From these graphs, we draw the following conclusions: • PI 2 has adapted joint trajectories such that the end-effector passes through the intermediate subgoal at the right time. It learns to do so after only 30 updates (Figure 6.7). • After 100 updates the peaks of most gains occur just before the end-effector passes through the intermediate goal (Figure 6.11), and in many cases decrease to the minimum gain directly afterwards. As with the phantom robot we observe high impedance when the task requires accuracy, and more compliance when the task is relatively unconstrained. 211 Figure 6.11: Initial (red, dotted), intermediate (green, dashed), and final (blue, solid) joint gain schedules for each of the six joints of the Kuka robot. • The second joint (GA2) has the most work to perform, as it must support the weight of all the more distal links. Its gains are by far the highest, especially at the intermediate goal, as any error in this DOF will lead to a large end-effector error. • The learning has two phases. In the first phase (plotted as dashed, green), the robot is learning to make the end-effector pass through the intermediate goal. At 212 thispoint,thebasicshapeofthegainschedulinghasbeendetermined. Inthesecond phase, PI 2 fine tunes the gains, and lowers them as much as the task permits. 6.7 Manipulation task 6.7.1 Task 2: Pushing open a door with the CBi humanoid In this task, the simulated CBi humanoid robot (Cheng, Hyon, Morimoto, Ude, Hale, Colvin, Scroggin & Jacobsen 2007) is required to open a door. This robot is accurately simulated with the SL software (Schaal 2009). For this task, we not only learn the gain schedules, but also improve the planned joint trajectories withPI 2 simultaneously. Regarding the initial trajectory in this task, we fix the base of the robot, and consider only the 7 degrees of freedom in the left arm. The initial trajectory before learning is a minimum jerk trajectory in joint space. In the initial state, the upper arm is kept parallel to the body, and the lower arm is pointing forward. The target state is depicted inFigure6.12. Withthistask, wedemonstratethatourapproachcannotonlybeapplied to imitation of observed behavior, but also to manually specify trajectories, which are fine-tuned along with the gain schedules. The gains of the 7 joints are initialized to 1/10th of their default values. This leads to extremely compliant behavior, whereby the robot is not able to exert enough force to overcome the static friction of the door, and thus cannot move it. The minimum gain for all joints was set to 5. Optimizing both joint trajectories and gains leads to a 14-dimensional learning problem. 213 The terminal cost is the degree to which the door was opened, i.e. φ t N = 10 4 ·(ψ max − ψ N ),wherethemaximumdooropeningangleψ max is0.3rad(itisoutofreachotherwise). The immediate cost for the gains is again q t = 1 N C 3 i=1 K i P . The sum of the gains of all joints is divided by the number of time steps of the trajectory N, to be independent of trajectory duration. The cost for the gains expresses our preference for low gain control. The variance of the exploration noise for the gains is again 10 −4 γ n , and for the joint trajectories10γ n , bothwithdecayparameter λ=0.99andnthenumberofupdates. The relativelyhighexplorationnoiseforthejointtrajectoriesdoesnotexpresslessexploration per se, but is rather due to numerical differences in using the function approximator to model the gains directly rather than as the non-linear component of a DMP. The number of executed and reused ‘elite’ roll-outs is both 5, so the number of roll-outs on which the update is performed is K = 10. Figure 6.12 (right) depicts the total cost of the noise-less test trial after each update. The costs for the gains are plotted separately. When all of the costs are due to gains, i.e. the door is opened completely to ψ max and the task is achieved, the graphs of the total cost and that of the gains coincide. Here, it can be clearly seen that the robot switches to high-gain control in the first 6 updates (costs of gains go up) to achieve the task (cost of not opening the door goes down). Then, the gains are lowered at each update, until they are lower than the initial values. The joint trajectories and gain schedules after 0, 6 and 100 updates are depicted in Figure 6.13. 214 Figure 6.12: Left: Task scenario. Right: Learning curve for the door task. The costs specific to the gains are plotted separately. 6.7.2 Task 3: Learning tasks on the PR2 In (Pastor, Kalakrishnan, Chitta, Theodorou & Schaal 2011)PI 2 was used on PR2 robot for learning how to perform two manipulations tasks: learning billiard and rolling a box withchopsticks. Inthissection,weleavethedetailsoftheapplicationofPI 2 andwefocus on the design of the cost function for these two tasks. A more thorough and in detailed discussion on the application of PI 2 on the PR2 can be found in (Pastor et al. 2011). For the first task of learning to play billiard the critical issue is to find the states which are relevant. These state are illustrated in 6.14 and they consist of the cue roll, pitch, yaw, the elbow posture and the cue tip offset. The cost function used is minimizes large cue displacements, time to the target and distance to the target. Thus: 215 Figure 6.13: Learned joint angle trajectories (center) and gain schedules (right) of the CBi arm after 0/6/100 updates. q(x)=w 1 ×(Displacement)+w 2 ×(Time to Target)+w 3 ×(Distance to Target) For the second tasks, the goal for the robot is to learn to flip the box by using chopsticks. The state dependent cost function for this particular task penalizes high box accelerations measured by an IMU insight the box, high forces measured in the tactile sensors of PR2 robot and high arm accelerations measured by an accelerometer at each gripper. The terminal cost penalizes deviation from the desired state which is the one with the boxed flipped. Thus the cost function is expressed as: q(x)=w 1 ×(Box acceleration)+w 2 ×(Force)+w 3 ×(Gripper acceleration) 216 φ(x T N )=w 4 ×(Terminal state error) In figure 6.15 the initial policy and the final policy are illustrated. As we can see the robot learns how to succesfully flip the box 1 ! !"#$%&'$())*#% !"#$+(,, !"#$'&%!- !"#$./0 " # 1 ! " # #,1(0$'(*%"+# $ $ Figure 6.14: Relevant states for learning how to play billiard. Figure 6.15: Initial and final policies for rolling the box. 6.8 Discussion We have applied applied the PI 2 algorithm, which is modified version of iterative path integral control, to the problems of optimal planning and control for robotic tasks. The 217 DMPs which correspond to nonlinear dynamical systems with adjustable land scape play an essential role for the representation of kinematic trajectories and control gains. The key results of the path integral control formalism, which were presented in Table 4.1 and Section 4.4, consider how to compute the optimal controls for a general class of stochastic control systems with state-dependent control transition matrix. One impor- tant class of these systems can be interpreted in the framework of reinforcement learning with parameterized policies. For this class, we derived Policy Improvement with Path Integrals (PI 2 ) as a novel algorithm for learning a parametrized policy. PI 2 inherits its sound foundation in first order principles of stochastic optimal control from the path integral formalizm. It is a probabilistic learning method without open algorithmic tun- ing parameters, except for the exploration noise. In our evaluations, PI 2 outperformed gradient algorithms significantly. It is also numerically simpler and has easier cost func- tion design than previous probabilistic RL methods that require that immediate rewards are pseudo-probabilities. The similarity of PI 2 with algorithms based on probability matching indicates that the principle of probability matching seems to approximate a stochastic optimal control framework. Our evaluations demonstrated that PI 2 can scale to high dimensional control systems, unlike many other reinforcement learning systems. The mathematical structure of the PI 2 algorithm makes it suitable to optimize si- multaneously both reference trajectories and gain schedules. This is similar to classical DDP. We evaluated our approach on two simulated robot systems, which posed up to 14 dimensional learning problems in continuous state- action spaces. The goal was to learn compliantcontrolwhilefulfillingkinematictaskconstraints,likepassingthroughaninter- mediate target. The evaluations demonstrated that the algorithm behaves as expected: 218 it increases gains when needed, but tries to maintain low gain control otherwise. The optimalreferencetrajectoryalwaysfulfilledthetaskgoal. Learningspeedwasratherfast, i.e., within at most a few hun- dred trials, the task objective was accomplished. From a machine learning point of view, this performance of a reinforcement learning algorithm is very fast. The PI2 algorithms inherits the properties of all trajectory-based learning algorithms in that it only finds locally optimal solutions. For high dimensional robotic system, this is unfortunately all one can hope for, as exploring the entire state- action space in search for a globally optimal solution is impossible. Wecontinueourdiscussioninthenextsubsectionswithsomeissuesthatdeservemore detailed discussions. 6.8.1 Simplifications of PI 2 . In this section we discuss simplifications of PI 2 . The discussions starts with research directions that may allows us to remove the assumption between control weight matrix and variance of the noise. Moreover, we show how PI 2 could be used as model based, semi model based of model free way. Finally, we discuss some rules for cost function design as well as how PI 2 handles hidden states in the state vector and arbitrary states in the cost function. 6.8.2 The assumption λR −1 =Σ ! Inordertoobtainlinear2 nd orderdifferentialequationsfortheexponentiallytransformed HJB equations, the simplification λR −1 =Σ ! was applied. Essentially, this assumption couples the control cost to the stochasticity of the system dynamics, i.e., a control with 219 high variance will have relatively small cost, while a control with low variance will have relatively high cost. This assumption makes intuitively sense as it would be mostly unreasonable to attribute a lot of cost to an unreliable control component. Algorithmi- cally, this assumption transforms the Gaussian probability for state transitions into a quadratic command cost, which is exactly what our immediate reward function postu- lated. Future work may allow removing this simplification by applying nonlinear versions of the Feynman-Kac Lemma. 6.8.3 Model-based, Hybrid, and Model-free Learning Stochastic optimal control with path integrals makes a strong link to the dynamic system to be optimized – indeed, originally, it was derived solely as model-based method. As this paper demonstrated, however, this view can be relaxed. The roll-outs, needed for computing the optimal controls, can be generated either from simulating a model, or by gatheringexperiencefromanactualsystem. Inthelattercase, onlythecontroltransition matrix of the model needs be known, such that we obtain a hybrid model-based/model- free method. In this work, we even went further and interpreted the stochastic dynamic system as a parameterized control policy, such that no knowledge of the model of the control system was needed anymore – i.e., we entered a model-free learning domain. It seems that there is a rich variety of ways how the path integral formalism can be used in different applications. Further simplifications of PI 2 can be consider if one substitutes the optimal controls to stochastic dynamics. More precisely the optimal controls are expressed as: 220 u(τ i )dt =R −1 G (c) t i T ? G (c) t i R −1 G (c) t i T @ −1 K > k=1 ˜ p (k) (τ i )G (c) t i dw (k) t i (6.40) When the controls above are applied to the stochastic dunamics then they have to be multiplied by the matrix G (c) t i . This multiplication results in: G (c) t i u(τ i )dt = K > k=1 ˜ p (k) (τ i )G (c) t i dw (k) t i (6.41) The equation above suggests simplifications of PI 2 which will be explored. As the evaluations in this chapter show, PI 2 , in its current form, has amazingly robust perfor- mance in a variety of learning robotic control tasks. 6.8.4 Rules of cost function design The cost functions allowed in our formulations can have arbitrary state cost, but need quadratic command cost. This is somewhat restrictive, although the user can be flexible in what is defined as a command. For instance, the dynamic movement primitives (6.11) used in this paper can be written in two alternative ways: 1 τ ˙ z t = f t +g T t (θ+! t ) (6.42) or 1 τ ˙ z t = N g T t f t O θ 1 + ˜ ! t (6.43) 221 where the new noise vector ˜ ! t has one additional cofficient. The second equation treats f t as another basis function whose parameter is constant and is thus simply not updated. Thus, we added f t to the command cost instead of treating it as a state cost. We also numerically experimented with violations of the clean distinction between state and command cost. Equation (6.23) could be replaced by a cost term, which is an arbitrary function of state and command. In the end, this cost term is just used to differentiate the different roll-outs in a reward weighted average, similarly as in (Peters & Schaal 2008a, Kober & Peters 2009). We noticed in several instances thatPI 2 continued to work just fine with this improper cost formulation. Again, it appears that the path integral formalism and the PI 2 algorithm allow the user to exploit creativity in designing cost functions, without absolute need to adhere perfectly to the theoretical framework. 6.8.5 Dealing with hidden state Finally, it is interesting to consider in how far PI 2 would be affected by hidden state. Hidden state can either be of stochastic or deterministic nature, and we consider hidden state as adding additional equations to the system dynamics (4.2). Section 4.2 already derived that deterministic hidden states drop out of the PI 2 updateequations–thesecomponentsofthesystemdynamicsweretermed“uncontrolled” equations. More interesting are hidden state variables that have stochastic differential equations, i.e., these equations are uncontrolled but do have a noise term and a non-zero corre- sponding coefficient inG t in equation (4.2), and these equations are coupled to the other 222 equations through their passive dynamics. The noise term of these equations would, in theory, contributeterms in Equation (6.23), but given that neither thenoise nor the state of these equations are observable, we will not have the knowledge to add these terms. However, as long as the magnitude of these terms is small relative to the other terms in Equation (6.23), PI 2 will continue to work fine, just a bit sub-optimally. This issue would affect other reinforcement learning methods for parameterized policies in the same way, and is not specific to PI 2 . 6.8.6 Arbitrary states in the cost function As a last point, we would like to consider which variables can actually enter the cost functions for PI 2 . The path integral approach prescribes that the cost function needs to be a function of the state and command variables of the system equations (4.2). It should be emphasized that the state cost q t can be any deterministic function of the state, i.e., anythingthatispredictablefromknowingthestate, evenifwedonotknowthepredictive function. Thereisalot of flexibiliy in thisformulation, butit isalsomorerestrictivethan other approaches, e.g., like policy gradients or the PoWER algorithm, where arbitrary variables can be used in the cost, no matter whether they are states or not. We can think of any variable that we would like to use in the cost as having a corre- sponding differentential equation in the system dynamics (4.2), i.e., we simply add these variables as state variables, just that we do not know the analytical form of these equa- tions. As in the previous section, it is useful to distinguish whether these states have deterministic or stochastic differential equations. 223 If the differential equation is deterministic, we can cover the case with the derivations from Section 4.2, i.e., we consider such an equation as uncontrolled deterministic differ- ential equation in the system dynamics, and we already know that we can use its state in the cost without any problems as it does not contribute to the probability of a roll-out. If the differential equation is stochastic, the same argument as in the previous section applies, i.e., the (unknown) contribution of the noise term of this equation to the expo- nentiated cost (6.23) needs to be small enough for PI 2 to work effectively. Future work and empirical evaluations will have to demonstrate when these issues really matter – so far, we have not encountered problems in this regard. 224 Chapter 7 Neuromuscular Control Neuromuscular control or control of bio-mechanical models is one of the areas, in which the optimal control theory has been applied with significant contributions. These con- tributions are related to a better understanding of bio-mechanical and neuromuscular structure in terms of its functionality and design. In this chapter we are discussing the main characteristic of bio-mechanical systems and we investigate the main challenges in modeling such systems. More precisely, in section 7.1 we present the main differences between torque driven and tendon driven systems and we discuss the alternative use of control theory. In section 7.2 we have a literature review on the skeletal mechanics mod- eling approaches. In section 7.3 we discuss various modeling choices regarding the high dimensionality and redundancy of neuromuscular systems. We continue with the section 7.4 which reviews previous work on musculotendon routing models. In section 7.5 the application of optimal control to psychophysical and bio-mechanical models is discussed. In the last section 8.5 we conclude with the main points of this chapter. 225 7.1 Tendon driven versus torque driven actuation The gap in the functionality and robustness between robotic and human hands has its origins in our lack of understanding of design principles based on control theoretic ideas applicable to complex biomechanical structures such as the hand. From the control theoretic standpoint, the control of a highly dimensional and nonlinear stochastic plant ofthecomplexityofaroboticorbiomechanicalhandisnotaneasytask–whichalsomakes it difficult to understand the neuromuscular control of the hand. To appreciate the high dimensionality, it is enough to consider that more than 35 tendons must be controlled by the nervous system (Freivalds 2000). Some critical questions that remain open are: • Whatstrategiesdoesthenervoussystemuseformovingthefingergiventhegeomet- rical and mechanical characteristics of the muscular-tendon-bone structure? How sensitive these strategies are with respect to variations in the underlying dynamics and moment arm geometry? There are few important differences between torque driven and tendons driven bio- mechanical structures. In particular, in tendon driven systems, the number of control variables is usually higher than the number of corresponding controls in torque driven systems. Forexample,forthecaseoftheindexfinger,thereare7actuatingtendonswhich produce the required torque around the 3 joins, while in torque actuated mechanical fin- gerssystems, 3torquebasedcontrolvariablesaresufficienttoproduceplanarmovements. An additional component is that, the tendon actuation is constrained since tendons can only pull and not push while in most robotic systems that are torque driven, the control variables can take negative and positive values to generate negative or positive torques 226 around joins. The limits or control constrains for the case of torque driven control sys- temsareduetotorquesaturation. Clearlytheactuationmechanismisdifferentintendon driven and torque driven dynamical systems. A step towards understanding the role of each tendon for the production of a movement is to control a bio-mechanical model and discover the underlying control strategies. In order to apply a control theoretic approach, a model of the underlying neuromus- cular dynamics is required. This model is usually built based on the knowledge of the physiology and anatomy of the bio-mechanical system under investigation, and it is, with no doubt, an approximation of the true dynamics. Given this “ acceptable “ model a control theoretic approach is used to generated the desired behavior. The main goal in this form of scientific reasoning is to generate with the use of control the same dynamic behaviorwiththeoneobservedexperimentally. Providedthatboththeexperimenterand the theoretician trust the bio-mechanical model the claim is that the underlying control strategies matches the one that was used to generated the desired behavior in simulation and thus these control strategies is what the nervous systems may implement. Clearly this is one way of making use of control theory which relies on the assump- tion that the model captures the main characteristics of the bio-mechanics and it is an acceptable approximation. Nevertheless, there are examples and cases of bio-mechanical systems for which there is no such a good model or if there is, then it is very sensitive with respect to parameter variations. In such cases, the use of control theory could be twofold. On one side it can be used as a verification tool of every proposed model as a candidate while one the other side it can be used to explore the sensitivity of the model with respect to critical parameters. 227 Wewillleavethisdiscussionforthenextchapterandinthenextsectionswearefocus- ing on previous effort of bio-mechanical modeling based on the characteristics of skeletal mechanics and the redundancy and high dimensionality of neuromuscular systems. 7.2 Skeletal Mmechanics In neuromuscular function studies, skeletal segments are generally modeled as rigid links connected to one another by mechanical pin joints with orthogonal axes of rotation. These assumptions are tenable in most cases, but their validity may depend on the pur- pose of the model. Some joints like the thumb carpometacarpal joint, the ankle and shoulder joints are complex and their rotational axes are not necessarily perpendicular [46][48], or necessarily consistent across subjects (Hollister, Buford, Myers, Giurintano & Novick 1992), (Santos & Valero-Cuevas 2006), (Cerveri, De Momi, Marchente, Lopomo, Baud-Bovy, Barros & Ferrigno 2008). Assuming simplified models may fail to capture the real kinematics of these systems (Valero-Cuevas, Johanson & Towles 2003). While passivemomentsduetoligamentsandothersofttissuesofthejointareoftenneglected,at times they are modeled as exponential functions of joint angles (Yoon & Mansour 1982), (Hatze 1997) at the extremes of range of motion to passively prevent hyper-rotation. In other cases, passive moments well within the range of motion could be particularly im- portant in the case of systems like the fingers (Esteki & Mansour 1996), (Sancho-Bru, Prez-Gonzlez, Vergara-Monedero & Giurintano 2001) where skin, fat and hydrostatic pressure tend to resist flexion. Modeling of contact mechanics could be important for joints like the knee and the ankle where there is significant loading on the articulating 228 surfaces of the bones, and where muscle force predictions could be affected by contact pressure. Joint mechanics are also of interest for the design of prostheses, where the knee or hip could be simulated as contact surfaces rolling and sliding with respect to each other (Rawlinson & Bartel 2002),(Rawlinson, Furman, Li, Wright & Bartel 2006). Several studies estimate contact pressures using quasi-static models with deformable con- tact theory (e.g., (Wismans, Veldpaus, Janssen, Huson & Struben 1980),(Blankevoort, Kuiper, Huiskes & Grootenboer n.d.)). But these models fail to predict muscle forces during dynamic loading. Multibody dynamic models with rigid contact fail to predict contact pressures (Piazza & Delp 2001). 7.3 Dimensionality and redundancy The first decision to be made when assembling a musculoskeletal model is to define dimensionalityofthemusculoskeletalmodel(i.e.,numberofkinematicdegrees-of-freedom andthenumberofmusclesactingonthem). Ifthenumberofmusclesexceedstheminimal number required to control a set of kinematic DOF, the musculoskeletal model will be redundant for some sub-maximal tasks. The validity and utility of the model to the research question will be affected by the approach taken to address muscle redundancy. Most musculoskeletal models have a lower dimensionality than the actual system they are simulating because it simplifies the mathematical implementation and analysis, or becausealow-dimensionalmodelisthoughtsufficienttosimulatethetaskbeinganalyzed. Kinematicdimensionalityisoftenreducedtolimitmotiontoaplanewhensimulatingarm motion at the level of the shoulder(Abend, Bizzi & Morasso 1982),(Mussa-Ivaldi, Hogan 229 & Bizzi 1982), when simulating fingers flexing and extending (Dennerlein, Diao, Mote & Rempel 1998) or when simulating leg movements during gait (Olney, Griffin, Monga & McBride 1991). Similarly, the number of independently controlled muscles is often reduced (An, Chiao, Cooney & Linscheid 1985) for simplicity, or even made equal to the number of kinematic degrees-of-freedom to avoid muscle redundancy (Harding, Brandt & Hillberry 1993). While reducing the dimensionality of a model can be valid in many occasions, one needs to be careful to ensure it is capable of replicating the function being studied. Forexample,aninappropriatekinematicmodelcanleadtoerroneouspredictions (Valero-Cuevas, Towles & Hentz 2000), (Jinha, Ait-Haddou, Binding & Herzog 2006), or reducing a set of muscles too severely may not be sufficiently realistic for clinical purposes. A subtle but equally important risk is that of assembling a kinematic model with a given number of degrees of freedom, but then not considering the full kinematic output. For example, a three-joint planar linkage system to simulate a leg or a finger has three kinematic DOF at the input, and also three kinematic degrees of freedom at the output: the x and y location of the endpoint plus the orientation of the third link. As a rule, the number of rotational degrees- of-freedom (i.e., joint angles) maps into as many kinematic degrees-of-freedom at the endpoint (Murray, Li & Sastry 1994). Thus, for example, studying muscle coordination to study endpoint location with- out considering the orientation of the terminal link can lead to variable results. As we have described in the literature (Valero-Cuevas, Zajac & Burgar 1998), (Valero-Cuevas 2009), the geometric model and Jacobian of the linkage system need to account for all input and output kinematic degrees- of-freedom to properly represent the mapping from muscle actions to limb kinematics and kinetics. 230 7.4 Musculotendon routing Next, we need to select the routing of the musculotendon unit consisting of a muscle and its tendon in series (Zajac 1989), (Zajac 1992). The reason we speak in general about musculo-tendons (and not simply tendons) is that in many cases it is the belly of the muscle that wraps around the joint (e.g., gluteus maximus over the hip, medial deltoid overtheshoulder). Inothercases, however, itisonlythetendonthatcrossesanyjointsas in the case of the patellar tendon of the knee or the flexors of the wrist. In addition, the properties of long tendons affect the overall behavior of muscle like by stretching out the force- length curve of the muscle fibers (Zajac 1989). Most studies assume correctly that musculotendonsinsertintobonesat singlepointsormultiplediscretepoints(iftheactual muscle attaches over a long or broad area of bone). Musculo-tendon routing defines the direction of travel of the force exerted by a muscle when it contracts. This defines the moment arm r of a muscle about a particular joint, and determines both the excursion δs the musculo-tendon will undergo as the joint rotates an angle δθ defined by the equation, δs =rδθ,aswellasthejointtorqueatthatjointduetothemuscleforcefmtransmittedby thetendonτ =r·f m whereristheminimalperpendiculardistanceofthemusculo-tendon from the joint center for the planar (scalar) case (Zajac 1992). For the three dimensional case the torque is calculated by the cross product of the moment arm with the vector of muscle force τ = r×f m . In todays models, musculo-tendon paths are modeled and visualized either by straight lines joining the points of attachment of the muscle; straight lines connecting via points attached to specific points on the bone which are added or removed depending on joint configuration (Garner & Pandy 2000) or as cubic splines 231 with sliding and surface constraints (Blemker & Delp 2005.). Several advances also allow representing muscles as volumetric en- tities with data extracted from imaging studies (Blemker & Delp 2005.) ,(S. S. Blemker & Delp 2007), and defining tendon paths as wrapping in a piecewise linear way around ellipses defining joint locations (R. Davoodi & Loeb 2003), (Delp & Loan 2007). The path of the musculotendon in these cases is defined based on knowledge of the anatomy. Sometimes, it may not be necessary to model the musculotendon paths but obtaining a mathematical expression for the moment arm (r) could suffice. The moment arm is often a function of joint angle and can be obtained by recording incremental tendon excursions (δs) and corresponding joint angle changes (δθ) in cadaveric specimens. 7.5 Discussion The use of stochastic optimal control theory as conceptual tool towards understanding neuromuscular behavior was proposed in, for example, (He, Levine & Loeb 1991), (Harris & Wolpert 1998), (Todorov 2004). In that work, a stochastic optimal control framework for systems with linear dynamics and control-dependent noise was used to understand the variability profiles of reaching movements. The influential work by (Todorov 2004) established the minimal intervention principle in the context of optimal control. The minimal interven- tion principle was developed based on the characteristics of stochastic optimal controllers for systems with multiplicative noise in the control signals. TheLQRandLQGoptimalcontrolmethodshavebeenmostlytestedonlineardynam- ical systems for modeling sen- sorimotor behavior; e.g, in reaching tasks, linear models 232 were used to describe the kinematics of the hand trajectory (Harris & Wolpert 1998), (Todorov & Jordan 2002). In neuromuscular modeling, however, linear models cannot capturethenonlinearbehaviorofmusclesandmulti-bodylimbs. In(Li&Todorov2004), an Iterative Linear Quadratic Regulator (ILQR) was first introduced for the optimal con- trol of nonlin- ear neuromuscular models. The proposed method is based on linearization of the dynamics. An interesting component of this work that played an influential role in the studies on optimal control methods for neuromuscular models was the fact that there was no need for a pre-specified desired trajectory in state space. By contrast, most approaches for neuromuscular optimization that use classical con- trol theory (see Section VI) require target time histories of limb kinematics, kinetics and/or muscle activity. In (Todorov 2005) the ILQR method was extended for the case of nonlinear stochastic systems with state and control dependent noise. The proposed algorithm is the Iterative Linear Quadratic Gaussian Regulator (iLQG). This extension allows the use of stochastic nonlinear models for muscle force as a function of fiber length and fiber velocity. Figure 6 illustrates the application of LQG to our arm model (Section II). Further theoretical developments in (Li & Todorov 2006) and (Todorov 2007) allowed the use of an Extended Kalman Filter (EKF) for the case of sensory feedback noise. The EKF is an extension of the Kalman filter for nonlinear systems. Thehasbeenonlyfewexamplesofstudiesintheareaofthebiomechanicsoftheindex finger which try to identity the underlying control signals for the case of movement and force production, either these signals corresponds to neural commands or tensions ap- plied on the tendons. More precisely on the experimental side, the work in (Venkadesan & Valero-Cuevas 2008b) investigated the neural control of contact transition between 233 motion and force during tapping. On the theoretical side the study in (Venkadesan & Valero-Cuevas 2008a) has found that such transitions from motion to well-directed con- tact force are a fundamental part of dexterous manipulation, and that such tasks are likely controlled optimally. Moreover, one of the main assumptions in (Venkadesan & Valero-Cuevas 2008a) is that the underlying control strategy of the finger is considered to be open loop. In addition, the model used is a torque driven model while the neu- romuscular delays are modeled as activation contraction dynamics at the level of the torques driving the 3 joints of the index finger. Even though with this simple model the optimality principles of the motion to force transition for the task of tapping were investigated, an open loop control strategy would have failed in tasks such as object ma- nipulation where feedback control is critical requirement for successfully performing the manipulation task. Furthermore, since only 3 sets of differential equation that model the activation contraction dynamics are considered, the full structure and redundancy of the index finger is not explored and the system under investigation remains in nature torque driven. In this chapter we have reviewed previous work on bio-mechanical modeling by touch- ing the critical issues of skeletal mechanics, muscle redundancy and musculotendon rout- ing as well as on application of optimal control theory to psychophysical and neuromus- cular models. We have provided the main differences between torque driven and tendon drivensystems. Wehavediscussedtheroleoftheuseofcontroltheoryintobio-mechanical modelsnotonlyasatoolthatprovidesinsightsregardingtheunderlyingcontrolstrategies put also as a way to verify bio-mechanical models through a sensitivity analysis. 234 Following this line of reasoning, in the next chapter we apply the optimal control theory to two tendon driven models of the index finger. 235 Chapter 8 Control of the index finger In this chapter we apply the iterative optimal control algorithm on two bio-mechanical models of the index finger and we compare the resulting behavior. The bio-mechanical models share the same multi-body dynamics but they differ in the tendon geometry since they incorporate different moment arm matrices found in (Valero-Cuevas et al. 1998) and (An, Ueba, Chao, Cooney & Linscheid 1983). As it is illustrated, the different moment arm matrices play important role in the actuation capabilities of each model of the index finger which become obvious as we compare the underlying tension profiles for the case a flexing and a tapping movement. Theremainingofthischapterisorganizedasfollows: insection8.1weprovideashort introduction for the biomechanics of the index finger while in section 8.2 we discuss the iterative linear quadratic regulator which is the optimal control algorithm used for our simulations. In section 8.3 we provide the multi-body dynamics and in 8.4 we compare our results on the optimal control of the index finger between the two models of the moment arm matrices. The moment arm models and the optimal control algorithm are tested on the tasks of flexing and tapping with the index finger. 236 8.1 Index fingers biomechanics The skeleton of the human index finger consist of 3 joints connected with 3 rigid links. The two joints (proximal interphalangeal (PIP) and the distal interphalangeal (DIP)) are described as hinge joints that can generate both flexion-extension. The metacarpopha- langeal joint (MCP) is a saddle joint and it can generated flexion-extension as well as abduction-adduction. Fingers have at least 6 muscles, and the index finger is controlled by 7. Starting with the flexors, the index finger has the Flexor Digitorum Profundus (FDS) and the Flexor Digitorum Superficialis (FDP). The the Radial Interosseous (RI) acts on the MCP joint. Lastly, the extensor mechanism acts on all three joints. It is an interconnected network of tendons driven by two extensors Extensor Communis (EC) and the Extensor Indicis (EI), and the Ulnar Interosseous (UI) and Lumbrical (LU). There are also 4 passive tendon elements that complete this network. These passive tendons are the Terminal Extensor (TE), the Radial Band (RB) the Ulnar Band (UB) and the Extensor Slip (ES). Activetendonsareconnectedtomusclesandthereforetheydirectlyactuatethefinger. Passive tendons are connected with other tendons(active) and ligaments and their role for the case of the index finger is to transform the applied tensions to the distal join. In our work we will consider only the active tendons. 8.2 Iterative stochastic optimal control Weconsiderthenonlineardynamicalsystemdescribedbythestochasticdifferentialequa- tion that follows: 237 dx =f(x,u)dt+F(x,u)dw where x∈" n×1 is the state, u∈" m×1 is the control and w∈" p×1 Brownian motion noise with variance σ 2 I p×p . The stochastic differential equation above corre- sponds to a rather general class of dynamical systems which are found in robotics and biomechanics. The term h(x(T)) is the terminal cost in the cost function while the +(τ,x(τ),π(τ,x(τ))) is the instantaneous cost rate which is a function of the state x and control policy π(τ,x(τ)). The cost-to - go v π (x,t) is defined as the expected cost accumulated over the time horizon (t 0 ,...,T) starting from the initial statex t to the final state x(T). v π (x,t)=E . h(x(T))+ " T t 0 +(τ,x(τ),π(τ,x(τ)))dτ / The expectation above is taken over the noise ω. We next discretize the determin- istic dynamics and therefore we will have ¯ x t k+1 = ¯ x t k +Δtf(¯ x t k ,¯ u t k ). Furthermore the deterministic dynamics are linearized according to the equation that follows around ¯ x t k δx t k+1 + ¯ x t k+1 = ¯ x t k +δ¯ x t k +Δtf(¯ x t k +δ¯ x t k ,¯ u t k +δ¯ u t k ) 238 Thefirstorderapproximationofthenonlineardynamicsleadsthelinearizeddynamics: δx t k+1 =A k x t k +B k δu t k +Γ k (δu t k )ξ t k whereΓ k is the noise transition matrix that is control depended and it is defined as follows: Γ k (δu t k )= . c 1,k +C 1,k δu t k ··· c p,k +C p,k δu t k / withc i,k = √ dtF (i) andC i,k = √ dt∂F (i) /∂δu. Thestateandcontroltransitionmatri- cesareexpressedas: A k =I+dt∂f/∂xandB k =dt∂f/∂u. Thequadraticapproximation of the cost function is given as follows: Cost k =q k +δx T t k q+ 1 2 δx T t k Q k x t k (8.1) +δu T t k r+ 1 2 δu T t k R k u t k +δx T t k P k u t k where the terms : q k ,q k ∈" n×1 ,Q k ∈" n×n ,r k ∈" m×1 ,R k ∈" m×m ,P k ∈" n×m are defined as: q k =dt +; q k =dt ∂+/∂x (8.2) Q k =dt ∂ 2 +/∂x∂δx; P k =dt ∂ 2 +/∂u∂x (8.3) r k =dt ∂+/∂δu; R k =dt ∂ 2 +/∂u∂u (8.4) 239 The cost to go v k (δx) is quadratic of the state and therefore it has the form: v k (δx)=s k +s T k+1 δx+δx T S k+1 δx (8.5) Wherethetermss k ,s k+1 andS k+1 arebackwardpropagatedfromtheterminalorgoal state to the initial state. More precisely starting with the terminal conditions s k+1 = q T ,s k+1 =q T and S k+1 =Q T , for k =T−1 we find the following terms: g =r k +B T k s k+1 +σ 2 > i C T i,k S k+1 c i,k G =P k +B T k S k+1 A k (8.6) H = σ 2 > i C T i,k S k+1 C i,k +B T k S k+1 B k +R k g By using the terms above the we can now calculate the correction in the control policy δu t k is formulated as δu t k =−H −1 (g+Gδx t k ) or in a more compact form δu t k = l k +L k δx t k where l k =−H −1 g and L k =−H −1 G. As we can see the correction in the control policy consist of an open loop gain l k and a close loop gain L k which guarantees local stability around the point of linearization of the nonlinear dynamics. Since the open and close loop gains l k and L k have been specified the next step is the backward propagation of the terms s k ,s k+1 and S k+1 . This backward propagation is expressed by the equations that follow: 240 S k =Q k +A T k S k+1 A k +L T k HL k +L k G+G T L k s k =q k +A T k s k+1 +L T k Hl k +G T L k +L T k g (8.7) s k =q k +s k+1 + 1 2 σ 2 > i c T i,k S k+1 c i,k + 1 2 l T k Hl k +l T k g The control policy at the next iteration is given by the adding the correction δu (i) t,...,T in the control policy of the current iteration. Therefore we will have that u (i+1) t,...,T = u (i) t,...,T + γ · δu (i) t,...,T where γ is the step size. Using the updated control policy u (i+1) t,...,T and by propagating the nonlinear dynamics a new trajectory is generated in state space. The linear and quadratic approximation of the dynamics and cost are found and the algorithmsisrepeatedagainuntilconvergence. Thecontrollawδu t k =−H −1 (g+Gδx t k ) is the optimal one for as long as the matrix H is positive definite. The cost-to -go function v π (δx) depends on the control law δu k =π k (δx) through the term α(δx,δu)= δu T (g+Gδx)+ 1 2 δu T Hδu. Thereforeminimizationofthecosttogofunctionisequivalent to the minimization of the quadratic function α(δx,δu) which is convex iff the its Hessian H> 0. In highly dimensional dynamical systems H might loose its positive definiteness. In such cases we follow an approach similar to Levenberg-Marquardt : (1) compute the eigenvalue decomposition of H, [V,D]= eig(H)(2) replace all the negative elements of the diagonal matrix with 0 (3) add a small positive number λ to the diagonal of D (4) set H = VDV T using the modified diagonal matrix D from the steps (2) and (3). For our simulation we need to constrain the controlsu since the control variable of our index finger model corresponds to neural activation that is always positive. To avoid violating 241 Table 8.1: Pseudocode of the iLQG algorithm • Given: – An immediate cost function +(x,u) – A terminal cost term φ t N . – The stochastic dynamics dx =f(x,u)dt+F(x,u)dω • Repeat until convergence: – Givenatrajectoryinstatesandcontrols ¯ x,¯ ufindtheapproximationsA t ,B t ,Γ t and + o ,+ x ,+ xx ,+ uu ,+ ux around these trajectories. – Compute all the terms H,G and g according to equations (8.6). – Back-propagate the quadratic approximation of the value function based on the equations (8.7). – Compute δu t k =−H −1 (g+Gδx t k ) – Update controls u ∗ new =u ∗ old +γ ·δu ∗ – Ifu ∗ new <u c then reduce γ to γ c so that the constraint is not violated and find the controls u ∗ new =u ∗ old +γ c ·δu ∗ – Get the new optimal trajectory x ∗ by propagating the nonlinear dynamics dx =f(x,u ∗ )dt+F(x,u ∗ )dω. – Set ¯ x =x ∗ and ¯ u =u ∗ old =u ∗ new and repeat. the control constrains the step size γ is reduced until the constrain is not violated. The iLQG algorithm in a pseudocode form is illustrated in table (8.1). 8.3 Multi-body dynamics The full model of the index finger is given by the equations that follow: ¨ θ = −I(θ) −1 C ? θ, ˙ θ @ +B ˙ θ+I(θ) −1 T (8.8) T = M(θ) F (8.9) 242 ˙ F = − 1 τ (F−u) (8.10) where I∈" 3×3 is the inertial matrix, C(θ, ˙ θ)∈" 3×1 is matrix of Coriolis and centripetal forces and B∈" 3×3 is the damping matrix. The matrix M∈" 3×7 is the moment-arm matrix, T∈" 3×1 is the torque vector, F∈" 7×1 is the force-tension on the tendons and u is the control vector. Equation (8.10) is used to model delays in the generation of tensions on the tendons. For our simulations we have excluded the abduction-adduction movement at MCP joint and we examine planar movements and we investigate the necessary length and velocity profiles of the tendons for producing such movements. Therefore, thestatespaceformulationofourmodelhasdimensionalityof13, corresponding to 6 states related to joint space kinematics (angles and velocities) and 7 statesforthetensionsappliedonthe7activetendons. Thequantitiesθ and ˙ θ arevectors of dimensionalityθ∈" 3×1 , ˙ θ∈" 3×1 defined asθ=(θ 1 ,θ 2 ,θ 3 ) and ˙ θ = ? ˙ θ 1 , ˙ θ 2 , ˙ θ 3 @ . The inertia I(θ)terms of the forward dynamics are given as follows: I 11 = I 31 +µ 1 +µ 2 +2µ 4 cosθ 2 I 21 = I 22 +µ 4 cosθ 2 +µ 6 cos(θ 2 +θ 2 ) I 22 = I 33 +µ 2 +2µ 5 cosθ 3 I 31 = I 32 +µ 6 cos(θ 3 +θ 3 ) I 33 = µ 3 243 while the term of coriolis and centripetal forces C(θ, ˙ θ) is formulated as follows: C 1 =µ 4 sinθ 2 6 − ˙ θ 2 ? 2 ˙ θ 1 + ˙ θ 2 @7 +µ 5 sinθ 3 6 − ˙ θ 3 ? 2 ˙ θ 1 +2 ˙ θ 2 + ˙ θ 3 @7 −µ 6 sin(θ 2 +θ 3 ) ? ˙ θ 2 + ˙ θ 3 @? 2 ˙ θ 1 + ˙ θ 2 + ˙ θ 3 @ C 2 =µ 5 sinθ 2 ˙ θ 2 1 −µ 5 sinθ 3 6 ˙ θ 3 ? 2 ˙ θ 1 + ˙ θ 2 + ˙ θ 3 @7 +µ 6 sin(θ 2 +θ 3 ) ˙ θ 1 2 C 3 =µ 5 sinθ 3 ? ˙ θ 1 + ˙ θ 2 @ +µ 6 sin ? ˙ θ 2 + ˙ θ 3 @ ˙ θ 2 1 The terms µ 1 ,µ 2 ,µ 3 are functions of the masses (m 1 ,m 2 ,m 3 ) = (0.05,0.04,0.03)Kgr and the lengths (l 1 ,l 2 ,l 3 ) = (0.0508,0.0254,0.01905)m of the 3 bones of the index finger. They are specified as mu 1 =(m 1 +m 2 +m 3 ),µ 1 =(m 1 +m 2 +m 3 )l 2 1 ,µ 3 = m 3 l 2 3 ,µ 4 = (m 2 +m 3 )l 1 l 2 ,µ 5 =m 3 l 2 l 3 and µ 6 =m 3 l 1 l 3 . 8.4 Effectofthemomentarmmatricesinthecontrolofthe index finger Inthissectionweapplyoptimalcontrolframeworktoabio-mechanicalmodeloftheindex and we are testing the effect of different moment arm matrices in the control of the index finger. In our analysis we used the moment arm matrix suggested in (An et al. 1983) and(Valero-Cuevas et al. 1998) . We apply iLQG optimal control to generate the two movements and we compare the behaviors of the two models. 244 8.4.1 Flexing movement The first movement is a flexion movement around the PIP and DIP joins while the MCP join remains almost constant. The initial posture is at θ 0 = (0,0,π/10,) and the terminal posture isat θ N = (0,π/2,π/12,) while thetime horizon of the movement isT N = 400ms The cost function is tuned such that it penalizes only terminal errors with respect to the target posture and control cost. Therefore, we do not pre-specify any desire trajectory that would have imposed extra state depended terms in the cost function. The flexion and tapping movement correspond to control problems where the goal is tobringthedynamicsformaninitialstatetoatargetstate. Theiterativeoptimalcontrol algorithms provide us with the optimal control sequenceu, a set of locally optimal closed loop gainsL and an the locally optimal state space trajectory. This trajectory in treated as a desired trajectory that is followed by the dynamics with the use of the open loop control u and the feedback policies L. Essentially, we leave the optimization procedure to come-up with each one desired trajectory. An alternative to this approach would be to record join kinematic trajectories and then use these trajectories in the cost function. In particular, in this scenario we would have to impose extra terms in the cost function which penalize errors with respect to any deviation from the desired trajectory. In both cases scenarios, the iterative optimal controller is a tracking controller, the deference between the two cases in whether or not the desired trajectory is pre-specified or it is the outcome of the optimization procedure. In the figures that follows the postures, the kinetics of the tendons and the underlying tension profiles are illustrated. 245 !!"!# !!"!$ !!"!% !!"!& !!"!' !!"!( !!"!) !!"!* !!"!+ ! !!"!' !!"!( !!"!) !!"!* !!"!+ ! !"!+ ,-./0123/ Figure 8.1: Flexing Movement: Sequence of postures generated when the first model of moment arm matrix is used and the iLQG is applied 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Tendons Excursions Time in sec FDS FDP EI EC LUM RI UI Figure 8.2: Flexing Movement: Tendon excursions for the right index finger during the flexing movement when the first model of moment arm matrix. 246 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Time in sec Tension Profiles FDS FDP EI EC LUM RI UI Figure 8.3: Flexing Movement: Tension profiles applied to the right index finger when the first model of moment arm matrix by is used. ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !"'# ! $ % & ' # ( ) * +,$! !& -./0,.1,203 4+501267,-012.61,8769.:02 , , 4; 4< Figure 8.4: Flexing Movement: Extensor tension profiles applied to the right index finger when the first model of moment arm matrix is used. 247 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !"'# !!"# ! !"# $ $"# % %"# & ()*+,-. (/0-1/21.-3 ()*+,-. 1 1 456 676 876 Figure 8.5: Flexing Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the first model of moment arm matrix is used. !!"!# !!"!$ !!"!% !!"!& !!"!' !!"!( !!"!) !!"!* !!"!+ ! !!"!' !!"!( !!"!) !!"!* !!"!+ ! !"!+ ,-./0123/ Figure 8.6: Flexing Movement: Sequence of postures generated when the second model of moment arm matrix is used and the iLQG is applied. 248 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !"'# !!"!' !!"!& !!"!% !!"!$ ! !"!$ !"!% !"!& !"!' ()*+,*-./0123-4,*- (45).4*.-)1 . . 678 679 /: /; <=> ?: =: Figure 8.7: Flexing Movement: Tendon excursions for the right index finger during the flexing movement when the second model of moment arm matrix.. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time in sec Tension Profiles FDS FDP EI EC LUM RI UI Figure 8.8: Flexing Movement: Tension profiles applied to the right index finger when the second model of moment arm matrix by is used. 249 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !"'# ! $ % & ' # ( ) *+$! !& ,-./+-0+1/2 3*4/0156+,/01-50+7658-9/1 + + 3: 3; Figure 8.9: Flexing Movement: Extensor tension profiles applied to the right index finger when the second model of moment arm matrix is used. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Time in sec Tension Profiles FDS EI EC LUM RI UI Figure 8.10: Flexing Movement: Flexors tension profiles applied to the right index finger when the second model of moment arm matrix is used. 250 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !"'# !!"# ! !"# $ $"# % %"# & &"# ()*+,-. (/0-1/21.-3 ()*+,-. 1 1 456 676 876 Figure 8.11: Flexing Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the second model of moment arm matrix is used. There are few important observations regarding the kinematic behaviors and the un- derlying tension profiles when the two different moment arm matrices are used. More precisely: • Figures 8.1 and 8.6 illustrate the sequence of postures for the two moment arms. In both cases the iLQG succeeds in bringing the finger to the desired posture. When themomentarmby(Valero-Cuevasetal.1998)isusedthenthereisasmallrotation at the MCP joint which is not observed for the case when the moment arm matrix by (An et al. 1983) is used. • In figures 8.2 and 8.7 the tendon excursions are illustrated. More precisely in both cases we see that the tendon flexors FDP and FDS move inwards and therefore the 251 corresponding tendons are flexing as expected. Correspondingly the tendon excur- sions EC and EI for are moving outwards and thus operate as expected. Moreover the tendons LUM, RI and UI move outwards as it is illustrated in the two figures. • In figures 8.3,8.4 and 8.8,8.9 the tensions applied on the 7 tendons to generate the flexing movement are shown. Clearly for the case of the first moment arm there is a synchronized burst of activity since all the tensions are reaching their maximum tensions during the time window between 0ms and 0.2 ms. For the case of the second moment arm, the results in 8.8, do not illustrated a burst of activity but they rather suggest a different mechanism which is characterized by a higher tensions in the FDP tendon with respect to the rest tendons, and a delay in the activation of the FDS and EI, EC tendons as it is shown in figure 8.9. • The torque profiles are illustrated in figures 8.5 and 8.11. As it is illustrated the torque profiles are very similar since in both cases the highest torque is generated around the MCP join and the smallest around the DIP join. The torques applied at the MCP and DIP join for the first moment arm reach a smaller pick than the corresponding pick reached by MCP and DIP torques for the second moment arm matrix. Furthermore the torques for the first moment arm 8.5 are changing over time in smoother fashion than the torques in 8.11. In the next subsection we will continue our sensitivity analysis for the case of the tapping movement and we are testing again the two moment arm matrices. 252 8.4.2 Tapping Movement The second movement corresponds to tapping with the index finger. The initial posture is at θ 0 = (5π/6,π/2,π/10,) and the terminal posture is at θ 0 = (7π/6,π/4,π/12,) while the time horizon of the movement is 300ms. The cost function is tuned such that it penalizes only terminal errors with respect to the target posture and control cost. In the figures that follows the postures, the kinetics of the tendons and the underlying tension profiles are illustrated. !!"!# !!"!$ !!"!% !!"!& ! !"!& !!"!$ !!"!' !!"!% !!"!( !!"!& !!"!) ! !"!) !"!& *+,-./01- Figure 8.12: Tapping Movement: Sequence of postures generated when the first model of moment arm matrix is used and the iLQG is applied. 253 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Tendons Excursions Time in sec FDS FDP EI EC LUM RI UI Figure 8.13: Tapping Movement: Tendon excursions for the right index finger during the flexing movement when the first model of moment arm matrix. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time in sec Tension Profiles FDS FDP EI EC LUM RI UI Figure 8.14: Tapping Movement: Tension profiles applied to the right index finger when the first model of moment arm matrix by is used. 254 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !"' !% ! % ' ( ) $! $% *+,-./0 *12/31430/5 *+,-./0 3 3 678 898 :98 Figure 8.15: Tapping Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the first model of moment arm matrix is used. −0.08 −0.06 −0.04 −0.02 0 0.02 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 Postures Figure 8.16: Tapping Movement: Sequence of postures generated when the second model of moment arm matrix is used and the iLQG is applied. 255 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 Tendons Excursions Time in sec FDS FDP EI EC LUM RI UI Figure 8.17: Tapping Movement: Tendon excursions for the right index finger during the flexing movement when the second model of moment arm matrix. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Time in sec Tension Profiles FDS FDP EI EC LUM RI UI Figure 8.18: Tapping Movement: Tension profiles applied to the right index finger when the second model of moment arm matrix by is used. 256 ! !"!# !"$ !"$# !"% !"%# !"& !"&# !' !% ! % ' ( ) $! $% *+,-./0 *12/31430/5 *+,-./0 3 3 678 898 :98 Figure 8.19: Tapping Movement: Generated torques at MCP, PIP and DIP joins of the right index finger when the second model of moment arm matrix is used. There are few important observations regarding the kinematic behaviors and the un- derlying tension profiles when the two different moment arm matrices are used. More precisely: • In figures 8.12 and 8.16 the sequence of the postures for the tapping movement for the cases of the two moment arm matrices is illustrated. In both cases the finger reaches the desired posture with some small error. It is important to mention that there is not any desired trajectory encoded in the cost function and therefore there is only a penalty at the terminal state which is the desired terminal posture. In addition, the dynamical systems are tendon driven and therefore the tensions and activation variable should be positive. Even though these hard constrains in 257 controls challenge the feasibility of the optimization problem, iLQG succeeds to bring the system close to the desired state. • The tendon excursions are shown in 8.13 and 8.17. Clearly the flexor tendon FDP and FDS flex since they move inwards and the extensor tendons EC and EI extend because they move outwards. The LU, RI and UI are moving inwards and therefore they are acting as flexors for this specific tapping movement. • In figures 8.14 and the tension profiles are shown. The difference in the use of the moment arm matrices is more apparent in the comparison of the underlying tensions. In both cases there is a synchronization of the time during which tensions are reaching their maximum value. However for the case of the first moment arm theretensionattheFDSisverysmall. Inadditionforthecaseofthesecondmoment arm in there is a pick of activation at 30ms before the end of the movement. • The torque profiles are shown in 8.19 and 8.15. In both cases the highest positive torque is applied at the MCP join and the smallest negative torque at the DIP and PIP joins. Furthermore for the case of the second moment arm there is a pick of the MCP torque just 30ms before the end of the movement. 8.5 Discussion The results above suggest that the application of the optimal control to the index finger provides different results for the cases of the two moment arm matrices. This is not a surprising result since as we have mentioned, optimal control is a constrained optimiza- tion problem with the characteristic that its constrains correspond to dynamical systems. 258 When different moment arm matrices are used, the underlying dynamics differ and thus theconstrainsoftheconstrainedoptimizationproblemchange. Thischangesresultindif- ferent local optimal solutions. An interesting component is that the observed differences between the two cases are more highlighted in the underlying tension profiles. The underlying optimization in the iterative optimal control method used in this chapter, was formulated without the existence of a desired trajectory. Only a terminal desired state was used as the goal state in both movements. The outcome of the applica- tion of the optimal control is a desired optimal state trajectoryx ∗ 1 ,...,x ∗ T , a feedforward optimal commandu ∗ 1 ,...,u ∗ T−1 and locally optimal gainsL 1 ,...,L T−1 . Thus even though no desired trajectory was initially used for the design of the cost function, the resulting policy is a feedback policy which has as a desired trajectory the one that is provided by the optimization and it is the optimal x ∗ 1 ,...,x ∗ T . Consequently, for the case of nonlinear systems even though no initial trajectory is used as a desired one, the resulting controller is a tracking controller. In this chapter, with the application of the optimal control framework on the two bio- mechanical models of the index finger, we have observed the sensitivity of the predictions with respect to model changes. These sensitivity suggest the need for verification and model checking of the bio-mechanical models under consideration. 259 Chapter 9 Conclusions and future work In this thesis a new method for learning control in high dimensional state spaces has been proposed based on the framework of path integral control. On the bio-mechanical side, models of the index finger were tested for two tasks and the results where compared. In the next section we give the outline of this thesis based on the aforementioned projects and we discuss future research and extensions of current work. 9.1 Path integral control and applications to learning and control in robotics Oneofthemaincontributionsofthisthesisisthederivationofpathintegralcontrolforthe class of nonlinear dynamical system affine in control and noise. Furthermore, this thesis suggests the iterative version of the path integral stochastic optimal control framework. The outcome of this version is a new formalizm the so called Policy Improvement with Path Integrals (PI 2 ) capable of scaling in high dimensional learning control problems. The advantages and characteristics of PI 2 could summarized as follows: 260 • With respect to other gradient based methods, in PI 2 and in path integral control the gradient is calculated based on the weighted averaged of the local controls or local changes in the policy. These weights are given by the exponentiation of the variable−S(x)whereS(x) isproportionalofthecost oftheeachpath. Thus, paths with high cost will have very low probability and therefore low weight while paths with low cost will have high probability. Consequently, the gradient or optimal change in policy is given by the convex combination of the local control or local changes in the policy. This calculation has obvious robustness against exploration noise. • Since the gradient is calculated based on the convex combination of local policies, the optimality is with respect to these sampled local policies. Therefore, the ques- tion is how PI 2 explores the state space. Exploration comes as an outcome of the iterative version of path integral stochastic optimal control. Essentially with the iterative version and the update of the parameterized policy, the local policies at the every iteration yield trajectories with lower cost than the local policies at the previous iteration. • An essential characteristic of path integral control is that the solution of the back- ward Chapman Kolmogorov equation is found with forward sampling of the corre- spondingSDE.ThischaracteristiccomesfromthedirectapplicationoftheFeynman Kac lemma. Moreover, it allows us to perform sampling by executing trials on the real physical system with forward propagation of its dynamics and accumulation of the observed cost. 261 • Finally, in the path integral control framework, the optimal control is transformed from a minimization to a maximization problem. The exponentiation of the value function results in a new value function Ψ(x) which has a probabilistic meaning. This probabilistic nature appears again in the final form of the optimal control as the expectation over the local controller evaluated under the probability metric of p = e −S(x) ! e −S(x) dx . 9.2 Future work on path integral optimal control The extensions of path integral control are related to different noise distributions as well as to more general classes of stochastic systems. Examples are the cases where the stochastic dynamics are not affine in controls but they are affine only in the noise term. In addition stochastic dynamics with Wiener and Poisson noise terms are also of interest. In the next three subsections we discuss these extensions of path integral control. 9.2.1 Pathintegralcontrolforsystemswithcontrolmultiplicativenoise So far path integral stochastic optimal control has been applied to stochastic dynamical systems with state multiplicative noise. If one considers control multiplicative noise then the underlying HJB equation can also be derived. In this case however the resulting HJB equation can not be transformed to a linear PDE and therefore the application of the Feynman Kac lemma may not be possible. To avoid this obstacle one could formulate the stochastic optimal control problem as follows: 262 min u J(u,x) = min u E . exp $ − " t N t 0 L(x,u) %/ dt subject to the stochastic dynamics with state and control multiplicative noise: dx =F(x,u)dt+B(x,u)dω The transition probability for the derivation of the path integral is now given as: ! δ[x i −φ(t i ;x i−1 ,t i−1 ] # = " dω (2π) n exp $ jω T A % exp $ − 1 2 ω T Bωdt % whereA =x(t i )−x(t i−1 )−F(x,u)dt andB =B(x,u)B(x,u) T . With respect to path integral control, there is no need for the derivation of the HJB equation. Thus, one could derive the path integral for the stochastic dynamics and then find the gradient of the cost function. 9.2.2 Path integral control for markov jump diffusions processes. Markov jump diffusion processes are important in applications of stochastic optimal con- trol in financial engineering, economics as well in systems biology. Many phenomena in these fields could be modeled as jump diffusion processes due to sudden changes or jumps observed, in markets and the dynamic behavior of micro-organisms such as cells. In ad- dition, in robotics, Markov jump diffusions could model contact phenomena of walking robots with the ground. Thus, extending the path integral control framework to Markov 263 jump diffusion processes is of our interest. A Markov jump diffusion is expressed by the equation: dx =F(x,u)dt+B(x,u)dω+h(x,t)dP(t) where F(x,u)∈" n×1 is the drift term, B(x,u)∈" n×m is the diffusion term and h(x,t)∈" n×l isthepoissonprocesscoefficient. TheHJBequationforthecaseofmarkov diffusions processes is a PDE equation with an additional integral term that corresponds to the poisson distributed stochastic term dP. It is an open question weather or not the path integral control framework could be derived for the cases of Markov jump diffusion processes and certainly it is a topic of current and future research. 9.2.3 Path integral control for generalized cost functions In this work, the cost functions under optimization have no cross terms between control and state dependent terms. However, one may consider a more general case of cost function in which besides the state dependent and control dependent term, there is an additional term that is the projection of controls on the space of the state. These cost functions have the form: L t =L(x t ,u t ,t)=q 0 (x,t)+q 1 (x,t) T u+ 1 2 u T t Ru t For these cost functions one can show that the optimal controls are expressed as follows: 264 u(x,t)=−R −1 $ q 1 (x,t)+G(x) T ∇ x V(x,t) % The linear HJB for this case is expressed as: −∂ t Ψ t =− 1 λ ˜ q 0 Ψ t + ˜ f T t (∇ x Ψ t )+ 1 2 tr((∇ xx Ψ t )Σ t ) where the terms ˜ q 0 and ˜ f are given as follows: ˜ q 0 (x,t)=q 0 (x,t)− 1 2 q 1 (x,t) T R −1 q 1 (x,t), ˜ f (x,t)=f (x,t)−G(x,t)R −1 q 1 (x,t) Underthelogarithmictransformationtheoptimalcontrolsaredefinedbytheequation: u(x,t)=−R −1 $ q 1 (x,t)−λG(x) T ∇ x Ψ(x,t) Ψ(x,t) % It is a topic of future research to investigate the differences in the resulting optimal policies when this type of cost functions is used. 9.3 Future work on stochastic dynamic programming AnothercontributionofthisthesisisthederivationofthestochasticversionofDifferential Dynamic Programming (SDDP) for the cases of stochastic dynamical systems with state and control multiplicative noise. There are many possible extensions and research topics for SDDP which are summarized as follows: 265 • Further applications of SDDP to stochastic dynamical systems and extension for the case of constraints in state and controls. • So far we have derived the SDDP by using the Itˆ o calculus. It is of our interest to investigate how different discretization schemes based on Stratonovich or other stochastic calculus could affect the convergence of SDDP. This is important for systems where the noise is control and state multiplicative. • Extensions of SDDP to the cases of partial observability and the addition of a extended second order truncated Kalman filter. The resulting algorithm can be thought as a version of nonlinear LQG design in which state space dynamics are expended up to the second order for both estimation and control. 9.4 Future work on neuromuscular control The application of optimal control methods to identify the underlying tension profiles for the index finger reveals that the results depend on the model. We have used two different moment arm models that distribute the forces applied on the index finger in a different way. The question of, which moment arm model is the most appropriate one, is open and difficult to answer since it requires a experiments in which access to tendon tensions is possible. A topic for future research is to develop methods which could be used to verify bio- mechanical models before optimal control is applied. A possible way to verify models of the index finger biomechanics would be to record trajectories of finger movements in 266 humans, and then test weather or not the candidate models satisfy the local controlla- bility condition when they are linearized on the recorded trajectories. But even if the controllability condition is satisfied that does not mean that the tested model is a good candidate due to the fact that controls are constrained. More precisely if neural activity is treated as the control variable then it is bounded between 0 and 1 while in cases where the forces produced by the muscles is treated as controls then these control variables have to be positive. Thus, the controllability condition is a necessary but not a sufficient condition for the case of constrained controls. Future research will investigate the application of alternative methods of optimal control such as the Pseudospectral methods. In Pseudospectral methods, the optimal trajectory and control are represented as polynomial functions of time. These methods can handle hard constrains in control and state however they provide open loop optimal policies and not feedback policies. Moreover, they are mostly applicable to deterministic and not stochastic systems. It is really an open question how Pseudospectral methods compare to iterative methods and how they could be applied to bio-mechanical models. 267 Bibliography Abend, W., Bizzi, E. & Morasso, P. (1982), ‘Human arm trajectory formation’, Brain 105(Pt 2), 331–348. Amari, S. (1999), ‘Natural gradient learning for over- and under-complete bases in ica’, Neural Computation 11(8), 1875–83. An, K. N., Chiao, E. Y., Cooney, W. P. & Linscheid, R. L. (1985), ‘Forces in the normal and abnormal hand’, Journal of Orthopaedic Research, 3, 202 – 211. An, K., Ueba, Y., Chao, E., Cooney, W. & Linscheid, R. (1983), ‘Tendon excursion and moment arm of index finger muscles’, Journal of Biomechanics 16(6), 419 – 425. Basar, T. (1991), Time Consistency and robustness of equilibria in noncooperative dy- namic games, Springer Verlag, North Holland. Basar, T. & Berhard, P. (1995), H-infinity Optimal Control and Related Minimax Design, Birkhauser, Boston. Baxter, J. & Bartlett, P. L. (2001), ‘Infinite-horizon policy-gradient estimation’, Journal of Artificial Intelligence Research 15, 319–350. Bellman, R. & Kalaba, R. (1964), Selected Papers On mathematical trends in Control Theory, Dover Publications. Bishop, C. M. (2006), Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA. Blankevoort, L., Kuiper, J., Huiskes, R. & Grootenboer, H. (n.d.), ‘Articular contact in a three-dimensional model of the knee’, Journal of Biomechanics . Blemker,S.S.&Delp,S.L.(2005.),‘Three-dimensionalrepresentationofcomplexmuscle architectures and geometries,’, Annals of Biomedical Engineering,33(5), 661 – 773. Broek, B. V. D., Wiegerinck, W. & Kappen., H. J. (2008), ‘Graphical model inference in optimal control of stochastic multi-agent systems’, Journal of Artificial Intelligence Research 32(1), 95–122. Buchli, J., Kalakrishnan, M., Mistry, M., Pastor, P. & Schaal, S. (2009), compliant quadruped locomotion over rough terrain, in ‘intelligent robots and systems, 2009. iros 2009. ieee/rsj international conference on’. URL: http://www-clmc.usc.edu/publications/B/buchli-IROS2009.pdf 268 Buchli, J., Theodorou, E., Stulp, F. & Schaal, S. (2010), Variable impedance control - a reinforcement learning approach, in ‘Robotics: Science and Systems Conference (RSS)’. Cerveri, P., De Momi, E., Marchente, M., Lopomo, N., Baud-Bovy, G., Barros, R. M. L. & Ferrigno, G. (2008), ‘In vivo validation of a realistic kinematic model for the trapezio-metacarpal joint using an optoelectronic system’, ANNALS OF BIOMED- ICAL ENGINEERING 36(7), 1268–1280. Cheng, G., Hyon, S., Morimoto, J., Ude, A., Hale, J., Colvin, G., Scroggin, W. & Jacob- sen, S. C. (2007), ‘Cb: A humanoid research platform for exploring neuroscience’, Journal of Advanced Robotics 21(10), 1097–1114. Chirikjian, S. G. (2009), Stochastic Models, Information Theory, and Lie Groups., Vol. I, Birkh˝ auser. Dayan, P. & Hinton, G. (1997), ‘Using em for reinforcement learning’, Neural Computa- tion 9. Deisenroth, M. P., Rasmussen, C. E. & Peters, J. (2009), ‘Gaussian process dynamic programming’, Neurocomputing 72(7–9), 1508–1524. Delp, S. L. & Loan, J. P. (2007), ‘A graphics-based software system to develop and analyze models of musculoskeletal structures,’, Computers in Biology and Medicine 25(1), 21 – 34. Dennerlein, J. T., Diao, E., Mote, C. D. & Rempel, D. M. (1998), ‘Tensions of the flexor digitorum superficialis are higher than a current model predicts’, Journal of Biomechanics 31(4), 295 – 301. Dorato, P., Cerone, V. & Abdallah, C. (2000), Linear Quadratic Control: An Introduc- tion, Krieger Publishing Co., Inc., Melbourne, FL, USA. Doyle, J. (1978), ‘Guaranteed margins for lqg regulators’, Automatic Control, IEEE Transactions on 23(4), 756 – 757. Esteki, A. & Mansour, J. M. (1996), ‘An experimentally based nonlinear viscoelastic model of joint passive moment’, Journal of Biomechanics 29(4), 443 – 450. Feynman, P. R. & Hibbs, A. (2005), Quantum Mechanics and Path Integrals, Dover - (Emended Edition). Fleming, W. H. & Soner, H. M. (2006), Controlled Markov Processes and Viscosity Solu- tions, Applications of aathematics, 2nd edn, Springer, New York. Freivalds, A. (2000), Biomechanics of the upper limbs: mechanics, modeling, and Muscu- loskeletal injures, 1rd edn, CRC Press. Friedman,A.(1975),StochasticDifferentialEquationsAndApplications,AcademicPress. 269 Gardiner, C. (2004), Handbook of Stochastic Methods: for Physics, Chemistry and the Natural Sciences, Spinger. Garner, B. A. & Pandy, M. G. (2000), ‘The obstacle-set method for representing muscle paths in musculoskeletal models,’, Computer methods in biomechanics and biomedi- cal engineering 3(1), 1 – 30. Ghavamzadeh, M. & Yaakov, E. (2007), Bayesian actor-critic algorithms, in ‘ICML ’07: Proceedings of The 24th International Conference on Machine Learning’, pp. 297– 304. Harding,D.,Brandt,K.&Hillberry,B.(1993),‘Fingerjointforceminimizationinpianists using optimization techniques’, Journal of Biomechanics 26(12), 1403 – 1412. Harris, C. M. & Wolpert, D. M. (1998), ‘Signal-dependent noise determines motor plan- ning’, Nature 394, 780–784. Hatze, H.(1997), ‘Athree-dimensionalmultivariatemodelofpassivehumanjointtorques and articular boundaries’, Clinical Biomechanics 12(2), 128 – 135. He, J., Levine, W. & Loeb, G. (1991), ‘Feedback gains for correcting small perturbations to standing posture’, Automatic Control, IEEE Transactions on . Hollister, A., Buford, W. L., Myers, L. M., Giurintano, D. J. & Novick, A. (1992), ‘The axes of rotation of the thumb carpometacarpal joint.’, Journal of Orthopaedic Research 10(3), 454–460. Ijspeert, A., Nakanishi, J., Pastor, P., Hoffmann, H. & Schaal, S. (submitted), ‘learning nonlinear dynamical systems models’. URL: http://www-clmc.usc.edu/publications/I/ijspeert-submitted.pdf Ijspeert, A., Nakanishi, J. & Schaal, S. (2003), Learning attractor landscapes for learning motor primitives, in S. Becker, S. Thrun & K. Obermayer, eds, ‘Advances in Neural Information Processing Systems 15’, Cambridge, MA: MIT Press, pp. 1547–1554. Jacobson, D. H. (1973), ‘Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games’, IEEE Transactions of Automatic Control AC - 18, 124–131. Jacobson, D. H. & Mayne, D. Q. (1970), Differential dynamic programming, American Elsevier Pub. Co., New York,. James, M. R., Baras, J. & Elliot, R. (1994), ‘Risk sensitive control of dynamic games for partially observed discrete - time nonlinear systems’, IEEE Transactions of Auto- matic Control AC - 39(4), 780–792. Jetchev, N. & Toussaint, M. (2009), Trajectory prediction: learning to map situations to robot trajectories, in ‘ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning’, pp. 449–456. 270 Jinha, A., Ait-Haddou, R., Binding, P. & Herzog, W. (2006), ‘Antagonistic activity of one-joint muscles in three-dimensions using non-linear optimisation’, Mathematical Biosciences 202(1), 57 – 70. Kalman, R. (1964), ‘When is a linear control system optimal?’, ASME Transactions, Journal of Basic Engineering 86, 51–60. Kappen, H. J. (2005a), ‘Linear theory for control of nonlinear stochastic systems’, Phys. Rev. Lett. 95, 200201. Kappen,H.J.(2005b),‘Pathintegralsandsymmetrybreakingforoptimalcontroltheory’, Journal of Statistical Mechanics: Theory and Experiment (11), P11011. Kappen, H. J. (2007), An introduction to stochastic control theory, path integrals and reinforcement learning, in J. Marro, P. L. Garrido & J. J. Torres, eds, ‘Cooperative Behavior in Neural Systems’, Vol. 887 of American Institute of Physics Conference Series, pp. 149–181. Karatzas, I. & Shreve, S. E. (1991), Brownian Motion and Stochastic Calculus (Graduate Texts in Mathematics), 2nd edn, Springer. Kober, J. & Peters, J. (2009), Learning motor primitives in robotics, in D. Schuurmans, J. Benigio & D. Koller, eds, ‘Advances in Neural Information Processing Systems 21’, Cambridge, MA: MIT Press, Vancouver, BC, Dec. 8-11. Lau, A. W. C. & Lubensky, T. C. (2007), ‘State-dependent diffusion: thermodynamic consistency and its path integral formulation’. URL: http://arxiv.org/abs/0707.2234 Leitmann, G. (1981), The Calculus Of Variations and Optimal Control, Plenum Press, New York. Li, W. & Todorov, E. (2004), Iterative linear quadratic regulator design for nonlinear biological movement systems, in ‘ICINCO (1)’, pp. 222–229. Li, W. & Todorov, E. (2006), An iterative optimal control and estimation design for nonlinear stochastic system, in ‘Decision and Control, 2006 45th IEEE Conference on’, pp. 3242 –3247. Morimoto, J. & Atkeson, C. (2002), Minimax differential dynamic programming: An ap- plication to robust biped walking, in ‘In Advances in Neural Information Processing Systems 15’, MIT Press, Cambridge, MA. Morimoto, J. & Doya, K. (2005), ‘Robust reinforcement learning’, Neural Comput.17(2). Murray, R. M., Li, Z. & Sastry, S. S. (1994), A Mathematical Introduction to Robotic Manipulation, 1 edn, CRC. Mussa-Ivaldi, A., Hogan, N. & Bizzi, E. (1982), ‘Neural, mechanical, and geometric factors subserving arm posture in humans’, Journal of Neuroscience 5, 331–348. 271 Nobel-Lectures (1965), Physics 1922-1941, Elsevier Publishing Company, Amsterdam. Nobel-Lectures (1972), Physics 1963-1970, Elsevier Publishing Company, Amsterdam. Øksendal, B. K. (2003), Stochastic Differential Equations : An Introduction with Appli- cations, 6th edn, Springer, Berlin; New York. Olney, S. J., Griffin, M. P., Monga, T. N. & McBride, I. D. (1991), ‘Work and power in gait of stroke patients’, Archives of physical medicine and rehabilitation, 72(5), 309 – 314. Pastor, P., Kalakrishnan, M., Chitta, S., Theodorou, E.&Schaal, S.(2011), skilllearning and task outcome prediction for manipulation, in ‘2011 IEEE international confer- ence on Robotics and Automation’. Peters, J. (2007), Machine Learning of Motor Skills for Robotics., PhD thesis, University of Southern California. Peters, J. & Schaal, S. (2008a), ‘Learning to control in operational space’, International Journal of Robotics Research 27, 197–212. Peters,J.&Schaal,S.(2008b),‘Naturalactorcritic’,Neurocomputing71(7-9),1180–1190. Peters, J. & Schaal, S. (2008c), ‘Reinforcement learning of motor skills with policy gra- dients’, Neural Networks 21(4), 682–97. Piazza, S. J. & Delp, S. L. (2001), ‘Three-dimensional dynamic simulation of total knee replacement motion during a step-up task’, Journal of Biomechanical Engineering 123(6), 599–606. Pontryagin, L., Boltyanskii, V., Gamkrelidze, R. & Mishchenko, E. (1962), The mathe- matical theory of Optimal Processes, Pergamon Press, New York. R. Davoodi, I. E. B. & Loeb, G. E. (2003), ‘Advanced modeling environment for devel- oping and testing fes control systems,’, Medical Engineering and Physics 25(1), 3 – 9. Rawlinson, J. J. & Bartel, D. L. (2002), ‘Flat medial-lateral conformity in total knee replacementsdoesnotminimizecontactstresses’, Journal of Biomechanics35(1),27 – 34. Rawlinson, J. J., Furman, B. D., Li, S., Wright, T. M. & Bartel, D. L. (2006), ‘Retrieval, experimental, and computational assessment of the performance of total knee re- placements’, Journal of Orthopaedic Research 24(7), 1384 – 1394. Ross, D. (2009), Aristotle: The Nicomachean Ethics, Oxford University Press. Runolfsson, T. (1994), ‘The equivalence between infinite horizon control of stochastic systems with exponential of integral performance index and stochastic differential games’, IEEE Transactions of Automatic Control 39, 1551–1563. 272 Russell, S. & Norvig, P. (2003), Artificial Intelligence: A Modern Approach, second edn, Prentice Hall. S.S.Blemker,D.S.Asakawa,G.E.G.&Delp,S.L.(2007),‘Image-basedmusculoskeletal modeling: applications, advances, and future opportunities,’, Journal of Magnetic Resonance Imaging, 25(2), 441 – 451. Safonov, M.G.&Athans, M.(1976), Gainandphasemarginformultilooplqgregulators, in‘DecisionandControlincludingthe15thSymposiumonAdaptiveProcesses, 1976 IEEE Conference on’, Vol. 15, pp. 361 –368. Sancho-Bru, J. L., Prez-Gonzlez, A., Vergara-Monedero, M. & Giurintano, D. (2001), ‘A 3-d dynamic model of human finger for studying free movements’, Journal of Biomechanics 34(11), 1491 – 1500. Santos, V. & Valero-Cuevas, F. (2006), ‘Reported anatomical variability naturally leads tomultimodaldistributionsofdenavit-hartenbergparametersforthehumanthumb’, Biomedical Engineering, IEEE Transactions on 53(2), 155 –163. Saridis, G. (1996), Stochastic Processed, Estimation and Control. The Entropy approach, John Wiley and Sons, New York. Schaal, S. (2009), the sl simulation and real-time control software package, Technical report. URL: http://www-clmc.usc.edu/publications/S/schaal-TRSL.pdf Schulz,M.(2006),Control Theory in Physics and other Fields of Science. Concepts, Tools and Applications, Spinger. Sciavicco, L. & Siciliano, B. (2000), Modelling and Control of Robot Manipulators, Ad- vanced textbooks in control and signal processing, Springer, London ; New York. Stengel, R. F. (1994), Optimal Control and Estimation, Dover books on advanced math- ematics, Dover Publications, New York. Sutton, R. S. & Barto, A. G. (1998), Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning), The MIT Press. Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000), Policy gradient meth- ods for reinforcement learning with function approximation, in ‘Advances in Neural Information Processing Systems 12’, MIT Press, pp. 1057–1063. Theodorou, E., Buchli, J. & Schaal, S. (2010), ‘A generalized path integral con- trol approach to reinforcement learning.’, Journal of Machine Learning Research p. 3137?3181. Theodorou, E., T.-Y. T. E. (2010), ‘stochastic differential dynamic programming’. Todorov, E. (2004), ‘Optimality principles in sensorimotor control.’, Nature neuroscience 7(9), 907–915. 273 Todorov, E. (2005), ‘Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system’, Neural Computation 17(5), 1084. Todorov, E. (2007), Linearly-solvable markov decision problems, in B. Scholkopf, J. Platt & T. Hoffman, eds, ‘Advances in Neural Information Processing Systems 19 (NIPS 2007)’, Cambridge, MA: MIT Press, Vancouver, BC. Todorov, E. (2008), General duality between optimal control and estimation, in ‘Decision and Control, 2008. CDC 2008. 47th IEEE Conference on’, pp. 4286 –4292. Todorov, E. & Jordan, M. I. (2002), ‘Optimal feedback control as a theory of motor coordination.’, Nature neuroscience 5(11), 1226–1235. URL: http://dx.doi.org/10.1038/nn963 Toussaint, M. & Storkey, A. (2006), ‘Probabilistic inference for solving discrete and con- tinuous state markov decision processes’. Valero-Cuevas, F. J. (2009), ‘A mathematical approach to the mechanical capabilities of limbs and fingers’, 629, 619–633. Valero-Cuevas, F. J., Johanson, M. E. & Towles, J. D. (2003), ‘Towards a realistic biome- chanical model of the thumb: the choice of kinematic description may be more critical than the solution method or the variability/uncertainty of musculoskeletal parameters.’, J Biomech 36(7), 1019–1030. Valero-Cuevas, F. J., Towles, J. D. & Hentz, V. R. (2000), ‘Quantification of fingertip forcereductionintheforefingerfollowingsimulatedparalysisofextensorandintrinsic muscles’, Journal of Biomechanics 33(12), 1601 – 1609. Valero-Cuevas,F.J.,Zajac,F.E.&Burgar,C.G.(1998),‘Largeindex-fingertipforcesare produced by subject-independent patterns of muscle excitation’, Journal of Biome- chanics 31(8), 693 – 703. Venkadesan,M.&Valero-Cuevas,F.(2008a),‘Effectsoftimedelaysoncontrollingcontact transitions’, Royal Society . Venkadesan, M. & Valero-Cuevas, F. (2008b), ‘Neural control of motion- to force transi- tions with the fingertip’, The journal of Neuroscience 28(6), 1366–1373. Vlassis, N., Toussaint, M., Kontes, G. & S., P. (2009), ‘Learning model-free control by a monte-carlo em algorithm’, Autonomous Robots 27(2), 123–130. Whittle, P. (1990), Risk Sensitive Optimal Control, Wiley. Whittle, P. (1991), ‘Risk sensitive optimal linear quadratic gaussian control’, Adv. Appl. Probability 13, 746 – 777. Williams, R. J. (1992), ‘Simple statistical gradient-following algorithms for connectionist reinforcement learning’, Machine Learning 8, 229–256. 274 Wismans, J., Veldpaus, F., Janssen, J., Huson, A. & Struben, P. (1980), ‘A three- dimensional mathematical model of the knee-joint’, Journal of Biomechanics 13(8), 677 – 679, 681–685. Yoon, Y. & Mansour, J. (1982), ‘The passive elastic moment at the hip’, Journal of Biomechanics 15(12), 905 – 910. Zajac, F. E. (1989), ‘Muscle and tendon: properties, models, scaling, and application to biomechanics and motor control.’, Crit. Rev. Biomed. Eng. 17(4), 350 – 411. Zajac, F. E. (1992), ‘How musculotendon architecture and joint geometry affect the ca- pacity of muscles to move and exert force on objects: a review with application to arm and forearm tendon transfer design.’, J. Hand. Surg. Am. 17(5), 799 – 804. Zefran, M., Kumar, V. & Croke, C. (1998), ‘On the generation of smooth three- dimensional rigid body motions’, IEEE Transactions on Robotics and Automation 14(4), 576–589. 275
Abstract (if available)
Abstract
Motivated by the limitations of current optimal control and reinforcement learning methods in terms of their efficiency and scalability, this thesis proposes an iterative stochastic optimal control approach based on the generalized path integral formalism. More precisely, we suggest the use of the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton Jacobi Bellman (HJB) equation, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The new algorithm, Policy Improvement with Path Integrals (PI2), demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Applications to high dimensional robotic systems are presented for a variety of tasks that require optimal planning and gain scheduling.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning objective functions for autonomous motion generation
PDF
Modeling motor memory to enhance multiple task learning
PDF
Machine learning of motor skills for robotics
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Data-driven autonomous manipulation
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
Discrete geometric motion control of autonomous vehicles
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Learning and control in decentralized stochastic systems
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Investigating the role of muscle physiology and spinal circuitry in sensorimotor control
PDF
Model-based approaches to objective inference during steady-state and adaptive locomotor control
PDF
Discounted robust stochastic games with applications to homeland security and flow control
PDF
Learning affordances through interactive perception and manipulation
PDF
Computational principles in human motor adaptation: sources, memories, and variability
PDF
Computational validation of stochastic programming models and applications
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Empirical methods in control and optimization
PDF
Leveraging cross-task transfer in sequential decision problems
Asset Metadata
Creator
Theodorou, Evangelos A.
(author)
Core Title
Iterative path integral stochastic optimal control: theory and applications to motor control
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/29/2011
Defense Date
01/11/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,reinforcement learning,,robotics,stochastic optimal control
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Schaal, Stefan (
committee chair
), Schweighofer, Nicolas (
committee member
), Sukhatme, Gaurav S. (
committee member
), Todorov, Emo (
committee member
), Valero-Cuevas, Francisco (
committee member
)
Creator Email
etheodor@usc.edu,theo0027@umn.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3804
Unique identifier
UC154543
Identifier
etd-Theodorou-4581 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-468575 (legacy record id),usctheses-m3804 (legacy record id)
Legacy Identifier
etd-Theodorou-4581.pdf
Dmrecord
468575
Document Type
Dissertation
Rights
Theodorou, Evangelos A.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
reinforcement learning,
robotics
stochastic optimal control