Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Characterizing and improving robot learning: a control-theoretic perspective
(USC Thesis Other)
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CHARACTERIZING AND IMPROVING ROBOT LEARNING: A CONTROL-THEORETIC PERSPECTIVE by James Alan Preiss A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 James Alan Preiss Acknowledgments My path to this dissertation has not been direct. I wish to thank the many people whose support, encouragement, teaching, and inspiration have brought me here. I am grateful for the opportunity to work with my advisor, Gaurav S. Sukhatme. His guidance has been thoughtful and kind. His interdisciplinary approach to robotics gave me freedom to explore different topics, knowing that he would help isolate the essence of a problem and see how it fits into the big picture. Gaurav often says that the output of a Ph.D. is not the research; it is the researcher. This lesson helped me appreciate the growth and learning from each project, no matter the results. Many thanks to the USC academic community. I appreciate the insightful questions and suggestions from the members of my qualifying, thesis proposal, and defense committees: Nora Ayanian, Heather Culbertson, Ashutosh Nayyar, and Stefanos Nikolaidis. Nora was also a close collaborator on research projects that do not appear in this dissertation. Shaddin Dughmi and Haipeng Luo taught courses that shaped my research interests. Lizsl De Leon was a boundless source of cheer and wisdom in navigating the Department, School, and University. My studies were generously supported by the USC Viterbi-Graduate School Ph.D. Fellowship. I was fortunate to collaborate with many fellow researchers: Sébastien M.R. Arnold, Jiajun Bi, Matt Buckley, Tao Chen, Zhenghao Dai, Nicole Fronda, Karol Hausman, Wolfgang Hönig, Marius Kloft, Alexander S. Koumis, T.K. Satish Kumar, Amlesh Sivanantham, Michael Leahy, David Millard, Artem Molchanov, Ragesh K. Ramachandran, Christian Wagner, Chen-Yu Wei, Stephan Weiss, Tao Yao, and Lifeng ii Zhou. It was a privilege to learn from people with such diverse backgrounds and interests. I especially thank Karol and Wolfgang for setting an example during the first few papers, Ragesh for introducing me to many new mathematical ideas, Séb, Marius, and Chen-Yu for influencing me ask more theoretical questions, and David and Tao for bringing me into the world of deformable manipulation. To the labmates with whom I did not collaborate directly: Thank you for camaraderie and interesting conversations. Thanks to Hanna Mazzawi and Eugen Hotaj for hosting my internship at Google Research NYC and broadening my perspective with a new problem. From my time in industry before graduate school, I thank my supervisors Thomas Jansen and Xan Gregg. Their mentorship in software development practice made the implementation side of my work much less intimidating. From The Evergreen State College, I am grateful to Clyde Barlow, Dawn Rorvik, and Richard Weiss for giving me a wonderful first academic research experience. We worked with numerical algorithms and hardware, as I still do today. Thanks also to David McAvity and Brian Walter for teaching engaging mathematics courses. My parents Tony and Leah Preiss have always given me unconditional love and encouragement. My brother Sandy is the best friend I could ask for. I thank them for everything. My grandfather Richard Palmer and uncle Dev Palmer were role models in technical fields since childhood. I am lucky to have an extended family filled with nice and interesting people. I cannot imagine doing this without my partner Alana. She has given emotional support, shared her knowledge of computer science theory, and brightened many days. She reminds me to celebrate good things and take care of myself. Finally, I dedicate this dissertation to my grandfather Jack Preiss, who pushed me to apply to graduate school when I needed to be pushed. iii TableofContents Acknowledgments ii ListofTables viii ListofFigures ix Abstract xii Chapter1: Introduction 1 1.1 Structure of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter2: Foundation: MathematicsandThemes 8 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Lipschitz and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Orthogonal and Euclidean groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1.1 Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1.2 Convex optimization problems . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.3 Quasiconvex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.4 Convex optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Partially observable Markov decision processes . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Infinite-horizon MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.3.1 Bellman equations and operators . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4 Finite-horizon objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Families of MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.1 Dynamics variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.2 Reward variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.1 On and off-policy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.2 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 iv 2.7.2.1 Log-derivative trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.2.2 Policy gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Control theory paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.8.1 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.8.1.1 Persistence of excitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8.2 Control with known model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8.3 Robust control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8.4 Gain scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8.5 Adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8.6 Model-predictive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8.6.1 Receding horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.8.6.2 Linear MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 Linear dynamical systems and control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9.1 Discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9.1.1 Autonomous system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9.1.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9.1.3 Linear control systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.9.1.4 Controllability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.9.1.5 Stabilizing controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.9.1.6 Linear quadratic regulator (LQR) . . . . . . . . . . . . . . . . . . . . . . . 45 2.9.1.7 Outputs and State Estimation . . . . . . . . . . . . . . . . . . . . . . . . 46 2.9.1.8 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.9.1.9 Luenberger observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.9.1.10 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.9.2 Continuous time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.9.2.1 Autonomous system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.9.2.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.9.2.3 Linear control systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.9.2.4 Controllability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.9.2.5 Linear-quadratic regulator . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.9.3 Canonical forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.9.4 Pole placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.10 Statistical learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.10.1 General statistical learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.10.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.10.3 Gradient-based optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.11 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.11.1 Neural network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.11.1.1 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.11.2 Fully connected neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.11.3 1D convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.11.4 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.11.4.1 Long short-term memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 v Chapter3: ReinforcementLearningforUniversalPolicies 63 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.1 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.1 Point-Mass Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.2 Half-Cheetah environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6 Simplified experiment: Universal policy versus experts . . . . . . . . . . . . . . . . . . . . 78 Chapter4: DeformableManipulationusingLearnedModels 82 4.1 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Problem Setting and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.2 RNN dynamics model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 Model-predictive control with reduced-order model . . . . . . . . . . . . . . . . . . 90 4.3.4 Estimating the RNN state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.1 Model frequency response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.2 MPC tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter5: VarianceofPolicyGradientforLQRproblems 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 Main result: Variance bounds on the REINFORCE estimator . . . . . . . . . . . . . . . . . 108 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.1 RL policy optimality for varyingΣ u . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Proof of Theorem 5.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6.1 Bounding∥x t ∥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6.2 BoundingTerm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6.3 BoundingTerm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6.4 Combining bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.7 Proof of Theorem 5.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7.1 Lower boundingE P H t=1 δ u t x t 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7.2 Lower boundingE P H t=1 r t 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.7.4 Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vi Chapter6: SuboptimalCoverings 128 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4.1 Scalar upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.4.2 Scalar lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.5.1 Geometric grid construction for upper bounds . . . . . . . . . . . . . . . . . . . . . 142 6.5.1.1 Empirical upper bound onN cov α (Φ) . . . . . . . . . . . . . . . . . . . . . 143 6.5.1.2 Efficiency of geometric grid partition. . . . . . . . . . . . . . . . . . . . . 144 6.5.1.3 Efficiency of GCC synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5.2 Suboptimal neighborhood visualizations . . . . . . . . . . . . . . . . . . . . . . . . 145 6.6 Proof of Lemma 6.4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.7 Efforts towards matrix case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.7.1 Easy case: Scalar multiples ofB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.7.2 Role ofα ’s lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.7.3 Form of Riccati perturbation for geometric grid recursion . . . . . . . . . . . . . . 158 6.7.3.1 Multiplicative change inP . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.7.3.2 Additive change inP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.7.4 How we would use bounds on cost change due toB perturbations . . . . . . . . . 162 6.7.5 Existing Riccati solution and perturbation bounds . . . . . . . . . . . . . . . . . . . 163 6.7.6 Lower bound candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.7.6.1 Lower bound forA=I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.7.6.2 Lower bound forA= 1 n 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.7.7 Packing-based strategies for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 166 6.7.8 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.7.9 Suboptimal neighborhoods for variations inA . . . . . . . . . . . . . . . . . . . . . 171 6.7.9.1 Cart-pole system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.7.9.2 Two real eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.7.9.3 Pair of conjugate eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . 174 6.7.9.4 Spring-mass-damper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.7.9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.8 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Chapter7: Conclusions 179 7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Bibliography 183 vii ListofTables 4.1 Values of user-chosen constants in deformable manipulation experiments. . . . . . . . . . 94 4.2 MPC tracking errors for our deformable manipulation method. . . . . . . . . . . . . . . . . 99 viii ListofFigures 1.1 Illustration of an early electromechanical autopilot. . . . . . . . . . . . . . . . . . . . . . . 5 2.1 The function− e − x 2 is quasiconvex but not convex. . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Typical nonlinearities used in neural networks. . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1 Diagram of method for adaptive universal policies using a system identification embedding. 67 3.2 Learned system identification embedding for point-mass system. . . . . . . . . . . . . . . . 73 3.3 Actual vs. estimated gain and embedding values for point-mass system. . . . . . . . . . . . 73 3.4 Visualization of “non-dimensionalizing” learned embedding for point-mass system with redundant parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5 Variations of Half-Cheetah environment produced by randomization of kinematic and dynamic properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.6 Training and test rewards for our method and baselines. . . . . . . . . . . . . . . . . . . . 76 3.7 Learning curves for multi-system “universal policy” and single-system “expert” policies for nine random linearized planar quadrotor systems. . . . . . . . . . . . . . . . . . . . . . 81 4.1 Real-robot test setup for deformable manipulation. . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Architectural diagram of our method for deformable manipulation, illustrating role of RNN model, EKF, and MPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3 Schematic diagram of pool noodle experimental setup. . . . . . . . . . . . . . . . . . . . . 95 4.4 Frequency-domain gain and phase response (Bode plots) for real pool noodle and LSTM model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Two-dimensional projections of paths traced by pool noodle free end in MPC tracking experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 ix 4.6 Traces of rotation angle inputs and horizontal and vertical components of pool noodle free end for MPC tracking of circle trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1 Comparison between our upper bounds and the empirically measured variance of REINFORCE as they relate to matrix parameters of the LQR problem. . . . . . . . . . . . . 111 5.2 Comparison between our upper bounds and the empirically measured variance of REINFORCE as they relate to state dimensionality and time horizon of the LQR problem. . 113 5.3 Learning curves of REINFORCE for a random LQR problem with varying scales of action noise and environment noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4 Suboptimality ratios of the policy after 1000 iterations of REINFORCE for a random LQR problem with varying scales of action noise and environment noise. . . . . . . . . . . . . 115 6.1 Diagram of quadrotor helicopter translation and attitude states. . . . . . . . . . . . . . . . 133 6.2 Illustration of geometric grid partition (Definition 6.5.1). . . . . . . . . . . . . . . . . . . . 142 6.3 Empirical upper bound on grid pitchk needed to construct geometric grid covering of linearized quadrotor using GCC synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4 Suboptimality ratios for corner cells in geometric grid covering of linearized quadrotor. . 144 6.5 α -suboptimal neighborhoods for geometric grid partition in 2D systems with minimum coupling (A=I) and maximum coupling (A= 1 n 1) dynamics. . . . . . . . . . . . . . . . 145 6.6 Topological phases ofα -suboptimal neighborhood for one controller in 3D system with minimum coupling (A=I). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.7 “Approximation error” accounted for byα> 2a+1 2a assumption in scalar upper bound proof. 157 6.8 Looseness introduced by the inequality (6.12) for random LQR problems. . . . . . . . . . . 157 6.9 Comparison of actual cost and lower bound based on the strong convexity constant derived by Mohammadi et al. (2019) for scalar LQR problem. . . . . . . . . . . . . . . . . . 171 6.10 Cart-pole system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.11 α -suboptimal neighborhoods for cart-pole system. . . . . . . . . . . . . . . . . . . . . . . 172 6.12 α -suboptimal neighborhoods for system in controllable canonical form (CCF) with A having two positive real eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.13 α -suboptimal neighborhoods for system in controllable canonical form (CCF) with A having two complex conjugate eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.14 Spring-mass-damper system. There is no gravity. . . . . . . . . . . . . . . . . . . . . . . . 175 x 6.15 α -suboptimal neighborhoods for spring-mass-damper system with variations in stiffness and damping constants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.16 Example of a poor match between a grid partition ofΦ and true suboptimal neighborhoods of LQR-optimal controllers forΦ in the cart-pole system. . . . . . . . . . . . . . . . . . . . 177 xi Abstract The interface between machine learning and control has enabled robots to move outside the laboratory into challenging real-world settings. Deep reinforcement learning can scale empirically to very complex systems, but we do not yet understand precisely when and why it succeeds. Control theory focuses on simpler systems, but delivers interpretability, mathematical understanding, and guarantees. We present projects that combine these strengths. In empirical work, we propose a framework for tasks with complex dynamics but known reward functions. We restrict the use of learning to the dynamics modeling stage, and act based on this model using traditional state-space control. We apply this framework to robotic manipulation of deformable objects. In theoretical work, we deploy the well-understood linear quadratic regulator (LQR) problem as a test case to "look inside" algorithms and problem structure. First, we investigate how reinforcement learning algorithms depend on properties of the dynamical system by bounding the variance of the REINFORCE policy gradient estimator as a function of the LQR system matrices. Second, we introduce the framework of suboptimal covering numbers to quantify how much a good multi-system policy must change with respect to the dynamics parameters, and bound the covering number for a simple class of LQR systems. xii Chapter1 Introduction A machine learning system makes decisions based on a data set of observations and improves its perfor- mance as the amount of data grows. Machine learning is useful for interacting with systems that are too complicated to model based on first principles. Within the field of robotics, two especially important ap- plications are vision input and complex dynamics. This dissertation focuses on the latter. Some examples of complex dynamics are: • The contact friction between a foot and the surface upon which it walks. • A deformable object bending under the influence of boundary conditions. • The aerodynamics of a helicopter when it is close to the ground. The dynamics of these phenomena obey the laws of physics, but that does not mean they are easy to predict: • The distribution of pressure and friction across the foot changes throughout the stride. Each walking surface adds its own complexity, for example sand or ice. • The deformable object is governed by continuum mechanics, so its true state is infinite-dimensional. • The helicopter’s airflow can be well-approximated in static hover, but is harder to predict when the helicopter is accelerating. 1 Even when an accurate physics-based model is available, using it in a robot’s control loop may be compu- tationally infeasible. Control theory provides (usually) mathematically principled techniques for realizing a desired behavior of a dynamical system via inputs. However, those techniques often depend on assumptions that are not satisfied by complex systems, such as linearity, smoothness, convexity, simple probability distributions, and so on. Often the techniques struggle with high-dimensional systems. Control theory has developed its own learning methods within the subfields of system identification and adaptive control. Reinforcement learning (RL) is the branch of machine learning that deals with acting optimally in an unknown dynamical system. Early RL research was mainly confined to finite state and action spaces in discrete time, whereas control theorists focused on continuous spaces in both discrete and continuous time. On the other hand, RL researchers have always been interested in “intelligent” behavior involving long- term planning, whereas control theory has focused on simpler behaviors like stabilizing at an equilibrium or tracking a reference trajectory. A key goal of RL theory, and of machine learning theory in general, is understanding the sample complexity of learning problems: how much data is required to guarantee that a certain performance metric is satisfied? Sample complexity upper bounds are most often derived by proposing an algorithm, while lower bounds are most often derived by carefully constructing worst-case problem instances. Sample complexity analysis is closely related in spirit to computational complexity analysis in computer science. In the past few years, the theoretical sides of each community have increased their overlap. Especially notable has been the drive for learning-theory-style sample complexity guarantees in control-theory-style dynamical systems, especially linear systems, where many previous results only provided asymptotic guar- antees. Mania et al. (2019), Simchowitz and Foster (2020), and others obtained finer-grained characteriza- tions of existing methods in control theory by applying the sample complexity perspective. Conversely, researchers have also used those classic settings as mathematically tractable test cases to gain insight into 2 existing reinforcement learning algorithms (Fazel et al., 2018; Mohammadi et al., 2019). New lines of in- quiry into these systems has also provoked new questions more fundamental than sample complexity (Bu et al., 2019b). A more extensive review of RL theory for continuous systems is given in §5.2. Empiricalsuccesses The overlap of the empirical sides of learning and control have produced spectac- ular results in recent years. The success of deep neural networks for nonlinear function approximation with high-dimensional data, such as in image classification (Krizhevsky et al., 2012), motivated researchers to experiment with applying them to control tasks. In a landmark result, Mnih et al. (2013) showed that a single reinforcement learning algorithm and deep neural network architecture could reach human-level performance on many different games from the Atari 2600 console. This combination is known as deep reinforcement learning (deep RL). Their method used the Q-learning principle, meaning it could only be applied to settings where computing the maximum in the Bellman optimality operator (2.9) is feasible. Practically, this meant environments with finite (and not combinatorially large) action spaces. Subsequently, a flurry of deep RL algorithms were proposed targeting similar results for continuous action spaces. The OpenAI Gym (Brockman et al., 2016), especially the locomotion-related environments, emerged as a benchmark. Among these algorithms, those of Schulman et al. (2017) and Haarnoja et al. (2018) were especially successful and became widely used for RL in continuous spaces. However, most of these results were confined to simulation due to high sample complexity, safety issues, or both. For real physical systems, model-based RL is usually preferred for its sample efficiency. Deisenroth and Rasmussen (2011) used Gaussian process regression to handle model uncertainty in a principled way, but the method is computationally infeasible for high-dimensional systems. The guided policy search family of algorithms uses localized traditional trajectory optimization as a “teacher” for a global neural network policy, and has been applied to tasks with vision input (Levine et al., 2016) and discontinuous dynamics (Chebotar et al., 2017). Kalashnikov et al. (2018) used RL to learn a policy for robotic grasping of a wide range of objects, sidestepping the data efficiency issue by using many robots at once and designing 3 a “self-resetting” environment. OpenAI et al. (2019) achieved simulation-to-reality transfer in dexterous manipulation of a Rubik’s cube, while Hwangbo et al. (2017) and Molchanov et al. (2019) demonstrated it for quadrotor control. We emphasize that these are only a few examples of successful applications of RL to robotics. RL also became an essential tool in game-playing AI. Silver et al. (2018) achieved superhuman per- formance on the board game Go using a thoughtful combination of deep reinforcement learning and tree search. (An earlier version reached this milestone but depended on supervised training data from human players.) Later, Schrittwieser et al. (2020) replaced the assumption of known game rules with a learned dynamics model, making the resulting algorithm (MuZero) applicable to a much larger set of RL problems. New challenges At the same time, researchers were discovering downsides of deep RL. Henderson et al. (2018) pointed out that the policy optimality achieved by deep RL algorithms is unusually sensitive to the initial state (seed) of the algorithm’s pseudorandom number generator, as well as hyperparameters and implementation details. In contrast, deep supervised learning accuracy is not sensitive to random seed (Bhojanapalli et al., 2021). Engstrom et al. (2020) and Andrychowicz et al. (2020) showed that some of the performance gains attributed to core algorithmic differences could in fact be explained by minor implementation differences. Agarwal et al. (2021) pointed out that the high computational cost of deep RL experiments leads to small sample sizes, which are often not treated with enough statistical care. Anecdotally, researchers applying deep RL to a novel task significantly different from the popular benchmarks generally cannot expect it to “just work”. Reward shaping, hyperparameter tuning, and com- paring several algorithms are the norm. Henderson et al. (2018) reported that the performance ranking of different algorithms is not consistent across different environments. Non-RLcombinations oflearning andcontrol Machine learning has useful applications in control beyond RL. For example, Shi et al. (2019, 2021) integrate learned models with first-principles models in 4 Figure 1.1: Illustration of a subassembly in an early electromechanical autopilot (Sperry, 1921, U.S. Patent 1368226A). Electrical components are used for actuation and as an energy source, but the core control policy is a physical mechanism. a restricted manner to account for complex dynamics (of quadrotor ground effect and multi-quadrotor downwash, respectively) while preserving a stability guarantee. Ideas from control theory can be used to improve learning: Terzi et al. (2021) derive sufficient conditions that can be enforced upon a recurrent neural network dynamics model to admit guarantees for a state observer, controller, and the model itself. Singh et al. (2021) enforce a stabilizability condition on a learned dynamics model and show that it acts as a regularizer, improving model accuracy over generic regression when the data set is small. Amos et al. (2018) derive analytic derivatives of the outputs of a model-predictive control optimization problem and use them for end-to-end imitation learning and system identification. Historicalremarks The fields of control theory and reinforcement learning are closely related. Histor- ically, control theory developed mainly within the context of mechanical and electrical engineering and 5 was applied to electrical or physical systems. Control theory predates digital computers: continuous-time controllers such as autopilots were implemented with analog electronics or even mechanisms with moving parts (Sperry, 1921), as shown in the patent illustration of Figure 1.1. Reinforcement learning developed mainly within the artificial intelligence research community. In addition to practical applications, research was also tied to efforts to understand mechanisms of learning in humans and animals. Sutton and Barto (2018) give a retrospective on the history of reinforcement learning, including its relationship to optimal control. 1.1 Structureofdissertation The remainder of this dissertation is organized into five chapters. Chapter 2 provides the setting for our work. This chapter mainly focuses on mathematical foundations, but also discusses some of the key motivating ideas. In particular, §2.6 discusses families of Markov decision processes and families of optimal control problems, including several examples. We use this formalism in Chapters 3 and 6. Chapter 3 presents our work on a deep reinforcement learning architecture for policies that can adapt online to systems with widely varying dynamics. Similar to adaptive methods in control theory, our frame- work is built on an online system identification process. We replace online estimation and optimization with pre-trained neural networks to handle complex settings with low computational cost. We also experi- ment with learned mapping from dynamics parameters to an embedding space. Our experiments highlight some fundamental questions about the multi-system setting and about reinforcement learning itself, which provide motivation for subsequent chapters. Chapter 4 proposes a more structured alternative to reinforcement learning for systems where the dy- namics are complex but the reward is simple and known. We learn a recurrent neural network dynamics model from input-output trajectory data and apply traditional state-space estimation and control to the 6 abstract internal state of the RNN model. We apply this framework to robotic manipulation of deformable objects. An ablation study demonstrates the benefit of using this closed-loop approach compared to feed- forward planning using the model. Chapter 5 begins the theoretical portion of this dissertation. We analyze the variance of the REINFORCE policy gradient estimator for linear-quadratic regulator (LQR) systems. We provide upper and lower bounds on the variance as a function of the system parameters. With respect to the dynamics and cost matrices, our bounds are tight and closely match the behavior of the empirical variance. How- ever, we also show that the variance is not directly correlated to the optimality of the policy produced by the REINFORCE algorithm. This challenges the folklore that high-variance gradient estimates are the dominant challenge in policy gradient methods. Chapter 6 introduces the concept of suboptimal covering numbers as a way to quantify how much an optimal policy for an infinite family of control problems must alter its behavior with respect to the problem parameters. Suboptimal covering numbers are intuitive and have desirable mathematical properties such as parameterization independence. We show matching logarithmic covering number bounds for single- input fully-observable LQR problem families with actuator strength variations. For multi-input problems, we present empirical work testing a conjectured upper bound and use visualizations to inspect the behavior of candidate systems for a lower bound. We discuss work in progress and intermediate results towards proving the matrix-case conjecture, and initial experiments regarding further expansion of the scope of problem families. 7 Chapter2 Foundation: MathematicsandThemes Although the topics of each chapter in this dissertation are diverse, their mathematical foundations have many ideas in common. This chapter defines notation and reviews important definitions and theorems we will use. We assume the reader is familiar with some fundamental definitions in linear algebra, analysis, probability, and differential equations. We also take the opportunity in this chapter to introduce some common themes that appear repeatedly in our work. In particular, we introduce notion of a structured family of control problems or Markov decision processes, and review concepts such as robustness and adaptivity that are broadly applicable in both traditional and learning-based control. 2.1 Notation Sets For a setX,2 X denotes the power set{Y :Y ⊆ X}. Linearalgebra The notationA≻ B (resp. A⪰ B) indicates that the matrixA− B is positive definite (resp. positive semidefinite). The sets of n× n positive definite and positive semidefinite matrices are denoted byS n ++ andS n + respectively. For a matrix A ∈ R n× n , the set of its eigenvalues are denoted by Λ( A), its spectral radius is denoted by ρ (A) = max{|λ | : λ ∈ Λ( A)}, and the largest real part of its eigenvalues is denoted byρ + (A) = max{Reλ : λ ∈ Λ( A)}. Matrices or vectors of zeros and ones, with 8 dimension implied by context, are denoted by 0 and 1. The notation diag(x) (resp. diag(x 1 ,...,x n )) refers to a diagonal matrix with the entries of the vectorx (resp. the scalars or blocksx 1 ,...,x n ) on its diagonal. Functionsandimages The notationY X indicates the set of all functions fromX toY . Iff : X 7→ Y andX ′ ⊆ X, we overload notation and denote the image ofX ′ underf by f(X ′ )={f(x):x∈X ′ }. Similarly, we denote the inverse image ofY ′ ⊆ Y by f − 1 (Y ′ )={x∈X :f(x)∈Y ′ }. Sequences The notationX N indicates the set of allX-valued sequences. The notationX <N indicates the set of all finite X-valued sequences, i.e. X <N = ∪ ∞ n=0 X n . When a sequence x 0 ,x 1 ,... has been defined, the notation x m:n refers to the tuple(x m ,x m+1 ,...,x n ). Normsandinnerproducts ForP ⪰ 0, we use the notation ⟨x,y⟩ P =⟨x,Py⟩ for its weighted inner product. For its induced norm, we use ∥x∥ P = p ⟨x,x⟩ P . 9 Otherwise, the notation∥·∥ p forp≥ 1 refers to the usualp-norm, i.e. ∥x∥ p = n X i=1 x p i !1 p , in which x i are the individual components of x. For a linear function F : X 7→ Y whereX andY are vector spaces, the notation∥F∥ p,q refers to thep-q operator norm: ∥F∥ p,q =max{∥Fx∥ q :x∈X,∥x∥ p ≤ 1}. We also apply the operator norm notation to matrices based on the natural isomorphism to linear operators. We denote the Frobenius norm of the matrix A by∥A∥ F = √ trA T A. We only use the norm notation without subscripts when the particular norm is explicitly stated or clear from the context. Probability If(X,Σ) is a measurable space, then we use the notation∆( X,Σ) to denote the set of all probability measures on(X,Σ) . In cases where the sigma-algebra has already been stated or is common (e.g. the power set whenX is finite; the Borel sigma-algebra when X is a topological space), we often use the notation∆( X). For a function f : X 7→ ∆( Y), if the density function (Radon-Nikodym derivative) exists for all measures inf(X), we use the notationf(·|x) to denote the density function off(x). Indicatorfunctions LetX denote some set andP a logical predicate onX. We use the notation I [P] (x)= 1:P(x) 0:¬P(x). For a subsetS⊆ X and the predicateP(x)=x∈S, we use the shorthandI {S} instead ofI [x∈S] . 10 Topology A topological space(X,T) is defined by a set X and a collection of “open” setsT ⊆ 2 X that is closed under arbitrary unions and finite intersections. For arbitrary Y ⊆ X, the interior ofY is denoted byintY =∪{S∈T :S⊆ Y}. 2.2 Fundamentals 2.2.1 Metrics Recall that a metric on a setX is a functiond:X× X 7→R satisfying the properties for allx,y,z∈X: • Identity of indiscernibles: d(x,y)=0 ⇐⇒ x=y. • Symmetry: d(x,y)=d(y,x). • Triangle inequality: d(x,z)≤ d(x,y)+d(y,z). From these properties, nonnegativityd(x,y)≥ 0 can be derived. Asemimetric onX satisfies nonnegativ- ity, symmetry, and the identity of indiscernibles, but does not satisfy the triangle inequality. Notably, the squared Euclidean distance∥x− y∥ 2 2 is a semimetric. 2.2.2 Lipschitzandsmoothfunctions Suppose (X,d X ) and (Y,d Y ) are metric spaces We say f : X 7→ Y is L-Lipschitz for L > 0 if, for all x,x ′ ∈domf, d Y (f(x),f(x ′ ))≤ Ld X (x,x ′ ). Most commonly, we study Lipschitz functions in normed vector spaces whered X andd Y are norm-induced metrics. If the functionf :R n 7→R is differentiable and its gradient is β -Lipschitz, that is, ∇f(x)−∇ f(x ′ ) ≤ β x− x ′ , 11 we sayf isβ -smooth. 2.2.3 OrthogonalandEuclideangroups The general linear group, denoted GL(n), is the group of invertible linear operators on n-dimensional Euclidean space over the field of real numbers, with composition as the group operation. The subset ofGL(n) that is distance-preserving forms a subgroup, the orthogonal groupO(n). Under the Euclidean topology, the groupO(n) has two connected components. The component that contains the identity is a subgroup, denotedSO(n), and consists of those elements ofO(n) that preserve orientation. These are the familiar rotations whenn=2 or3. The notation so(n) refers to the Lie algebra associated withSO(n), which corresponds to the tangent space of SO(n) at the identity. It is beyond the scope of this dissertation introduce Lie groups and Lie algebras with satisfactory rigor. For our purposes, we can think of each element of so(n) as a “rotational velocity”. The Euclidean group, denotedE(n), are the operators that are distance-preserving but not necessarily linear. ThespecialEuclideangroupSE(n) is the orientation-preserving subgroup ofE(n). The groupSE(3) describes the full “pose” of a rigid object in three-dimensional space. SE(3) is isomorphic toSO(3)× R 3 . The groups SO(3) and SE(3) appear often in robotics. The associated Lie algebra se(3) is isomorphic to so(3)× R 3 , and also appears often in robotics to describe the full “velocity state” of a rigid object in three-dimensional space, sometimes called the twist. There are several ways to represent elements ofSO(3) for computation, including the3× 3 matrices themselves, the unit quaternions, pairs of a unit vector axis and a rotation angle, triples of rotation angles around fixed axes (Euler angles), and skew-symmetric matrices for elements of so(3). Although the latter three representations are parameterized byR 3 , the incompatible topologies ofR 3 and SO(3) imply that 12 any mapping fromR 3 toSO(3) must suffer from at least one of the following problems: incompleteness, multiple-covering, or discontinuity. 2.3 Optimization Mathematical optimization is a generic framework that permeates many aspects of robotics, including planning, control, machine learning, and multi-robot coordination. An optimization problem has the form minimize f(x) subjectto x∈X, (2.1) whereX is some set and f : X 7→ R is some function. In this dissertation we are mainly concerned withcontinuousoptimization, whereX is an uncountable set endowed with notions like topology, metric, inner product, etc. (By contrast, in combinatorial optimization the setX is finite, although possibly very large.) IfX =R n for somen, we say the problem isunconstrained. On the other hand, ifX ⊂ R n , we say the problem isconstrained. For constrained problems we usually define membership in X by equality and inequality constraints, resulting in the form minimize f(x) subjectto g(x)=0 h(x)≤ 0, (2.2) whereg andh are arbitrary vector-valued functions. Localandglobaloptima Ifx∈X is a true solution to the generic optimization problem (2.1), we say it is aglobaloptimum. On the other hand, we sayx is alocaloptimum if there exists an open setB such that x∈B andf(x)≤ f(x ′ ) for allx ′ ∈B∩X . Depending on the structure of the optimization problem, it may 13 not be computationally feasible to seek a global optimum. An important class of optimization problems for which we can find a global optimum with polynomial query complexity are the convex optimization problems, discussed in §2.4. 2.4 Convexity A setX ⊆ R n is convex if, for anyx,x ′ ∈X andθ ∈[0,1], it holds that (1− θ )x+θx ′ ∈X. Example2.4.1. Some commonly encountered convex sets are: • R n itself. • Norm balls{x∈R n :∥x∥ p ≤ r} of any radiusr≥ 0 for1≤ p≤∞ . Note that thep-norm ball for p<1 is not convex. • Polytopes / polyhedra of the form {x∈V :Ax≤ b}, whereA is a linear map fromV to some other vector spaceW ,b∈W , and≤ denotes simultaneous elementwise satisfaction. • The set ofn× n positive semidefinite matrices S n + . • Degenerate cases: The empty set, the singleton set{x} forx∈R n . Theorem2.4.2 (Properties of convex sets). IfS andT are convex sets, then ... • Intersections:S∩T is convex. 14 • Minkowskisums:{s+t:s∈S,t∈T} is convex. • Cartesianproducts: Ifwedefinevectorspaceoperationsfor S×T inthenaturalway,thenS×T is convex. • Affineimages: Iff is an affine function, then the image f(S) is convex. 2.4.1 Convexfunctions The epigraph of a functionf :X 7→R, whereX ⊆ R n , is the set epif ={(x,r):x∈X,r≥ f(x)}⊆ R n+1 . If the epigraph of f is convex, we say f is a convex function. Equivalently, f is convex if and only if it satisfies Jensen’s inequality: f((1− θ )x+θx ′ )≤ (1− θ )f(x)+θf (x ′ ) for all x,x ′ ∈ X and θ ∈ [0,1]. Jensen’s inequality can be generalized to a probabilistic form: if X is a random variable, then f(E[X])≤ E[f(X)]. We say a functionf is concave if− f is convex. Convex functions satisfy the following properties: • If f and g are convex, then f + g is convex. More generally, any nonnegative weighted sum of convex functions is convex. • f(Ax+b) is convex. (Affine composition) 15 • IfF is a (potentially infinite) set of convex functions on domain X , then the pointwise supremum functionsup f∈F f(x) is convex inx. 2.4.1.1 Subdifferentials SupposeX ⊆ R n is a convex set andf :X 7→R is a function. We say thatg∈R n is asubgradientoff at x 0 ∈X if, for allx∈X , we have f(x 0 )+⟨g, x− x 0 ⟩≤ f(x). Thesubdifferentialof f atx 0 , denoted by∂f(x 0 ), is the set of all subgradients off atx 0 . The subdifferential is a convex set. If∂f(x)̸=∅ for allx∈intX , thenf is convex. Conversely, iff is convex, then∂f(x)̸=∅ for all x ∈ X . If f is differentiable at x , then ∂f(x) is a singleton set. If g : R n 7→ R is convex, then x∈R n is a local optimum ofg if and only if0∈∂g(x). These properties are proved by Bubeck (2015). 2.4.1.2 Convexoptimizationproblems A convex optimization problem is a problem of the form minimize f(x) subjectto x∈X, (2.3) whereX is a convex set andf is a convex function. There is much to be said about convex optimization problems (Nesterov, 2003; Boyd and Vandenberghe, 2004; Bubeck, 2015), but the most important facts are the following: • Ifx is a local optimum of a convex optimization problem, then it is also a global optimum. 16 • There exist algorithms to find an approximately optimal solution for convex optimization problems that are polynomial-time in the relevant properties off andX , even under very weak “membership oracle” access toX (Abernethy and Hazan, 2016). 2.4.2 Strongconvexity A function f is µ -strongly convex with respect to some norm∥·∥ if there exists µ > 0 such that, for all x,y∈domf and allg∈∂f(x), f(y)≥ f(x)+⟨g, y− x⟩+ µ 2 ∥y− x∥ 2 . (2.4) Informally, strongly convex functions are “at least as convex” as a quadratic function. Strong convexity leads to improved convergence rates of first-order optimization algorithms (Bubeck, 2015). 2.4.3 Quasiconvexfunctions Definition 2.4.3. A functionf : D 7→ R on the convex domain D ⊆ R n is quasiconvex if its sublevel setsD α ={x∈D :f(x)≤ α } are convex for allα ∈R. All convex functions are quasiconvex, but not all quasiconvex functions are convex: for example, the function− e − x 2 as shown in Figure 2.1. − 2 − 1 0 1 2 − 1 − 0.5 0 Figure 2.1: The function− e − x 2 is quasiconvex but not convex. 17 Lemma2.4.4 (Boyd and Vandenberghe (2004), §3.4). Thefollowingfactsholdforquasiconvexfunctionson a convexD⊆ R (equivalently,D is an interval). (a) Iff :R7→Riscontinuous,thenf isquasiconvexifandonlyifatleastoneofthefollowingconditions holds onD: 1. f is nondecreasing. 2. f is nonincreasing. 3. Thereexistsc∈D suchthatforallt∈D,ift<cthenf isnonincreasing,andift≥ cthenf is nondecreasing. (b) Iff :R7→R is twice differentiable and d 2 f dx 2 > 0 for allx∈D where df dx = 0, thenf is quasiconvex onD. (c) If f(x) = p(x) q(x) , where p : R 7→ R is convex with p(x) ≥ 0 onD and q : R 7→ R is concave with q(x)>0 onD, thenf is quasiconvex onD. 2.4.4 Convexoptimizationalgorithms Supposef :R n 7→R is a convex function and that the optimum x ⋆ =argmin x∈R n f(x) exists. In this section we sketch the main results for optimization of a convex function under the first- order oracle model, in which the algorithm can query∇f(x) for arbitraryx but has no further structural knowledge off(x). The first-order oracle falls in the broader class of black-boxoptimization models, which also includes zeroth-order oracles where onlyf(x) can be queried. 18 Many black-box optimization algorithms require an initial guessx 0 and generate an infinite sequence of iteratesx 1 ,x 2 ,... A central object of study for such algorithms is their convergence rate: a bound of the form f(x k )− f(x ⋆ )≤ b(k), where the functionb(k):N7→R ≥ 0 depends on properties off and the distanceR=∥x 0 − x ⋆ ∥. Gradientdescent The simplest procedure to optimizef is (sub)gradient descent: starting with an initial guessx 0 ∈R n , we follow the recurrence x k+1 =x k − η ∇f(x k ), whereη > 0 is a step size parameter. Whenf is not differentiable, we can understand ∇f(x) to denote an arbitrary deterministic selection from∂f(x). Whenf belongs to the following function classes, (sub)gradient descent with the appropriate choice ofη leads to the following convergence rates (Bubeck, 2015): • L-Lipschitz: O RL √ k . • β -smooth: O R 2 β k . • β -smooth andµ -strongly convex: O R 2 exp − k β/µ . 19 Acceleratedgradientdescent A computationally inexpensive modification of gradient descent intro- duced by Nesterov (1983) turns out to yield improved convergence rates. Starting with the initial guess y 0 =x 0 ∈R n , we follow the recurrence x k+1 =y k − 1 β ∇f(y k ), y k+1 =x k+1 + √ β − √ µ √ β + √ µ (x k+1 − x k ). Accelerated gradient descent leads to the following convergence rates (Bubeck, 2015): • β -smooth: O R 2 β k 2 . • β -smooth andµ -strongly convex: O R 2 exp − k p β/µ !! . Note that the ratio β µ ≥ 1 according to the definitions of smoothness and strong convexity. Lowerbounds The convergence rates attained by gradient descent and accelerated gradient descent are optimal. For any algorithm in the first-order oracle model that satisfies x k+1 ∈span(g 1 ,...,g k ), whereg k is the (sub)gradient queried at stepk, there exists aL-Lipschitz convex function, aβ -smooth con- vex function, and aβ -smooth andµ -strongly convex function for whichf(x k )− f(x ⋆ ) is lower-bounded byΩ(1 / √ k),Ω(1 /k 2 ), andΩ(exp( − k)) respectively (Bubeck, 2015). 2.5 Markovdecisionprocesses The formalism of Markov decision processes captures a wide range of problems in which an agent interacts with a dynamical system. 20 Definition 2.5.1. A discrete-time Markov decision process (MDP) is defined by the tuple (X,U,P,r,µ ) where: • X is the state space, • U is the action space, • P :X ×U 7→ ∆( X) is the state transition map, • r :X ×U 7→ R is the reward function, and • µ ∈∆( X) is a distribution over initial state. The agent interacts with the MDP in the following manner: • The initial state is sampled: x 0 ∼ µ . • At each time stept∈N, the agent observes the current statex t ∈X and takes an actionu t ∈U. • The agent receives and observes the rewardr t =r(x t ,u t ). • The next state is sampled according tox t+1 ∼ P(x t ,u t ). The property that P(x t+1 |x t ,u t )=P(x t+1 |x 1 ,u 1 ,...,x t ,u t ) is called the Markovian property. Any method for choosing actions in a MDP is called a policy. A fully general definition of policies includes all functionsX × (X ×U× R) <N 7→ ∆( U). That is, in addition to the current state, the policy could depend on the full sequence of states, actions, and rewards from previous timesteps. However, the Markovian property implies that history-dependent policies are unnecessary for optimality. A stationary policy is a functionX 7→ ∆( U) that does not depend on the history or time index. A policy is called 21 deterministic if it only outputs point distributions. When discussing deterministic policies we state the policy class codomain asU instead of∆( U). If either ofX andU are uncountable sets, then many mathematical statements about an MDP require substantial technical delicateness and additional conditions onP ,r, andµ to ensure that derived objects of interest exist, are unique, and satisfy properties such as measurability. A thorough treatment of these issues is given by Bertsekas and Shreve (1978). To keep this section simple, we limit our discussion to finite MDPs. However, we emphasize that many of the core ideas, such as Bellman optimality, still hold in uncountable spaces under mild assumptions. 2.5.1 PartiallyobservableMarkovdecisionprocesses In the standard MDP, the agent observes the full statex t . This is often unrealistic. To model incomplete state observations, we introduce the notion of partial observability: Definition2.5.2. A discrete-time partially observable Markov decision process (POMDP) is defined by the tuple(X,U,Y,P,r,h,µ ) where: • X, U, P, r, µ are as in Definition 2.5.1, • Y is a measurable observation space, • h:X ×U 7→ ∆( Y) is the observation function. The agent interacts with the MDP in the following manner: • The initial state is sampled: x 0 ∼ µ . • At each time step t ∈ N, the agent observes a single sample y t ∼ h(x t ,u t ) and takes an action u t ∈U. The agent does not observex t . • The agent receives and observes the rewardr t =r(x t ,u t ). 22 • The next state is sampled according tox t+1 ∼ P(x t ,u t ). In the case whereX,U,Y are all finite, computing an optimal finite-horizon policy when P,r,h,µ are known is PSPACE-complete (Papadimitriou and Tsitsiklis, 1987). This implies that a vast universe of computational problems can be reduced to policy optimization for a POMDP. From the perspective of control theory, partially observable settings are the norm. It is often expensive or impossible to equip a physical system with enough sensors to measure all states. The task of estimating state from a history of inputs and outputs is known as state estimation. In linear dynamical systems, state estimation becomes mathematically and computationally tractable, as we will discuss in §2.9.1.7. In nonlinear continuous-state systems it is generally difficult to provide performance guarantees for state estimation, but straightforward extensions of the linear methods often work well in practice when the system dynamics are smooth and the sensors are not too noisy. Remark 2.5.3. The MDP model is more expressive than it may initially appear. For example, we can model a system where the dynamics change over time by adding a time index variable to the state space. 2.5.2 Trajectories We refer to a record of interaction with an MDP as a trajectory and use the notation τ =(x 0 ,u 0 ,r 0 ),(x 1 ,u 1 ,r 1 ),... We refer to the space of all possible trajectories, with an “ambient” MDP implied by context, asT. Note thatT may be finite-dimensional or a sequence space. A policy π :X 7→∆( U) induces a distribution over T governed by x 0 ∼ µ, u t ∼ π (x t ), r t =r(x t ,u t ), x t+1 ∼ P(x t ,u t ). (2.5) 23 We will denote this distribution byτ π . When we use the notationτ ∼ τ π , it should be understood as a shorthand for the statement (2.5). 2.5.3 Infinite-horizonMDPs Aninfinite-horizon MDP is one where time goes on forever,t→∞. For infinite-horizon MDPs, we define the state-value function, or simply value function,V π :X 7→R of the policy as the discounted sum V π (x)=E " ∞ X t=0 γ t r(x t ,u t ) π,x 0 =x # (2.6) forγ ∈ (0,1], where the expectation is taken over the randomness of the dynamicsP and of the policy π . If the rewardr is bounded, then a discount factorγ < 1 ensures thatV π is bounded. In some special cases (such as linear-quadratic regulators, see §2.9.1.6), it can be shown thatV π is bounded for all policies of interest even thoughr is unbounded, in which caseγ =1 is safe to use. Similarly, we define the action-value function or Q-function ofπ as Q π (x,u)=E " ∞ X t=0 γ t r(x t ,u t ) π,x 0 =x,u 0 =u # . MDPobjective The optimization goal for infinite-horizon MDPs is to solve maximize π ∈∆( U) X E s∼ µ [V π (s)], (2.7) 24 2.5.3.1 Bellmanequationsandoperators From the infinite-horizon value function definition (2.6), we immediately see its recursive nature: V π (x)=E " ∞ X t=0 γ t r(x t ,u t ) π,x 0 =x # = E u∼ π, x ′ ∼ P(x,u) " r(x,u)+ ∞ X t=1 γ t r(x t ,u t ) π,x ′ # = E u∼ π, x ′ ∼ P(x,u) r(x,u)+γV π (x ′ ) , (2.8) where the first step uses linearity of expectation and the last step uses the Markov property. Alternatively, we can define the Bellman operator forπ , denoted asT π :R X 7→R X , by (T π V)(x)= E u∼ π (x), x ′ ∼ P(·|x,u) r(x,u)+γV (x ′ ) . Intuitively,T π takes an arbitrary value functionV and returns a value function “more like” theV π . We can now rewrite the conclusion of eq. (2.8) as T π V π =V π . In other words,V π is a fixed point of T π . This gives us a way to characterize and compute V π for a given π , but it does not give us a way to characterize or compute an optimalπ . For this, we define optimalvaluefunction andoptimalQ-function as V ⋆ (x)=sup π V π (x), Q ⋆ (x,u)=sup π Q π (x,u). 25 We define the Bellman optimality operator, denoted byT ⋆ :R X×U 7→R X×U , by (T ⋆ Q)(x,u)=r(x,u)+γ E x ′ ∼ P(x,u) max u ′ ∈U Q(x ′ ,u ′ ) . (2.9) We say a functionQ∈R X×U satisfies the Bellman optimality equations if T ⋆ Q=Q. (2.10) It can be shown that (Agarwal et al., 2022): • Q=Q ⋆ if and only ifQ is a fixed point of T ⋆ (that is, if it satisfies the Bellman optimality equations). • For any infinite-horizon MDP, an optimal stationary and deterministic policy exists. • T ⋆ is aγ -contraction mapping with respect to the∞-norm. Taken together, these propositions imply that the recurrence Q (0) =0, Q (k+1) =T ⋆ Q (k) will converge toQ ⋆ , and that any policy satisfyingπ (x)∈argmax u∈U Q ⋆ (x,u) is an optimal policy. The algorithm defined by computing this recurrence is known as Q-valueiteration. Note thatQ-value iteration requires full knowledge ofP andr. 2.5.4 Finite-horizonobjective In an episodic or finite-horizon MDP, the agent interacts with the environment for a fixed number of time steps, denoted byH ∈N, and is then reset back to an initial statex 0 ∼ µ . For the finite-horizon case, the value andQ functions are different for each time step. This implies that optimal policies for finite-horizon 26 MDPs are time-dependent in general: we must consider the policy class of functionsX × N 7→ ∆( U). However, in this dissertation we are only concerned with optimizing stationary policies, even in finite- horizon settings. Therefore, for the purposes of this dissertation, for eachh∈{0,...,H− 1} we define V π h (x)=E " H− 1 X t=h r(x t ,u t ) π,x h =x # , and Q π h (x,u)=E " H− 1 X t=h r(x t ,u t ) π,x h =x,u h =u # , where again the expectations are over the randomness inP andπ . For the purposes of this dissertation, the core optimization problem in finite-horizon MDPs is to solve maximize π ∈∆( U) X E x∼ µ [V π 0 (x)]. (2.11) Time-dependent equivalents of the Bellman optimality results described for infinite-horizon MDPs in §2.5.4 exist for finite-horizon MDPs. Stating these results requires more notational bookkeeping and is not used subsequently in this dissertation, so we omit them here. The proofs follow the same spirit, using the additivity of the value function and the Markov property of the dynamics. Remark 2.5.4. Some classes of stochastic bandit problems (Lattimore and Szepesvári, 2020) can be inter- preted as a degenerate case of episodic MDPs withH = 1. For contextual bandits, the context becomes the state, while for pure bandits the state space becomes a singleton set. 27 2.6 FamiliesofMDPs In this dissertation we will often be interested in discussing a family of MDPs withX,U in common but differences in one or more of P,r,µ . When this arises, we use the notationΦ to refer to this family, and ϕ ∈Φ for a particular MDP in the family. When working with a family of MDPs, we use the notation P(·|x,u;ϕ ), r(x,u;ϕ ), µ (·|ϕ ) to denote the “master” dynamics, reward, and initial state distributions that also depend on ϕ . We use the subscripted terms (P ϕ ,r ϕ ,µ ϕ ) to identify the transition dynamics, reward function, and initial state distribution for a particularϕ ∈Φ . We will also use this notation to refer to families of optimal control problems that do not necessarily fit our discrete-time finite MDP formalism precisely. For example, we will use it to describe families of continuous-time linear dynamical systems in Chapter 6. We are most interested in cases whereΦ is highly structured. We now give a list of such examples. 2.6.1 Dynamicsvariations In each of the following examples, only the dynamicsP change. Kinematic-chain robots The family Φ represents a set of robots with a common kinematic topology but with variations in mass, geometry, actuator strength, coefficients of friction, and so on. The state spaceX is described by the position inSE(3) of the kinematic root, the combined rotational and angular velocities in se(3) of the kinematic root, and the angles and velocities of the joints, which applies to all robots. The action spaceU is torque commands for the joint actuators. 28 Aircraft The familyΦ represents a set of aircraft. For practical purposes, the state is fully described as a rigid body. For a family of quadrotors, the action spaceU is four rotor thrusts. For a family of twin-engine airplanes, the action spaceU is two engine thrusts and the positions of the ailerons, elevators, and rudder. ∗ Φ represents variations in mass, moments of inertia, thruster configuration, and aerodynamic properties that lead to different dynamics. Manipulatedrigidobjects The familyΦ represents different objects that may be held in the gripper of a robot. The objects are rigid, so their state is fully described by an element ofSE(3)× se(3). However, the objects have different shapes, masses, and moments of inertia, resulting in different dynamics for the complete system of a force/torque-controlled robot grasping the object. Deformable objects materials The family Φ represents variations in material properties of a de- formable object that may be held in the gripper of a robot. Each variation shares the same shape, so its state is fully described by the same infinite-dimensional continuum state. However, the objects have dif- ferent material properties leading to varying levels of springiness, compressibility, etc. Again, this results in different dynamics for the coupled robot-object system. Game opponents The family Φ represents one player’s side of a two-player competitive game with varying opponent strategies. The reward r is zero until the game reaches a completed state. Different strategies of the opponent lead to different dynamics. 2.6.2 Rewardvariations In each of the following examples, only the reward functionr changes. ∗ These models are simplifications that are only reasonable under the assumption of smooth and Lipschitz control inputs. 29 Navigation goals The state spaceX ⊆ R 3 is some free space in the physical world. The action space U = R 3 is the desired robot velocity. The dynamics are simple integration combined with collision dy- namics. The reward isd(x,g) whereg∈X is a goal state andd is a semimetric onX . Objectarrangement A robot manipulates a set of objects on a desk. The state spaceX is the positions of the objects, alongside the robot state. The action spaceU includes the ability to move, close, and release the robot’s gripper. The reward encodes the user’s preferences for how the objects are arranged. For example, the user might want long objects to always be parallel to one of the desk edges. Alternatively, the user might want all objects packed as tightly as possible on one side of the desk. Drivingstyles In an autonomous car, the rewardr might encode the passenger’s preferences regarding speed, aggressiveness, smoothness of motion, avoiding freeways, and so on. Performance-efficiency tradeoffs Many robotics applications confront the system designer with a tradeoff between performance (time to complete a task, precision of tracking a trajectory, . . . ) and the amount of energy or other resources used. Changing the balance results in different reward functions. 2.7 Reinforcementlearning Reinforcement learning (RL) refers to the task of learning an optimal policy for an MDP—that is, solving the optimization problem (2.7) or (2.11)—without prior knowledge of the dynamicsP , the rewardr, or the initial state distribution µ . In the standard formulation of RL, we can only learn information about the MDP by interacting with it through the agent “interface” of Definition 2.5.1. That is, we are placed in the statex 0 , we take actions, observe the reward and next state, and the environment eventually resets in the finite horizon case. 30 On the other hand, computing an optimal policy when P , r, and µ are known is called solving the MDP in reference to the Bellman optimality equationT ⋆ Q = Q (§ 2.5.3), which can be solved directly using linear programming or value iteration in the finite-state case. In between there is a spectrum of interfaces with the MDP. For example, we may have an “oracle” or “generative model” to sampleP from any state and action instead of just the current state. In robotics the rewardr is often known and designed by the robotics engineer to achieve some practical goal. We may have the ability to reset the MDP to x 0 ∼ µ as desired. There are a great many algorithms for reinforcement learning. In this dissertation, we will only explore one family of algorithms in depth: the policy gradient family, as defined in § 2.7.2. We analyze a policy gradient method in Chapter 5. 2.7.1 Onandoff-policyalgorithms Reinforcement learning algorithms can be classified as either on-policy oroff-policy . On-policy algorithms can only make use of data that was generated by the current iterate of the policy being optimized. Off- policy algorithms can use data that was generated by a different behavior policy. Off-policy algorithms tend to be more complicated. The off-policy data is often, but not always, from an earlier iterate of the policy being optimized. Initializing the store of off-policy data with human (or other “expert”) demonstrations is a simple and powerful way to guide RL towards an optimal policy without the need for oracle access to the “expert” policy. The off-policy data may also be generated by a strategic exploration method, whereas in on-policy algorithms the exploratory actions must be part of the policy. 31 2.7.2 Policygradientmethods Policy gradient methods are an important class of reinforcement learning algorithms built upon generic techniques for stochastic optimization. We first introduce the generic technique, and then discuss its properties when instantiated for the RL problem. 2.7.2.1 Log-derivativetrick LetX denote some measurable set and consider a parameterized family of probability distributions overX . We have a parameter spaceΘ . Eachθ ∈ Θ induces a distributionp θ ∈ ∆( X). Now additionally consider a measurable functionf :X 7→R and the optimization problem minimize θ ∈Θ E x∼ p θ f(x). It is natural to attempt to solve this optimization problem with gradient descent (introduced in § 2.4.4) over θ . However, without strong restrictions on the form of f and the p θ , it is generally impossible to obtain analytic derivatives for this objective. More importantly, there are many situations where we do not know the full description off, but only have a zeroth-order oracle access. This means we can query the value off(x) for anyx∈X , but we cannot compute the gradient∇f(x). (For example, evaluatingf involves interacting with the real world.) In such situations, we can apply the following trick. Theorem 2.7.1. If p θ is a parameterized family of probability distributions where the probability density exists and is differentiable with respect to the parameter θ , and other sufficient technical conditions are met (L’ecuyer, 1990, and references therein), then ∇ θ E x∼ p θ f(x) = E x∼ p θ [∇ θ logp θ (x)f(x)]. (2.12) 32 Proof. Assuming sufficient conditions hold to apply Leibniz’s rule, we have ∇ θ E x∼ p θ f(x) =∇ θ Z x∈X p θ (x)f(x)dx = Z x∈X ∇ θ p θ (x)f(x)dx = Z x∈X p θ (x) p θ (x) ∇ θ p θ (x)f(x)dx = Z x∈X p θ (x)∇ θ logp θ (x)f(x)dx = E x∼ p θ [∇ θ logp θ (x)f(x)]. (2.13) After applying Theorem 2.7.1, the gradient may then be approximated by a finite sample as ∇ θ E x∼ p θ f(x)≈ 1 N N X i=1 ∇ θ logp θ (x i )f(x i ), (2.14) where eachx i ∼ p θ . To compute this expression, we require the following: • Ability to sample fromp θ . • Ability to compute the log-density gradient∇ θ logp θ (x). • Zeroth-order oracle forf. and wedonot require the following: • Ability to sample from any other distribution overX . • First-order oracle for∇ x f. 33 2.7.2.2 Policygradientalgorithm Now we instantiate the log-derivative trick for the RL problem. Suppose our policy class Π is parame- terized: we have a parameter spaceΘ ⊆ R d and a joint parameter/state policyϖ :Θ ×X 7→ ∆( U). We further assume that for all ofΘ ×X the output ofϖ admits a Radon-Nikodym derivative (density function), denoted byϖ(·|x,θ ) : U 7→ R ≥ 0 , and thatϖ(·|x,θ ) is differentiable with respect to θ . Using trajectory notation (§2.5.2), let R(τ )= H− 1 X t=h r(x t ,u t ). Letπ θ denote the partial application ofϖ for the parameterθ . We can then state the RL objective as maximize θ E τ ∼ τ π θ [R(τ )]. (2.15) We may perform gradient descent on the objective (2.15) by applying Theorem 2.7.1, yielding ∇ θ E τ ∼ τ π θ [R(τ )]= E τ ∼ τ π θ [∇ θ logτ π θ (τ )R(τ )]. (2.16) TheR(τ ) term is obtained as a direct effect of sampling τ from the MDP. The first term expands to ∇ θ logτ π θ (τ )=∇ θ log µ (x 0 ) H− 1 Y t=0 π θ (u t |x t )P(x t+1 |x t ,u t ) ! =∇ θ logµ (x 0 )+ H− 1 X t=0 logπ θ (u t |x t )+logP(x t+1 |x t ,u t ) ! =∇ θ logµ (x 0 )+ H− 1 X t=0 ∇ θ logπ θ (u t |x t )+∇ θ logP(x t+1 |x t ,u t ) = H− 1 X t=0 ∇ θ logπ θ (u t |x t ). (2.17) Critically, the initial state distribution µ and the transition dynamics P do not appear in the final line. This means it is possible to approximate the gradient of the RL objective (2.15) only by interacting with 34 the MDP usingπ θ and evaluating∇ θ logπ θ (·|· ) on the sampled actions. Putting it all together, we see the typical form known as REINFORCE (Williams, 1992): E[ˆ g]=∇ θ E τ ∼ τ π θ [R(τ )], ˆ g = H− 1 X t=0 r t ! H− 1 X t=0 ∇ θ logπ θ (u t |x t ) ! . (2.18) This can be approximated by a finite sample of interactions with the MDP and used to perform gradient descent. In general the RL objective is not convex, so the convergence rate guarantees for gradient descent presented in §2.4.4 do not apply. As we can observe from the sampling distribution in (2.18), policy gradient methods are on-policy algorithms (§2.7.1). The gradient estimatorˆ g is unbiased, but in practice it can have very high variance. If a sampleˆ g that is far fromE[ˆ g] is used in gradient descent, it can potentially cause a drastic and undesirable change inπ θ , and thus also inτ π θ . Various tricks have been proposed to reduce variance, for example subtracting an action-independent “baseline”, also known as a control variate (Greensmith et al., 2004), and using a trust region (Schulman et al., 2015). Another important limitation of policy gradient algorithms is that they depend on the stochasticity of the policyπ θ to explore the MDP. It is critical thatθ be initialized to a policy with high action entropy and that the action distribution does not collapse. One popular method to (heuristically) achieve the latter is entropy regularization, where a bonus is added to the reward that encourages high-entropy conditional action distributions and/or state visitation distributions. Entropy regularization can also yield policies more robust to changes in the MDP dynamics (Eysenbach and Levine, 2021). † By contrast, off-policy RL algorithms and can use exploration procedures that are independent of the current policy iterate. Exploration based on the principle of optimism in the face of uncertainty is especially powerful in off-policy algorithms, where it can deliver sample complexity bounds unavailable under naive † The discussion of Eysenbach and Levine (2021) also provides a good survey of other motivations and results for entropy regularization. 35 (ϵ -greedy) exploration (Jin et al., 2018). Recently, optimism in the face of uncertainty has been adapted to policy gradient methods (Agarwal et al., 2020). 2.8 Controltheoryparadigms Traditionally, the field of control theory is concerned with MDP-like settings in which the state space X is (some subset of)R n and a reasonable model of the transition dynamics P is available. The model is usually either derived from physics or estimated from interacting with the real system. In the case of physics-derived models there are often parameters that must be supplied, for example spring constants, masses, and so on. However, these parameters can often be identified using scientific experiments “outside” the MDP. For example, we may remove a spring from an assembly and measure its spring constant directly with a weight scale and a ruler. 2.8.1 Systemidentification The process of estimating a model or parameters from interacting with the real system (i.e. through the MDP “interface”) is known as system identification . There are two components to system identification: some way to estimate ϕ from a state-action sequence x 0:T ,u 0:T− 1 , and some scheme for choosing the inputsu 0:T− 1 to make the estimation problem easy (or possible at all). One way to formalize the system identification problem is as maximum-likelihood estimation (MLE) problem, where our goal is to select theϕ under which the observed sequence was most probable: maximize ϕ ∈Φ Pr(x 1:T |u 0:T− 1 ;x 0 ,ϕ ). (2.19) 36 This is equivalent to maximizing the logarithm of the likelihood. Taking the logarithm and using the Markov assumption, we have logPr(x 0:T |u 0:T− 1 ;ϕ )=logµ (x 0 |ϕ )+ T X t=1 logP(x t |x t− 1 ,u t− 1 ;ϕ ). Assuming the whole-family dynamics densityP(·|x,u;ϕ ) is known and is differentiable with respect to ϕ , the maximum-likelihood objective (2.19) is also differentiable with respect to ϕ , so the MLE problem is amenable to numerical optimization. On the other hand, ifP(·|x,u;ϕ ) is unknown or is only accessible via black-box sampling (i.e. a non-differentiable simulator), the problem (2.19) is no longer easy to optimize. An alternative is to simply learn a function that maps(x 0:T ,u 0:T− 1 ) directly to an estimate ofϕ . The field of deep learning (§2.11) provides powerful function classes such as recurrent and convolutional neural networks that are well-suited to representing such a function. Our work in Chapter 3 implements this idea. 2.8.1.1 Persistenceofexcitation Persistence of excitation refers to a sufficient condition on input signals to ensure that the solution of the system identification problem (2.19) converges to the true ϕ in the limit of infinite data. For linear systems, it is possible that inputs from a stabilizing linear feedback controller can fail to be persistently exciting. In other words, the goals of 1) minimizing some optimality criterion and 2) correctly identifyingϕ may be in conflict with each other. In this dissertation we are not concerned with the precise definition of persistent excitation for linear systems, but the overall concept is a key motivation for the methods we propose in Chapter 3. 2.8.2 Controlwithknownmodel Many published results in control theory provide guarantees under the assumption that an error-free dynamics model is available. In general, there is no reason to believe that these guarantees will be preserved 37 if the controller is deployed on a system different from the model. In certain cases it is safe, such as the fully observable LQR problem (§2.9.1.6) with uncertainty in the input matrix (Safonov and Athans, 1977). On the other hand, introducing partial observability to this same setting can be catastrophic: the famous note of Doyle (1978) gave an example of a system where an arbitrarily small model error renders a nominally optimal (controller, observer) pair unable to even stabilize the system. Recently, this fragility has appeared in complex nonlinear settings as the so-called “sim-to-real” problem in reinforcement learning, which we discuss in Chapter 3. 2.8.3 Robustcontrol The robust control paradigm mainly concerns scenarios in which we know that the true MDP ϕ belongs to some small set Φ , for example some small neighborhood of uncertainty about a nominal system. We seek a policy π : X 7→ U that is in some sense adequate for all ϕ ∈ Φ simultaneously. Traditionally the sense of “adequate” is based on stabilization or system norms. Certain models of uncertainty in the dynamics parameters are also equivalent to robustness against disturbance input signals (Dullerud and Paganini, 2000). In the case of linear dynamical systems, robust control is a mathematically deep topic with connections to functional analysis and convex optimization. Robust control is also highly focused on partially observable settings. Robust methods often consider the full closed loop of state estimator and controller, rather than studying each in isolation. Historically, robust control developed in response to issues of the type observed by Doyle (1978), as mentioned previously. 2.8.4 Gainscheduling Gain scheduling describes a technique in which some auxiliary information to identify ϕ is available. A classic example is an airplane autopilot, where the altitude and airspeed strongly affect the ability of the aerodynamic control surfaces to exert moments about the airplane’s rotational axes. The altitude and 38 airspeed can be directly measured with sensors. Therefore, a controller is designed in which the attitude feedback gains depend on the altitude and airspeed sensor inputs (Åström and Wittenmark, 2013). The term “gain” should be understood to mean “control policy” rather than anything more specific like gains for a PID controller. 2.8.5 Adaptivecontrol In its broadest interpretation, the termadaptivecontroller refers to any controller that changes its behavior based on the (possibly unknown) value ofϕ while it is running. However, in common usage it is restricted to cases whereϕ cannot be directly measured, thus excluding the gain-scheduling controllers. In particular, the so-called self-tuning regulator directly attempts to estimate ϕ using system identification techniques (§2.8.1), and then uses the estimate ofϕ to synthesize an optimal controller. The self-tuning regulator paradigm is quite broad, but in practice it most commonly refers to cases where: • Structural knowledge of Φ can be exploited in the process of identifyingϕ . For example, in linear dynamical systems, recursive least-squares estimation is especially efficient. • Synthesizing an optimal policy and/or selecting an optimal action conditioned on the estimate ofϕ can be done in real time. In Chapter 3 of this dissertation, we present a method for synthesizing adaptive controllers in systems that violate both of these assumptions. 2.8.6 Model-predictivecontrol Model-predictive control (MPC) is a class of controllers based on solving optimization problems quickly in a real-time loop, rather than storing the control policy as some sort of closed-form function. MPC is 39 not a subset or disjoint of any of the categories listed above. Known-model, robust, gain-scheduled, and adaptive versions of MPC all exist. Suppose a deterministic discrete-time dynamics modelx ′ = f(x,u) is known and the current system state isx t . Model-predictive control poses the optimization problem minimize ut,...,u t+K− 1 K− 1 X τ =0 ℓ τ (x t+τ ,u t+τ )+ℓ R (u t:t+K− 1 ) (2.20) subjectto x t+τ +1 =f(x t+τ ,u t+τ )∀τ ∈{0,...,K− 1} (2.21) u t+τ ∈U ∀τ ∈{0,...,K− 1}, (2.22) where eachℓ τ is a task-determined loss (for example, tracking a goal trajectory inx), the termℓ R imposes regularization (for example, penalizing large changes in consecutive values ofu), and the horizonK is a user-chosen hyperparameter. This is a simple MPC formulation. It is also possible to add state constraints and more complex constraints on the set of admissible input sequences. 2.8.6.1 Recedinghorizon Solving the optimization problem (2.20) yields a sequence of inputs u ⋆ t ,...,u ⋆ t+K− 1 . Typically, only the first input u ⋆ t is actually supplied to the system. Then the sensors and estimation system produce the actual value ofx t+1 , which may not be equal tof(x t ,u ⋆ t ) due to modeling error and/or disturbances. We then immediately solve another MPC problem starting from the true x t+1 , which is used to decide the input u t+1 . This pattern gives MPC a favorable structure for iterative optimization methods. If the optimization algorithm requires an initial guess, then the guess u ⋆ t+1 ,...,u ⋆ t+K− 1 ,u ⋆ t+K− 1 40 constructed by shifting the previous solution and duplicating the last input is often a nearly optimal for the new optimization problem. Therefore, the optimization algorithm may require far fewer iterations than it would require if starting from scratch, for example with allu i =0. 2.8.6.2 LinearMPC An important special case of MPC is linear dynamics (§2.9) with convex lossesℓ t and convex regularization ℓ R where eachU τ is a convex set. The linear dynamics imply thatx t+τ is an affine function of u t:t+τ − 1 . Therefore, due to properties of convex functions (§ 2.4.1), the problem (2.20) is a convex optimization problem. If theℓ t andℓ R are quadratic, then the problem is similar to a linear-quadratic regulator (§2.9.1.6), but MPC can easily handle convex constraints on both actions and states, which cannot fit into the standard LQR framework. 2.9 Lineardynamicalsystemsandcontrol In this section we collect the background material on linear control theory needed to state our results. We only considertime-invariant linear systems, where the coefficient matrices of the recurrences or differential equations are constant with respect to time. For brevity, we will take the term “linear system” and the acronym “LTI system” to mean “linear time-invariant system”. Linear dynamical systems are defined for both discrete and continuous time. We give a brief tour of discrete-time linear control first. We will follow with an abbreviated discussion of continuous-time linear control. 41 2.9.1 Discretetime 2.9.1.1 Autonomoussystem A discrete-time deterministic linear dynamical system follows the recurrence x t+1 =Ax t (2.23) forx t ∈R n , A∈R n× n . 2.9.1.2 Stability Perhaps the most important concept for linear dynamical systems is that of stability. For the purposes of this overview, a discrete-time linear dynamical system is stable if lim t→∞ ∥x t ∥=0 for all initial states x 0 under some norm∥·∥ . Far more general definitions of stability exist for broader classes of systems (Hinrichsen and Pritchard, 2005). The state at timet is given by x t =A t x 0 . Recalling that ρ (A) = max{|λ | : λ ∈ Λ( A)}, the spectrum of A controls how quickly the states decay towards zero: Theorem 2.9.1. (Hinrichsen and Pritchard, 2005, Lemma 3.3.19) If ρ (A) < e ω , then there exists M > 0, depending onω, such that A t 2,2 ≤ Me ωt , t∈N. 42 Ifρ (A)<1 thenω <0, so we see A t 2,2 goes to zero. This condition is necessary as well as sufficient: ifλ,ν is an eigenvalue/eigenvector pair ofA and|λ |≥ 1, then A t ν = λ t ν =λ t ∥ν ∦=0. Therefore, the discrete-time linear dynamical system (2.23) is stable if and only ifρ (A)<1. A matrix withρ (A)<1 is called Schur (not to be confused with the well-known Schur complement of a block matrix). 2.9.1.3 Linearcontrolsystems A linear control system has inputs in addition to state. A discrete-time linear control system follows the recurrence x t+1 =Ax t +Bu t +w t , (2.24) whereA∈R n× n andB ∈R n× m are arbitrary matrices andw t ∈R n is a disturbance from the environ- ment. In the deterministic case, we havew t =0 for allt. In the stochastic case, we assumew t is sampled i.i.d. in time according to some probability distribution. Often the distribution is Gaussian:w t ∼N (0,Σ x ) for someΣ x ⪰ 0. Ifw t is allowed to be adversarial, sophisticated analysis beyond the scope of this disser- tation is required. 2.9.1.4 Controllability It is clear from the trivial example B = 0 that not all linear control systems can be driven to arbitrary states. This is formalized by the notion of controllability. Definition 2.9.2. The discrete-time linear control system (2.24) is controllable if, for any initial state x 0 ∈R n and goal state x g ∈ R n , there exists H ∈ N and an input sequence (u 0 ,...,u H− 1 ) ∈ (R m ) H such thatx H =x g . 43 Controllability rank conditions Controllability of discrete-time systems can be reduced to simple linear-algebraic questions about the matricesA andB. To see this, we now expand the expression forx H and introduce the notation x H =A H x 0 +A H− 1 Bu 0 +··· +Bu H− 1 ≜A H x 0 + h A H− 1 B A H− 2 B ··· AB B i | {z } L c (H) h u ⊤ 0 u ⊤ 1 ... u ⊤ H− 2 u H− 1 i ⊤ | {z } u 0:H− 1 . From this, we can see that a system can be driven to an arbitrary goalx g inH timesteps if and only if the system of linear equations L c (H)u 0:H− 1 =x g − A H x 0 has a solution in the variables u 0:H− 1 ∈ R mH . Since x g is arbitrary, this is true for all x g and x 0 if and only ifrank(L c (H))=n. It can be shown that ifrank(L c (n))<n, thenrank(L c (H))<n for allH >n, sorank(L c (n))=n is a necessary and sufficient condition for controllability. The matrix L c (n) is known as the controllability matrix. For simplicity, we use the notationL c =L c (n). Since the controllability matrixL c is “wide” in general, it is sometimes convenient to instead consider the matrix W c =L c L ⊤ c = H− 1 X i=0 A i BB ⊤ (A ⊤ ) i , known as thecontrollabilityGramian. Becauserank(W c )=rank(L c ), the conditionrank(W c )=n is also necessary and sufficient for controllability. Note that these rank conditions are often numerically unstable to compute in practice, and more appropriate alternative tests exist. 44 2.9.1.5 Stabilizingcontrollers If we use a linear control policyu t =Kx t for someK∈R m× n , then the system dynamics become x t+1 =Ax t +BKx t +w t . In the case wherew t =0, the closed-loop system has the linear dynamics x t+1 =(A+BK)x t . Therefore, our results on stability of linear systems apply: If ρ + (A + BK) < 1, then our control pol- icy stabilizes the system. It can be shown that such a stabilizing K exists if and only if the system is controllable. 2.9.1.6 Linearquadraticregulator(LQR) If we impose a quadratic cost on a linear dynamical system, we obtain the heavily-studiedlinearquadratic regulator (LQR) problem setup. More specifically, we specify the cost J = H X t=0 c(x t ,u t )≜ H X t=0 x ⊤ t Qx t +u ⊤ t Ru t , (2.25) whereQ ⪰ 0 andR ≻ 0. We sometimes use the notationc t = c(x t ,u t ). The horizonH may be either finite or infinite. It is most natural to think of (2.25) as a penalty on being far from the zero state (first term) and expending control energy (second term). To adapt LQR problems to the standard reward-based MDP formulation, we setr(x,u)=− c(x,u). ‡ ‡ Many results in reinforcement learning theory depend on boundedness of the reward. These results cannot be directly applied to the LQR setting. 45 Optimal controller for LQR It can be shown (Lancaster and Rodman, 1995) that, for the infinite- horizon case, the policy that minimizes Equation (2.25) is linear feedback of the form u t = Kx t , which leads to quadratic total cost, i.e. J = x ⊤ 0 Px 0 for some P ⪰ 0, where P is the unique maximal positive semidefinite solution to the discrete-time algebraic Riccati equation (DARE): P =A ⊤ PA− A ⊤ PB(R+B ⊤ PB) − 1 B ⊤ PA+Q. The optimal controller is then K =− (R+B ⊤ PB) − 1 B ⊤ PA. We emphasize that this controller is optimal over the class of all feedback controllers, not only the linear feedback controllers. 2.9.1.7 OutputsandStateEstimation In many real-world dynamical systems it is not possible to measure every dimension of the state x. For example, many small unmanned aircraft contain an inertial measurement unit (IMU) capable of measur- ing acceleration and gravity forces and a global positioning system (GPS) receiver capable of measuring position, but no instrument capable of measuring velocity. For linear systems, we consider a sensor or set of sensors that measure some linear mapping of the state y =Cx forC∈R p× n . 46 2.9.1.8 Observability Analogous to the notion of controllability, a LTI system must satisfy certain conditions to ensure that it is possible to estimatex from a history of inputsu and outputsy. For a generic (possibly nonlinear) discrete- time system, we define observability by the condition that there exists H ∈ N and an input sequence (u 0 ,...,u H− 1 ) such that it is possible to determine the initial state x 0 from the known inputs and the observationsy 0 ,...,y H . For a LTI system, expanding the expressions for the observations yields y 0 y 1 y 3 . . . y H = C CA CB CA 2 CAB CB . . . . . . . . . CA H− 1 CA H− 2 B . . . CB x 0 u 0 u 1 . . . u H− 1 = C CA CA 2 . . . CA H− 1 | {z } L o (H) x 0 +const. (2.26) Therefore, determiningx 0 reduces to solving a linear system. We see that for LTI systems, the feasibility of determiningx 0 does not depend on the chosen inputs at all. (This is not true in general for nonlinear systems.) Instead, it depends only on the coefficient matrix L o (H). Although this system will generally be overdetermined, in a perfectly modeled and noiseless system one can show that it will always have an exact solution if and only ifrank(H) = n for sufficiently large H. As in the controllability case, we can show that must hold for someH ≤ n, so the matrixL o ≜L o (n) is called the observability matrix. 47 2.9.1.9 Luenbergerobserver Suppose we wish to design an online recursive estimator of the current statex t . Let ˆ x t denote the current estimate ofx t . A controller (outside our influence) supplies an input u t , and we observey t = Cx t . It is reasonable to impose that our update for ˆ x t+1 takes the form ˆ x t+1 =Aˆ x t +Bu t +f(y t ,ˆ x t ). Now we additionally impose thatf(y t ,ˆ x t ) = L(y t − Cˆ x t ), that is, we update the estimate with a linear function of the “residual”y t − Cˆ x t , or the difference between the true measurement and the measurement predicted by our estimate. We then examine the dynamics of the estimate errore t = ˆ x t − x t : e t+1 = ˆ x t+1 − x t+1 =Aˆ x t +Bu t +L(y t − Cˆ x t )− (Ax t +Bu t ) =A(ˆ x t − x t )+L(y t − Cˆ x t ) =A(ˆ x t − x t )+L(Cx t − Cˆ x t ) =(A+LC)e t . (2.27) The error dynamics are linear, so the stability condition for linear systems implies that, ifρ + (A+LC)<1, thene t →0 ast→∞. We note thatρ + (A+LC)=ρ + (A ⊤ +C ⊤ L ⊤ ), matching exactly the formA+BK we encounter when describing stabilizing controllers (§ 2.9.1.5). Therefore, linear estimator synthesis is mathematically equivalent to linear control synthesis, with(A ⊤ ,C ⊤ ) playing the same role as(A,B). 2.9.1.10 Kalmanfilter For control synthesis, we discussed both arbitrary stabilizing controllers and optimal controllers according to the LQR cost criterion. We might ask if a similar notion of an optimal observer exists for state estimation. 48 The LQR quadratic cost expressed a tradeoff between regulating the state to zero and consuming energy with control inputs. For state estimation, a quadratic cost would also capture the goal of driving the estimate error to zero, but imposing a cost on the magnitude of the updateL(y t − Cˆ x t ) does not have a clear meaning. It turns out that a useful “cost” will arise if we consider the stochastic case, where the dynamics are given by x t+1 =Ax t +Bu t +w t , y t =Cx t +v t , where the dynamics noise w t and sensor noise v t are both zero-mean Gaussian random variables with covariancesΣ x andΣ y respectively. Whereas the Luenberger observer only maintained an estimateˆ x of the state, in the stochastic case we will maintain a Gaussian belief distribution: x t ∼N (ˆ x t ,P t ). From the properties of Gaussian distributions, after supplying the inputu t , the belief distribution should change according to ˆ x ′ t =Aˆ x t +Bu t , P ′ t =AP t A ⊤ +Σ x . This is known as the propagation step. The key question is how to update the belief distribution after observing the outputy t . This can be treated as a Bayesian inference problem where the current belief is the prior, the measurementy t is the evidence, and the updated belief distribution is the posterior. Instead of the common maximum a priori (MAP) update, we will seek an update that minimizes the trace of the updated belief covariance. Deriving the update from probabilistic principles without making the assumption that 49 we updateµ with a linear function of the measurement residual—as we did for the Luenberger observer—is complex. It is beyond the scope of this dissertation to derive the update, but it takes the form e t =y t − Cˆ x ′ t− 1 S t =CP ′ t− 1 C ⊤ +Σ y K t =P ′ t− 1 C ⊤ S − 1 t ˆ x t = ˆ x ′ t− 1 +K t e t P t =(I− K t C)P ′ t− 1 , wheree t is the measurement residual,S t is the covariance ofe t according to the current belief distribution, K t is the so-called Kalman gain, and(ˆ x t ,P t ) are the updated belief distribution parameters. This is called the update step. Although we have coupled the propagation and update steps notationally, they need not be coupled. For instance, it is common in practice to have several propagation steps per update step. 2.9.2 Continuoustime All of the discrete-time definitions and theorems stated above have analogues for continuous-time systems. We quickly review these analogues here. In the remainder of this dissertation we will never need to refer to a discrete-time and continuous-time linear control system simultaneously, so we overload all notation and rely on context to disambiguate. 2.9.2.1 Autonomoussystem A continuous-time deterministic linear dynamical system follows the ordinary differential equation (ODE) ˙ x(t)=Ax(t) (2.28) 50 fort∈[0,∞), x(t)∈R n , A∈R n× n . 2.9.2.2 Stability Again stability is defined by lim t→∞ ∥x(t)∥=0 for all initial statesx(0). The state at timet is given by x t =e tA x 0 , where the matrix exponential is defined by the usual power series. For continuous-time LTI systems ˙ x=Ax, we have that Theorem 2.9.3. (Hinrichsen and Pritchard, 2005, Lemma 3.3.19) If ρ + (A) < ω, then there exists M > 0 such that lim t→∞ e tA 2,2 ≤ Me ωt , t∈R ≥ 0 . By a similar argument as the discrete case using an eigenvalue-eigenvector pair, the system ˙ x = Ax cannot be stable ifρ + (A)≥ 0. Therefore, the continuous-time linear dynamical system (2.28) is stable if and only ifρ + (A)<0. A matrix withρ + (A)<0 is called Hurwitz. 2.9.2.3 Linearcontrolsystems A continuous-time deterministic linear control system has the state and input spacesX =R n , U =R m , and follows the ODE ˙ x(t)=Ax(t)+Bu(t), (2.29) 51 whereA∈R n× n andB∈R n× m are arbitrary matrices. (We will not consider continuous-time stochastic systems in this dissertation.) 2.9.2.4 Controllability The continuous-time linear control system (2.29) is controllable if, for any initial statex 0 ∈R n , goal state x g ∈ R n , and time horizonT > 0, there exists a continuous functionu(t) : [0,T)7→ R m for which the system reaches statex g at timeT . It turns out that the exact same definition and rank condition of the controllability matrix as in the discrete-time case (§2.9.1.4) imply controllability in the continuous case. We will not prove this here. The controllability Gramian is defined differently, but it also has full-rank controllability condition. 2.9.2.5 Linear-quadraticregulator The infinite-horizon continuous-time linear quadratic regulator (LQR) problem is specified by the cost from a particular starting statex(0)∈R n by J x(0) = Z ∞ 0 h x(t) ⊤ Qx(t)+u(t) ⊤ Ru(t) i dt, (2.30) whereQ⪰ 0 andR≻ 0. Again, it can be shown (Lancaster and Rodman, 1995) that the optimal controller is linear feedback of the form u(t) = Kx(t), which leads to the optimal total cost taking the quadratic form J x(0) = x(0) ⊤ Px(0), where P ⪰ 0 is the unique maximal positive semidefinite solution to the continuous-time algebraic Riccati equation (CARE): A ⊤ P +PA− PBR − 1 B ⊤ P +Q=0. (2.31) 52 The optimal controller is then K ⋆ =− R − 1 B ⊤ P. (2.32) 2.9.3 Canonicalforms The Laplace transform, its discrete-time counterpart the z-transform, and transfer functions are an im- portant tool for frequency-domain analysis of LTI systems. It is beyond the scope of this dissertation to introduce them. In this section we briefly summarize one tool from frequency-domain analysis that we will use to synthesize a state-space system with specified open-loop eigenvalues. From the perspective of frequency-domain analysis, a single-input, single-output (SISO) control sys- tem is completely characterized by the poles and zeros of its transfer function. However, the mapping from state-space systems to transfer functions is many-to-one. Given a transfer function, one of several particularly useful state-space realizations is the controllable canonical form or reachable canonical form (Åström and Murray, 2010). If the transfer function is given by b 0 s n +b 1 s n− 1 +··· +b n− 1 s+b n s n +a 1 s n− 1 +··· +a n− 1 s+a n , then the controllable canonical form (CCF) realization is given by A= 0 1 . . . . . . 0 1 − a n ··· − a 2 − a 1 , B = 0 . . . 0 1 , C = h b n − a n b 0 ,b n− 1 − a n− 1 b 0 ,...,b 1 − a 1 b 0 i . To realize a state-space system for a particular set of open-loop eigenvaluesλ 1 ,...,λ n , we compute the denominator coefficients a 1:n by expanding the characteristic polynomial(λ 1 − x)(λ 2 − x)··· (λ n − x). 53 2.9.4 Poleplacement Pole placement is a non-optimal control method. A pole placement algorithmP takes the dynamics matrices A and B and a set of desired closed-loop eigenvalues λ 1 ,...,λ n ∈ C, and returns a matrix K such that Λ( A +BK) = {λ 1 ,...,λ n }. If the system (A,B) is controllable, then such a K always exists (Sontag, 2013). It is beyond the scope of this dissertation to discuss algorithms for computing pole placement, but many exist. 2.10 Statisticallearning Statistical learning is an umbrella term for machine learning settings in which the learner observes a complete dataset in one instant, as opposed to in some online or interactive protocol. 2.10.1 Generalstatisticallearningproblem In the most general statistical learning setting we have some abstract data spaceZ, an abstract function classF, a loss function ℓ : F × Z 7→ R, and an unknown distributionD ∈ ∆( Z). Our ideal goal is to solve the optimization problem minimize f∈F L(f)≜ E z∼D [ℓ(f,z)]. (2.33) However, we are only given a data set of N items D = (z 1 ,...,z N ) ∼ D N , that is, each z i is sampled i.i.d. fromD. A natural learning algorithm to consider is solving the empirical risk minimization (ERM) problem: minimize f∈F N X i=1 ℓ(f,z i ). (2.34) 54 The mathematically rich field of statistical learning theory tells us that for certain forms of ℓ, such as binary classification and least-squares regression, a function class F is learnable if and only if it is learnable by ERM. However, this is not true for the fully general setting (Shalev-Shwartz et al., 2010). In benign settings the learning rate guarantee usually takes the form sup D∈∆( Z) E D∼D N [L(f ERM (D))]− inf f∈F L(f) ≤ O 1 √ N In other words, as the sample size goes to infinity, there is no distribution D that can “trick” the ERM algorithm. The rate1/ √ N is typical. 2.10.2 Supervisedlearning A particularly important class of statistical learning problems is supervised learning, in which we have an input spaceX, an output spaceY ; we haveZ =X× Y ; our function class is belongs toF ⊆ Y X ; and our loss function takes the form ℓ(f,(x,y))=ℓ s (f(x),y), in which ℓ s : Y × Y 7→ R is some kind of “comparison” function, often satisfying the properties of a semimetric (§2.2.1). One common example is least-squares regression, whereY =R d and we have ℓ(f,(x,y))=∥f(x)− y∥ 2 2 . (2.35) Another common example classification, where Y ={− 1,+1} (or any set of two distinct elements) and we have ℓ(f,(x,y))=I [f(x)=y] . (2.36) 55 A fundamental difference between the least-squares loss (2.35) and the classification loss (2.36) is that only the former is differentiable. The non-differentiability of the classification loss requires extra care in both algorithm design and analysis. 2.10.3 Gradient-basedoptimization In this work, we are only concerned with regression problems. The differentiability of the least-squares loss (2.35) lends favorable structure to the problem. Now suppose our function classF is parameterized: we have a parameter spaceΘ ⊆ R n and a functionF : Θ × X 7→ Y . We denote the partial application with respect toθ by f θ (x)=F(θ,x ) and letF ={f θ : θ ∈ Θ }. Ifℓ s is differentiable with respect to its first argument and F is differentiable with respect toθ , then the resulting empirical risk minimization objective L ERM (θ )= X (x,y)∈D ℓ s (f θ (x),y) (2.37) is differentiable with respect to θ . This allows us to attack the supervised learning ERM problem with the tools of continuous optimization. In particular, this problem has the following characteristics: • When the parameter spaceΘ is high-dimensional, second order optimization algorithms are com- putationally difficult (Nocedal and Wright, 2006). Second-order algorithms are those that require solving a linear system of the formHw = b forw, whereH =∇ θθ L ERM is the Hessian ofL ERM at some point in Θ . The canonical example is applying Newton’s method to the system of equations ∇ θ L ERM =0. • When the data set sizeN is large, computing the gradient∇ θ L ERM or even simply evaluatingL ERM can be computationally expensive. 56 • The parameter space is often unconstrained, i.e. Θ= R n . These characteristics favor the optimization method of stochastic gradient descent. In its simplest form, stochastic gradient descent is based on the observation that E (x,y)∼ Uniform(D) [∇ θ ℓ s (f θ (x),y)]=∇ θ L ERM (θ ), which follows from linearity of expectation. This suggests we follow Algorithm 1. Algorithm1 Stochastic gradient descent for statistical learning Require: DatasetD, learning rateη > 0, batch size1≤ K≪ N, initial guessθ 1: repeat 2: Sample(x 1 ,y 1 ),...,(x K ,y K )∼ Uniform(D) K . 3: Compute ˆ ∇ θ L ERM = 1 K K X i=1 ∇ θ ℓ s (f θ (x i ),y i ). 4: Updateθ ← θ − η ˆ ∇ θ L ERM . 5: until stopping criteria met. In Algorithm 1, the batch sizeK balances the variance of the gradient estimate ˆ ∇ θ L ERM against compu- tational cost. In practice, a sizeK >1 is almost always used to take advantage of parallel processing. The uniform sample in line 2 of Algorithm 1 is sometimes replaced with another minibatch selection method. For example, we might randomly permute the dataset and then sample each length-K chunk of the per- mutation in order to improve memory locality. (One complete pass through the dataset in this manner is sometimes called anepoch.) It is also common to apply the accelerated gradient descent methods discussed in §2.4.4 to stochastic gradient descent. 2.11 Neuralnetworks The nameNeuralnetwork is a broad descriptor for function approximation classes built up from recursive composition of linear maps alternating with simple, usually elementwise, nonlinear maps. The latter is often called anonlinearity for short. The pair of a linear map followed by a nonlinearity is referred to as a 57 layer. This structure is loosely inspired by structures observed in animal brains. Usually the only learnable parameters are those of the linear maps in each layer. Neural networks have a long history in machine learning (Rosenblatt, 1958), but have recently grown in prominence. Their growth has been influenced by several factors coming together: • Graphics processing units (GPUs) growing from fixed-function devices into general-purpose mas- sively parallel computers. • Software libraries for reverse-mode automatic differentiation, also known as backpropagation, en- abling loss gradient evaluation for arbitrarily complex functions. • Special-purpose neural network architectures designed to exploit structural properties of high- dimensional data such as images and sequences. • Huge datasets of user-generated content from the Internet e.g. ImageNet (Deng et al., 2009). Neural networks have become the de facto parametric nonlinear function class for many applications. 2.11.1 Neuralnetworkarchitectures 2.11.1.1 Nonlinearities Common nonlinearities used in neural networks are: • Sigmoid: σ (x)= 1 1+e − x • Hyperbolic tangent: tanh(x)= e x − e − x e x +e − x • Rectified linear unit (ReLU) : relu(x)=max{x,0} 58 − 4 − 3 − 2 − 1 0 1 2 3 4 − 1 − 0.5 0 0.5 1 σ (x) tanh(x) relu(x) Figure 2.2: Typical nonlinearities used in neural networks. In the remainder of this chapter, we will occasionally overload the notationσ (x) to denote an arbitrary ele- mentwise nonlinearity instead of the sigmoid function specifically. This notational convention is common in the literature on neural networks. Often the final layer lacks the nonlinearity. To avoid complicating definitions, we therefore allow these generic nonlinearities to include the identity map as well. 2.11.2 Fullyconnectedneuralnetwork A fully connected neural network withn layers is a function of the form f(x)=f n (f n− 1 (··· f 1 (x)··· )), where each layerf i takes the form f i (y)=σ i (W i y+b i ), in whichσ i is a nonlinearity or an identity function,W i is a “weight” matrix of appropriate size, andb i is a “bias” vector of appropriate size. Using identity for the final output σ n is common. 59 2.11.3 1Dconvolutionalneuralnetwork One-dimensional convolutional neural networks are used to process sequence data. Consider an input spaceX =R n× d , wheren is the sequence length andd is the per-step dimensionality. A 1D convolution of widthk maps an input(x 1 ,...,x n )∈X to a new sequence(y 1 ,...,y n− k+1 ) according to y i =W 0 x i +W 1 x i+1 +··· W k− 1 x i+k− 1 +b, in whichd ′ is the output per-step dimensionality, eachW i is ad ′ × d matrix, andb∈R d ′ is a bias vector. Now consider a single scalar entry of a single vector in the output sequence, for example the first entry of the vectory 1 . It is a linear combination of allkd scalar values of the input subsequencex 1 ,...,x k . Many common and important signal processing operations can be expressed as 1D convolutions, including smoothing, derivative estimation, and finite impulse response filters. Note that a convolution of width k reduces the output sequence length byk− 1. A 1D convolutional neural network consists of more than one 1D convolution applied sequentially with nonlinearities in between. After each layer, the “receptive field” of each entry in the sequence grows. We then optimize all of the matrices W and the biases b simultaneously. We have left out many other details such as strides and pooling, which can be found in any tutorial material on convolutional neural networks. 60 2.11.4 Recurrentneuralnetwork The term recurrent neural network (RNN) is used to describe a broad variety of differentiable function classes suitable for representing sequence-to-sequence mappings. In contrast to 1D convolutional net- works (§ 2.11.3), RNNs can represent sequence-to-sequence mappings with infinite impulse response. A generic RNN is a discrete-time dynamical system of the form x t+1 =f(x t ,u t ), ˆ y t =g(x t ,u t ) wherex is the internal state,u is the input, ˆ y is the output andf andg are the dynamics and output func- tions respectively. The RNN model is parameterized by some real vectorθ . Bothf andg are differentiable with respect to their arguments and the parameterθ . The particular form of the functionsf andg must be carefully chosen to maximize expressiveness while preserving desirable properties for optimization. It is important to emphasize that the RNN internal state x is an abstract quantity. For example, if we optimize a RNN to evaluate arithmetic expressions, values of x in the learned RNN might contain similar information to the stack in a parser. If we optimize a RNN to imitate the input-output mapping of a Markovian physical system,x might end up being an approximately invertible function of the true system state. However, any such structure would arise implicitly via the optimization objective and would not be enforced. Depending on the application, the optimization objective for a RNN may be a function of the full output sequence y 0:T or only the final output y T . The RNN parameter θ is typically optimized with stochastic gradient descent (SGD) using backpropagation, so the gradient of the output loss is allowed to flow through the recursive applications off. This allows the RNN to learn dynamics where the effect of an input does not appear until many time steps later. RNNs have achieved state-of-the-art results on many sequence 61 modeling tasks, even though the objective is nonconvex and SGD may converge to suboptimal local minima (Lipton, 2015). 2.11.4.1 Longshort-termmemory The long short-term memory (LSTM) network is a particular kind of recurrent neural network—that is, a particular form of the functions f and g—designed to have favorable properties for optimization with gradient-based methods (Hochreiter and Schmidhuber, 1997). The LSTM is the de facto standard RNN architecture due to these properties (Lipton, 2015). The LSTM partitions the internal state x into two vectorsh∈R n andc∈R n . The functional form is most clearly expressed using the intermediate values i,f,q,p∈R n as i f q p = σ σ σ tanh ◦ W " h t u t # +b ! , c t+1 =f⊙ c t +i⊙ p, h t+1 =q⊙ tanh(c t+1 ), (2.38) where σ denotes the sigmoid nonlinearity (§ 2.11.1.1), the symbol ⊙ denotes elementwise multipli- cation, and ◦ denotes elementwise function composition. The learnable parameters are the matrix W ∈R 4n× (n+m) and the vectorb∈R n+m , wherem denotes the dimensionality ofu. In some applications of LSTMs the stateh is used directly as the output, i.e. g(h,c,u) = h. In other settings, it is appropriate forg itself to be a learned mapping. For example, if the output space is very low- dimensional, it may be useful to select a higher dimensionality ofh and learn a projectiony =Ph, where P is a “wide” matrix. The input can also be pre-processed, replacingu t in (2.38) with a learned function of u t . For example, if the input space is very high-dimensional (e.g. a one-hot encoding of English words), it may be useful to use a projection here too. It is also possible to form a multi-layer LSTM, in which subsequent layers take the internal statesh of previous layers as inputs. 62 Chapter3 ReinforcementLearningforUniversalPolicies In this chapter we work in the MDP family framework described in § 2.6. We propose a method to give reinforcement-learned policies the ability to adapt to unknown dynamics at test time. Our primary mo- tivation is the simulation-reality gap in robotics, where a policy optimized in simulation performs poorly in the real system that the simulator approximates. Adaptivity is also useful for deploying a pre-trained policy into a wide set of real-world scenarios, for example if designing an autonomous driving system that works on both sports cars and passenger vans. Our method merges ideas from the self-tuning regulator paradigm in adaptive control (§ 2.8.5) with the generality and representation learning ability of deep reinforcement learning. As discussed in §2.8.5, traditional techniques for adaptive control generally exploit structure in the MDP familyΦ to render the processes of estimatingϕ and adapting the policy computationally efficient. Unfortunately, many systems of interest in robotics do not admit analytical solutions for the system identification and policy synthesis problems. In particular, with regard to policy synthesis there are numerous problems in robotics and AI for which the only known satisfactory methods require hours of computation, such as some of the RL successes listed in Chapter 1. The work presented in this chapter was originally published in Preiss et al. (2018). The presentation has been substantially revised for clarity, but the results have not. 63 For system identification, if the whole-family system dynamics P(·|x,u;ϕ ) is only accessible through a non-differentiable simulator, then the only computationally feasible method for online system identifi- cation may be the function approximation approach (§2.8.1). By using deep neural networks for system identification, we can take advantage of their representation learning ability. We learn an embedding function that maps the system identification parameters in a family of MDPs Φ (see §2.6) into an abstract embedding space. The embedding retains enough information to support policy specialization to particu- lar MDPs withinΦ while potentially being easier to identify from state-action trajectories. Our framework also includes an observability-promoting reward that encourages the policy to balance the task goal with behavior that aids system identification. Our simulation experiments demonstrate desirable properties of the learned embedding in a toy ex- ample, but show only a modest improvement in ultimate optimality of the multi-system policy in a more complex example. The experimental results raise fundamental questions about reinforcement learning and multi-system control that motivate our theoretical work in subsequent chapters. 3.1 Relatedwork Although policies trained with reinforcement learning (RL) can achieve state-of-the-art performance on some tasks, they are often brittle and fail to generalize beyond the training environment, even when the differences are small (Zhang et al., 2018). An important instance of this problem is sim-to-real transfer for robotics. RL algorithms can often learn policies that exploit bugs in physics simulators, or vision-based policies that work with synthetic rendered inputs but not with real images. A natural first step to deal with such brittleness is to add randomization to the simulator. In the the domain randomization approach, we optimize the objective maximize π ∈Π E ϕ ∼ ζ J ϕ (π ), (3.1) 64 where Π ⊆ ∆( U) X is a policy class, ζ is some distribution over Φ , and J ϕ is the RL objective for ϕ (encapsulating the horizon and discount). Domain randomization during training can improve robustness (Antonova et al., 2017; Zhu et al., 2018; van Baar et al., 2019; Sadeghi and Levine, 2017; Tobin et al., 2017), but it is limited by its assumption that a single policy can perform adequately in all possible test domains. In this sense, domain randomization is similar in spirit to robust control. Other deep learning techniques in the robustness spirit include merging ensembles of policies (Parisotto et al., 2016; Teh et al., 2017), adversarial perturbations of state or observations (Pinto et al., 2017; Huang et al., 2017), and learning robust feature spaces (Higgins et al., 2017; Bousmalis et al., 2016). Fine-tuning (Rusu et al., 2017) and (some) meta-learning approaches (Finn et al., 2017) assume there will be an opportunity to collect data from the test domain and update the policy. In this work, we seek a policy that specializes to a novel test environment rapidly, without using significant data or computational effort at test time. Recurrent neural network policies are also capable of fast adaptation to unobserved quantities but require more complex reinforcement learning algorithms (e.g. Wierstra et al., 2007). Duan et al. (2016b) specifically evaluate RNN policies as a tool for adapting to different tasks, as opposed to more generic POMDP settings. Another possibility is augmenting the MDP states with “memory” and adding actions so that the policy can write to the memory states (Peshkin et al., 1999). In our setting, we assume that dynamics parameters are known during training but unknown under test. Under the same assumptions, the method most similar to our is that of Yu et al. (2017), who train a policy in simulation that observesϕ and a neural network to estimateϕ from a state-action trajectory. Our method builds upon this by adding two contributions: 1) a learned embedding space that represents the system dynamics parameters in a form that is both useful and easy to identify, and 2) an observability reward that encourages the agent to maximize identification accuracy. 65 3.2 Problemstatement We consider reinforcement learning in a family of Markov decision processes using the notation defined in § 2.6, with variations only in the transition dynamics P . We assume that the system space Φ can be parameterized by a real vector and identify Φ with the parameter, i.e. we assume Φ ⊆ R d ϕ . We present our results for the case of finite-horizon MDPs with horizon H, but our method is also applicable to infinite- horizon MDPs. Our learning protocol is separated into training and testing phases. During training, we have access to a simulator for eachϕ ∈ Φ . During testing, the environment selects a particularϕ ∈ Φ , but does not reveal its choice to us. Our goal is to take near-optimal actions in the MDP ϕ despite not knowing the value ofϕ . In cases where Φ is large and unstructured, for example the set of all linear dynamical systems or all finite MDPs, the testing phase of our learning protocol has no meaningful difference from the basic RL problem (§ 2.7). At the other extreme, if Φ represents a small neighborhood of uncertainty around a single nominal MDP, then we are in the robust control regime (§2.8.3) and it is reasonable to seek a blind policyπ :X 7→U that is nearly optimal for allϕ ∈ Φ , for example by using the domain randomization approach (3.1). We consider the cases in between, where Φ is large but still highly structured. In such settings, noblind policy can perform adequately over the entirety ofΦ . However, it is still possible to gain much sample efficiency at test time compared a generic RL algorithm by doing some kind of meta-learning or other preparation during the training phase. 3.3 Method A natural approach in this protocol is to learn a policyπ :X× Φ 7→∆( U) alongside a system identification functionid ϕ :X <N ×U <N 7→ Φ that maps the history of past states and actions to an estimate ofϕ . At 66 TrainingTime: Simulation Environment Policy π Embeddere Trajectoryτ x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 Estimatorid ε Supervised Learning Loss states actions SysID embeddings estimated embeddings TestingTime: Test Environment Policy π × Trajectoryτ x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 x t− 1 , u t− 1 Estimatorid ε states actions SysID estimated embeddings Figure 3.1: Overview of our method. At training time, correct dynamics parameters are available from the simulator. A mappinge from parameters to an abstract embedding space is learned, along with a module id ε to identify the embedding value from a state-action trajectoryτ . The policy is rewarded for behavior that improves system identification accuracy. At testing time, the true dynamics parameters are no longer known, and the estimated embeddings are input directly to the policy. test time, we act with the policy u t ∼ π x t , id ϕ (x 0:t ,u 0:t− 1 ) . This method was explored by Yu et al. (2017), in which the authors refer toπ as auniversalpolicy, and refer to their method as UP-OSI (universal policy with online system identification). Our method addresses several hypothetical failure modes of UP-OSI. First, UP-OSI requires estimating every dimension ofϕ , even though some may be redundant, difficult to estimate, or unneeded to maximize reward. Second, behavior that maximizes reward in training may be suboptimal for system identification at test time. It is preferable to learn a behavior that balances the primary reward with a secondary goal of making the system identification task as easy as possible. For example, some adaptive control methods for linear systems require a persistently exciting input, but inputs from a stabilizing linear feedback controller may fail to be persistently exciting (§2.8.1.1). 67 We address the first concern by introducing a learned abstract representation E ⊆ R d E of the dynam- ics parameters. (The dimensionality d E is a user-chosen hyperparameter.) During training, we learn an embedding function e : Φ 7→ E and a universal policy π ε : X ×E 7→ ∆( U) conditioned on an embed- ding value instead of the environment parameter ϕ . We simultaneously learn an identification function id ε : X <N ×U <N 7→ E to estimate the embedding value from the past states and actions. Then, at test time, we act with the policyu t ∼ π ε x t , id ε (x 0:t ,u 0:t− 1 ) . We address the second concern by augmenting the main RL reward with a term penalizing estimation error ofid ε . This rewards behavior that makes estimatingε easier. Our method is illustrated in Figure 3.1. 3.3.1 Learningalgorithms Simple case: policy and identification decoupled We learn the embedding functione and the uni- versal policyπ ε in an end-to-end fashion using a standard reinforcement learning algorithm. We reduce the universal policy optimization problem to a standard RL problem. First, we choose a distribution over environment parametersζ ∈ ∆(Φ) . For example, ifΦ is bounded, we might chooseζ to be uniform. We then create an augmented MDP with the state spaceX ′ =X× Φ , the dynamicsP ′ ((x,ϕ ),u)=P ϕ (x,u), and the initial state following the probability density functionρ ′ ((x,ϕ )) = ρ (x)ζ (ϕ ). Then, performing RL on our augmented MDP is equivalent to optimizing the objective maximize π ε, e J R (π ε ,e)≜ E ϕ ∼ ζ " E H− 1 X t=0 r(x t ,u t ) # , (3.2) where the inner expectation is with respect to x 0 ∼ ρ , u t ∼ π ε (x t ,e(ϕ )), x t+1 ∼ P ϕ (x t ,u t ). To opti- mize the RL objective (3.2), we can use any “off-the-shelf” reinforcement learning algorithm that supports continuous state and action spaces and can optimize the composition ofe andπ ε directly. 68 After RL is complete, we optimize the system identification function id ε according to the objective minimize idε J id (id ε ;π ε ,e)≜ E ϕ ∼ ζ " E H− 1 X t=0 ∥id ε (x 0:t ,u 0:t− 1 )− e(ϕ )∥ 2 2 # , (3.3) where the inner expectation is as in (3.2) with respect to the optimal policy obtained from RL. Because this expectation depends on the action distributions ofπ ε , the optimal system identification function id ε can be different for different policies. Complex case: observability reward As discussed in § 3.3, the optimal policy for the objective (3.2) does not necessarily take actions that make system identification feasible. We can address this by adding a term to the RL objective that penalizes system identification error. Specifically, for a fixed system iden- tification function id ε , we modify the RL objective to maximize π ε, e J O (π ε ,e;id ε )≜ E ϕ ∼ ζ " E H− 1 X t=0 r(x t ,u t )− α ∥id ε (x 0:t ,u 0:t− 1 )− e(ϕ )∥ # , (3.4) whereα > 0 is a user-chosen weight. We refer to the added term as the observability reward. Note that the value of the observability reward at timet is completely determined by the actions takenbefore timet. This is no different from the “credit assignment problem” already present in reinforcement learning, and is accounted for by the RL algorithm. Our algorithm alternates between optimizing the policy and the system identification estimator. In an idealized setting, we would follow Algorithm 2. Note that on line 1 of Algorithm 2, we initialize the policy and embedding function without the observability reward. It would also be possible to initialize the system identification function first using some default policy, for example a policy that takes purely random actions independent of state and embedding value. However, we did not explore this possibility in our work. 69 Algorithm2 Idealized algorithm 1: π (0) ε ,e (0) ← argmax J R (π ε ,e) ▷ initialize policy for task reward only 2: fori∈1,...,N do 3: id (i) ε ← argmin J id (id ε ;π (i− 1) ε ,e (i− 1) ) ▷ update sysID estimator for current policy 4: π (i) ε ,e (i) ← argmax J O (π ε ,e;id (i) ε ) ▷ update policy for current sysID estimator 5: endfor We note a possible failure mode of this approach: If the observability reward weightingα is chosen to be too large, its contribution could dominate the RL reward. Asα →∞, one optimal solution would be for e and id ε to both be constant functions. The hyperparameter α must be selected to ensure this does not happen. In practice, it is computationally expensive to solve the optimization problems within each iteration of Algorithm 2. In our experiments, we instead alternate between one iteration of RL for(π ε ,e) followed by a one iteration of gradient descent for supervised learning ofid ε , as shown in Algorithm 3. Algorithm3 Practical algorithm Initializeπ ε , e, id ε randomly 1: fori∈1,...,N do 2: sampleϕ 1 , ..., ϕ B i.i.d. fromζ 3: collect trajectoriesτ 1 , ..., τ B fromπ ε for eachϕ i 4: perform one iteration of RL to optimizeπ ε ,e forJ O 5: perform one iteration of SGD to optimizeid ε forJ id on the dataset{τ i } B i=1 6: endfor 3.3.2 Implementationdetails Functionclasses In our experiments, the universal policyπ ε is parameterized as a fully-connected neu- ral network with 2 hidden layers of 128 units each, using ReLU nonlinearities (§2.11.1.1). The embedding functione is also a fully-connected neural network, containing one hidden layer of 128 units. We employ a one-dimensional convolutional architecture for the identifier function id ε , composed of three 1D-convolutional layers, each with 64 filters of width 3 and ReLU activation, followed by a single fully connected layer with 128 units, and a linear output layer. (See § 2.11.3 for more information on 70 1D-convolutional neural networks.) Convolutions in time match the intuition that differentiation of the state and action trajectories is often required to identify the underlying dynamics parameters in real-world physical systems. If the discrete-time dynamics are derived from integrating continuous-time dynamics, then the finite-difference operations used to approximately recover derivatives are naturally represented as convolutions in the time dimension. For example, if x 1 ,x 2 ,... is a sampling of a continuous signal with one unit of time per step, then convolution in time by the kernel [− 1,1] approximates the signal’s derivative, and the kernel[1,− 2,1] approximates its second derivative. Although we defined id ε in § 3.3 as a function of all states and actions since the episode beginning, the composition of convolutions with a fully-connected layer (as opposed to e.g. convolution followed by taking the mean over time) implies a fixed-length window of state-action inputs. Therefore, we choose a window length K and supply onlyx t− K:t ,u t− K:t− 1 to id ε . For steps whent < K, we define the initial states and actions in the window as zero vectors. In our experimentsK =16. RL algorithm For all experiments, we use the entropy-regularized Soft Actor-Critic (SAC) reinforce- ment learning algorithm (Haarnoja et al., 2018). SAC is an off-policy algorithm where a stochastic policy is trained only with TD-learned value function estimates. We observed significantly higher rewards using SAC compared to the on-policy, Monte Carlo policy gradient algorithm PPO (Schulman et al., 2017). We conjecture that the use of TD-learning and a replay buffer is especially helpful in our scenario compared to single-environment training, since the replay buffer helps prevent “forgetting” about areas of the dynamics distributionp Φ that have not been sampled recently. The replay buffer is used for RL but not to learn the system identification function id ϕ , to ensure that the (nonstationary) training distribution of state-action trajectories forid ϕ reflects the same policy behavior that will be observed at test time. The value function estimators used by SAC are parameterized by fully-connected networks of identical structure to the policy. Forblind policies, the value function networks are conditioned only on the observed statex∈X to emulate domain randomization approaches. Forplain policies, they are conditioned on both 71 x andϕ ∈ Φ to emulate the setup of Yu et al. (2017). For ours policies, they are conditioned onx,ϕ , and ε ∈ E. Although the embedding function e could theoretically be optimized via the least-squares value learning loss of SAC, we optimize it only via the policy gradient. 3.4 Experiments In this section, we show desirable properties of our learned embedding spaceE using a toy example and compare the performance of our architecture against several baselines on a more complicated benchmark problem in robotic locomotion. 3.4.1 Point-MassEnvironment This low-dimensional system allows us to visualize learned embeddings. A 2D point mass with position p∈R 2 follows dynamics ¨p=gu− d˙ p, where the actionu ∈ [− 1,1] 2 is a bounded force input, the parameterg ∈ R ̸=0 is a gain factor, and the parameter d ∈ R ≥ 0 is a damping factor. The goal is to push the point towards the origin, specified by rewardr =−∥ p∥ 2 . To convert these second-order continuous-time dynamics into a discrete-time dynamical system suit- able for standard reinforcement learning algorithms, we apply the state space transformation x(t)= " p(t) v(t) # , ˙ x(t)= " v(t) gu(t)− dv(t) # , 72 4 3 2 1 0 1 2 3 4 gain 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 embed[0] 4 3 2 1 0 1 2 3 4 gain 0.5 0.0 0.5 1.0 1.5 2.0 2.5 embed[1] Figure 3.2: Point-mass example system: Learned mappinge from gaing to one dimension of the embedding spaceE. −2 −1 0 1 2 gain (actual) −2 −1 0 1 2 estimated plain −2 −1 0 1 2 embedding (actual) −2 −1 0 1 2 estimated embed Figure 3.3: Point-mass example system: Comparison of actual vs. estimated gaing (left) and one dimension of the embeddingε (right). Embedding separates parameters requiring disjoint behavior into clusters. wherev∈R 2 is the velocity, and discretize with simple forward Euler integration over the time interval ∆ t to yield the discrete-time system x t = " p t v t # , x t+1 =x t + " ∆ t v t ∆ t (gu t − dv t ) # . The dynamics parameterϕ is the gain factorg ∈± [0.175,1.75]. Due to the unknown sign ofg, a policy that ignores the value ofϕ cannot possibly perform well in this environment. 73 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 gain 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 mass embed[0] embed[1] Figure 3.4: Point-mass system with redundant mass parameter: Multidimensional visualization of learned embedding. Axes represent the gain and mass parametersg,m. Colors represent both dimensions of the learned embeddingε: one dimension controls the amount of orange; other dimension controls the amount of blue. Note that lines of constant gain/mass ratio are mapped to approximately the same color. Plot restricted to positive gain domain for clarity. For our experiments, we select the embedding dimensiond E =2, even though the true spaceΦ is only one-dimensional, to emphasize that makingε higher-dimensional than necessary is not detrimental. We show results in Figures 3.2 to 3.4. In Figure 3.2, we show one output dimension of the learned embedding function e : Φ 7→ E after applying our method to the point-mass system. We observe that the embedding “squashes” all positive gains to approximately the same value, and does the same for negative gains. In Figure 3.3, we evaluate the accuracy of the learned system identification functions in both our method and in the plain method. The scatter plots compare the ground truth values ofϕ (resp. ε) against the values estimated byid ϕ (resp.id ε ). We again see the effect of the “squashing” done by id ε , resulting in two easily distinguished clusters. To demonstrate how the learned embedding can eliminate redundant parameters in Φ , we construct a version of the point-mass environment with a mass parameter m > 0 alongside the gain g, such that ¨p = (g/m)u. The dynamics of all combinations with the same ratio g/m are indistinguishable. This implies that in the plain method, the system identification loss will always be large. 74 In Figure 3.4, we visualize both dimensions of our learned embedding for this version of the point-mass environment. Our framework learns an embedding where all(g,m) combinations with similarg/m ratios map to a similar embedding value. 3.4.2 Half-Cheetahenvironment Figure 3.5: Variations of Half-Cheetah environment produced by randomization of kinematic and dynamic properties. As a more complex benchmark task, we demonstrate results on the Half-Cheetah planar locomotion environment from the OpenAI Gym (Brockman et al., 2016). We randomize the length of the torso and the lengths of the three segments in each leg. For each of the six rotational joints, we randomize the beginning and end of the angular range, the gear ratio of the actuator, a velocity-proportional damping constant, and the stiffness of a virtual spring pulling the joint towards a specified resting angle. In total, 37 parameters are randomized. For joint ranges, limits on either side are shifted by ω ∼ Uniform(− 0.3,0.3) radians from nominal. All other parameters must take nonnegative values, and some should not be too close to zero. To en- sure this, we multiply the nominal value of those parameters by a random nonnegative ratio β p where p∼ Uniform(− 1,1). The parameterβ > 1 controls the amount of randomness. For example, ifβ = 2, then 1/2 ≤ β p ≤ 2. In our experiments, β = 1.75, which was the largest value we could use without generating too many overly difficult configurations. Some examples of the randomized half-cheetahs are shown in Figure 3.5. 75 blind 0 500 1000 1500 2000 ours plain 0 500 1000 1500 2000 train test Figure 3.6: Box plot of training and test reward for blind, ours, and plain policies in randomized half- cheetah environment. Distributions are over the random sample of environments. Whiskers indicate full extent (min/max) of rewards. Due to the architecture of the MuJoCo physics simulator used in this environment (Todorov et al., 2012), it is not practical to sample new random dynamics parametersϕ for each training iteration. Instead, we construct a “universe” of 256 models initially, and sample a new “universe” at test time. We select that the embedding spaceE is 8-dimensional. We did not observe significant sensitivity to this choice. Results are shown in Figure 3.6. Variance of rewards is large due to the randomized dynamics. (For example, a random variant with strong damping will require more control effort to reach the same speed.) The blind policy fails to achieve high rewards because it is unable to specialize its behavior to the dy- namics parameters. Naturally, it also does not suffer a performance loss on the test set. The plain policy, conditioned onϕ instead ofε, achieves similar training rewards but suffers negative rewards in some test environments, while ours does not. In current experiments, we have not yet observed a significant effect of our observability reward term, so we do not include its results in Figure 3.6. 76 3.5 Discussion Our experiments raise several questions. First, we observed that our learned embedding spacee : Φ 7→E offered only modest improvement over directly identifying ϕ . This may be because identifyingϕ for the half-cheetah environment is not difficult, despite its high dimensionality. Also, we hypothesized that our embedding would be useful for cases where twoϕ,ϕ ′ ∈ Φ are indistinguishable, but if they are indistin- guishable, then the adaptive policy does not need to differentiate them. Therefore, if the system identifi- cation module only outputsϕ , the least-squares system identification loss will be nonzero but the policy reward will still be high. Second, we did not observe any effect of adding our observability reward. The simplest explanation is that optimal behavior for the task reward already yields good observability as a side effect. Third, a result we observed during our experiments that we did not report in § 3.4 because it was tangential to the main work: a performance gap between the universal policy and the average performance of individual “expert” policies trained for a singleϕ ∈ Φ . We investigate this phenomenon in a simplified setting in §3.6. 77 3.6 Simplifiedexperiment: Universalpolicyversusexperts In this section we investigate the behavior of RL with a “universal policy” independent of any concern about system identification. We eliminate the testing phase from our protocol and focus purely on the performance during the training phase. Without the need to identifyϕ in the training phase, the system embedding and observability reward proposed in §3.3 are no longer useful, so we eliminate them from our experiments. Therefore, we are working in the same setup as Yu et al. (2017). We compare the performance of a multi-system “universal policy” neural network against the aggre- gate performance of a collection of single-system policies. Recalling thatζ is some distribution overΦ , we sample a fixed set of N systems Φ s =ϕ 1 ,...,ϕ n ∼ ζ N . For each ϕ i ∈ Φ s , we train an “expert” policy π i to optimize the RL objective for ϕ i . We terminate the RL algorithm forπ i after a fixed number T of interactions with the MDPϕ i . We also train a multi-system policyπ multi , without using any embedding or observability reward, in the augmented MDP as described in § 3.3.1 with the system distribution ζ = Uniform(Φ s ). Note that this distribution is uniform over the sample, not over all of Φ . We trainπ multi forNT interactions with the augmented MDP. Therefore, the universal policy π multi and the collection of expert policies each consume the same total amount of environment interactions in aggregate. As a simple test environment, we use a linearization of the “planar quadrotor” (Singh et al., 2021). The system is linearized about its hover state. We randomize the mass and moment of inertia parameters, sampling from a log-uniform distribution over two (decimal) orders of magnitude. In this system,X =R 6 , U = R 2 , and Φ ⊆ R 2 . The reward is negative LQR cost (2.25) with Q and R identity matrices. We train all policies using the implementation of the PPO algorithm (Schulman et al., 2017) from the popular 78 reinforcement learning librarystable-baselines3 (Raffin et al., 2021). We leave all hyperparameters of the policy class and RL algorithm at their default value. A goal of multi-system learning is to leverage shared structure between systems to improve sample efficiency. If a multi-system learning method is able to do this, then we expect to see that for some values ofT , 1 N N X i=1 J ϕ i (π i )≤ 1 N N X i=1 J ϕ i (π multi ), whereJ ϕ i (π ) denotes the RL objective for systemϕ i . We visualize results for this experiment in Figure 3.7. To account for the different optimal policy costs for different systems, we plot the ratio of the learned policy’s cost to the cost of the LQR-optimal linear controller for the infinite-horizon version of the problem. (Note that this normalization is performed only for plotting, not for the RL reward.) Learning curves for each random planar quadrotor are plotted individually. The training steps on the horizontal axis are per-system, so each step corresponds to an equivalent amount of training time for both the multi-system policy and the aggregated single-system policies. We observe that the multi-system policy does initially achieve lower LQR cost for most systems. How- ever, as training progresses, in most of the systems the “expert” policies eventually become more optimal than the multi-system policy. This suggests that the multi-system setup is able to exploit shared structure to some degree, but there is also some phenomenon inhibiting the multi-system setup from reaching full optimality. It also appears that in the multi-system setting, PPO converged by around1 million steps (100k steps per system) but became unstable after around2 million steps (200k steps per system). Discussion We emphasize that this experiment is not a strong negative result against the “universal pol- icy” architecture. It only shows that applying the simplest possible approach is not enough. Researchers 79 have proposed more sophisticated approaches for RL for continuous system spaces (for example, Kalash- nikov et al., 2021). We include this experiment only as a motivating example for our subsequent theoretical inquiry into reinforcement learning and multi-system control. There are two broad categories of explanations for this result. One possible explanation is that the augmented MDP is somehow unfavorable for the RL algorithm. Another possible explanation is the policy class. It might be hard to represent a good multi-system policy with a neural network. These two possibili- ties motivated us to seek a better understanding of RL algorithms and multi-system control at a theoretical level, which we explore in Chapter 5 and Chapter 6 respectively. 80 1 2 3 4 suboptimality ratio seed = 1 seed = 2 seed = 3 1 2 3 4 suboptimality ratio seed = 4 seed = 5 seed = 6 0 100k 200k 300k steps 1 2 3 4 suboptimality ratio seed = 7 0 100k 200k 300k steps seed = 8 0 100k 200k 300k steps seed = 9 expert multi Figure 3.7: Learning curves for multi-system “universal policy” and single-system “expert” policies for nine random linearized planar quadrotor systems. 81 Chapter4 DeformableManipulationusingLearnedModels In this chapter we investigate the middle ground between model-free reinforcement learning and tra- ditional control methods based on physics-derived models. Our interest is motivated by the following observation: in MDPs for robotics the reward function is often known, or even designed by an engineer. The main source of complexity in many tasks is the environment dynamics. For example, in the half- cheetah simulated locomotion environment discussed in Chapter 3, the reward is simply forward velocity, which is one of the system states. Therefore, we consider using learning to account for complex dynamics but traditional estimation and control techniques to act optimally within the learned model. We consider the test case of deformable object manipulation, where first-principle physical models are usually high- dimensional. In the present work, we assume that an engineer can design an experiment that sufficiently explores the state space. After collecting a dataset of trajectories from the real system, we train a recurrent neural network (RNN) to approximate its input-output behavior with a latent state-space model. Unlike finite element models and other physics-derived models of deformable objects, the RNN internal state is low- dimensional enough to enable realtime nonlinear control methods. We demonstrate a closed-loop control scheme with the RNN model using a standard nonlinear state observer and model-predictive controller. The work presented in this chapter was originally published in Preiss et al. (2022). It is reproduced here with minimal changes. 82 We apply our method to track a highly dynamic trajectory with a point on the deformable object, in real time and on real hardware. Our experiments show that the RNN model captures the true system’s frequency response and can be used to track trajectories outside the training distribution. In an ablation study, we find that the full method improves tracking accuracy compared to an open-loop version without the state observer. 4.1 IntroductionandRelatedWork Manipulating deformable objects represents a challenging area of robotics. In contrast to typical objects consisting of a single rigid body, deformable objects often admit limited control authority and have dy- namics that are difficult to predict. At the same time, many objects of human interest are deformable. Safe, reliable robotic manipulation of these objects is critical for capable general-purpose robots (Arriola-Rios et al., 2020; Zhu et al., 2021). As with many manipulation problems in robotics, there are multiple ways to model the dynamical behavior of the manipulated object. Broadly speaking, there are two classes of models: Fully data-driven methods based on an expressive function class with many parameters, and analytical methods based on a physical model with fewer parameters that can be identified from data. Physics-based models of deformable objects have been studied extensively in science and engineering contexts, and many constitutive models have been described for various materials (Holzapfel, 2002). These models are continuous in space and time, and simulation on computer hardware requires a discretization strategy. Finite element methods (FEM) have been used for decades to solve continuum mechanics prob- lems (Wriggers, 2008) and have shown promise for robotic control. Recent work with FEM modeling addresses dynamic control of deformable objects and soft robots with offline trajectory optimization (Zim- mermann et al., 2021; Bern et al., 2019; Duenser et al., 2018; Li et al., 2021; Qiao et al., 2020; Heiden et al., 2021a). FEM is appealing as it admits extensive theoretical analysis (Barbič and Popović, 2008; Thieffry 83 et al., 2018). For real-world problems, it is possible to estimate the parameters of FEM meshes in a prin- cipled way from exteroceptive sensors common in robotics (Hahn et al., 2019). Alternative discretizations include the material point method (Sulsky et al., 1994; Hu et al., 2019), (extended) position based dynamics (Macklin et al., 2016), and meshless shape matching (Müller et al., 2005). Additionally, task-specific reduced states can be formulated, which accelerate control, perception and planning (Mcconachie and Berenson, 2018; McConachie et al., 2020). Deformable objects can be complex to model, and for objects with resonant dynamics, seemingly mi- nor errors in model assumptions or material parameter estimates can cause large deviations in dynamic behavior over time (Bern et al., 2020). For these reasons, purely data-driven methods are an appealing alternative to physics-based models. Machine learning methods have been explored in deformable ma- nipulation (Mirza, 2020), for purely kinematic trajectory tracking (Bern et al., 2020), for cable-driven soft robot actuators (Bruder et al., 2019), for control of pneumatic deformable mechanisms (Gillespie et al., 2018), and for structures actuated by shape-memory alloys (Sabelhaus and Majidi, 2021). For scenarios where controllers can continuously interact with their environment to improve, model-based reinforce- ment learning has been proposed (Thuruthel et al., 2019). Other data-driven approaches studied in soft robot control include proper orthogonal decompositions (Tonkens et al., 2021). In this work, we propose a data-driven method for modeling and trajectory-tracking control of a de- formable object. Tracking fast trajectories serves as a benchmark for a method’s ability to predict and account for dynamics in the manipulated object. Our method models the dynamics with a long short- term memory (LSTM) recurrent neural network (RNN) (Hochreiter and Schmidhuber, 1997; Lipton, 2015), trained in a standard sequence modeling setup using input-output trajectory data from a real physical system. The internal state of the learned RNN is not physically meaningful, but the RNN still forms a discrete-time dynamical system that is compatible with standard methods in state-space nonlinear control. 84 We therefore apply model-predictive control (Allgöwer and Zheng, 2012) and extended Kalman filtering methods (Kalman, 1960) to track the trajectory using the RNN model. We implement our method to run online with a real robot, and use it to “draw in the air” along a fast trajectory with the free end of a pool noodle (long foam cylinder) held by the robot arm. To measure how well non-instantaneous dynamics are captured, we compare frequency response plots of the true system and the RNN model. To verify that a nonlinear observer helps compensate for model errors, we perform an ablation study comparing our method to an open-loop variant where the RNN model is assumed to be perfect. We find that the full closed-loop method reduces trajectory tracking error. RNNs have a long history within the broader field of nonlinear state-space system identification with partial observability (Nelles, 2001). Such models can be used as an intermediate tool for learning a control policy, e.g. in model-based reinforcement learning (Ha and Schmidhuber, 2018), or directly with nonlinear observers and controllers as in our method (Terzi et al., 2021). To our knowledge, our work represents the first application of the RNN/observer/MPC architecture to the task of manipulating deformable objects. 4.2 ProblemSettingandPreliminaries We consider a robot manipulating a deformable object such that a particular point on the object tracks a trajectory. We define a setting with the following assumptions: Assumption4.2.1 (Input). The deformable object is attached to the robot’s end effector by an unbreakable grasp at a fixed position. The control policy interacts with the robot by commanding the position and attitude of the grasp point. Assumption4.2.2 (Output). The robot’s perception system can accurately measure the three-dimensional position of a single point on the surface of the deformable object. 85 Figure 4.1: Physical testbed for our method. Trajectories are tracked by a Vicon motion capture marker attached to the end of a pool noodle, rigidly held by a Franka Emika Panda robot. Pitch/yaw inputs ¯u=(ϕ,ψ ), tracking point measurement ¯y, and coordinate axes (X,Y,Z) shown. (The fiducial marker visible in the image is not used.) Assumption4.2.3 (Objective). The robot should manipulate the deformable object such that the measured point on its surface tracks a given trajectory in three-dimensional space. Assumption4.2.4 (Protocol). Interaction with the environment is divided into two stages. In the prepa- ration stage, the robot can interact with the deformable object, perform calculations, and store results. There is no time limit on this stage. In the testing stage, a supervisor reveals the trajectory to be tracked. The robot must track the trajectory promptly without slow pre-computations. The robot can use all the results stored from the preparation stage, but cannot interact with the deformable object in any way other than attempting to track the trajectory. This restriction is motivated by safety- and time-constrained tasks, where it may not be feasible for the robot to perform additional exploratory actions. 86 Interact with object Train Recurrent Neural Network Extended Kalman Filter Model Predictive Control Robot, Object, Motion Capture D θ ˆ x y u Preparation Testing Figure 4.2: Diagram of our system. A small dataset of control inputs and tracked marker locations is obtained from the real system. A recurrent neural network (RNN) model is trained to predict input/output behavior with a latent state. The RNN forms the nonlinear dynamics model for an extended Kalman filter (EKF) and model-predictive controller (MPC) to track a dynamic input trajectory with the deformable object. Formally, we assume that the coupled system of the robot and deformable object can be described by a continuous-time, deterministic, time-invariant dynamical system ˙ ¯x= ¯ f(¯x,¯u), ¯y = ¯h(¯x), (4.1) where the state ¯x is an abstract infinite-dimensional quantity representing the full continuum state of the deformable object and the robot, the input ¯u ∈ SE(3) is the desired pose of the robot end effector, and the output ¯y ∈ R 3 is the position of the measured point. The unknown dynamics ¯ f encapsulates both the deformable object itself and a low-level controller that attempts to track the end effector pose input ¯u by issuing actuator commands, e.g. motor torques, to the robot. The measurement model ¯h extracts the position of the measured point from the continuum state. Assumption4.2.5. The true system has a unique globally exponentially stable equilibrium state ¯x 0 cor- responding to the steady-state identity inputI, i.e. ¯ f(¯x 0 ,I) =0. This assumption is realistic for damped elastic objects, such as closed cell foam. We assume that no new plastic deformations to the foam (such as a permanent bend) are introduced during our experimental protocol. The control policy interacts with the system (4.1) in discrete time steps of fixed length ∆ t . In this para- graph we overload the same notations for discrete-time and continuous-time inputs, states, and outputs; for the remainder of the chapter we will refer to the discrete-time quantities exclusively. At stepk∈N, 87 the discrete-time input ¯u[k] is supplied to the continuous-time system as a zero-order hold, resulting in the continuous-time input signal ¯u(t)= ¯u[⌊t/∆ t ⌋] for allt≥ 0, and the discrete-time output ¯y[k] is sam- pled such that ¯y[k]= ¯y(k∆ t ). Our control task is specified by a discrete-time signal of K goal positions z[1],...,z[K]∈R 3 for the tracked point and the cost function J = P K k=1 ∥z[k]− ¯y[k]∥ 2 W , (4.2) where the weighting matrixW ⪰ 0 encodes a (potentially) non-isotropic tracking objective. Remark The protocol in Assumption 4.2.4 rules out some methods that have been widely used for de- formable manipulation. Trajectory optimization methods that build a local dynamics model around a reference trajectory by interacting with the environment, such as guided policy search (Thuruthel et al., 2019), violate the rule against non-task interaction in the testing stage. On the other hand, goal-conditioned reinforcement learning (e.g. Andrychowicz et al., 2017) would be compatible with our protocol. 4.3 Methods The components of our method are outlined in Figure 4.2. To ensure that our system relies on the whip-like resonant behavior of the deformable object to track trajectories, rather than relying on large translational movements of the robot end effector, we restrict the inputs to only the pitch ϕ and yaw ψ angles of the end effector. We denote this restricted space as U ⊂ SE(3). In practice, we generate training data within a compact subset of pitch and yaw angles, and regularize the MPC problem so that inputs far from the identity I are penalized. Therefore, we parameterizeU byR 2 (in radians) and ignore potential issues of multiple covering (§2.2.3). 88 4.3.1 Datacollection We begin by collecting a training datasetD containingN input-output trajectories from the real system. We denote by ¯u[j,k] and ¯y[j,k] the input and output from the k th time step of the j th trajectory inD. Before starting each trajectory, we apply the identity input and allow the system to settle for several seconds (10 in our experiments) to ensure that the system returns to a state very close to the rest state ¯x 0 . This will happen promptly due to Assumption 4.2.5. We then apply random sinusoidal pitch and yaw inputs. Sinusoidal inputs excite the system enough to demonstrate large-scale dynamics such as resonance that are important for manipulation, but are smooth enough to respect the actuator limits of our robot. The sinusoidal inputs are chosen to respect a user-selected angular acceleration limit ˙ ω max and abso- lute angle limitsϕ min ,ϕ max andψ min ,ψ max to avoid triggering our robot arm’s built-in safety stop when actuator limits are reached. For each angle, the sinusoid’s frequencyν is sampled log-uniformly between 0.125Hz and 3Hz. The maximum angular acceleration of a sinusoid is given by 2πAν 2 , whereA is the amplitude. Therefore,ν and the acceleration limit induce a maximum amplitudeA≤ A max = ˙ ω max /2πν 2 . We sample the amplitude uniformly from[0.1,A max ]. Finally, we sample the phase uniformly from[0,2π ) and add a constant offset, which is sampled uniformly from the range induced by the amplitude and angle limits. 4.3.2 RNNdynamicsmodel The true state ¯x of the deformable object is an infinite-dimensional continuum of points, which is not representable on a computer without discretization and approximation. Furthermore, the dynamics of the system are not purely dictated by the behavior of the deformable object—they also include the behavior of the low-level controller of the robot arm. We overcome both challenges by representing the system state as the hidden state of a learned RNN. In comparison to other methods of system identification from input/output data, such as linear methods (Brunton and Kutz, 2019), RNNs are a highly expressive class 89 of nonlinear state-space models. Recall from § 2.11.4 that the RNN is a generic function approximation scheme parameterized by a real-valued vectorθ , and consists of a discrete-time dynamics model x[k+1]=f θ (x[k],u[k]), y[k]=h θ (x[k]), (4.3) where x ∈ R n is the internal state, u is the input, y is the output, and f θ and h θ are the dynamics and measurement functions respectively. Bothf θ andh θ are differentiable with respect to their arguments and the parameterθ . The particular form of the functionsf θ andh θ must be carefully chosen to maximize ex- pressiveness while preserving desirable properties for optimization, but from the perspective of estimation and control, their exact form is unimportant. The RNN is trained on the dataset D to minimize a regression loss on the input-output map over complete sequences: minimize θ N X j=1 K j X k=1 ∥¯y[j,k]− h θ (x[j,k])∥ 2 2 subjectto x[j,0]=0 x[j,k+1]=f θ (x[j,k],¯u[j,k]), (4.4) where K j is the length of the j th trajectory inD. The fixed initial state of 0 is justified in our setting because each trajectory inD begins at the rest state ¯x 0 . We use input and output projections, as discussed in §2.11.4.1, with atanh nonlinearity after (resp. before) the input (resp. output) projection. We find an approximate local optimum of (4.4) using stochastic gradient descent. 4.3.3 Model-predictivecontrolwithreduced-ordermodel After finding a value of the RNN parameter θ ∗ that approximately optimizes the learning objective (4.4), we use the RNN model in a model-predictive control (MPC) framework to optimize the trajectory-tracking 90 objective (4.2) in an online manner. Assume temporarily (we will return to this assumption in §4.3.4) that at time stepk we know a value ˆ x[k] of the abstract RNN state that is consistent with the input and output history from previous time steps in the real-world system. We solve the short-horizon optimal control problem minimize ¯u[k],..., ¯u[k+H− 1] H X i=1 ∥z[k+i]− h θ ∗ (x[k+i])∥ 2 W +α H− 1 X i=0 d(¯u[k+i− 1],¯u[k+i]) +β H X i=1 d(¯u[k+i],I) (4.5) subjectto x[k]= ˆ x[k], x[k+i+1]=f θ ∗ (x[k+i],¯u[k+i])∀i. In the objective (4.5), H ≪ K is the MPC horizon. In the second and third terms, d:U×U 7→ R ≥ 0 is a semimetric (§ 2.2.1) on the input spaceU. The first regularization term, weighted by constant α ≥ 0, encourages smoothness of the control signal. The second regularization term, weighted by constantβ ≥ 0, encourages the robot to stay near its rest state. Solving (4.5) yields a sequence of inputs ¯u ∗ [k],...,¯u ∗ [k+H− 1] optimized to track the next H steps of the full goal trajectory. Following the standard moving horizon architecture (§2.8.6.1), we apply only the first input ¯u ∗ [k] from the solution to the real system. Then, at time stepk+1, we solve a new instance of (4.5). The optimization problem (4.5) is nonconvex, but the solution from the previous time step provides a high-quality initial guess, so local optimization methods typically perform well as long as the initial guess 91 is not close to a bad local optimum (Allgöwer and Zheng, 2012). We obtain an approximate solution with a few steps of gradient descent with momentum. The momentum state from previous MPC steps is persisted and time-shifted in the same manner as the initial guess. 4.3.4 EstimatingtheRNNstate In § 4.3.3, we assumed the availability of a RNN state ˆ x[k] that is consistent with previous inputs and outputs from the real-world system. To obtainˆ x[k], it is not sufficient to simply evaluate f θ ∗ recursively on ¯u[1],...,¯u[k− 1]. Because the RNN model is not perfect, the true outputs¯y[1],...,¯y[k− 1] obtained from the real-world system may diverge from the outputs h θ ∗ (x[1]),...,h θ ∗ (x[k− 1]) predicted by applying the RNN model in this open-loop manner. Instead we observe that, although the RNN state has no direct physical meaning, the RNN model is still a nonlinear discrete-time dynamical system with known dynamics. Therefore, its state can be estimated from the input-output history using standard techniques from estimation theory. ∗ In particular, we apply an extended Kalman filter (EKF) to the RNN model. The EKF is defined by linearizing the system about the current state and applying the standard linear Kalman filter covariance propagation and update steps, as discussed in §2.9.1.10. We now review the well-known EKF equations to highlight the role of the RNN. The EKF maintains a Gaussian-distributed belief over the RNN state. At timek, the belief is distributed according to x[k]∼N (µ [k],Σ[ k]), ∗ In general, nonlinear state estimation techniques provide no guarantees of optimality or other notions of correctness. 92 whereN(µ, Σ) is the Gaussian distribution with meanµ and covarianceΣ . After sending an input ¯u[k] to the system, we then update the belief according to µ [k|k− 1]=f θ ∗ (µ [k− 1],¯u[k]), Σ[ k|k− 1]=F[k]Σ[ k− 1]F[k] ⊤ +Q[k], (4.6) where F[k]= ∂f θ ∗ ∂x (µ [k− 1], ¯u[k]) is the Jacobian of the RNN dynamics f θ ∗ with respect to state. The process noise covariance Q[k] ⪰ 0 represents noise in the RNN abstract state, so it cannot be derived from a physical model or estimated from data. We simply set Q to a scaled identity matrix in this work. Next, the measurement ¯y[k] is captured from the real system. We compute the measurement residualγ [k]= ¯y[k]− h θ ∗ (µ [k|k− 1]). According to the current belief,γ [k] has covariance S[k]=H[k]Σ[ k|k− 1]H[k] ⊤ +R[k], where H[k]= ∂h θ ∗ ∂x (µ [k|k− 1]) is the Jacobian of the RNN observation function h θ ∗ with respect to state. The sensor noise covariance R[k]≻ 0 represents a physically meaningful quantity that can be derived from a model or estimated from data. In this work we use a motion capture system with isotropic noise, so we setR to a constant scaled identity matrix. We then compute the Kalman gain K[k]=Σ[ k|k− 1]H[k] ⊤ S[k] − 1 93 and update the belief according to µ [k|k]=µ [k|k− 1]+K[k]γ [k], Σ[ k|k]=(I− K[k]H[k])Σ[ k|k− 1]. (4.7) We then use the mean of the belief distribution as the initial state for the MPC problem (4.5), that is, ˆ x[k]=µ [k]. 4.3.5 Implementation For the RNN reduced-order dynamics modelf θ , we select the long short-term memory (LSTM) architecture (§2.11.4.1). Numerical values of the architectural and training hyperparameters are listed in Table 4.1. We train the LSTM in PyTorch (Paszke et al., 2019). We also solve the model-predictive control problems and implement the EKF in PyTorch due to the ease of differentiating through the RNN model. We run the MPC control loop at40Hz. Table 4.1: Values of user-chosen constants in our experiments. Q EKF process covariance 10 − 6 I R EKF measurement covariance 10 − 2 I H MPC horizon 25 α MPC smoothness weight 1.0 β MPC homing weight 1e-1 — MPC gradient descent rate 4e-1 — MPC gradient descent steps 5 — LSTM layers 1 n Reduced state dimension 200 — LSTM SGD steps 1e4 — LSTM SGD learning rate 1e-3 — LSTM SGD batch size 10 N # trajs. in dataset 100 94 u ¯y = ¯h(¯x) Figure 4.3: Schematic diagram of pool noodle experimental setup showing end effector pose ¯u and tracked surface point ¯y. Details are given in §4.4. 4.4 Experiments In all experiments, our deformable body is a thin cylinder of uniform closed-cell polyethylene foam, com- monly known as a pool noodle, with length1.5m and diameter6.5cm. To attach the object to the robot, we press-fit approximately 4cm of one end of the pool noodle into a rigid 3D-printed ring attached to the robot end effector. To track the distal end of the object, we attach a rigid assembly of motion capture markers and track its full pose with a Vicon motion capture system. The measurement ¯y has a calibrated offset from the marker assembly to lie on the center line of the pool noodle. Our experimental setup is shown in Figure 4.1. The robot is a Franka Emika Panda. To track the pose commands ¯u ∈ SE(3) output by the MPC controller, we apply a proportional-only control law in both the position and in the attitude quaternion to generate desired linear and angular velocitiesv for the end effector. We compute the kinematic Jacobian from libfranka and produce desired joint velocities using the Jacobian pseudoinverse with null space optimization towards the “home” position (Siciliano et al., 2009). 4.4.1 Modelfrequencyresponse To validate our RNN model, we compare its empirical frequency response near the internal statex=0 to that of the true physical system at its rest state ¯x 0 . Frequency response is a holistic property of the model that depends on its behavior in multiple parts of the state space. To avoid the complications of frequency- domain analysis for multiple-input/multiple-output systems, we actuate the system only in the yaw axisψ 95 and measure only the deflection of the pool noodle tip in the horizontal axis. Yaw inputs, as opposed to pitch inputs, avoid the asymmetric effect of gravity. . We sample 33 geometrically-spaced frequencies ranging from 0.125Hz to 2Hz. For each frequency, we apply a yaw input with a small amplitude (11.5 ◦ peak-to-peak) for 20sec and record the trajectory of the pool noodle tip ¯y. The small input amplitude allows higher-frequency inputs before reaching the robot’s actuator limits. We discard approximately the first half of the recording (on the nearest whole cycle boundary) to eliminate transient effects. We then compute the empirical gain and phase by taking the inner product of the output signal with complex exponential functions, analogous to the discrete Fourier transform. Because the input and output have different units, only the relative gain magnitudes between different frequencies are meaningful. We normalize the gains such that the gain of the real system for the slowest input frequency equals1. Bode (gain and phase) plots for each system are overlaid in Figure 4.4. In the true physical system, we observe typical behavior of a resonant lowpass filter with a strong resonant peak around 0.8Hz. The peak gain is approximately4 times larger than the low-frequency gain. This represents a dramatic phenomenon that a faithful model must capture. In the RNN model, the gain response closely matches the real system. The phase response matches closely below the resonant frequency, but begins to lag behind the true system for high frequencies. We suspect this may be due to an unbalanced training data distribution. Overall, this experimental result suggests that the RNN model has successfully captured the frequency response of the physical system. 4.4.2 MPCtracking We apply our method to track several test trajectories which attempt to expose the controller’s performance with regard to resonant dynamics. The goal trajectoriesz[1],...,z[K]∈R 3 are specified by the user. As described in eq. (4.2), our tracking cost is non-isotropic. We use the valueW =diag(0,1,1) to focus only 96 1 /8 1 /4 1 /2 1 2 frequency (Hz) 0 1 2 3 4 gain 1 /8 1 /4 1 /2 1 2 frequency (Hz) 0 − 90 − 180 − 270 phase (deg) real lstm Figure 4.4: Frequency-domain gain and phase response (Bode plots) for real pool noodle and LSTM model. − 0.5 0.0 0.5 y − 0.6 − 0.4 − 0.2 0.0 z circle − 0.5 0.0 0.5 y lissajous − 0.5 0.0 0.5 y rectangle goal closedloop openloop Figure 4.5: Two-dimensional projections of paths traced by pool noodle free end in MPC tracking experi- ments. See §4.4.2 for details and discussion. 97 -0.25 0.00 0.25 y -0.50 0.00 z -0.50 0.00 pitch 0 1 2 3 4 5 6 7 time (sec) -0.10 0.00 0.10 yaw goal closedloop openloop Figure 4.6: Traces of rotation angle inputs (pitch, yaw) and horizontal and vertical components of pool noodle free end (y, z) for MPC tracking of the circle trajectory (see Figure 4.5). The closed-loop variant of our method shows superior tracking performance. 98 on tracking in the Y- and Z-axes, which represents the view from the front of the robot. Without this non-isotropic weight, the goal trajectory would need to be a carefully designed curve inR 3 to be trackable with zero error. We show the results from three trajectories in Figure 4.5. The first two trajectories are a circle of diam- eter0.6m and a figure-8 Lissajous curve of width 1m and height0.4m. Both trajectories are sinusoidal in each axis, which matches the data collected during our training step. The circle and the vertical axis of the Lissajous curve are set to the system’s resonant frequency0.8Hz as determined experimentally. The third trajectory is a rectangle with constant linear velocity along each edge (no stopping at corners). These trajectories are close to the robot’s dynamic limits as discussed in §4.3.1. Improvements in our low-level controller are required to test more demanding trajectories that push the system further into nonlinearity. To show that the EKF observer provides meaningful feedback to the controller (the “closed-loop” setup), we compare it to a setup that assumes the predicted feedforward state, i.e. the value yielded by applying the RNN dynamicsf θ ∗ to the full sequence of past inputs, is always correct (the “open-loop” setup). The results of this comparison are visualized in Figures 4.5 and 4.6, and the tracking errors are compared numerically in Table 4.2. Table 4.2: MPC tracking errors max error (cm) mean error (cm) shape kind circle closedloop 12.59 5.59 openloop 29.15 11.61 lissajous closedloop 9.19 4.43 openloop 14.85 4.77 rectangle closedloop 14.60 6.34 openloop 14.62 6.01 For the circle trajectory, we observe significantly improved performance in both mean and maximum error from the closed-loop setup. In the traces over time, shown in Figure 4.6, the open-loop solution drifts towards stronger resonance in the vertical axis and weaker resonance in the horizontal axis. In 99 contrast, the error of the closed-loop solution does not grow over time, indicating that our EKF setup is able to compensate for model error. For the Lissajous trajectory, the closed-loop setup yields a minor improvement in mean tracking error but a significant improvement in maximum error. The rectangle trajectory shows little difference between the open- and closed-loop approaches, but it is noteworthy that both approaches track the rectangle without gross errors, even though it is physically infeasible and not similar to the training data. This result suggests that the LSTM model behaves reasonably for at least one sequence input that is dissimilar to those in the training data. 4.5 Conclusion We have described and demonstrated a system for optimal control in settings with unknown complex dynamics but a known reward function. Our approach is completely data-driven, and requires a fixed initial data-collection phase without further exploratory actions. We model our dynamical system as an LSTM recurrent neural network and design a nonlinear MPC controller. We use an EKF state observer to account for model error by estimating a value of the LSTM hidden state consistent with past inputs and outputs. We apply our method to the task of manipulating a deformable object such that a particular point on the object tracks a fast trajectory. We validate our model on a real hardware setup with a robot manipulator holding a foam pool noodle, measured by a motion capture system. Our experiments show that closing the loop with the EKF observer improves tracking performance compared to an open loop control scheme for several of the test trajectories. In future work, we aim to improve tracking accuracy by investigating other nonlinear state estimation methods and MPC implementations. The EKF is one of several ways to account for modeling error. Other nonlinear state estimators such as particle filters and moving window least-squares estimators could also be adapted to estimate RNN internal states. Alternatively, a pure deep-learning approach might add past 100 observations as inputs to the RNN. For MPC, a more sophisticated constrained optimization scheme could be applied to the nonlinear MPC problem (4.5) to help enforce smoothness and actuator limits. We are also interested in applying our work to non-position-based control tasks, such as force control for using deformable objects as tools. We also note that our method resembles the inner loop of a model-based reinforcement learning al- gorithm. An extended method could use the data gathered at test time to further update the model. This could relax the demand on good state space coverage in the initial training data. 101 Chapter5 VarianceofPolicyGradientforLQRproblems 5.1 Introduction We concluded Chapter 3 by discussing some possible reasons why reinforcement learning with neural network “universal policies” might lead to a suboptimal multi-system policy. One possibility is the inter- action between the RL algorithm and the problem. The RL algorithm was able to learn good policies for a singleϕ ∈Φ , but when deploying the same algorithm on the multi-system problem, it was unable to learn a multi-system policy that matched the single-system policies in aggregate. One possible reason is that some property of the multi-system MDP hindered the performance of the algorithm. More broadly, it leads us to ask the question: How do properties of the MDP dynamics impact the performance of RL algorithms? For RL in finite MDPs, performance bounds are most commonly given with respect to the number of states, actions, and rewards, and the horizon length (if finite) or discount factor (if infinite-horizon) (Agarwal et al., 2022). In the more challenging setting of undiscounted infinite-horizon MDPs with the average reward criterion, stronger assumptions on the MDP such as connectedness (resp. ergodicity) are required, and regret bounds are given in terms of related quantities such as diameter and/or span (resp. mixing and/or hitting time). Wei et al. (2020, Table 1) review some results of this type alongside their own. The work presented in this chapter was originally published in Preiss et al. (2019). The related work section has been expanded to include more recent research in this area. 102 Our results in Chapter 3 are difficult to investigate from a theoretical perspective due to the combi- nation of neural network function approximation, complex deep RL algorithms with many “tricks”, and nonlinear dynamics with contact (in the MuJoCo experiments). Adding the multi-system structure in- creases the complexity further. As we will discuss in §5.2, the theoretical understanding of reinforcement learning in continuous spaces has advanced considerably in the past several years. However, at the time this project was initiated, there were few theoretical guarantees for RL beyond finite MDPs. Therefore, we narrowed our focus to a very simple continuous MDP and a simple RL algorithm. Since the theoreti- cal picture for single-system MDPs is still far from complete, we leave the development of RL theory for multi-system problems to future work. The experiment in § 3.6 used PPO, a member of the policy gradient family of RL algorithms (§ 2.7.2). Policy gradient methods construct an unbiased estimate of the gradient of the RL objective with respect to the policy parameters, and perform stochastic gradient ascent/descent with this estimate. Policy gradient methods are widely used for RL in continuous spaces because, unlike Q-learning, they do not require evaluating an “argmax” over a continuousU, which is computationally intractable in general. However, the gradient estimate is known to suffer from high variance (Greensmith et al., 2004). The earliest and simplest policy gradient algorithm is REINFORCE (Williams, 1992). More recent algorithms such as TRPO and PPO (Schulman et al., 2015, 2017) extend the basic idea of REINFORCE with techniques to inhibit the possibility of making very large changes in the policy action distribution in a single step. In a benchmark test (Duan et al., 2016a), these algorithms generally learned better policies than REINFORCE, but their additional complexity makes them hard to analyze. In this chapter, we seek a more detailed understanding of how the policy gradient estimate variance relates to properties of the continuous-space Markov decision process (MDP) that defines the RL problem instance. As an analytically tractable class of continuous MDPs, we select the linear-quadratic regulator (LQR) problem, introduced in §2.9.1.6. Our primary contributions are derivations of bounds on the variance 103 of the REINFORCE gradient estimate as an explicit function of the dynamics, reward, and noise parameters of the LQR problem instance. We validate our bounds with comparisons to the empirical gradient variance in random problems. We also explore the relationship between gradient variance and sample complexity, but find it to be less straightforward, as the problem parameters that affect variance also affect the op- timization landscape. We emphasize that our goal is not to draw a conclusion about the utility of using REINFORCE to solve LQR problems, but rather to use LQR as an example system that is simple enough to allow us to “look inside” the REINFORCE policy gradient estimator. 5.2 Relatedwork PolicygradientmethodsforLQR LQR systems are a popular case study for analyzing policy gradient algorithms in continuous spaces. They have also been studied in the context of model-based and value- based RL algorithms, but we restrict our attention to policy gradient methods here. Fazel et al. (2018) showed that, even though the optimization landscape of LQR is neither convex nor smooth, gradient de- scent using a zeroth-order optimization approximation of the policy gradient enjoys global convergence and sample complexity bounds. The analysis centers on the gradient domination (also known as Polyak- Łojasiewicz) condition. Malik et al. (2018) and Bu et al. (2019a) strengthened this type of result and gen- eralized to other LQR variants. Bhandari and Russo (2019) generalized the gradient domination technique to other highly structured problems outside the LQR setting, including finite MDPs, an optimal stopping problem, and an inventory control problem. Alternatively, Mohammadi et al. (2019) obtained similar results for the continuous-time case by relating the gradient flow in a classic convex reparameterization of the LQR problem to the nonconvex policy gradient. Sun and Fazel (2021) generalized the reparameterization- based results to a broader family of LQR problems. Cassel and Koren (2021) obtained a nearly optimal regret bound for policy gradient methods with respect to the time horizon, matching a minimax lower bound of Simchowitz and Foster (2020) up to logarithmic factors. In contrast, Tu and Recht (2019) showed 104 that REINFORCE is strictly less efficient than model-based methods with respect to the state and action dimensionalities in the cheap-control setting. Most of the aforementioned works study exact policy gradients or policy gradient approximations obtained by parameter-space noise or perturbations, similar to generic derivative-free optimization meth- ods. This is in contrast to the action-space noise used by REINFORCE, as described in §2.7.2. The work of Tu and Recht (2019) is an exception, but their variance analysis only applies to the restricted class of cheap- control LQR problems used in their lower bound. To our knowledge, the work in this chapter represents the only generic variance upper bound for action-space policy gradient methods in the LQR setting. Nonlinear system classes Outside the LQR setting, researchers have also obtained theoretical guar- antees in settings where the state space, and possibly the action space, are continuous. The linear MDP, in which transition probabilities and rewards are linear in a feature space, is a popular case for anal- ysis (Jin et al., 2020). Several variations of the linear MDP model exist. Linear MDPs generalize finite MDPs. However, this class cannot express the LQR problem (Song and Sun, 2021), which suggests that it may be of limited relevance for robotics. Mania et al. (2022) gave a finite-sample complexity guarantee for the system identification problem in the expressive kernelized nonlinear regulator (KNR) system class. Kakade et al. (2020) proposed a model-based RL algorithm with a regret bound depending on information- theoretic quantities for KNRs, but their algorithm is computationally intractable. Song and Sun (2021) propose a tractable model-based method for KNRs. Boffi et al. (2021) prove a regret bound for a different class of nonlinear systems with a stronger stability assumption. Agarwal et al. (2019) give global conver- gence results for policy gradient methods in tabular MDPs, and in highly generic settings with smooth function approximation policies they give guarantees in terms of approximation error. Their results high- light the importance of exploration. Agarwal et al. (2020) build upon this framework and address the exploration issue using an ensemble of policies and exploration reward bonuses. Feng et al. (2021) extend this class of methods to handle more nonlinearity. Recently, Jin et al. (2021) and Du et al. (2021) proposed 105 new expressive classes of tractable MDPs that generalize both linear MDPs and LQRs. These provide a promising setting for future work on provably efficient algorithms for problems that combine features of finite MDPs (like difficult exploration) with features of physical systems (like stabilizing at an equilibrium). Reinforcementlearningtheorywithdeepneuralnetworks Until recently, research into the theo- retical properties of deep neural networks focused on static optimization problems like supervised learning. He and Tao (2020) provide a survey on such results. More recently, researchers have begun to study deep neural networks in reinforcement learning context. Fan et al. (2020) analyzed sample complexity for the Deep Q-Network algorithm (DQN, Mnih et al., 2013) under the assumption of an fixed i.i.d. sampling distri- bution, which they argue is a reasonable approximation of sampling from a large replay buffer generated byϵ -greedy exploration. Xu and Gu (2020) obtain the same rate for DQN but remove the i.i.d. assumption, learning only from the most recent interaction with the MDP. However, the actions are sampled from a fixed policy, as opposed to an ϵ -greedy policy with respect to the current estimate ofQ ⋆ . Therefore, both of these analyses sidestep the issue of exploration. Cai et al. (2019) analyze the closely related task of policy evaluation, i.e. finding a fixed point of the policy Bellman operator instead of the optimality Bellman op- erator, with neural networks. Yang et al. (2020) study the least-squares value iteration algorithm for both kernels and overparameterized neural networks using optimism in the face of uncertainty for exploration. Moving from value-based methods to policy-based methods, Wang et al. (2020) proved convergence rates for actor-critic methods with vanilla and natural policy gradients. Their analysis showed that the overparameterized neural networks can approximately satisfy thecompatiblefunctionapproximation con- dition, which leads to unbiased policy gradient estimates. Liu et al. (2019) obtain similar results specifically for the widely-used PPO and TRPO algorithms, but their analysis requires fully solving optimization prob- lems for each update instead of taking a single gradient step. Agazzi and Lu (2021) work in the mean-field regime with continuous-time dynamics to express the policy gradient dynamics as a partial differential equation, and show that all fixed points are global optima. 106 5.3 Problemsetting In this chapter,∥·∥ denotes the 2,2 operator norm of a matrix or the 2-norm of a vector. We study the REINFORCE policy gradient estimator, as defined in § 2.7.2. We consider a variant of the discrete-time LQR problem introduced in §2.9.1.6 that adds stochasticity to the dynamics but retains noise-free full state observability. Recall that the state space isX = R n and action spaceU = R m . The dynamics are linear dynamics with additive Gaussian noise: x t+1 =Ax t +Bu t +ϵ x t , ϵ x t ∼N (0,Σ x ), (5.1) for dynamics matricesA∈R n× n , B ∈R n× m , and noise covarianceΣ x ⪰ 0. The initial statex 1 follows an arbitrary zero-mean Gaussian distribution. We consider the finite horizon objective H X t=1 r t ≜ H X t=1 =− (x T t Qx t +u T t Ru t ) (5.2) for cost matrices Q ⪰ 0, R ≻ 0. (Note that the standard LQR cost has been negated to fit the reward- maximization formulation of RL.) As discussed in §2.9.1.6, for the deterministic version of the problem, the infinite-horizon LQR cost is minimized by a stationary linear policy u t = K ⋆ x t , K ⋆ ∈ R m× n , where the value of K ⋆ depends on the solution of an discrete-time algebraic Riccati equation with coefficients derived from (A,B,Q,R). To apply REINFORCE, the policy must be stochastic, so we consider linear stochastic policies u t =Kx t +ϵ u t , ϵ u t ∼N (0,Σ u ), (5.3) for K ∈ R m× n ,Σ u ∈ S m ++ . The state noise Σ x is an immutable property of the system, but the action noise Σ u is not. Instead, it is usually chosen by the user of the RL algorithm, or learned as a parameter 107 subject to optimization by the RL algorithm. Genuine noise in the actuators enters the system linearly through the matrixB, so it can be subsumed intoΣ x . In the RL literature, action noise is usually seen as either 1) a tool for exploring of the state space, 2) a method of regularization to avoid converging on bad local optima, or 3) a consequence of a probabilistic in- terpretation of the RL problem (Levine, 2018). Its effect on the RL optimization algorithm is less frequently discussed, but in this work we find that it can be significant. 5.4 Mainresult: VarianceboundsontheREINFORCEestimator In this section, we present bounds on the variance of the REINFORCE estimator for LQR systems. Back- ground on REINFORCE is given in § 2.7.2. The instantiation of REINFORCE for the system defined by eqs. (5.1) to (5.3) using a single trajectory is: ˆ g = H X t=1 Σ − 1 u ϵ u t x ⊤ t ! H X t=1 r t ! ∈R m× n . (5.4) The estimate ˆ g is a function of the independent random variables{ϵ u t ,ϵ x t } H t=1 . Although x t is a linear function of the previous noise variables{ϵ u τ ,ϵ x τ } t− 1 τ =1 , the rewardr t is quadratic inx t , so the overall form of ˆ g is a product of a sum and a sum of products of sums. Therefore, while it is possible to apply matrix concentration inequalities (Tropp, 2015) to bound∥x t − E[x t ]∥ with high probability, it is more difficult to bound the dispersion ofˆ g. Instead, we use a more specialized method to derive a bound on ν (ˆ g)≜ P m i=1 P n j=1 Var(ˆ g i,j )=E tr ˆ g ⊤ ˆ g − tr E[ˆ g] ⊤ E[ˆ g] , which we simplify by boundingE tr ˆ g ⊤ ˆ g . 108 Theorem5.4.1. Ifρ (A+BK)<1, thenE tr ˆ g ⊤ ˆ g ≤ O ¯n 4 C 2 1 C 2 2 , where C 1 =µ 2 Σ − 1 2 u (∥x 1 ∥+σH )H ′ , C 2 =∥R∥∥Σ u ∥H +µ 2 ∥Q∥+∥R∥∥K∥ 2 ∥x 1 ∥ 2 +σ 2 H H ′2 , ¯n ≜ max{n,m}, σ ≜ Σ 1 2 x + BΣ 1 2 u , H ′ ≜ min n H, 1 1− ρ (A+BK) o , and µ is a constant bounding the transient behavior of∥A+BK∥ t , with more details provided in §5.6. Proof Sketch. 1. Rewriteϵ u t ,ϵ x t asΣ 1 2 u δ u t ,Σ 1 2 x δ x t , where theδ u t ,δ x t are unit-Gaussian random variables. 2. Boundtr ˆ g ⊤ ˆ g byP , a polynomial function of theχ -distributed random variables{∥δ u t ∥,∥δ x t ∥} H t=1 with nonnegative coefficients. 3. Bound the sum of the coefficients of P by substituting1 for allχ random variables. 4. Bound for the expectation of each monomial inP using the moments of theχ distribution. A detailed proof of Theorem 5.4.1 is given in §5.6. For a special case of scalar states and actions, we also show a lower bound onE[ˆ g 2 ]. SinceE[ˆ g] = 0 at a local optimum, this lower bound corresponds to the variance caused strictly by noise in the system when the policy is already optimal. Here, the matricesA,B,K,Q,R are denoted asa,b,k,q,r, andσ x , σ u denote the standard deviation (not variance) of state and action noise. This notationr is different from the notationr t for reward. Theorem5.4.2. Ifm=n=1 and0≤ a+bk <1, thenE[ˆ g 2 ]≥ Ω( c 2 1 c 2 2 ), with c 1 = 1 σ u |x 1 |+σ √ H √ h ′ , c 2 =rσ 2 u H +(q+rk 2 )(x 2 1 +σ 2 H)h ′ , 109 whereσ ≜σ x +bσ u andh ′ ≜min n H, 1 1− (a+bk) 2 o . A detailed proof of Theorem 5.4.2 is given in §5.7. If we reduce the upper bound of Theorem 5.4.1 to its scalar case, all terms match with the notable exception of the horizon-related terms H and H ′ (com- pare toh ′ ), which appear squared in several places in Theorem 5.4.1 compared to the equivalent term in Theorem 5.4.2. There is another gap in the denominators ofH ′ andh ′ since 1 1− x 2 < 1 1− x on the domain x∈ (0,1). We conjecture our upper bound can be tightened by fully exploiting the independence of the noise variablesϵ x , ϵ u . 5.5 Experiments In the first set of experiments, we compare our upper bound of ν (ˆ g) to its empirical value when exe- cuting REINFORCE in randomly generated LQR problems. Our results show qualitative similarity in the parameters for which our upper and lower bounds match. On the other hand, the gap with respect to stability-related parameters is also visible. For each experiment shown here, we repeated the experiment with different random seeds and observed qualitatively identical results. We generate random LQR problems with the following procedure. We sample each entry in A and B i.i.d. fromN(0,σ =n − 1/2 ) andN(0,σ =m − 1/2 ) respectively. To construct a random k× k positive definite matrix, we sample from the Wishart(k − 1 I,k) distribution by computingQ = X T X forX i.i.d. analogous toA. The scale factork − 1 ensures that if the vectorx is distributed byN(0,k − 1 I), such that E[∥x∥ 2 ]=1, thenE[x T Qx]=1. We sampleQ,R,Σ x , andΣ u this way. In each experiment, we vary some of these parameters systematically while holding the others con- stant, allowing us to visualize the impact of each parameter onν (ˆ g). We plot the upper bound of Theo- rem 5.4.1 on the top row of Figure 5.1, and the empirical estimate ofν (ˆ g) on the bottom row. In both cases, since the variance depends on the initial statex 1 , we sampleN =100 initial statesx 1 fromN(0,n − 1 I) and 110 plotE x 1 ∼N (0,n − 1 I) ν (ˆ g). We estimateν (ˆ g)| x 1 =x for a particular initial statex by sampling30 trajectories with randomϵ u , ϵ x . 10 2 10 1 10 0 10 1 10 2 a 10 11 10 12 10 13 10 14 10 15 10 16 Bound(Var(g)) s 0.1 1.0 10.0 10 2 10 1 10 0 10 1 10 2 a 10 6 10 7 10 8 10 9 10 10 10 11 Var(g) s 0.1 1.0 10.0 (a) Relationship betweenΣ x andΣ u 10 2 10 1 10 0 10 1 10 2 |B| 10 11 10 13 10 15 10 17 10 19 10 21 10 23 10 25 Bound(Var(g)) 0.1 1.0 10.0 10 2 10 1 10 0 10 1 10 2 |B| 10 5 10 7 10 9 10 11 10 13 10 15 10 17 Var(g) 0.1 1.0 10.0 (b) Control authority|B| 0.2 0.4 0.6 0.8 1.0 (A + BK) 10 10 10 12 10 14 10 16 10 18 10 20 10 22 Bound(Var(g)) 0.01 1.0 100.0 0.2 0.4 0.6 0.8 1.0 (A + BK) 10 5 10 7 10 9 10 11 10 13 10 15 10 17 Var(g) 0.01 1.0 100.0 (c) Stabilityρ (A+BK) Figure 5.1: Comparison between our upper bound from Theorem 5.4.1 (top) and the empirically measured variance (bottom) as they relate to various parameters of the LQR problem. Behavior is qualitatively similar for action noise covariance (a) and control authority (b), but less similar for the stability (c) where our bounds are loose. Further discussion is in §5.5. Effect of Σ u In this experiment, we generate a random LQR problem and replace Σ u with σ u I for σ u geometrically spaced in the range [10 − 2 ,10 2 ]. Using a scaled identity matrix is common practice when applying RL to a problem where there is no a priori reason to correlate the noise between different action dimensions. We evaluate the variance at the value K = K ⋆ , where K ⋆ is the infinite-horizon optimal controller computed using traditional LQR synthesis, as described in § 2.9.1.6. Using K ⋆ ensures that ρ (A+BK)<1, a required condition to apply Theorem 5.4.1. Results are shown in Figure 5.1a. The separate line plots correspond to scaling the randomΣ x by the values{0.1,1,10}, while thex-axis corresponds to the value ofσ u . For each value ofσ x , there appears to 111 be a uniqueσ u that minimizesν (ˆ g), and this value ofσ u increases withσ x . This phenomenon appears in both the bound and empirical variance. Effectof ∥B∥ In this experiment, we generate a random problem wherem = n and replaceB withbI forb geometrically spaced in the range[10 − 2 ,10 2 ]. The resulting system essentially gives the policy direct control over each state. For each B, we compute a separate infinite-horizon optimal K and sample the variance for different x 1 . Results are shown in Figure 5.1b. The separate line plots correspond to scaling both the random Σ x and the random Σ u by the values{0.1,1,10}. Thex-axis corresponds to the value of b. Again, there appears to be a unique∥B∥ that minimizes ν (ˆ g), but its value changes minimally for different magnitudes of Σ x , Σ u . Effectof ρ (A+BK) Here we measure the change in variance with respect to the closed-loop spectral radiusρ (A+BK). To synthesize controllersK such thatρ (A+BK) obtains a specified value, we use the pole placement algorithm of Tits and Yang (1996). (Pole placement is discussed in §2.9.4.) We sample a “prototype” set ofn eigenvalues withλ 1 ,...,λ ⌈n/2⌉ as complex conjugate pairsλ i ,λ i+1 = re ± iφ , where r∼ Uniform([0,1]) andφ∼ Uniform([0,2π )), and sample the remaining realλ i fromUniform([− 1,1]). Then, for each desired ρ , we compute K ρ = P(A,B,ρλ 1 ,...,ρλ n ). By rescaling the same set of λ i instead of sampling a new set for each ρ , we avoid confounding effects from changing other properties ofK. Results are shown in Figure 5.1c. Again, we repeat the experiment for different magnitudes of Σ x and Σ u . Unlike the previous two experiments, here we see qualitatively different behavior between our upper bound and the empirical variance. The bound begins to increase rapidly near ρ = 1, corresponding to the growth of1/(1− ρ ) in the termH ′ , but atρ = 0.9 theH term becomes active inH ′ , and the bound suddenly flattens. In contrast, the empirical variance grows more moderately and does not explode near the 112 threshold of system instability. This further bolsters our confidence that the upper bound of Theorem 5.4.1 can be tightened to match the √ H ′ and √ H terms in the lower bound of Theorem 5.4.2. 10 3 10 5 10 7 10 9 10 11 10 13 empirical 10 8 10 10 10 12 10 14 10 16 10 18 10 20 10 22 bound n 3 10 30 H 3 10 30 Figure 5.2: Scatter plot of empiricalν (ˆ g) (x-axis) and upper bound from Theorem 5.4.1 (y-axis) with varying state dimensionalityn and time horizonH. Each point represents one random LQR problem. Dimensionality parameters In all of the preceding experiments, we arbitrarily chose the state and action dimensions n=5, m=3 and time horizon H = 10. To visualize the variance for other values of these parameters, we generate 1000 random LQR problems with n and H each varying over the set {3,10,30}. We fix m = ⌈n/2⌉. Results are shown in Figure 5.2. The overall positive trend with a slope greater than1 shows that the bound grows superlinearly with respect to the empirical, as expected. One interesting property is the tighter clustering for large values ofn. This may be due to several eigenvalue distribution results in random matrix theory which state that, asn→∞, our random LQR problems tend to become similar up to a basis change (Tao, 2012). 113 200 400 600 800 1000 iteration 3 4 5 6 cost x =0.01 200 400 600 800 1000 iteration x =0.1 200 400 600 800 1000 iteration x =1.0 u 0.01 0.1 1.0 10.0 Figure 5.3: Learning curves of REINFORCE for a random LQR problem with varying scales of action noise σ u and environment noiseσ x . Largerσ u can improve learning, despite larger variance ofˆ g. 5.5.1 RLpolicyoptimalityforvaryingΣ u The results in §5.5 suggest that, for a fixed Σ x , the magnitude ofΣ u has a significant effect on ν (g). This is of practical interest becauseΣ u is usually under control of the RL practitioner. It is therefore natural to ask if the change in variance corresponds to a change in the rate of convergence of REINFORCE. We test this empirically by executing REINFORCE in variants of one random LQR problem with different values ofΣ u andΣ x . To avoid a confounding effect from larger Σ u incurring greater penalty from the− u T t Ru t term inr t , we evaluate the trained policies in a modified version of the problem where Σ u = Σ x =0. It can be shown that the optimalK ⋆ for the stochastic problem is also optimal for the deterministic problem, so each problem variant should converge to the same evaluation returns in the limit. We initializeK by perturbing the elements of the LQR-optimal controller with i.i.d. Gaussian noise and scaling the perturbation untilρ (A+BK)≈ 0.98. After every10 iterations of REINFORCE, we evaluate the current policy in the noise-free environment. For each (Σ u ,Σ x ) pair, we repeat this experiment 10 times with different random seeds. The random seed only affects the ϵ u t ,ϵ x t andx 1 samples. The aggregate data are shown in Figure 5.3. Shaded bands correspond to one standard deviation across the separate runs of REINFORCE. The lowercaseσ u ,σ x refer to scaling factors applied to the initial samples ofΣ u ,Σ x in the random LQR problem. 114 The effect is different than one would predict under the hypothesis that the performance of REINFORCE is mainly dictated by its estimation variance. For all values ofΣ x in the experiment, problems with larger Σ u converge faster—whereas Figure 5.1a would suggest that the “optimal” value of Σ u changes with re- spect toΣ x . The fact that largerΣ u tends to make REINFORCE converge faster is not obvious, given the Σ − 1 u term inˆ g (5.4). Also, whenΣ u is very small andΣ x is very large, the algorithm becomes unstable and sees large variations across different random seeds. For the middle values Σ u ∈ {0.1,1.0}, we observe that largerΣ x causes faster convergence. 10 1 10 0 10 1 10 2 u 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 suboptimality ratio x = 0.10 x = 0.32 x = 1.00 x = 3.16 x = 10.0 Figure 5.4: Suboptimality ratios of the policy after1000 iterations of REINFORCE for a random LQR prob- lem with varying scales of action noiseσ u and environment noiseσ x . An alternate visualization of the same experiment is given in Figure 5.4. Here, as in Figure 5.1, the horizontal axis corresponds to the action variance σ u and the line colors correspond to different values of the environment dynamics variance σ x . However, instead of plotting the gradient estimate variance on the vertical axis as in Figure 5.1, here we plot the suboptimality ratio of the final policy after running 115 REINFORCE for a fixed number of iterations. The ratio is taken relative to the LQR-optimal policy for the infinite-horizon version of the problem. In this plot we include results for larger values ofσ u than were shown in Figure 5.3. We observe that the suboptimality does increase for large enough values ofσ u , so the effect is not monotonic as Figure 5.3 might suggest. However, the value ofσ u for which the suboptimality ratio is closest to1 does not move significantly for different values of σ x . We remark further on these results in §5.8. 5.6 ProofofTheorem5.4.1 In this section, we provide the detailed derivation of the upper bound stated in Theorem 5.4.1. We define the following notations: letδ x t andδ u t be independent random vectors that followN(0,I n ) andN(0,I m ) respectively for allt. Theϵ x t andϵ u t defined in Equations (5.1) and (5.3) can be written as ϵ x t = Σ 1 2 x δ x t and ϵ u t =Σ 1 2 u δ u t . We will use the following steps to upper-boundE tr ˆ g ⊤ ˆ g : 1. Boundtr ˆ g ⊤ ˆ g byP , a polynomial ofd x t ≜∥δ x t ∥ andd u t ≜∥δ u t ∥ fort=1,2,...,H. We restrict that P should have only nonnegative coefficients. Since we assume δ x t andδ u t are independent random vectors with standard normal distribution,d x t andd u t will be independent random variables following theχ (n) andχ (m) distributions respectively. 2. Bound the sum of the (already nonnegative) coefficients of P by substituting all of thed x t , d u t with one. More formally, let P({d x t ,d u t } H t=1 )= X i C i H Y t=1 (d x t ) α t,i H Y t=1 (d u t ) β t,i , 116 where Q H t=1 (d u t ) α t,i Q H t=1 (d x t ) β t,i is the i-th monomial in P , and C i is its nonnegative coefficient. Then we calculate P i C i by substituting alld x t ,d u t with1. We use the notation 1(P) to denote this operation (we define the 1(·) operator analogously for expressions other thanP itself): 1(P)=P| d x t =1, d u t =1,∀t = X i C i .j 3. Bound for the expectation of all monomials inP , i.e., find M such that E " H Y t=1 (d x t ) α t,i H Y t=1 (d u t ) β t,i # ≤ M, ∀i. With the three steps above, we can then bound E tr ˆ g ⊤ ˆ g ≤ E[P] ≤ P i C i M = M 1(P). To calculateM, we can use the known formula for thek-th moment of aχ random variable. Recall that ˆ g = H X t=1 Σ − 1 u ϵ a t x ⊤ t ! H X t=1 r t ! , and thus tr ˆ g ⊤ ˆ g =tr H X t=1 Σ − 1 u ϵ u t x ⊤ t ! ⊤ H X t=1 Σ − 1 u ϵ u t x ⊤ t ! | {z } Term 1 H X t=1 r t ! 2 | {z } Term 2 . Thanks to the property 1(P) = 1(P 1 )1(P 2 ) for polynomialP ≡ P 1 P 2 , and 1(P) = 1(P 1 )+ 1(P 2 ) for P ≡ P 1 +P 2 , we do not need to directly find P and bound 1(P). Below in Section 5.6.1, we first bound∥x t ∥ by a polynomial⌈x t ⌉ of{d s τ ,d a τ }, and then find 1(⌈x t ⌉). In Section 5.6.2 and 5.6.3, we further upper bound 1(Term 1 ) and 1(Term 2 ) with the help of 1(⌈x t ⌉). Then finally we obtain an upper bound forE[tr ˆ g ⊤ ˆ g ] asM 1(Term 1 )1(Term 2 ). 117 5.6.1 Bounding∥x t ∥ In this section, we bound∥x t ∥ from above. Although the spectral radiusρ (A+BK) determines the asymp- totic stability of the closed-loop system, it guarantees little about the transient behavior–for example, for anyx>1 and0<ϵ< 1, the matrix A= " ϵ x 0 ϵ # has the properties ρ (A) < 1, ∥A∥ > x. Therefore, while the state magnitude∥x t ∥ is bounded by the operator norm∥A+BK∥ t− 1 , it is too restrictive to require∥A+BK∥ < 1. Instead, we will use the following result: Lemma5.6.1 (Trefethen and Embree (2005)). LetA∈R n× n be a matrix withρ (A)<1. Then there exists µ> 0 such that, for allk∈N, ∥A k ∥≤ µρ (A) k . (5.5) µ is bounded by the “resolvent condition”µ ≤ 2enr(A), wheree is the exponential constant and r(A)= sup z∈C,|z|>1 (|z|− 1)∥(zI− A) − 1 ∥. (5.6) The derivation and interpretation of (5.6) is a deep subject related to the matrix pseudospectrum, cov- ered extensively by Trefethen and Embree (2005). Intuitively,r(A) is large if a small perturbationϵ ∈R n× n would causeρ (ϵ +A)>1. Expanding the state transition function in Equation (5.1) with the linear stochastic policy in Equa- tion (5.3), we get x t =(A+BK) t− 1 x 1 + t− 1 X τ =1 (A+BK) t− τ − 1 (Σ 1 2 x δ x τ +BΣ 1 2 u δ u τ ). (5.7) 118 Recall that ∥·∥ denotes the ℓ 2 norm for vectors and the ℓ 2 − ℓ 2 operator norm for matrices. By Lemma 5.6.1, there exists µ such that∥(A + BK) k ∥ ≤ µρ (A) k . Let f = ρ (A + BK), σ 2 x = ∥Σ x ∥, σ 2 u =∥Σ u ∥, andb=∥B∥. By repeatedly applying the triangle inequality, ∥x t ∥= (A+BK) t− 1 x 1 + t− 1 X τ =1 (A+BK) t− τ − 1 (Σ 1 2 x δ x τ +BΣ 1 2 u δ u τ ) ≤ (A+BK) t− 1 x 1 + t− 1 X τ =1 (A+BK) t− τ − 1 (Σ 1 2 x δ x τ +BΣ 1 2 u δ u τ ) ≤ µf t− 1 ∥x 1 ∥+µ t− 1 X τ =1 f t− τ − 1 (σ x d x τ +bσ u d u τ ). (5.8) We denote the final bound in (5.8) as ⌈x t ⌉. The bound⌈x t ⌉ is linear in the random variables{d x t ,d u t } H t=1 with only positive coefficients. Furthermore, 1(⌈x t ⌉)=µf t− 1 ∥x 1 ∥+µ t− 1 X τ =1 f t− τ − 1 (σ x +bσ u ) =µf t− 1 ∥x 1 ∥+µ t− 2 X τ =0 f τ (σ x +bσ u ). (5.9) 5.6.2 BoundingTerm 1 In this subsection we show that tr H X t=1 Σ − 1 u ϵ u t x ⊤ t ! ⊤ H X t=1 Σ − 1 u ϵ u t x ⊤ t ! ≤∥ Σ − 1 u ∥ H X t=1 d u t ∥x t ∥ ! 2 . 119 Letξ t =Σ − 1 u ϵ u t . Then tr H X t=1 ξ t x ⊤ t ! ⊤ H X t=1 ξ t x ⊤ t ! =tr X 1≤ i,j≤ H x i ξ ⊤ i ξ j x ⊤ j = X 1≤ i,j≤ H ξ ⊤ i ξ j x ⊤ i x j ≤ X 1≤ i,j≤ H |ξ ⊤ i ξ j ||x ⊤ i x j | = X 1≤ i,j≤ H |δ u i ⊤ Σ − 1 u δ u j ||x ⊤ i x j | ≤ X 1≤ i,j≤ H ∥Σ − 1 u ∥∥δ u i ∥∥δ u j ∥∥x i ∥∥x j ∥ =∥Σ − 1 u ∥ H X t=1 ∥δ u t ∥∥x t ∥ ! 2 ≤∥ Σ − 1 u ∥ H X t=1 d u t ⌈x t ⌉ ! 2 , (5.10) in which we made use of the circulant property of the trace, the Cauchy-Schwarz inequality, and the fact thatΣ 1 2 Σ − 2 Σ 1 2 =Σ − 1 for positive semidefinite Σ . 120 5.6.3 BoundingTerm 2 We now boundC ≜ P H t=1 − r t from above. Note that− r t ≥ 0, sinceQ ⪰ 0 andR ≻ 0. Letq = ∥Q∥, r =∥R∥, andk =∥K∥. Then C = H X t=1 − r t = H X t=1 x ⊤ t Qx t +u ⊤ t Ru t ≤ H X t=1 q∥x t ∥ 2 +r∥u t ∥ 2 = H X t=1 q∥x t ∥ 2 +r∥Kx t +Σ 1 2 u δ u t ∥ 2 ≤ H X t=1 q∥x t ∥ 2 +r(k∥x t ∥+σ u ∥δ u t ∥) 2 ≤ H X t=1 q∥x t ∥ 2 +2rk 2 ∥x t ∥ 2 +2rσ 2 u ∥δ u t ∥ 2 ≤ (q+2rk 2 ) H X t=1 ⌈x t ⌉ 2 +2rσ 2 u H X t=1 (d u t ) 2 , (5.11) where the triangle inequality and the fact (a+b) 2 ≤ 2(a 2 +b 2 ) are used above. This bound on C is a quadratic polynomial in thed x ,d u . 5.6.4 Combiningbounds Combining Section 5.6.2 and Section 5.6.3, we have tr ˆ g ⊤ ˆ g ≤ P =C 2 ∥Σ − 1 u ∥ H X t=1 d u t ⌈x t ⌉ ! 2 . (5.12) For brevity, letα =∥Σ − 1 u ∥,β =q+2rk 2 ,γ =2rσ 2 u ,σ =σ x +bσ u , andH ′ ≜ min n H, 1 1− f o . The term H ′ reflects the stability of the closed-loop system: if highly stable ( f ≪ 1), we haveH ′ ≪ H, but when approaching instability (f →1),H ′ approachesH. 121 We expandP and substituted x t =1, d u t =1 for allt to compute the sum ofP ’s coefficients, using the notation 1(·) for the transformation of replacing alld with1. We first bound 1(C 2 ): 1(C 2 )≤ γH +β H X t=1 1(⌈x t ⌉) 2 ! 2 ≤ γH +βµ 2 H X t=1 f t− 1 ∥x 1 ∥+σH ′ 2 ! 2 ≤ γH +2βµ 2 H X t=1 f 2t− 2 ∥x 1 ∥ 2 +σ 2 H ′2 ! 2 ≤ γH +2βµ 2 (H ′ ∥x 1 ∥ 2 +σ 2 HH ′2 ) 2 (5.13) where the result is obtained by repeatedly applying the fact(a+b) 2 ≤ 2(a 2 +b 2 ). Next, 1( H X t=1 d u t ⌈x t ⌉ ! 2 )≤ µ H X t=1 f t− 1 ∥x 1 ∥+σH ′ ! 2 ≤ µ 2 H ′2 (∥x 1 ∥+σH ) 2 . (5.14) Finally, 1(P)≤ αµ 2 H ′2 γH +2βµ 2 (H ′ ∥x 1 ∥ 2 +σ 2 HH ′2 ) 2 (∥x 1 ∥+σH ) 2 =4∥Σ − 1 u ∥µ 2 H ′2 rσ 2 u H +µ 2 (q+2rk 2 )(H ′ ∥x 1 ∥ 2 +σ 2 HH ′2 ) 2 (∥x 1 ∥+σH ) 2 ≜ 1(P). (5.15) 1(P) is an order-8 polynomial in thed s ,d a . The formula for the 8th moment of aχ (n) random variable is E[X 8 ]=n(n+2)(n+4)(n+6), soE[tr ˆ g ⊤ ˆ g ]≤ 1(P)· O(¯n 4 ), where ¯n=max(n,m). (Since we are summing the variances ofO(¯n 2 ) random variables in ˆ g, we would expect scaling of no less than ¯n 2 compared to the scalar case.) 122 5.7 ProofofTheorem5.4.2 In this section, we provide a lower bound for E[ˆ g 2 ] in the scalar case, m = n = 1. The matrices A,B,K,Q,R are thus denoted as a,b,k,q,r here (notice that this notation r is different from the no- tation r t for reward). Other notations follow the definitions in § 5.6. We will analyze the case when 0≤ a+bk <1. Lemma5.7.1. E H X t=1 ϵ u t x t σ 2 u ! 2 H X t=1 r t ! 2 ≥ E H X t=1 ϵ u t x t σ 2 u ! 2 E H X t=1 r t ! 2 . (5.16) Proof. Whena+bk≥ 0, all terms have positive coefficients. Rename the 2H random variables{δ x t ,δ u t } H t=1 asx 1 ,...,x 2H . We can see that a monomial on the right-hand side: E x α 1 1 ··· x α 2H 2H E x β 1 1 ··· x β 2H 2H corresponds to the monomial on the left-hand side: E x α 1 +β 1 1 ··· x α 2H +β 2H 2H . Thex i are independent zero-mean normal random variables, so the propertyE[x α i ]E[x β i ]≤ E[x α +β i ] holds for any non-negative integersα,β . Combining with the fact that all coefficients are non-negative shows the lemma. 123 In the following two subsections, we lower bound the two terms on the right-hand side of Eq. (5.16) separately. Note that the first term can be simplified as H X t=1 ϵ u t x t σ 2 u ! 2 = H X t=1 δ u t x t σ u ! 2 = 1 σ 2 u H X t=1 δ u t x t ! 2 . 5.7.1 LowerboundingE P H t=1 δ u t x t 2 By the expansion ofx t in Eq. (5.7), we have E H X t=1 δ u t x t ! 2 =E H X t=1 δ u t (a+bk) t− 1 x 1 +L t ! 2 ≥ E H X t=1 δ u t (a+bk) t− 1 x 1 ! 2 +E H X t=1 δ u t L t ! 2 , whereL t ≜ P t− 1 τ =1 (a+bk) t− 1− τ (σ x δ s τ +bσ u δ a τ ). The first term is equal to H X t=1 (a+bk) 2t− 2 x 2 1 = 1− (a+bk) 2H 1− (a+bk) 2 x 2 1 . The second term can be further written as E H X t=1 δ u t t− 1 X τ =1 (a+bk) t− 1− τ (σ x δ s τ +bσ u δ a τ ) ! 2 =E H X t=1 (δ u t ) 2 t− 1 X τ =1 (a+bk) t− 1− τ (σ x δ s τ +bσ u δ a τ ) ! 2 =E H X t=1 t− 1 X τ =1 (a+bk) t− 1− τ (σ x δ s τ +bσ u δ a τ ) ! 2 =E " H X t=1 t− 1 X τ =1 (a+bk) 2t− 2− 2τ (σ 2 x +b 2 σ 2 u ) # =(σ 2 x +b 2 σ 2 u )E " H X t=1 1− (a+bk) 2t− 2 1− (a+bk) 2 # 124 =(σ 2 x +b 2 σ 2 u ) H 1− (a+bk) 2 − 1− (a+bk) 2H (1− (a+bk) 2 ) 2 . In the above several equalities, we use the independence amongδ u t ,δ x t . Combining two terms, we get E H X t=1 δ u t x t ! 2 ≥ x 2 1 +H(σ 2 x +b 2 σ 2 u ) 1− (a+bk) 2 − (a+bk) 2H 1− (a+bk) 2 x 2 1 − (σ 2 x +b 2 σ 2 u ) 1− (a+bk) 2H (1− (a+bk) 2 ) 2 = 1− (a+bk) 2H 1− (a+bk) 2 x 2 1 + H− 1− (a+bk) 2H 1− (a+bk) 2 σ 2 x +b 2 σ 2 u 1− (a+bk) 2 ≈ x 2 1 +H(σ 2 x +b 2 σ 2 u ) 1− (a+bk) 2 whenH ≫ 1 1− (a+bk) 2 H x 2 1 +H(σ 2 x +b 2 σ 2 u ) whenH ≪ 1 1− (a+bk) 2 ≈ min H, 1 1− (a+bk) 2 x 2 1 +H(σ 2 x +b 2 σ 2 u ) =H ′ x 2 1 +H(σ 2 x +b 2 σ 2 u ) . 5.7.2 LowerboundingE P H t=1 r t 2 5.7.3 We first lower bound this term by E H X t=1 r t ! 2 ≥ E " H X t=1 r t # 2 =E " H X t=1 qx 2 t +r(kx t +σ u δ u t ) 2 # 2 =E " H X t=1 (q+rk 2 )x 2 t +2rkx t σ u δ u t +rσ 2 u δ a2 t # 2 = (q+rk 2 )E " H X t=1 x 2 t # +Hrσ 2 u ! 2 125 E " H X t=1 x 2 t # =E " H X t=1 (a+bk) t− 1 x 1 +L t 2 # =E " H X t=1 (a+bk) 2t− 2 x 2 1 +2(a+bk) t− 1 x 1 L t +L 2 t # =E H X t=1 (a+bk) 2t− 2 x 2 1 + H X t=1 t− 1 X τ =1 (a+bk) t− 1− τ (σ x δ x t +bσ u δ u t ) ! 2 (the middle termE[(a+bk) t− 1 x 1 L t ] is zero becauseL t is a sum of zero-mean RVs) = 1− (a+bk) 2H 1− (a+bk) 2 x 2 1 +E " H X t=1 t− 1 X τ =1 (a+bk) 2t− 2− 2τ (σ 2 2 +b 2 σ 2 u ) # = 1− (a+bk) 2H 1− (a+bk) 2 x 2 1 +(σ 2 x +b 2 σ 2 u ) H 1− (a+bk) 2 − 1− (a+bk) 2H (1− (a+bk) 2 ) 2 ≈ min H, 1 1− (a+bk) 2 x 2 1 +H(σ 2 x +b 2 σ 2 u ) =H ′ x 2 1 +H(σ 2 x +b 2 σ 2 u ) . The second-to-last approximation is obtained similarly as in the previous subsection. 5.7.4 Combining Combining the results in previous two subsections, we get the final lower bound on E[ˆ g 2 ]: E[ˆ g 2 ]=E H X t=1 δ u t x t σ u ! 2 H X t=1 r t ! 2 ≥ E H X t=1 δ u t x t σ u ! 2 E H X t=1 r t ! 2 ≥ Ω( c 2 1 c 2 2 ), where (recallσ ≜σ x +bσ u ) c 1 = 1 σ u |x 1 |+σ √ H √ H ′ , c 2 =rσ 2 u H +(q+rk 2 )(x 2 1 +σ 2 H)H ′ . 126 5.8 Discussion In this chapter, we derived bounds on the variance of the REINFORCE policy gradient estimator in the stochastic linear-quadratic control setting. Our upper bound is fully general, while our lower bound applies to the scalar case at a stationary point. The bounds match with respect to all system parameters except the time horizonH and closed-loop spectral radiusρ (A+BK). We compared our bound prediction to the empirical variance in a variety of experimental settings, finding a close qualitative match in the parameters for which the bounds are tight. Our experiments in § 5.5.1 plotting the empirical convergence rate of REINFORCE suggest that the effect of action noise Σ u on the overall RL performance is not fully captured by its effect on the variance. An interesting direction for future work would be to investigate the role ofΣ u more closely and attempt to disentangle its effect on gradient magnitude, variance, exploration, and regularization. Such an analysis could lead to improved variance reduction methods or algorithms that manipulateΣ u to speed up the RL optimization. 127 Chapter6 SuboptimalCoverings 6.1 Introduction In this chapter, we present work motivated by one of the questions in the discussion of §3.5: Howhardisitto representagoodmulti-systempolicy? We propose theα -suboptimal covering number to characterize multi- system control problems where the set of dynamical systems and/or cost functions is infinite, analogous to the cardinality of finite system sets. We study suboptimal covering numbers for continuous-time linear- quadratic regulator problems (§2.9.2.5) and construct a class of multi-system LQR problems amenable to analysis. For the scalar case, we show logarithmic dependence on the “breadth” of the space. For the matrix case, we present empirical results and intermediate theoretical results towards an equivalent theorem. In the multi-system control paradigm described in §2.6, we did not specify any particular properties of the set of MDPsΦ . Let us consider the cardinality of the system set, using the example of a mobile robot as motivation. If the system set is finite, like selecting between “map an environment” and “deliver a package”, then its size is naturally quantified by the number of systems. If the system set is infinite, like delivering packages with arbitrary mass and inertial properties, then its size is not so easily quantified. Even if Φ is equipped with a metric or measure, these structures may be only weakly linked to the diversity of behavior required for good performance on all systems. Most work presented in this chapter was originally published in Preiss and Sukhatme (2021). The efforts towards the matrix case in §6.7 are new. 128 Suppose a multi-system policy that maps state and system parameters directly to actions is to be se- lected from a parameterized family of functions. As the system space expands from a singleton set, we expect to need a more expressive class of functions to represent a good multi-system policy. In this work, we propose theα -suboptimalcoveringnumber to capture this idea. For a system spaceΦ and a suboptimal- ity ratioα> 1, we define N cov α (Φ) as the size of the smallest set of single-system policiesC such that for everyϕ ∈Φ , at least oneπ ∈C has a cost ratio no greater thanα relative to the optimal policy forϕ . If the policies inC are parameterized functions, thenC provides an upper bound on the number of parameters needed to represent anα -suboptimal multi-system policy. In switching-based adaptive control, whereϕ is unknown, a smallerC reduces the computational complexity of the controller and can reduce the number of switches per unit of time (Hespanha et al., 2000). To study suboptimal covering numbers in a concrete setting, we consider continuous-time LQR prob- lems (§2.9.2.5). LQR problems are a common setting to analyze learning algorithms because detailed prop- erties are known (see §5.2 for examples). This has led to new inquiries into their fundamental properties (Bu et al., 2019b). Our work follows the latter spirit. We construct a family of well-behaved multi-system LQR problems where Φ is controlled by a “breadth” parameter θ ∈ [1,∞), and for which N cov α (Φ θ ) is finite and increasing in θ . For the special case of a scalar LQR problem, we derive matching logarithmic upper and lower bounds onN cov α (Φ θ ) as a function ofθ . As an effort towards analogous bounds for the matrix case, we present empirical results intended to shed light on the problem structure. For the upper bound, we analyze properties of a logical extension of our scalar cover. For the lower bound, we visualize suboptimal neighborhoods for two choices of “extremal” systems, revealing surprising topological behavior for one choice. Finally, we present some intermediate theoretical tools that may be useful for the matrix case, and more suboptimal neighborhood visualizations for a generalization of our multi-system LQR family to include variations in the state dynamics matrixA as well as the input matrixB. 129 This chapter is an initial step towards a comprehensive theory. In addition to a more complete picture of deterministic LQR systems, ideas ofα -suboptimal coverings could be applied to a wide range of multi- system problems. We also hope they will lead to insights about function class expressiveness in learning- based multi-system control. 6.2 Problemsetting In this section we define α -suboptimal covering numbers with respect to an abstract multi-system control problem, independent of distinctions such as continuous vs. discrete time and stochastic vs. deterministic dynamics. We then instantiate these definitions for a particular class of LQR problems. Notation We consider a family of MDPs as defined in § 2.6, with state space X , action spaceU, and system spaceΦ . We also require a class of reference policiesΠ ref ⊆U X and a strictly positive objective functionJ : Φ ×U X 7→ R >0 . The partial application ofJ forϕ ∈ Φ is denoted byJ ϕ : U X 7→ R. The optimal reference cost for an system is denoted byJ ⋆ ϕ =inf π ∈Π ref J ϕ (π ). Definition 6.2.1. Consider a multi-system optimal control problem(X,U,Φ ,Π ref ) and a suboptimality ratioα> 1. Given a policyπ :X 7→U, we define its α -suboptimal neighborhood as N α (π )= ( ϕ ∈Φ: J ϕ (π ) J ⋆ ϕ ≤ α ) . A set of policiesC⊆U X is anα -suboptimal cover of Φ if [ π ∈C N α (π )=Φ . 130 (Note thatC need not belong toΠ ref – these definitions are still meaningful in the “improper” case.) The α -suboptimalcoveringnumber ofΦ , denotedN cov α (Φ) , is the size of the smallest finite α -suboptimal cover ofΦ if one exists, or∞ otherwise. Standard LQR problem In this chapter, we analyze suboptimal coverings for continuous-time, deter- ministic, infinite-horizon, LQR problems, as defined in §2.9.2.5. Whereas in (2.30) the LQR cost was defined for a particular initial statex(0), in this chapter we define the overall policy cost for a stabilizing controller K∈R m× n by J(K)= E x(0)∼N (0,I) J x(0) = E x(0)∼N (0,I) Z ∞ 0 h x(t) ⊤ Qx(t)+u(t) ⊤ Ru(t) i dt, (6.1) in other words the expected cost when the initial state is distributed by a unit Gaussian. With this defi- nition, ifP solves the algebraic Riccati equation (2.31) andK is the optimal controllerK =− R − 1 B ⊤ P , thenJ(K) = tr[P]. Furthermore, ifK is an arbitrary stabilizing controller, we have (Mohammadi et al., 2019): J(K)=tr[(Q+K ⊤ RK)W], (6.2) where W = Z ∞ 0 e t(A+BK) ⊤ e t(A+BK) dt. (6.3) W can be computed by solving the Lyapunov equation (A+BK) ⊤ W +W(A+BK)+I =0. Multi-dynamicsLQR A fully general formulation of multi-system LQR would allow variations in each of(A,B,Q,R), but this creates redundancy. Any LQR problem whereQ≻ 0 is equivalent via change of 131 coordinates to another LQR problem whereQ = I andR = I. To reduce redundancy, we consider only multi-dynamics LQR problems whereQ = I n× n andR = I m× m in this work. The reference policy class is linear: Π ref =R m× n . One way to then define a multi-dynamics LQR problem is by a simple product Φ= A× B for some sets A⊆ R n× n andB⊆ R n× m . However, it is not obvious how to designA andB. To support an asymptotic analysis of N cov α (Φ) , the system space Φ should have a real-valued “breadth” parameter θ that sweeps from a single system to sets with arbitrarily large, but finite, covering numbers. Matrix norm balls are a popular representation of dynamics uncertainty in the robust control literature, but they can easily contain uncontrollable pairs, and removing the uncontrollable pairs can lead to an infinite covering number. For example, in the scalar problem A={a}, B=[− θ, 0)∪(0,θ ], wherea̸=0, it can be shown that noα -suboptimal cover is finite. These properties are worrying, but the example B is pathological. The zero crossing is analogous to reversing the direction of force applied by an actuator in a physical system. Allowing B to become arbitrarily close to zero means the system can become arbitrarily close to uncontrollable. A more relevant multi-dynamics problem is variations in mass or actuator strength, whose signs are fixed. We formalize this idea with the following definition. Definition6.2.2. FixA∈R n× n and a breadth parameterθ ≥ 1. LetA={A} and let B={UΣ V ⊤ :Σ ∈Σ }, where Σ ={diag(σ ):σ ∈[ 1 θ ,1] d }. The matrices U ∈ R n× d and V ∈ R m× d each have rank d, where 0 < d ≤ min{n,m}. The tuple (A,U,V,θ ) fully defines a multi-system LQR problem in decomposed dynamics form, or DDF problem for brevity. 132 We will abuse notation and associate Φ with both A× B and Σ when the meaning is clear from context. The continuity of the LQR cost (6.2) with respect to B and the compactness of Φ for any θ imply thatN cov α (Φ θ ) is always finite. Variations in A are redundant in the scalar case where we focus our theoretical work in this chapter. The definition can be extended to include them in future work. x y u 1 u 2 u 3 u 4 z Figure 6.1: Quadrotor helicopter with position states x,y,z, attitude states θ,ϕ,ψ , and per-rotor thrust inputsu 1 ,u 2 ,u 3 ,u 4 . The linearized dynamics at hover, subject to variations in mass, geometry, etc., can be expressed in decomposed dynamics form—see §6.2. Linearizedquadrotorexample As an example of a realistic DDF problem, we consider the quadrotor helicopter illustrated in Figure 6.1. Near the hover state, its full nonlinear dynamics are well approximated by a linearization. The state is given by x = (x,v,r,ω), where x ∈ R 3 is position, v ∈ R 3 is linear velocity, r ∈ R 3 is attitude Euler angles, andω ∈ R 3 is angular velocity. The inputs u ∈ R 4 ≥ 0 are the squared angular velocities of the propellers. Many factors influence the response to inputs, including geometry, mass, moments of inertia, motor properties, and propeller aerodynamics. These can be combined and partially nondimensionalized into 133 four control authority parameters to form ϕ ∈ Φ . The hover state occurs at x = 0, u ∝ 1, where the constant input counteracts gravity. The linearized dynamics are given by ˙ x= 0 I 0 0 0 0 G 0 0 0 0 I 0 0 0 0 | {z } A x+ 0 0 ˆ e z 0 0 0 0 I | {z } U σ z σ θ σ ϕ σ ψ | {z } Σ 1 1 1 1 1 − 1 − 1 1 − 1 − 1 1 1 1 − 1 1 − 1 | {z } V ⊤ u, G= 0 g 0 − g 0 0 0 0 0 , where g is the gravitational constant and ˆ e z = [0 0 1] ⊤ . The parameters (σ z , σ θ , σ ϕ , σ ψ ) denote the thrust, roll, pitch, and yaw authority constants respectively. Since we use the conventionσ ∈ [ 1 θ ,1], the maximum value of each constant can be varied by scaling the columns ofU. 6.3 Relatedwork Suboptimal coverings are closely related to several topics in control theory. Robust control synthesis under parametric uncertainty (Dullerud and Paganini, 2000) can be interpreted as seeking a policy that performs well on all of Φ without observing the particularϕ ∈Φ . Most problem formulations in robust synthesis admit problem instances with no solution; the goal is to find a robust policy if one exists. Gain-scheduled control considers a multi-system setup identical to ours, while adaptive control control adds the compli- cation thatϕ is not known to the policy. Adaptive and gain-scheduled policies of the self-tuning type synthesize a single-system policy after estimating ϕ , but this relies on the assumption that control synthesis can be computed quickly (Åström and Wittenmark, 2013). In contrast, methods of the multi-model type use a precomputed set of policies (Murray-Smith and Johansen, 1997). All multi-model methods impose some kind of coverage condition on the policy set. In a stable cover, each ϕ ∈ Φ is stabilized by at least one policy. Researchers often 134 focus more on the switching rule than the policy set. For example, Fu and Barmish (1986); Stilwell and Rugh (1999); Yoon et al. (2007) non-constructively assert the existence of a finite cover by continuity and compactness arguments. To address the need for small covers, Anderson et al. (2000); McNichols and Fadali (2003); Tan et al. (2004); Fekri et al. (2006); Du et al. (2012) propose constructive algorithms for various classes ofΦ , sometimes with arguments for minimality but without bounds on the covering number. Jalali and Golmohammad (2012) bound stability covering numbers in terms of worst-case Vinnicombe metric distances and sensitivity function norms across the system set. The most closely related work to ours is from Fu (1996), who shows a tight bound of2 n for the stability covering number of a relatively broadΦ . This result is complementary to ours: suboptimality is a stronger criterion than stability, but our class of Φ is more restrictive. We are not aware of prior work that bounds covering numbers in a setup based on local suboptimality, as opposed to a single global performance measure. Multi-system control is also a popular topic in deep learning research, where it is often motivated by ideas of lifelong skill acquisition in robotics. Domain randomization methods follow the spirit of robust control (Peng et al., 2017), but usually optimize for the average case instead of a worst-case guarantee. Many methods where the policy observesϕ use architectural constructs that can only be applied to finite system sets (Yang et al., 2017; Parisotto et al., 2016; Devin et al., 2017). A common approach for infinite system spaces is to treatϕ as a vector input alongside the system state. Yu et al. (2017) and Chen et al. (2018) use this approach for dynamics parameters; Schaul et al. (2015) use it for navigation goals. There is evidence that policy class influences these methods: in a recent benchmark (Yu et al., 2019), the concatenated-input architecture that supports infinite system spaces trails the multi-head architecture that only supports finite system spaces. Other investigations into the difficulty of learning policies for multi-system control include methods to condition the multi-system optimization landscape (Yu et al., 2020) or balance disparate cost ranges (van Hasselt et al., 2016). 135 6.4 Theoreticalresults In this section we show logarithmic upper and lower bounds on the growth of N cov α (Φ θ ) in θ for scalar DDF problems. We present several intermediate results in matrix form because they are needed for our empirical results later. We begin with a key lemma in the framework ofguaranteedcostcontrol (GCC) from Petersen and McFarlane (1994), simplified for our use case. Lemma 6.4.1 (GCC synthesis, Petersen and McFarlane (1994)). Given the multi-system LQR problem de- fined by A={A},B={B 1 ∆+ B 2 :∥∆ ∥≤ 1}, whereB 1 ,B 2 ∈R m× p are arbitrary for arbitraryp, and the state cost matrix isQ≻ 0, if there existsτ > 0 such thatP ≻ 0 solves the Riccati equation A ⊤ P +PA+P 1 τ B 1 B ⊤ 1 − 1 1+τ B 2 B ⊤ 2 P +Q=0, (6.4) then the controller K =− 1 1+τ B ⊤ 2 P has costJ B (K)≤ tr(P) for allB∈B. Also,tr(P) is a convex function ofτ . We use the notation P,τ,K = GCC(A,B 1 ,B 2 ,Q) to indicate that P,τ solve (6.4) and K is the cor- responding controller. It is straightforward to show that any DDF problem can be expressed in the form required by Lemma 6.4.1 with additional constraints on∆ . In the original presentation, Petersen and McFarlane (1994) treatB 1 as given, so they accept that (6.4) may have no solution. (For example, any timeB 2 =0 andA has unstable eigenvalues, due to uncontrol- lability.) Our application requires constructing values ofB 1 ,B 2 that guarantee a solution, motivating the following lemma. We abbreviate the reference text Lancaster and Rodman (1995) as Lan95. 136 Lemma 6.4.2 (existence ofα -suboptimal GCC). For the DDF problem (A,U,V,θ ), ifB ∈ B andα > 1, then there existsτ > 0 such that the GCC Riccati equation (6.4) withB 1 = τB andB 2 = B has a solution (P,τ ) satisfyingtr[P]≤ αJ ⋆ B . Proof. For this proof, it will be more convenient to write the algebraic Riccati equation as A ⊤ P +PA− PDP +Q=0, (6.5) where D ⪰ 0. Let D = {D ⪰ 0 : (A,D) is controllable}. Controllability of (A,B) implies that BB ⊤ ∈ D (Lan95, Corollary 4.1.3). Let Ric + denote the map fromD to the maximal solution of (6.5), which is continuous (Lan95, Theorem 11.2.1), and let D α ={D∈D :tr[Ric + (D)]<αJ ⋆ B }. The setD α is open inD by continuity and is nonempty because it containsBB ⊤ . Now define B 1 (τ )=τB forτ ∈(0, 1 2 ). The equivalent ofD in the GCC Riccati equation (6.4) becomes D(τ )=− 1 τ B 1 (τ )B 1 (τ ) ⊤ + 1 1+τ B 2 B ⊤ 2 = 1− τ − τ 2 1+τ BB ⊤ . As a positive multiple ofBB ⊤ , we knowD(τ )∈D, and becauselim τ →0 D(τ ) = BB ⊤ , the set ofτ for whichD(τ )∈D α is nonempty. Any suchτ andB 1 (τ ) provide a solution. Finally, the following comparison result will be useful in several places. Lemma6.4.3 (ARE comparison lemma). Given two algebraic Riccati equations A ⊤ P +PA− PBB ⊤ P +Q=0 and ˜ A ⊤ P +P ˜ A− P ˜ B ˜ B ⊤ P + ˜ Q=0, 137 with maximal solutionsP and ˜ P, letX = h Q A ⊤ A − BB ⊤ i and ˜ X = h ˜ Q ˜ A ⊤ ˜ A − ˜ B ˜ B ⊤ i . IfX⪰ ˜ X, thenP⪰ ˜ P. Proof. Lan95, Corollary 9.1.6. 6.4.1 Scalarupperbound We are now ready to bound the covering number for scalar systems. The first lemma bounding J ⋆ a,b will be useful for the lower bound also. We then construct a cover inductively. Lemma6.4.4. In a scalar LQR problem, ifa > 0 and0 < b≤ 1, then the optimal scalar LQR cost satisfies the bounds 2a b 2 <J ⋆ a,b < 2a+1 b 2 . Proof. The closed-form solution for the scalar Riccati equation is J ⋆ a,b = a+ √ a 2 +b 2 b 2 . The lower bound is immediately visible. The upper bound follows from observing thata 2 +b 2 ≤ (a+1) 2 . Lemma6.4.5. Ifp,τ,k = GCC(a,b 1 ,b 2 ,q), then for anyβ ∈(0,1), there existsk ′ ∈R such that β − 2 p,τ,k ′ = GCC a,βb 1 ,βb 2 ,β − 2 q . Proof. In the scalar system, the GCC matrix Riccati equation (6.4) reduces to the quadratic equation 1 τ b 2 1 − 1 1+τ b 2 2 p 2 +2ap+q =0. (6.6) 138 Substitutingp ′ =β − 2 p into (6.6) and multiplying byβ − 2 yields a new instance of (6.6) with the parameters b ′ 1 =βb 1 , b ′ 2 =βb 2 , q ′ =β − 2 q, for whichp ′ is a solution withτ unchanged. Theorem6.4.6. ForthescalarDDFproblemdefinedby A={a},wherea>0,andB= 1 θ ,1 ,ifα ≥ 2a+1 2a , thenN cov α (B)=O(logθ ). Proof. We construct a cover from the upper end ofB. By Lemma 6.4.4, the condition α ≥ 2a+1 2a implies thatJ ⋆ b=1 < α 2a < αJ ⋆ b=1 . Therefore, by Lemmas 6.4.1 and 6.4.2, there existsβ ∈ (0,1) andp,τ,k such that p,τ,k = GCC a, 1− β 2 , 1+β 2 ,1 andp≤ α 2a. Proceeding inductively, suppose that forN ≥ 1, we have covered[β N ,1] by the intervals B n =[β n+1 ,β n ] forn∈{0,...,N− 1}, and eachB n has a controllerk n such that β − 2n p,τ,k n = GCC a, β n − β n+1 2 , β n +β n+1 2 ,β − 2n . Then the existence of the desiredB N ,k N follows immediately from Lemma 6.4.5. By Lemma 6.4.3, for eachB n the GCC state costq n = β − 2n ≥ 1 is an upper bound on the cost if we replaceq n with1 to match the DDF problem. Therefore, for each intervalB n , for allb∈B n , αJ ⋆ b ≥ αJ ⋆ β n >β − 2n α 2a≥ β − 2n p≥ J b (k n ), where first inequality is due to Lemma 6.4.3, the second is due to Lemma 6.4.4, the third is by construction ofp, and last is due to the GCC guarantee ofk n . Hence,B n ⊆ N α (k n ). We cover the fullB whenβ N ≤ 1 θ , which is satisfied by N ≥− logθ/ logβ . 139 6.4.2 Scalarlowerbound For the matching lower bound, we begin by deriving a simplified overestimate of N α (k). We then show that the trueN α (k) is still a closed interval moving monotonically withk. Finally, we argue that the gaps between consecutive elements of a cover grow at most geometrically, while the range ofk values in a cover must grow linearly withθ . Lemma 6.4.7. For a scalar DDF problem with a ≥ 1, B = [ 1 θ ,1], for any k < 0, if α ≥ 3/2, then N α (k)⊆ 1 |k| [c 1 − c 2 , c 1 +c 2 ], wherec 1 andc 2 are constants depending onα anda. Proof. Beginning with the closed-form solution forJ b (k), which can be derived from (6.2), we define J b (k)= 1+k 2 − 2(a+bk) ≥ k 2 − 2(a+bk) ≜J b (k). (6.7) By Lemma 6.4.4, we have J ⋆ b < 3a b 2 ≜J ⋆ b , so˜ r = J b (k) J ⋆ b is a lower bound on the suboptimality ofk. Computing∂ 2 ˜ r/∂b 2 shows that˜ r is strictly convex in b on the domain a+bk < 0, so the α -sublevel set of ˜ r is the closed interval with boundaries where ˜ r =α . This equation is quadratic inb with the solutions b=− a(3α ± √ 9α 2 − 6α ) k . The resulting interval containsN α (k). Lemma 6.4.8. For a scalar DDF problem, if α > 1 and k < − 1, thenN α (k) is either empty or a closed interval[b 1 ,b 2 ], withb 1 andb 2 positive and nondecreasing ink. 140 Proof. The result follows from the quasiconvexity of both the suboptimality ratioJ b (k)/J ⋆ b and the cost J b (k). Showing these requires some tedious calculations and is deferred to §6.6. Theorem 6.4.9. For a scalar DDF problem defined by a = 1 and B = [ 1 θ ,1], if α ≥ 3/2, then N cov α (B)=Ω(log θ ). Proof. From the closed-form solution k ⋆ a,b =− a+ √ a 2 +b 2 b , we observe that k ⋆ b < − 1 for all b ∈ B. This, along with the quasiconvexity of J b (k) in k, implies that there exists a minimal α -suboptimal coverC for which all k i < − 1. SupposeC = k 1 ,...,k N is such a cover, ordered such thatk i < k i+1 . Then by Lemma 6.4.8,N α (k i ) andN α (k i+1 ) must intersect, so their overestimates according to Lemma 6.4.7 certainly intersect, therefore satisfying c 1 +c 2 − k i+1 ≥ c 1 − c 2 − k i =⇒ k i+1 k i ≤ c 1 +c 2 c 1 − c 2 =⇒ k N k 1 ≤ c 1 +c 2 c 1 − c 2 N− 1 . By Lemma 6.4.7, to coverb=1 controllerk 1 must satisfyk 1 ≥− (c 1 +c 2 ), and to coverb= 1 θ , controller k N must satisfyk N ≤− θ (c 1 − c 2 ). Along with the previous result, this implies c 1 +c 2 c 1 − c 2 N− 1 ≥ θ c 1 − c 2 c 1 +c 2 =⇒ N ≥ logθ log c 1 +c 2 c 1 − c 2 . Recalling thatc 1 andc 2 only depend ona andα , theΩ(log θ ) dependence onθ is established. Remarks • For the upper bound, it may be possible to compute or boundβ in the scalar case as a function ofa andα , but the analogous result will likely be harder to obtain in the matrix case. 141 • These results impose a lower bounds on α greater than 1. We believe this is a mild condition in control applications: if the application demands a suboptimality ratio very close to 1, then the size of the suboptimal cover is likely to become impractical for storage. However, further theoretical results building upon suboptimal coverings may require eliminating the bound. 6.5 Empiricalresults For matrix DDF problems, we present empirical results as a first step towards covering number bounds. The proof technique of §6.4 is not easily extended to thed>1 case. We discuss the difficulties and our in- termediate results further in §6.7. In this section, we empirically validate a possible cover construction and use visualization better understand the topological and geometric properties of suboptimal neighborhoods whend>1. 6.5.1 Geometricgridconstructionforupperbounds We begin by testing a cover construction. If the construction fails to achieve a conjectured upper bound in a numerical experiment, then either the conjecture is false, or the construction is not efficient. A natural idea is to extend the geometrically spaced sequence ofb values from Lemma 6.4.4 to multiple dimensions. We now make this notion, illustrated in Figure 6.2, precise. 0.1 0.5 1.0 σ 1 0.1 0.5 1.0 σ 2 Figure 6.2: Illustration of geometric grid partition (Definition 6.5.1). 142 Definition 6.5.1 (Geometric grid partition). Given a DDF problem with Σ = [ 1 θ ,1] d , and a grid pitch k∈N + , selects 1 ,...,s k+1 such thats 1 = 1 θ ,s k+1 =1, and s i+1 s i >0 is constant. For eachj∈{1,...,k} d , define the grid cell Σ (j) = Q d i=1 [s j(i) ,s j(i)+1 ], where j(i) is the i th component of j. The cells satisfy Σ = S j∈{1,...,k} dΣ (j), thus forming an partition (up to boundaries) ofΣ intok d cells. 6.5.1.1 EmpiricalupperboundonN cov α (Φ) . 10 0 10 1 10 2 θ 1 5 10 grid pitchk log Figure 6.3: Empirical upper bound on grid pitchk needed to construct geometric grid covering of linearized quadrotor using GCC synthesis. In this experiment, we construct2-suboptimal covers of the linearized quadrotor for varyingθ using geometric grids. We begin with the guessk = 1. For each grid cellΣ (j), we compute a controllerK(j) using GCC synthesis and check ifΣ (j)⊆ N 2 (K(j)). (This requires evaluating only one Lyapunov equa- tion due to Lemma 6.4.3.) If not, we incrementk and try again. Termination is guaranteed by continuity. Results for this experiment with θ ∈ [1,100] are shown in Figure 6.3. The required grid pitch k follows roughly logarithmic growth, as indicated by the linear least-squares best-fit curve in black. Small values ofθ are excluded from the fit (indicated by black markers), as we do not expect the asymptotic behavior to appear yet. 143 These results do not rule out thelog(θ ) d growth suggested by the geometric grid construction. Testing larger values ofθ is computationally difficult because the number of grid cells becomes huge and the GCC Riccati equation (6.4) becomes numerically unstable for very smallΣ . 6.5.1.2 Efficiencyofgeometricgridpartition. 1 1 10 σ 4 1.9 1.8 1.9 1.8 σ 2 = 1 10 1.8 1.4 1.8 1.5 σ 1 =1 σ 2 =1 1 10 1 σ 3 1 1 10 σ 4 1.9 1.8 2 1.9 1 10 1 σ 3 1.8 1.5 1.9 1.6 σ 1 = 1 10 Figure 6.4: Suboptimality ratios for corner cells in geometric grid covering of linearized quadrotor. Given anα -suboptimal geometric grid cover, we examine a measurable quantity that may reflect the “efficiency” of the cover. Intuitively, in a good cover we expect the worst-case suboptimality ratio of each controllerK(j) relative to its grid cellΣ (j) to be close toα . If it close toα for some cells but significantly less thanα for others, then the grid pitch around the latter cells is finer than necessary. We visualize results for this computation on the linearized quadrotor withθ = 10, k = 4 in Figure 6.4 — only the corners of the 4× 4× 4× 4 grid are shown. The suboptimality ratio is close to α = 2 for cells with low control authority (nearΣ = 1 θ I), but drops to around1.4 for cells with high control authority (nearΣ = I). The difference suggests that the geometric grid cover could be more efficient in the high-authority regime. 6.5.1.3 EfficiencyofGCCsynthesis. One possible source of the conservativeness of GCC in the high-authority regime is that Lemma 6.4.1 applies to the affine image of a m× n-dimensional matrix norm ball, but we only require guaranteed 144 0.03 1 σ 2 α = 1.05 α = 1.1 A =I α = 1.35 0.03 1 σ 1 0.03 1 σ 2 0.03 1 σ 1 0.03 1 σ 1 A = 1 n 1 Figure 6.5: α -suboptimal neighborhoods for geometric grid partition in 2D system. Top: minimum cou- pling;A=I. Bottom: maximum coupling;A= 1 n 1. Columns: varying suboptimality thresholdα . All axes are logarithmic. Colors have no meaning. Discussion in §6.5.2. cost on a d-dimensional affine subspace of matrices. In other words, we ask GCC synthesis to ensure α -suboptimality on systems that are not actually part ofΦ . If this is negatively affecting the result, then we should observe that the worst-case cost ofK(j) onΣ (j) is less than the trace of the solutionP for the GCC Riccati equation (6.4). The worst-case cost always occurs at the minimalΣ ∈Σ (j) by Lemma 6.4.3; we evaluate it with (6.2). For the quadrotor, a mismatch sometimes occurs for smaller values ofθ , but it does not occur for the large values ofθ . 6.5.2 Suboptimalneighborhoodvisualizations We now present intuition-building experiments towards a covering number lower bound for matrix DDF problems. A lower bound requires a class of DDF problem that can be instantiated for any dimensionality d. Two “extremal” systems come to mind: minimumcoupling, whereA=I, andmaximumcoupling, where A= 1 n 1. Note that for minimum coupling, anα -suboptimal policy is not necessarilyα -suboptimal on each scalar subsystem – if it were, the lower boundlog(θ ) d would trivially follow from the results in §6.4. 145 Figure 6.6: α -suboptimal neighborhoods for the three-dimensional decomposed dynamics system with minimal coupling (A = U = V ⊤ = I 3× 3 ) and breadth θ = 100. Neighborhoods shown for α ranging from1.04 to1.2 with a fixed controller. We show approximate suboptimal neighborhoods for a two-dimensional system in Figure 6.5. We select a geometric grid ofΣ values (indicated by the circular markers) and synthesize their LQR-optimal controllers. Then, we evaluate the suboptimality ratio of each controller on a finer grid of Σ values to get approximate neighborhoods, indicated by the semi-transparent regions. We repeat this experiment with three values ofα for both choices ofA. Interestingly, the neighborhoods for A = I are not always connected. In the plot for α =1.05 (far left), the neighborhood for the minimal Σ has another component that overlaps other neighborhoods to its top and right. If we increase toα = 1.1, the components join into an “L”-shaped region. In contrast, the neighborhoods forA= 1 n 1 seem more well-behaved. For both choices ofA, the neighborhoods are of comparable size. To verify that this behavior is not an artifact of the two-dimensional case only, we repeat the experi- ment in three dimensions. Figure 6.6 shows neighborhoods of one controllerK = K ⋆ (2/θ )I forα ranging from 1.04 to 1.2. As α grows,N α (K) shows similar topological phases as the 2D case. In the simply- connected phase (largeα ), the neighborhood appears to include anyΣ where at least oneσ i is sufficiently small. If this property holds in higher dimensions, then it would be possible to construct a cover using only controllers of uniform gain in all dimensions for largeα . 146 6.6 ProofofLemma6.4.8 We first present some supporting material. The following facts about scalar LQR problems can be derived from the LQR Riccati equation and some calculus (not shown). Lemma6.6.1. For the scalar LQR problem witha > 0,b > 0 andq = r = 1, the optimal linear controller k ⋆ a,b is given by the closed-form expression k ⋆ a,b =min k∈R J a,b (k)=min k∈R 1+k 2 − 2(a+bk) =− a+ √ a 2 +b 2 b . Forfixed a,themapfrombtok ⋆ a,b iscontinuousandstrictlyincreasingonthedomainb∈(0,∞)andhasthe range(−∞ ,− 1). For anyk∈ (−∞ ,− 1), there exists a uniqueb k ∈ (−∞ ,− 1) for whichk = k ⋆ a,b k , given by b k = 2ak 1− k 2 . We now recall the statement of the lemma. Lemma 6.4.8. For a scalar DDF problem, if α > 1 and k < − 1, thenN α (k) is either empty or a closed interval[b 1 ,b 2 ], withb 1 andb 2 positive and nondecreasing ink. Instead of a monolithic proof, we present supporting material in Lemmas 6.6.2 and 6.6.3. We then show the main result in Lemma 6.6.4, which considersα -suboptimal neighborhoods on all ofR instead of restricted toB. Lemma 6.4.8 will follow as a corollary. We proceed with more setup. Recall that the scalar DDF problem is defined by A={a} andB=[ 1 θ ,1], wherea>0. For this section, let D ={(b,k)∈(0,∞)× R:a+bk <0} 147 (note that J b (k) < ∞ ⇐⇒ a +bk < 0). Denote its projections byD b (k) = {b : (b,k) ∈ D} and D k (b)={k :(b,k)∈D}. We compute the suboptimality ratior :D7→R by r(b,k)= J b (k) J ⋆ b = 1+k 2 − 2(a+bk) , a+ √ a 2 +b 2 b 2 = (1+k 2 )b 2 − 2(a+bk)(a+ √ a 2 +b 2 ) . We denote its sublevel sets with respect tob for fixed k by D b α (k)={b∈D b (k):r(b,k)≤ α }. Lemma 6.6.2. For fixed k < 0, the ratior(b,k) is quasiconvex onD b k , and there is at most oneb ∈ D b k at which∂r/∂b=0. Proof. By inspection, r(b,k) is smooth on D b . We now show that the second-order condition of Lemma 2.4.4(b) holds. To solve ∂r/∂b = 0 for b, we multiply ∂r/∂b (not shown due to length) by the strictly positive factor 2(a+bk) 2 a+ √ a 2 +b 2 2√ a 2 +b 2 ab(k 2 +1) and set the result equal to zero to get the equation 2a 2 +abk+b 2 =(− 2a− bk) p a 2 +b 2 . Squaring both sides (which may introduce spurious solutions) and collecting terms yields the equation − 2ak− bk 2 +b = 0, with the solutionb = 2ak 1− k 2 . This is the expression forb k from Lemma 6.6.1. Note that it is only positive fork <− 1. Ifk ∈ [− 1,0), then there are no stationary points inD b k . Otherwise, substitution into∂r/∂b confirms that this solution is not spurious, so it is the only stationary point of r 148 with respect tob. We now must check the second-order condition fork <− 1. Evaluating∂ 2 r/∂b 2 (not shown due to length) and multiplying by the strictly positive factor − a+ √ a 2 +b 2 (2a+2bk) k 2 +1 , we have sign ∂ 2 r ∂b 2 =sign b 4 a+ √ a 2 +b 2 (a 2 +b 2 ) 3 2 + 2b 4 a+ √ a 2 +b 2 2 (a 2 +b 2 ) + 2b 3 k (a+bk) a+ √ a 2 +b 2 √ a 2 +b 2 + 2b 2 k 2 (a+bk) 2 − 5b 2 a+ √ a 2 +b 2 √ a 2 +b 2 − 4bk a+bk +2 . Evaluating at the stationary pointb k , this reduces to sign ∂ 2 r ∂b 2 b k ,k =sign 2k 2 (k− 1) 2 (k+1) 2 (k 2 +1) 3 ! . (6.8) Recalling thatk <− 1, the sign is positive. The conclusion follows from Lemma 2.4.4(b). Lemma6.6.3. Forfixed b,thecostJ b (k)isquasiconvexonD k (b). Also,J b (k)isnotmonotonic,socase3. of Lemma 2.4.4(a) applies. Proof. We have J b (k)= 1+k 2 − 2(a+bk) . The numerator is nonnegative and convex on k ∈ R. The denominator is linear (hence concave) and positive onD k (b). Quasiconvexity follows from Lemma 2.4.4(c). Nonmonotonicity follows from the fact thatJ b (k) is smooth onD k (b) and has a unique optimum atk ⋆ b , which is not on the boundary ofD k (b). We now combine these into the main result. 149 Lemma 6.6.4. For a scalar DDF problem, if α > 1 and k < − 1, thenD b α (k) is either: a bounded closed interval[b 1 ,b 2 ], withb 1 andb 2 increasing ink, or a half-bounded closed interval[b 1 ,∞), withb 1 increasing ink. Proof. By Lemma 6.6.2, due to quasiconvexityD b α is convex. The only convex sets onR are the empty set and all types of intervals: open, closed, and half-open. We knowD b α is not empty because it containsb k . We can further assert thatD b α has a closed lower bound becauselim b→(− a/k) r(b,k) =∞ (see Boyd and Vandenberghe (2004, §A.3.3) for details). However, the upper bound may be closed or infinite. We handle the two cases separately. Bounded case. Fix k 0 < − 1. Suppose D b α (k 0 ) = [b 1 ,b 2 ] for 0 < b 1 < b 2 < ∞. By the implicit function theorem (IFT), at any (b 0 ,k 0 ) satisfying r(b 0 ,k 0 ) = α , if ∂r/∂b| b 0 ,k 0 ̸= 0 then there exists an open neighborhood around (b 0 ,k 0 ) for which the solution to r(b,k) = α can be expressed as (g(k),k), whereg is a continuous function ofk and ∂g(k) ∂k k 0 = − ∂r ∂b − 1 ∂r ∂k b 0 ,k 0 . By the continuity and quasiconvexity ofr, and the fact that∂r/∂b=0 only atb k (Lemma 6.6.2) we know thatr(b 1 ,k 0 )=r(b 2 ,k 0 )=α and ∂r ∂b b 1 ,k 0 <0 and ∂r ∂b b 2 ,k 0 >0. By Lemma 6.6.1, since k < − 1 there exists b k > 0 satisfying k = k ⋆ b k . Since r(b k ,k) = 1 and α > 1, we know b k 0 ∈ (b 1 ,b 2 ). Again by Lemma 6.6.1, the map from b to k ⋆ b is increasing in b. 150 Therefore, k ⋆ b 1 <k 0 <k ⋆ b 2 . By the quasiconvexity and nonmonotonicity of J b (k) from Lemma 6.6.3, via Lemma 2.4.4(a) we have ∂r ∂k b 1 ,k 0 ≥ 0 and ∂r ∂k b 2 ,k 0 ≤ 0. Therefore, the functionsg 1 ,g 2 satisfying the conclusion of the IFT in the neighborhoods around(b 1 ,k 0 ) and(b 2 ,k 0 ) respectively also satisfy ∂g 1 (k) ∂k b 1 ,k 0 ≥ 0 and ∂g 2 (k) ∂k b 2 ,k 0 ≥ 0. Therefore,b 1 andb 2 are locally nondecreasing ink. Unboundedcase. SupposeD b α (k)=[b 1 ,∞) forb 1 <∞. By the same IFT argument as in the bounded case, b 1 is increasing in k. By the quasiconvexity of r in b, the value of r is increasing for b > b k , but the definition of D b α (k) implies that r(b,k) ≤ α for all b > b k . Therefore, lim b→∞ r(b,k) exists and is bounded byα . In particular, lim b→∞ r(b,k)= lim b→∞ (1+k 2 )b 2 − 2(a+bk)(a+ √ a 2 +b 2 ) = lim b→∞ (1+k 2 )b 2 /b 2 − 2(a+bk)(a+ √ a 2 +b 2 )/b 2 =− 1+k 2 2k . Taking the derivative shows that this value is decreasing ink fork <0. Therefore, ifk <k ′ <0 then lim b→∞ r(b,k ′ )≤ lim b→∞ r(b,k)≤ α. The property that r(b,k ′ ) is increasing in b for b > b k further ensures that r(b,k ′ ) ≤ α for all b > b k . Therefore,D b α (k ′ ) is also unbounded. 151 For completeness, we prove Lemma 6.4.8. Proof. (of Lemma 6.4.8). By Lemma 6.6.4,D b α (k) is either a bounded closed interval[b 1 ,b 2 ], withb 1 andb 2 increasing in k, or a half-bounded closed interval [b 1 ,∞), with b 1 increasing in k. Recall thatN α (k) = D b α (k)∩B withB = [ 1 θ ,1]. Therefore, the half-bounded case can be reduced to the bounded case with b 2 =1. The intersection can be expressed as N α (k)=[max{b 1 , 1 θ },min{b 2 ,1}], where the interval [a,b] is defined as the empty set if a > b. Taking the maximum or minimum of a nonstrict monotonic function and a constant preserves the monotonicity, so we are done. 152 6.7 Effortstowardsmatrixcase In this section, we present intermediate results in our effort to prove or disprove the following conjecture: Conjecture 6.7.1. For a general DDF problem as defined in Definition 6.2.2 with rank d and breadth θ , N cov α (B)∈O(d logθ ). This conjecture is not particularly strong – essentially, it posits that the covering number may suffer from a “curse of dimensionality” with respect to d, but the dependency on θ matches that of the scalar case. We also visualize suboptimal neighborhoods for multi-system LQR problems where the variations are in theA matrix instead of theB matrix. To prove an upper bound for the matrix case using the geometric grid construction (Definition 6.5.1) with a similar proof technique to that of Theorem 6.4.6, the key obstacle appears to be Lemma 6.4.5. The proof of Lemma 6.4.5 relies on commutativity of scalar multiplication, which no longer holds in the matrix case. 6.7.1 Easycase: ScalarmultiplesofB As an intermediate step towards Conjecture 6.7.1, we can address the case whereA is arbitrary butB is a one-dimensional subspace as in the scalar case. To be precise, letB={σB 0 :σ ∈[ 1 θ ,1]}. It turns out that this case is not much more difficult than the scalar case, even though the unforced dynamics ˙ x=Ax can exhibit oscillation due to complex eigenvalues, may contain a mixture of stable and unstable eigenvalues, and so on. Whereas the proof for Theorem 6.4.6 relied on the closed-form solution for optimal LQR cost, here we will use a less constructive method. Our proof will require a more generic lower limit ofα , of which the constraintα ≥ 2a+1 2a for the scalar setting was a special case. In what follows, we will use the notation [s,t]B ={σB :σ ∈[s,t]} 153 and refer to such a set as an “interval”. For legibility, when referring to an LQR problem(A,B,Q,R) with variations in matrices other thanB, we will also use the notationJ ⋆ (A,B,Q,R) instead of the notation J ⋆ A,B,Q,R for the LQR problem’s optimal cost. We begin with the following generalization of Lemma 6.4.5: Lemma6.7.2. For(A,B) controllable andQ⪰ 0, ifP 0 is the maximal solution of the ARE A ⊤ P +PA− PBB ⊤ P +Q=0 associated with the LQR problem(A,B,Q,I), then for anyβ > 0, the matrixβ − 2 P 0 solves the ARE A ⊤ P +PA− β 2 PBB ⊤ P +β − 2 Q=0, (6.9) and is the unique optimal cost matrix for the LQR problem defined by (A,βB,β − 2 Q,I). Proof. It is easy to see thatβ − 2 P 0 solves the ARE (6.9). In general, AREs may have more than one positive semidefinite solution, and we would need to show that β − 2 P 0 is maximal to ensure that it actually rep- resents the LQR optimal cost matrix. However, for AREs defined by well-posed LQR problems, it can be shown that the LQR cost matrix is theunique positive semidefinite solution of the associated ARE (Lan95, Theorem 16.3.3). Therefore, we are assured thatβ − 2 P 0 is also the LQR cost matrix. Theorem6.7.3. Forthemulti-systemLQRproblemdefinedby A={A}andB=[ 1 θ ,1]B 0 ,where(A,B 0 ) is controllable, with fixed cost matrices Q≻ 0 andR=I, if there exists a constantc such that J ⋆ (A,B,sQ,sI)≤ cJ ⋆ (A,B,Q,sI) for alls>0, then ifα>c , we haveN cov α (B)=O(logθ ). 154 Proof. LetP 0 denote the solution to the ARE A T P +PA− PB 0 B T 0 P +Q=0, (6.10) so thatJ ⋆ B 0 =tr(P 0 ). By Lemmas 6.4.1 and 6.4.2, there existsβ ∈(0,1) andP GCC ,τ,K such that P GCC ,τ,K = GCC A, 1− β 2 B 0 , 1+β 2 B 0 ,Q (6.11) andtr(P GCC )≤ α c J ⋆ B 0 . Then for allB∈[β, 1]B 0 , we have J B (K)≤ tr(P GCC )≤ α c J ⋆ B 0 ≤ α c J ⋆ B , where the first inequality follows from the definition of GCC synthesis and the last inequality follows from the ARE comparison lemma (Lemma 6.4.3). Now let us consider covering [β N ,1]B 0 by the intervals{B n } N− 1 n=0 , whereB n = [β n+1 ,β n ]B 0 . By applying Lemma 6.7.2 to the Riccati equation (6.10) we find that J ⋆ (A,β n B 0 ,β − 2n Q,I)=β − 2n tr(P 0 )=β − 2n J ⋆ B 0 . Invoking Lemma 6.7.2 on the GCC Riccati equation (6.4) with the base case of (6.11) yields a controllerK n for eachB n such that β − 2n P GCC ,τ,K n = GCC A, β n − β n+1 2 B 0 , β n +β n+1 2 B 0 ,β − 2n Q . 155 The ARE comparison lemma with respect to the state cost matrix then implies that, for allB∈B n , J B (K n )≤ β − 2n tr(P GCC ). Putting it all together, for eachB n andK n , we have J B (K n )≤ β − 2n tr(P GCC )≤ β − 2n α c J ⋆ B 0 = α c J ⋆ (A,β n B 0 ,β − 2n Q,I)≤ αJ ⋆ (A,β n B 0 ,Q,I)≤ αJ ⋆ B for allB∈B n , where the second-to-last inequality is due to the hypothesis and the last is due to the ARE comparison lemma. Therefore,B n ⊆ N α (K n ). We cover the fullB whenβ N ≤ 1 θ , which is satisfied by N ≥− logθ/ logβ . Recalling that the original choice ofβ did not depend onθ , we are done. In the next section, we elaborate on the cost ratio limitc in the hypothesis of Theorem 6.7.3 and how it differs between the scalar and matrix cases. 6.7.2 Roleofα ’slowerbound We have been focusing on what happens asB is scaled down towards zero. From the LQR Riccati equation (2.31), we can see that scaling downB gives the same ARE solution as scaling up the control costR. To be precise, J ⋆ (A, 1 s B,Q,R)=J ⋆ (A,B,Q,s 2 R) for anys̸=0. However, in our proof for the scalar case, we do something equivalent to J ⋆ (A, 1 s B,Q,R)=J ⋆ (A,B,Q,s 2 R)≤ J ⋆ (A,B,s 2 Q,s 2 R). (6.12) 156 The role of α is to accommodate for the looseness of that inequality. In the scalar case we (implicitly) proved that the ratio J ⋆ (A,B,sQ,sR) J ⋆ (A,B,Q,sR) (6.13) never gets bigger than (2a+1)/2a. In Figure 6.7, we plot the ratio from Equation (6.13) for the scalar system whena=1. It converges to a value slightly larger than1.2. 10 0 10 1 10 2 10 3 10 4 s 1.0 1.1 1.2 J ⋆ (a,b/s,q,r) J ⋆ (a,b/s,s 2 q,r) Figure 6.7: “Approximation error” accounted for byα> 2a+1 2a assumption in scalar upper bound proof. We can also plot the ratio from Equation (6.13) for non-scalar LQR problems. We sample random A andB matrices with dimensions2≤ n,m≤ 10 and entries i.i.d. normally distributed. We reject samples whereρ + (A)≤ 0. Otherwise, we scaleA such thatρ + (A)=1, and scaleB such that∥B∥ 2,2 =1. These plots are shown in Figure 6.8. 10 0 10 1 10 2 10 3 10 4 s 10 0 10 1 10 2 J ⋆ (A, 1 s B,Q,R) J ⋆ (A, 1 s B,s 2 Q,R) Figure 6.8: Looseness introduced by the inequality (6.12) for random LQR problems. 157 In Figure 6.8, we see that the ratio of Equation (6.13) still appears to converge to a finite value for these examples, but it can be much bigger than 1.2. This explains the constant c in the hypothesis of Theorem 6.7.3. The large values seen in Figure 6.8 make this constraint unsatisfying, because a guarantee that only holds for a suboptimality ratioα ≫ 1 is unlikely to be useful in practice. On the other hand, it seems possible that the asymptotic dependence on θ obtained from adding this constraint would not be any different from the true dependence for arbitrary α . The existence of such a constant c is of course not guaranteed. Based on the empirical behavior in Figure 6.8, we conjecture that it does exist: Conjecture6.7.4. For any well-posed LQR problem(A,B,Q,R), there exists a constantc such that J ⋆ (A,B,sQ,sI)≤ cJ ⋆ (A,B,Q,sI) for alls>0. 6.7.3 FormofRiccatiperturbationforgeometricgridrecursion If we are to use a grid-based construction to prove an upper bound on the suboptimal covering numbers for DDF problems, it will likely involve some kind of induction or dynamic programming-style recursion. If we again start from theΣ= I case and move “down” to less control authority, then we will be recursing fromB = UΣ V ⊤ toB ′ = UΣ ′ V ⊤ with Σ ≻ Σ ′ . If we consider stepping from one grid cell to its face- sharing neighbors, then only one element ofΣ ′ will change. Without loss of generality, we may assume that elementΣ 1,1 is the one changing. Since only one element of Σ ′ changes, the change to B is a rank-one update. It is not immediately clear if this is useful. 158 In the scalar case of § 6.4 and the one-dimensional subspace case of § 6.7.1, we performed a series of algebraic manipulations on the Riccati equation for the optimal cost of the system withB ′ to arrive at a new value of the cost matrixP ′ expressed in terms of the solutionP for the system withB. Unfortunately, we were not able to use such a simple approach for the matrix case. We attempted representing the changes in B, P, Q as both additive and multiplicative perturbations. Ultimately they both failed due to the structural difference between the A ⊤ P+PA term and thePBB ⊤ P term in the Riccati equation. We present details in the next two subsubsections. 6.7.3.1 MultiplicativechangeinP Recall we are trying to recurse from a solution for the Riccati equation A ⊤ P +PA− PBB ⊤ P +Q=0 (6.14) to a solution for the Riccati equation A ⊤ P +PA− PB ′ B ′ ⊤ P +Q=0, withB ′ as described in §6.7.3. Let us first consider the simple case where d =n andU =V =I n× n . We use the notationΣ ′ =MΣ , whereM ⪯ I is diagonal and positive definite, to express the change in Σ as a multiplicative perturbation. The new Riccati equation then becomes A ⊤ P ′ +P ′ A− P ′ MΣΣ MP ′ +Q ′ =0. Our initial hope might be to follow the scalar case and test if something like P ′ =M − 1 PM − 1 , Q ′ =M − 1 QM − 1 159 works. This proposed solution converts the Riccati equation above to A ⊤ M − 1 PM − 1 +M − 1 PM − 1 A− M − 1 PΣΣ PM − 1 +M − 1 Q ′ M − 1 ? =0. (6.15) But now, to follow the same pattern as the scalar case, we want to perform some algebraic manipulation to get rid of all instances ofM and arrive back at the hypothesis (6.14), thus showing that eq. (6.15) holds. This does not seem possible. 6.7.3.2 AdditivechangeinP Now we will write the perturbation toB as additive instead of multiplicative. Suppose we have solved the Riccati equation A ⊤ P +PA− PBB ⊤ P +Q=0 forP . Now we recurse fromB toB ′ = B− ∆ B such thatBB ⊤ ⪰ B ′ B ′ ⊤ (and therefore∆ B⪰ 0). We are interested in the solutionP ′ for the Riccati equation A ⊤ P ′ +P ′ A− P ′ B ′ B ′ ⊤ P ′ +Q=0. Due to the ARE comparison lemma (Lemma 6.4.3), we know thatP ′ ⪰ P . If we writeP ′ = P +S with S⪰ 0, we get the Riccati equation A ⊤ (P +S)+(P +S)A− (P +S)B ′ B ′ ⊤ (P +S)+Q=0. 160 Expanding certain terms gives us A ⊤ P +PA+A ⊤ S+SA− (PBB ⊤ P − PB∆ B ⊤ P − P∆ BB ⊤ P +P∆ B∆ B ⊤ P) − PB ′ B ′ ⊤ S− SB ′ B ′ ⊤ P − SB ′ B ′ ⊤ S+Q=0. Subtracting the original Riccati equation yields A ⊤ S+SA− (− PB∆ B ⊤ P − P∆ BB ⊤ P +P∆ B∆ B ⊤ P) − PB ′ B ′ ⊤ S− SB ′ B ′ ⊤ P − SB ′ B ′ ⊤ S =0. Now if we define A ′ =A− B ′ B ′ ⊤ P and group like terms with respect toS, we get A ′ ⊤ S+SA ′ − SB ′ B ′ ⊤ S+P(BB ⊤ − B ′ B ′ ⊤ )P =0. So we have an expression for the change in cost as the solution to another Riccati equation. Regarding the expression A ′ =A− B ′ B ′ ⊤ P, we note that the optimal controller for the original problem was K = − B ⊤ P , so this term is closely related to A+BK, that is, the closed loop dynamics of the original system with its optimal controller. Therefore, by some continuity or gain margin argument, we may be able to argue thatA ′ is Hurwitz. This could be useful. 161 6.7.4 HowwewoulduseboundsoncostchangeduetoB perturbations Although we have not yet derived any useful results about the cost change resulting from a change toB as described in §6.7.3, let us look ahead and think about how such a result might be used by the full proof if we follow the geometric grid construction. Suppose we have a result like: IfP solves the ARE A ⊤ P +PA− PUΣ V ⊤ VΣ U ⊤ P +Q=0 andΣ ′ =diag(1,...,1,β i ,1,...,1)Σ forβ i <1, then the solutionP ′ to the ARE A ⊤ P +PA− PUΣ ′ V ⊤ VΣ ′ U ⊤ P +Q=0 satisfies tr[P ′ ]≤ β − 2 i c i tr[P], whereP ′ is the solution to the ARE for the new Riccati equation andc i > 0 andβ i ∈ (0,1) are some constants specific to the i th dimension ofΣ . The constants c 1 ,...,c d and β 1 ,...,β d should be allowed to depend on A,U,V in arbitrarily complex ways, but they need to be constant. Following the template for the proof of Theorem 6.7.3, we need to obtain two quantities for each cell of the geometric grid: • An upper bound on the numerator of the suboptimality ratio, which comes from a GCC Riccati equation, and • A lower bound on the denominator of the suboptimality ratio, which comes from a standard LQR Ric- cati equation applied to the “worst-case”B in the grid cell, which can be selected using Lemma 6.4.3. 162 In the scalar case, the numerator upper bound came from induction but the denominator lower bound came from the closed-form solution of the scalar Riccati equation. Since there is no closed-form solution for the matrix case, we will likely need some kind of inductive argument for the lower bound as well. 6.7.5 ExistingRiccatisolutionandperturbationbounds There are many published bounds on spectral properties of either 1) solutions to the ARE, or 2) changes in the solution to an ARE caused by perturbations to its matrix coefficients. Most of these results appear to be very general bounds phrased in terms of properties like matrix norms and minimum/maximum eigenvalues of the matrix coefficients (or their perturbations). It is not clear if any of these bounds are fine-grained enough to be useful with the highly structured setup of rank-one perturbations toB, that we would need to handle in the geometric grid construction for an upper bound onα -suboptimal covering numbers for DDF problems. Wang et al. (1986) give some fairly easy-to-derive and user-friendly bounds under strong assumptions onA: Lemma6.7.5 (Theorems 1 and 2, Wang et al. (1986)). For the algebraic Riccati equation A ⊤ P +PA− PRP +Q=0 whereAisHurwitz, the positive semidefinite solution P satisfies tr(P)≤ λ max (A s )+ p λ max (A s ) 2 +(λ min (R)/n)tr(Q) λ min (R)/n (6.16) and tr(P)≥ λ min (A s )+ p λ min (A s ) 2 +λ max (R)tr(Q) λ max (R) , (6.17) 163 whereA s =(A ⊤ +A)/2. Note that in Lemma 6.7.5, theλ max andλ min functions are only applied to symmetric matrices, so they are only comparing real eigenvalues. Regarding the requirement thatA is Hurwitz, we note that theA ′ in the Riccati equation of §6.7.3.2, whose solution tells us the how much the cost has changed additively, is Hurwitz. Many other Riccati bounds have been published. Bounds on various properties of the solution are collected in the survey of Kwon et al. (1996). Perturbation bounds are given by Konstantinov et al. (2003) and Sun (1998), among many others. 6.7.6 Lowerboundcandidates For a lower bound on the covering number, we may need as a lemma some overestimate of suboptimal neighborhoods, containing them into sets that are easy to work with like boxes, balls, etc. Overestimating suboptimal neighborhoods requires underestimating suboptimality ratio. Underestimating suboptimality ratios requires overestimatingJ ⋆ B and underestimatingJ B (K) for arbitraryK. 6.7.6.1 LowerboundforA=I Although the lower bound candidateA = I leads to complicated topology of suboptimal neighborhoods, as discussed in § 6.5.2, it is easier to work with algebraically than the candidate A = 1 n 1. We therefore begin our attempt to find a covering number lower bound using A=I, even though it may lead to a loose bound, so that we can get any bound at all (beyond the trivial dimension-independent lower bound we get from the scalar case). OverestimatingJ ⋆ B LetP denote the unique positive definite solution for the ARE AP +PA− PBB T P +I =0, 164 in whichA=I andB =diag(σ 1 ,...,σ n )≻ 0. ThenP is a diagonal matrix with entries P ii = 1+ q 1+σ 2 i σ 2 i , where the entry P ii corresponds to the optimal scalar cost for the scalar LQR system with a = 1 and b = σ i , as discussed in §6.4. An easy calculation verifies that the proposed P solves the given ARE. We see thatP ≻ 0 by construction, soP is the unique positive definite solution (Lan95, Theorem 16.3.3). The resulting optimal controllerK ⋆ B =− B T P is diagonal with (K ⋆ B ) ii = 1+ q 1+σ 2 i σ i . We can simplify the result forP by applying Lemma 6.4.4, concluding that tr(P)≤ 3 n X i=1 1 σ 2 i . (6.18) As in the scalar case, as the σ i approach zero asymptotically this “user-friendly” upper bound becomes tight up to the constant factor. Underestimating J B (K). We have to account for the possibility that a suboptimal covering contains non-diagonalK’s. However, for now we make the following conjecture: Conjecture 6.7.6. For the DDF problem defined by Φ with A = I and U = V = I, if the controller K∈R n× n is not diagonal, then there exists a diagonalK d such thatN α (K)⊆ N α (K d ). 165 If Conjecture 6.7.6 is true, then we can assume any suboptimal coveringC for this problem is composed entirely of diagonal controllers. We can then again build upon the scalar results by invoking Lemma 6.4.7. This gives us the following expression for diagonalK: J B (K)=− n X i=1 1+K 2 ii 2(1+σ i K ii ) ≥− n X i=1 K 2 ii 2(1+σ i K ii ) . Suboptimalityratio. These two bounds leave us with an optimistic suboptimality ratio estimate which we denote as ˆ r: J B (K) J ⋆ B ≥ ˆ r = n X i=1 K 2 ii 1+σ i K ii − 6 n X i=1 1 σ 2 i . This ratio-of-sums-of-ratios structure is challenging to work with. Considering that the shapes of theα - suboptimal neighborhoods for this system in Figure 6.5 do not appear easy to approximate, we did not devote further to working with this estimate. 6.7.6.2 LowerboundforA= 1 n 1 For the caseA= 1 n 1, we can no longer easily build upon the results from the scalar case. To upper-bound J ⋆ B , we would like to start by applying the upper bound of Lemma 6.7.5, but we cannot do it immediately because 1 n 1 is not Hurwitz. However, Wang et al. (1986) do not clearly state where the stability of A is used in their proof of Lemma 6.7.5. Further investigation is required. 6.7.7 Packing-basedstrategiesforlowerbounds In more well-known (often geometric) covering problems based on metrics/norms, the notion of coverings is closely related to the notion of packings. We can also consider packings for our problem. 166 Definition 6.7.7. Consider a multi-system optimal control problem(X,U,Φ ,Π ref ) and a suboptimality ratioα> 1. Given a systemϕ ∈Φ , we define its α -suboptimal policy neighborhood as P α (ϕ )= ( π ∈Π ref : J ϕ (π ) J ⋆ ϕ ≤ α ) . A set of systemsP ⊆ Φ is anα -suboptimal packing ofΦ if its corresponding family ofα -suboptimal policy neighborhoods{P α (ϕ )} ϕ ∈P is pairwise disjoint. The α -suboptimal packing number of Φ , denoted N pack α (Φ) , is the size of the largest α -suboptimal packing ofΦ . It is clear thatN pack α (Φ) ≤ N cov α (Φ) . Constructing aα -suboptimal packing could be a useful strategy for covering number lower bounds. With packings we have a “for all” condition with respect to the subop- timality ratio of arbitrary controllers for a finite set of Bs. As discussed in §2.9.1.6, the cost of an arbitrary suboptimal controllerK for a particularB depends on the solution to the (linear) Lyapunov equation (A+BK)X +X(A+BK) ⊤ +I =0 and is given by J(K)=tr h (Q+K ⊤ RK)X i . (6.19) On the other hand, the optimal cost for a particularB depends on the solution to the (quadratic) algebraic Riccati equation A ⊤ P +PA− PBR − 1 B ⊤ P +Q=0, but the optimal cost is simply J ⋆ B =tr[P]. 167 So for suboptimal costs we have a simpler matrix equation, butK appears both in the equation coefficients and in a product with the solution to get the final cost. It is not clear which is easier to work with. 6.7.8 Reparameterization The LQR cost is not a convex function of the controller matrix K, but it can be rendered convex by a reparameterization. We follow the presentation of Mohammadi et al. (2019). LetY = KX whereX ≻ 0. SubstitutingK =YX − 1 into (6.19) yields J(X,Y)=tr h QX +Y ⊤ RYX − 1 i , whereX solves the same Lyapunov equation as before, which under our reparameterization becomes AX +XA ⊤ − (BY +Y ⊤ B ⊤ )+I =0. Subsequently we name the linear operators (overloading notation with the sets of matrices in the general multi-system LQR problem) A(X)=AX +XA ⊤ , B(Y)=BY +Y ⊤ B ⊤ , and rewrite the Lyapunov equation asA(X)−B (Y) +I = 0. Then, under the assumption thatA is invertible,X becomes an affine function of Y : X(Y)=A − 1 (B(Y)− I). 168 We now denote the set of stabilizing solutions by S Y ={Y ∈R m× n :X(Y)≻ 0}, which is equivalent to the set of stabilizing controllers. We can then define J(Y)= J(X(Y),Y) :Y ∈S Y ∞ :otherwise . Mohammadi et al. (2019) show that, over thea-sublevel setS Y (a) = {Y : J(Y)≤ a}, the costJ(Y) is µ -strongly convex (see §2.4.2) with strong convexity constant µ = 2λ min (R)λ min (Q) a(1+a 2 η ) 2 , where η = ∥B∥ 2 λ min (Q)λ min (I) p νλ min (R) , where ν = λ 2 min (I) 4 ∥A∥ 2 p λ min (Q) + ∥B∥ 2 p λ min (R) ! − 2 . Mohammadi et al. (2019) do not explicitly state the norm to which the strong convexity constantµ applies, but they also give a Lipschitz smoothness constant with respect to the Frobenius norm∥Y∥ F = √ trY T Y , so we we will assume thatµ is also w.r.t. the Frobenius norm. 169 Note that the silly expressionλ min (I) appears because Mohammadi et al. (2019) give the result for an arbitrary initial state covariance, whereas we have already fixed it to I. Further applying our simplifying assumptionsQ=I andR=I, the latter becomes ν = 1 4 (∥A∥ 2 +∥B∥ 2 ) − 2 , so the former becomes η = ∥B∥ 2 q 1 4 (∥A∥ 2 +∥B∥ 2 ) − 2 =2∥B∥ 2 (∥A∥ 2 +∥B∥ 2 ), and thus the strong convexity constant becomes µ = 2 a 1+2a 2 ∥B∥ 2 (∥A∥ 2 +∥B∥ 2 ) 2 . (6.20) We are interested in the growth of the cost as we move away fromY ⋆ B ≜argmin Y∈S Y J(Y). The smooth- ness and convexity of the costJ imply that∇J(Y ⋆ )=0. Therefore, the linear term in the strong convexity definition (2.4) becomes zero when centered on Y ⋆ , leaving the lower bound J(Y)≥ J ⋆ B + µ 2 ∥Y − Y ⋆ B ∥ 2 F . (6.21) Unfortunately this bound seems to be vacuous in numerical experiments. We instantiate it for a scalar LQR problem with A = 1, B = 1 and plot a comparison between the actual LQR cost and the lower bound implied by (6.20) and (6.21) in Figure 6.9. The sublevel set constanta is chosen asa=αJ ⋆ B for two different values: α ∈{1.001,1.2}. Only those values ofY for whichJ B (YX(Y) − 1 )≤ a are shown, that is, we restrict each plot to the domainS Y (a) where the lower bound is valid. 170 0.84 0.86 Y 2.415 2.416 cost α = 1.001 0.8 1.0 1.2 Y 2.4 2.6 2.8 α = 1.2 variable actual lower_bound Figure 6.9: Scalar LQR problem: Actual cost (solid) and lower bound (dashed) based on the strong con- vexity constant derived by Mohammadi et al. (2019). The horizontal axis is the value Y in the convex reparameterizationY =KX as described in §6.7.8. The curvature of the lower bound is barely visible when plotted alongside the true cost. This is dis- appointing. It seems unlikely that this would be useful for bounding covering numbers. Note that this happens even for the small sublevel set induced byα =1.001, where we can see in Figure 6.9 that the true cost is close to a quadratic. Therefore, the lower bound appears loose relative to the second-order Taylor expansion, not just when far from the minimum. 6.7.9 SuboptimalneighborhoodsforvariationsinA Since most of our theoretical efforts do not appear fruitful, we return to empirical work. One major area of interest not explored in our original paper was variations in theA matrix instead of theB matrix. The key question is: what kind of sets of A matrices should we consider? While the decomposed dynamics form had an appealing interpretation in terms of actuator strength, applying the same construction to the A matrix does not seem to be as interpretable. Instead of proposing a particular structure for a set of A matrices, we can simply plot suboptimal neighborhoods for well-known control systems without requiring that they all share some structure. We show two-dimensional suboptimal neighborhoods plots analogous to Figure 6.5 for various systems. In all systems, the dynamics are parameterized by two real values, represented by the horizontal and vertical 171 graph axes. We sample a 3× 3 grid of parameter pairs, indicated by points on the plots, and synthesize their LQR-optimal controllers. To visualize approximate suboptimal neighborhoods, we evaluate the sub- optimality ratio of each controller on a finer grid of parameter pairs, indicated by the semi-transparent regions. We repeat each experiment with three values ofα . 6.7.9.1 Cart-polesystem m c m p ℓ θ x u g Figure 6.10: Cart-pole system. 0.2 8 mass 1 3e+01 gravity α = 1.01 0.2 8 mass α = 1.05 0.2 8 mass α = 1.2 Figure 6.11: α -suboptimal neighborhoods for cart-pole system. Each dot indicates an LQR-optimal con- troller for a particular (mass, gravity) pair; the surrounding transparent region indicates itsα -suboptimal neighborhood. Axes are logarithmic. Colors have no meaning. Discussion in §6.7.9.1. As our first example for variations in A, we consider the cart-pole system illustrated in Figure 6.10. The cart rolls on a frictionless surface. A rigid massless rod is attached to a frictionless hinge joint on the cart. At the other end of the rod is a point mass. The state space is the position of the cartx and the angle of the rodθ , and their derivatives. The inputu is a force upon the cart in the positive-x direction. The system has four physical parameters: cart massm c , pole massm p , pole lengthℓ, and gravitational constantg. All 172 are strictly positive. The cart-pole system has an unstable equilibrium atθ = 0 and a stable equilibrium atθ = π . We consider stabilization at theθ = 0 (pole-up) position. Linearizing the nonlinear dynamics (not shown) about the pole-up state yields the system matrices A= 0 0 1 0 0 0 0 1 0 gm p m c 0 0 0 g(m c +m p ) m c ℓ 0 0 , B = 0 0 1 m c 1 m c ℓ . We see that the parametersm c andℓ influence both A andB, but the parametersg andm p only influence A. If we fix m c = 1, ℓ = 1, then we are left with two parameters that affect only A. Although variations in gravity are perhaps not of concern in most applications, we proceed with this experiment and show the results in Figure 6.11. The parameter m p (horizontal axis) varies from 1/4 to 8 and the parameter g (vertical axis) varies from 1 to 32. We observe that the suboptimal neighborhoods are oblong and oriented diagonally to the grid. 6.7.9.2 Tworealeigenvalues 0.03 1 λ 1 0.03 1 λ 2 α = 1.0001 0.03 1 λ 1 α = 1.001 0.03 1 λ 1 α = 1.01 Figure 6.12: α -suboptimal neighborhoods for system in controllable canonical form (CCF) with A hav- ing two positive real eigenvalues. Each dot indicates an LQR-optimal controller for a particular pair of eigenvalues; the surrounding transparent region indicates itsα -suboptimal neighborhood. Axes are loga- rithmic. Colors have no meaning. Discussion in §6.7.9.2. 173 Next we plot suboptimal neighborhoods for systems in controllable canonical form (defined in §2.9.3) whenA has two real eigenvaluesλ 1 ,λ 2 . Because we are mainly interested in stabilizing unstable systems, we select both eigenvalues to be positive: λ 1 ,λ 2 ∈[0.03,1]. Results are shown in Figure 6.12. Recall from §2.9.3 that in controllable canonical form, theA matrix is a permutation-invariant function of the eigenvalues. This is apparent in the plot, where we see that all suboptimal neighborhoods are mirror- symmetric across theλ 1 = λ 2 line. Therefore, we only show suboptimal neighborhoods for the optimal controllers of systems with λ 1 ≥ λ 2 . We observe that the suboptimal neighborhoods are elongated and highly curved. 6.7.9.3 Pairofconjugateeigenvalues 0 1 real 0 1 imag (positive) α = 1.0001 0 1 real α = 1.001 0 1 real α = 1.01 Figure 6.13: α -suboptimal neighborhoods for system in controllable canonical form (CCF) withA having two complex conjugate eigenvalues. Plot corresponds to upper right quadrant of complex plane. Each dot indicates an LQR-optimal controller for the corresponding eigenvalue and its conjugate. The surrounding transparent region indicates its α -suboptimal neighborhood. Axes are linear. Colors have no meaning. Discussion in §6.7.9.3. Next we plot suboptimal neighborhoods for systems in controllable canonical form whenA has a con- jugate pair of complex eigenvaluesλ, λ . In this case,A is parameterized by the two real valuesReλ, Imλ . Again, because we are interested in stabilizing unstable systems, we select Reλ, Imλ ∈ (0,1]. Results are shown in Figure 6.13. 174 Here we observe that the suboptimal neighborhoods are more aligned with the grid, but they are much narrower with respect to Reλ . In other words, the optimal controllers tend to be robust with respect to the LQR cost against variations inImλ , but not inReλ . 6.7.9.4 Spring-mass-damper 1kg k s k d u Figure 6.14: Spring-mass-damper system. There is no gravity. Finally, we plot suboptimal neighborhoods for a spring-mass-damper system under external forces, as illustrated in Figure 6.14. This is the canonical second-order linear ordinary differential equation given by ¨z =− kz− dz+u, where z is the displacement from the spring’s resting length, k is the spring constant, d is the damping constant, and the mass is fixed at 1. Raised into first-order state-space form, the dynamics are ˙ x= " 0 1 − k − d # | {z } A x+ " 0 1 # |{z} B u, where x = h z ˙ z i ⊤ . We observe that A is already in controllable canonical form. Also, unlike our other example systems, the unforced dynamics of this system are either passively stable (whend > 0) or marginally stable (whend=0). Results are shown in Figure 6.15. Here the suboptimal neighborhoods are the closest to balls of all the example systems. Further experiments are warranted to investigate potential links between stability of 175 0 1 spring constant 0 1 damping constant α = 1.0001 0 1 spring constant α = 1.001 0 1 spring constant α = 1.005 Figure 6.15: α -suboptimal neighborhoods for spring-mass-damper system. Each dot indicates an LQR- optimal controller for a particular (spring constant, damping constant) pair; the surrounding transparent region indicates itsα -suboptimal neighborhood. Axes are linear. Colors have no meaning. Discussion in §6.7.9.4. A and the behavior of suboptimal neighborhoods. It is also possible that systems with stableA could be easier to work with analytically, but this requires more theoretical work. For example, we need to think about the conditions under which a suboptimal controller could destabilize an inherently stable system. 6.7.9.5 Discussion In summary, we visualized the suboptimal neighborhoods of LQR-optimal controllers for several families of simple linear control systems for which the dynamics parametersΦ can be parameterized byR 2 >0 . We saw that in general, the suboptimal neighborhoods are substantially nonconvex, not isotropic, and not aligned with the grid of parameters. If we were to use a grid-based constructive covering of theΦ parameterization space, then we would need to find α -suboptimal policies for each grid cell. These experiments show that there exist systems with the following property: if the policy π is α -suboptimal for a particular grid cell, then it is also α - suboptimal for substantial area outside the cell. We illustrate this in Figure 6.16 for the cart-pole system. Any axis-aligned box of systems for which the brown controller isα -suboptimal will be far smaller than its trueα -suboptimal neighborhood. 176 Figure 6.16: Example of a poor match between a grid partition ofΦ and true suboptimal neighborhoods of LQR-optimal controllers forΦ in the cart-pole system. This example suggests that a grid-based covering would be an inefficient construction for the cart-pole system. Still, it is possible that this inefficiency is not significant from an asymptotic perspective. 6.8 Conclusionandfuturework In this chapter, we introduced and motivated theα -suboptimal covering number to quantify infinite system spaces for multi-system control problems. We defined a particular class of multi-system linear-quadratic regulator problems amenable to analysis of theα -suboptimal covering number, and showed logarithmic dependency on the problem “breadth” parameterθ in the scalar case. Towards analogous results for the matrix case, we presented empirical studies intended to shed light on possible proof techniques. For the upper bound, we considered a natural covering construction that would preserve logarithmic dependence on θ but give exponential dependence on dimensionality. Experiments did not rule out its validity. For the lower bound, we visualized suboptimal neighborhoods for two possible system classes and observed interesting topological behavior for the minimal-coupling class. We also presented some intermediate theoretical tools, reported on lines of inquiry that did not yield results, and visualized more suboptimal neighborhoods for a generalization of our original setting. 177 After extending our current results to the matrix case, in future work the analysis can be applied to other classes of multi-system LQR problems including variations inA,Q,R, discrete time, and stochastic dynamics. It will be interesting to see if there are major differences between LQR variants. We also hope that suboptimal covers and covering numbers will be a useful tool for analyzing how the size of the system space affects the required expressiveness of function classes used in practice as multi-system policies, such as neural networks. 178 Chapter7 Conclusions In this dissertation we presented four projects in the intersection of learning and control. 7.1 Summaryofcontributions Our empirical work on deep reinforcement learning for domain adaptation shares an overall structure with self-tuning regulators in classic adaptive control, but is generic enough to apply to any parameterized family of MDPs. Introducing deep neural networks enables learning an embedding space representation of the dynamics parameters that is both useful for adaptation and easier to identify from state-action trajectories. By representing the full multi-system policy as a pre-trained neural network, we decouple the adapting process from the control synthesis process, allowing us to use arbitrarily large amounts of computation in the preparation stage. Although our experiments demonstrated the favorable properties of the learned embedding and showed a modest improvement in an ablation study, they also provoked our interest in more fundamental questions about algorithms and problem structure. In our work on deformable manipulation, we demonstrated techniques to use a recurrent neural net- work dynamics model for control in a partially observable system. We compensated for the inevitable modeling error by using nonlinear state estimation to identify a value of the RNN internal state that is 179 consistent with past inputs and outputs. For control, we implemented nonlinear model-predictive con- trol using gradient descent, allowing us to take advantage of the performance optimizations for RNNs in deep learning libraries. Our technique is suitable for systems with complex dynamics but known reward functions, exemplified by the task of tracking a trajectory with a point on a highly deformable object. The challenges we faced in empirical reinforcement learning projects motivated interest in RL theory. Our work on bounding the variance of the REINFORCE policy gradient estimator for LQR systems enabled us to see how the variance is influenced by various problem parameters, including the dynamics stability, reward function coefficients, environment stochasticity, and policy stochasticity. Our upper bound was tight with respect to most parameters, and matched the behavior of the empirical variance qualitatively. On the other hand, we also demonstrated with experiments that the lowest gradient variance does not necessarily translate into the fastest convergence of REINFORCE. Our work on suboptimal covering numbers attempted to answer a question raised by the first project: How can we quantify “how much” a good adaptive policy must change its behavior with respect to the problem parameters? We proposed the suboptimal covering number as a highly general, parameter- independent way to do this. As an example, we showed matching logarithmic covering number bounds for scalar and one-dimensional fully observable LQR problem families modeling adaptivity with respect to variations in actuator strength. Our results showed a mild dependency, indicating that for these systems we can handle a large range of actuator strengths, approaching uncontrollability, with only a few distinct policies. We also conjectured an bound exponential in the dimension for multi-input problems, but have not yet resolved it. Our intermediate mathematical results and intuition-building experiments suggest that answering this question require nontrivial insight into the behavior of algebraic Riccati equations. Our choice to focus on suboptimality ratio instead of cost difference was deliberate, but appears to create extra challenges. 180 7.2 Futurework Our control framework for deformable manipulation involved standard choices for recurrent modeling, estimation, and control. For modeling, we are interested in replacing LSTM networks with more specialized classes of learnable models, such as models that place expressive learnable components in a computational graph alongside fixed or low-parameter nodes derived from physics principles (Heiden et al., 2021b). For estimation and control, there are many other standard methods. More broadly, we note that our framework can be seen as the inner loop of a model-based RL algorithm. By closing the loop and updating the model continuously, we can relax the assumption that our initial data gathering sufficiently covers the state space. Our encouraging experimental results, along with the improved debugging and interpretability of our framework relative to model-free RL, make us enthusiastic about further research into model-based RL. Our work on RL theory suggests two possible follow-ups. As we discussed in § 5.2, the RL theory community has already made great strides on the LQR problem in the past few years. These results in- clude sample complexity bounds that tie the problem parameters more directly to the RL algorithm per- formance, instead of indirectly through the gradient variance. To our knowledge, the sample complexity upper bound for REINFORCE in LQR problems remains open. Such a bound might help us gain insight into the REINFORCE algorithm. However, due to the strict suboptimality of REINFORCE for cheap-control LQR problems discussed by Tu and Recht (2019), it would not necessarily improve our understanding of the LQR problem itself. Beyond LQR, our experiments also suggest there is an interesting and complex relationship between action-space noise, environmental noise, and algorithm performance in policy gradient methods. We find our results somewhat surprising, because action-space noise is historically thought to be related to ex- ploration, but exploration is not hard in linear dynamical systems. We are intrigued by the possibility of 181 designing an RL algorithm that controls action noise specifically for its effect on policy gradient optimiza- tion. If sufficient state-space exploration is already guaranteed, can additional action noise still be helpful for policy gradient methods? We view our work on suboptimal coverings as the first step in a potentially large line of inquiry. There is still much work to do on LQR problems. Beyond LQR, control theory offers many other classes of “almost linear” systems, such as systems with delays, actuator limits, dead zones, sector-bounded (Lur’e) nonlinearities, linear-in-features dynamics, and more. Bounding suboptimal covering numbers for these system classes could also lead to insights. Looking further, we hope to find applications where suboptimal covering numbers can be used to derive other useful properties about sets of dynamical systems. Connect- ing back to our original motivation, we are especially interested in the possibility of converting covering number bounds into insights about representing multi-system policies with neural networks. We also wish to explore designing RL or adaptive control algorithms based on switching between policies in a known suboptimal cover. Outside these specific projects, we hope that our theoretical and empirical work can overlap more. In the theoretical work in this dissertation, we have mainly used settings and mathematical tools from control theory to analyze the behavior of existing algorithms and to derive general insights into problem structure. However, the ultimate goal is to use insights and analysis to guide algorithm design. There is a large gap between the empirical success of deep RL and the systems for which provably efficient RL algorithms exist. We hope to contribute to bridging that gap, by building upon the current thorough understanding of simple cases and gradually relaxing assumptions without losing all theoretical guarantees. 182 Bibliography Abernethy, Jacob D. and Hazan, Elad (2016). Faster convex optimization: Simulated annealing with an efficient universal barrier. In International Conference on Machine Learning (ICML), pages 2520–2528. Agarwal, Alekh, Henaff, Mikael, Kakade, Sham, and Sun, Wen (2020). PC-PG: Policy cover directed exploration for provable policy gradient learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 13399–13412. Agarwal, Alekh, Jiang, Nan, Wen, Sham M. Kakade, and Sun (2022). Reinforcement learning: Theory and algorithms. Working draft. Agarwal, Alekh, Kakade, Sham M., Lee, Jason D., and Mahajan, Gaurav (2019). Optimality and approximation with policy gradient methods in Markov decision processes. CoRR, abs/1908.00261. Agarwal, Rishabh, Schwarzer, Max, Castro, Pablo Samuel, Courville, Aaron C, and Bellemare, Marc (2021). Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems (NeurIPS), pages 29304–29320. Agazzi, Andrea and Lu, Jianfeng (2021). Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. In International Conference on Learning Representations (ICLR). Allgöwer, Frank and Zheng, Alex (2012). Nonlinear model predictive control, volume 26. Birkhäuser. Amos, Brandon, Rodriguez, Ivan Dario Jimenez, Sacks, Jacob, Boots, Byron, and Kolter, J. Zico (2018). Differentiable MPC for end-to-end planning and control. CoRR, abs/1810.13400. Anderson, Brian D. O., Brinsmead, Thomas S., De Bruyne, Franky, Hespanha, Joao, Liberzon, Daniel, and Morse, A. Stephen (2000). Multiple model adaptive control. Part 1: Finite controller coverings. International Journal of Robust and Nonlinear Control, 10(11-12):909–929. Andrychowicz, Marcin, Crow, Dwight, Ray, Alex, Schneider, Jonas, Fong, Rachel, Welinder, Peter, McGrew, Bob, Tobin, Josh, Abbeel, Pieter, and Zaremba, Wojciech (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems (NIPS), pages 5048–5058. Andrychowicz, Marcin, Raichuk, Anton, Stanczyk, Piotr, Orsini, Manu, Girgin, Sertan, Marinier, Raphaël, Hussenot, Léonard, Geist, Matthieu, Pietquin, Olivier, Michalski, Marcin, Gelly, Sylvain, and Bachem, Olivier (2020). What matters in on-policy reinforcement learning? A large-scale empirical study. CoRR, abs/2006.05990. Antonova, Rika, Cruciani, Silvia, Smith, Christian, and Kragic, Danica (2017). Reinforcement learning for pivoting task. CoRR, abs/1703.00472. 183 Arriola-Rios, Veronica E., Guler, Puren, Ficuciello, Fanny, Kragic, Danica, Siciliano, Bruno, and Wyatt, Jeremy L. (2020). Modeling of deformable objects for robotic manipulation: A tutorial and review. Frontiers in Robotics and AI, 7:82. Åström, Karl J and Wittenmark, Björn (2013). Adaptive Control. Courier Corporation. Åström, Karl Johan and Murray, Richard M (2010). Feedback systems: An introduction for scientists and engineers. Princeton University Press. Barbič, Jernej and Popović, Jovan (2008). Real-time control of physically based simulations using gentle forces. ACM Transactions on Graphics, 27(5):1–10. Bern, James M, Banzet, Pol, Poranne, Roi, and Coros, Stelian (2019). Trajectory optimization for cable-driven soft robot locomotion. In Robotics: Science and Systems (RSS). Bern, James M., Schnider, Yannick, Banzet, Pol, Kumar, Nitish, and Coros, Stelian (2020). Soft robot control with a learned differentiable model. In International Conference on Soft Robotics (RoboSoft), pages 417–423. Bertsekas, Dimitri P. and Shreve, Steven E. (1978). Stochastic Optimal Control: The Discrete Time Case. Mathematics in Science and Engineering. Academic Press. Bhandari, Jalaj and Russo, Daniel (2019). Global optimality guarantees for policy gradient methods. CoRR, abs/1906.01786. Bhojanapalli, Srinadh, Wilber, Kimberly, Veit, Andreas, Rawat, Ankit Singh, Kim, Seungyeon, Menon, Aditya Krishna, and Kumar, Sanjiv (2021). On the reproducibility of neural network predictions. CoRR, abs/2102.03349. Boffi, Nicholas M., Tu, Stephen, and Slotine, Jean-Jacques E. (2021). Regret bounds for adaptive nonlinear control. In Conference on Learning for Dynamics and Control (L4DC), pages 471–483. Bousmalis, Konstantinos, Trigeorgis, George, Silberman, Nathan, Krishnan, Dilip, and Erhan, Dumitru (2016). Domain separation networks. In Advances in Neural Information Processing Systems (NIPS), pages 343–351. Boyd, Stephen and Vandenberghe, Lieven (2004). Convex Optimization. Cambridge University Press. Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech (2016). OpenAI Gym. CoRR, abs/1606.01540. Bruder, Daniel, Gillespie, Brent, Remy, C. David, and Vasudevan, Ram (2019). Modeling and control of soft robots using the Koopman operator and model predictive control. In Robotics: Science and Systems (RSS). Brunton, Steven L and Kutz, J Nathan (2019). Data-driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge University Press. Bu, Jingjing, Mesbahi, Afshin, Fazel, Maryam, and Mesbahi, Mehran (2019a). LQR through the lens of first order methods: Discrete-time case. CoRR, abs/1907.08921. Bu, Jingjing, Mesbahi, Afshin, and Mesbahi, Mehran (2019b). On topological and metrical properties of stabilizing feedback gains: The MIMO case. CoRR, abs/1904.02737. 184 Bubeck, Sébastien (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357. Cai, Qi, Yang, Zhuoran, Lee, Jason D., and Wang, Zhaoran (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (NeurIPS), pages 11312–11322. Cassel, Asaf B and Koren, Tomer (2021). Online policy gradient for model free learning of linear quadratic regulators with √ T regret. In International Conference on Machine Learning (ICML), pages 1304–1313. Chebotar, Yevgen, Hausman, Karol, Zhang, Marvin, Sukhatme, Gaurav, Schaal, Stefan, and Levine, Sergey (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In International Conference on Machine Learning (ICML), pages 703–711. Chen, Tao, Murali, Adithyavairavan, and Gupta, Abhinav (2018). Hardware conditioned policies for multi-robot transfer learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 9355–9366. Deisenroth, Marc Peter and Rasmussen, Carl Edward (2011). PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning (ICML), pages 465–472. Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li (2009). ImageNet: A large-scale hierarchical image database. In International Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Devin, Coline, Gupta, Abhishek, Darrell, Trevor, Abbeel, Pieter, and Levine, Sergey (2017). Learning modular neural network policies for multi-task and multi-robot transfer. In International Conferenceon Robotics and Automation (ICRA), pages 2169–2176. Doyle, John C (1978). Guaranteed margins for LQG regulators. IEEE Transactions on Automatic Control, 23(4):756–757. Du, Jingjing, Song, Chunyue, and Li, Ping (2012). Multimodel control of nonlinear systems: An integrated design procedure based on gap metric andH ∞ loop shaping. Industrial & Engineering Chemistry Research, 51(9):3722–3731. Du, Simon, Kakade, Sham, Lee, Jason, Lovett, Shachar, Mahajan, Gaurav, Sun, Wen, and Wang, Ruosong (2021). Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning (ICML), pages 2826–2836. Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter (2016a). Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778. Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter (2016b). RL 2 : Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779. Duenser, Simon, Bern, James M, Poranne, Roi, and Coros, Stelian (2018). Interactive robotic manipulation of elastic objects. InInternationalConferenceonIntelligentRobotsandSystems(IROS), pages 3476–3481. Dullerud, Geir E. and Paganini, Fernando (2000). A Course in Robust Control Theory: A Convex Approach. Springer-Verlag New York. 185 Engstrom, Logan, Ilyas, Andrew, Santurkar, Shibani, Tsipras, Dimitris, Janoos, Firdaus, Rudolph, Larry, and Madry, Aleksander (2020). Implementation matters in deep RL: A case study on PPO and TRPO. In International Conference on Learning Representations (ICLR). Eysenbach, Benjamin and Levine, Sergey (2021). Maximum entropy RL (provably) solves some robust RL problems. CoRR, abs/2103.06257. Fan, Jianqing, Wang, Zhaoran, Xie, Yuchen, and Yang, Zhuoran (2020). A theoretical analysis of deep q-learning. In Conference on Learning for Dynamics and Control (L4DC), pages 486–489. Fazel, Maryam, Ge, Rong, Kakade, Sham, and Mesbahi, Mehran (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning (ICML), pages 1467–1476. Fekri, Sajjad, Athans, Michael, and Pascoal, Antonio (2006). Issues, progress and new results in robust adaptive control. International Journal of Adaptive Control and Signal Processing, 20(10):519–579. Feng, Fei, Yin, Wotao, Agarwal, Alekh, and Yang, Lin (2021). Provably correct optimization and exploration with non-linear policies. In International Conference on Machine Learning (ICML), pages 3263–3273. Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), pages 1126–1135. Fu, Minyue (1996). Minimum switching control for adaptive tracking. In Conference on Decision and Control (CDC), pages 3749–3754. Fu, Minyue and Barmish, B. Ross (1986). Adaptive stabilization of linear systems via switching control. IEEE Transactions on Automatic Control, 31(12):1097–1103. Gillespie, Morgan T., Best, Charles M., Townsend, Eric C., Wingate, David, and Killpack, Marc D. (2018). Learning nonlinear dynamic models of soft robots for model predictive control with neural networks. In International Conference on Soft Robotics (RoboSoft), pages 39–45. Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530. Ha, David and Schmidhuber, Jürgen (2018). Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), pages 2455–2467. Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), pages 1856–1865. Hahn, David, Banzet, Pol, Bern, James M, and Coros, Stelian (2019). Real2sim: Visco-elastic parameter estimation from dynamic motion. ACM Transactions on Graphics (TOG), 38(6):1–13. He, Fengxiang and Tao, Dacheng (2020). Recent advances in deep learning theory. CoRR, abs/2012.10931. Heiden, Eric, Macklin, Miles, Narang, Yashraj S., Fox, Dieter, Garg, Animesh, and Ramos, Fabio (2021a). DiSECt: A differentiable simulation engine for autonomous robotic cutting. In Robotics: Science and Systems (RSS). 186 Heiden, Eric, Millard, David, Coumans, Erwin, Sheng, Yizhou, and Sukhatme, Gaurav S. (2021b). NeuralSim: Augmenting differentiable simulators with neural networks. In International Conference on Robotics and Automation (ICRA), pages 9474–9481. Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau, Joelle, Precup, Doina, and Meger, David (2018). Deep reinforcement learning that matters. In AAAI Conference on Artificial Intelligence , pages 3207–3214. Hespanha, João P, Liberzon, Daniel, and Morse, A Stephen (2000). Bounds on the number of switchings with scale-independent hysteresis: Applications to supervisory control. In Conference on Decision and Control (CDC), pages 3622–3627. Higgins, Irina, Pal, Arka, Rusu, Andrei A., Matthey, Loïc, Burgess, Christopher, Pritzel, Alexander, Botvinick, Matthew, Blundell, Charles, and Lerchner, Alexander (2017). DARLA: improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning (ICML), pages 1480–1490. Hinrichsen, Diederich and Pritchard, Anthony J. (2005). Mathematical Systems Theory I: Modelling, State Space Analysis, Stability and Robustness. Texts in Applied Mathematics. Springer Berlin Heidelberg. Hochreiter, Sepp and Schmidhuber, Jürgen (1997). Long short-term memory. Neural computation, 9(8):1735–1780. Holzapfel, Gerhard A. (2002). Nonlinear solid mechanics: A continuum approach for engineering science. Meccanica, 37(4):489–490. Hu, Yuanming, Liu, Jiancheng, Spielberg, Andrew, Tenenbaum, Joshua B., Freeman, William T., Wu, Jiajun, Rus, Daniela, and Matusik, Wojciech (2019). ChainQueen: A real-time differentiable physical simulator for soft robotics. In International Conference on Robotics and Automation (ICRA), pages 6265–6271. Huang, Sandy H., Papernot, Nicolas, Goodfellow, Ian J., Duan, Yan, and Abbeel, Pieter (2017). Adversarial attacks on neural network policies. CoRR, abs/1702.02284. Hwangbo, Jemin, Sa, Inkyu, Siegwart, Roland, and Hutter, Marco (2017). Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters, 2(4):2096–2103. Jalali, Ali Akbar and Golmohammad, Hassan (2012). An optimal multiple-model strategy to design a controller for nonlinear processes: A boiler-turbine unit. Computers & Chemical Engineering, 46:48–58. Jin, Chi, Allen-Zhu, Zeyuan, Bubeck, Sébastien, and Jordan, Michael I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems (NeurIPS), pages 4868–4878. Jin, Chi, Liu, Qinghua, and Miryoosefi, Sobhan (2021). Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 13406–13418. Jin, Chi, Yang, Zhuoran, Wang, Zhaoran, and Jordan, Michael I (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), pages 2137–2143. Kakade, Sham M., Krishnamurthy, Akshay, Lowrey, Kendall, Ohnishi, Motoya, and Sun, Wen (2020). Information theoretic regret bounds for online nonlinear control. CoRR, abs/2006.12466. 187 Kalashnikov, Dmitry, Irpan, Alex, Pastor, Peter, Ibarz, Julian, Herzog, Alexander, Jang, Eric, Quillen, Deirdre, Holly, Ethan, Kalakrishnan, Mrinal, Vanhoucke, Vincent, and Levine, Sergey (2018). Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning (CoRL), pages 651–673. Kalashnikov, Dmitry, Varley, Jake, Chebotar, Yevgen, Swanson, Benjamin, Jonschkowski, Rico, Finn, Chelsea, Levine, Sergey, and Hausman, Karol (2021). Scaling up multi-task robotic reinforcement learning. In Conference on Robot Learning (CoRL), pages 557–575. Kalman, R.E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45. Konstantinov, Mihail, Gu, D Wei, Mehrmann, Volker, and Petkov, Petko (2003). Perturbation Theory for Matrix Equations. Elsevier Science. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. InAdvancesinNeuralInformationProcessingSystems, pages 1106–1114. Kwon, Wook Hyun, Moon, Young Soo, and Ahn, Sang Chul (1996). Bounds in algebraic Riccati and Lyapunov equations: A survey and some new results. International Journal of Control, 64(3):377–389. Lancaster, Peter and Rodman, Leiba (1995). Algebraic Riccati Equations. Clarendon Press. Lattimore, Tor and Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. L’ecuyer, Pierre (1990). A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36(11):1364–1383. Levine, Sergey (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909. Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334–1373. Li, Yifei, Du, Tao, Wu, Kui, Xu, Jie, and Matusik, Wojciech (2021). DiffCloth: Differentiable cloth simulation with dry frictional contact. CoRR, abs/2106.05306. Lipton, Zachary Chase (2015). A critical review of recurrent neural networks for sequence learning. CoRR, abs/1506.00019. Liu, Boyi, Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems (NeurIPS), pages 10564–10575. Macklin, Miles, Müller, Matthias, and Chentanez, Nuttapong (2016). XPBD: Position-based simulation of compliant constrained dynamics. In International Conference on Motion in Games, pages 49–54. Malik, Dhruv, Pananjady, Ashwin, Bhatia, Kush, Khamaru, Koulik, Bartlett, Peter L., and Wainwright, Martin J. (2018). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. CoRR, abs/1812.08305. Mania, Horia, Jordan, Michael I., and Recht, Benjamin (2022). Active learning for nonlinear system identification with guarantees. Journal of Machine Learning Research, 23:32:1–32:30. 188 Mania, Horia, Tu, Stephen, and Recht, Benjamin (2019). Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems (NeurIPS), pages 10154–10164. Mcconachie, Dale and Berenson, Dmitry (2018). Estimating model utility for deformable object manipulation using multiarmed bandit methods. IEEE Transactions on Automation Science and Engineering, 15(3):967–979. McConachie, Dale, Dobson, Andrew, Ruan, Mengyao, and Berenson, Dmitry (2020). Manipulating deformable objects by interleaving prediction, planning, and control. The International Journal of Robotics Research, 39(8):957–982. McNichols, Kenneth H. and Fadali, M. Sami (2003). Selecting operating points for discrete-time gain scheduling. Computers & Electrical Engineering, 29(2):289–301. Mirza, Nada Masood (2020). Machine learning and soft robotics. In International Arab Conference on Information Technology (ACIT), pages 1–5. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin (2013). Playing Atari with deep reinforcement learning. CoRR, abs/1312.5602. Mohammadi, Hesameddin, Zare, Armin, Soltanolkotabi, Mahdi, and Jovanovic, Mihailo R. (2019). Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator. In Conference on Decision and Control (CDC), pages 7474–7479. Molchanov, Artem, Chen, Tao, Hönig, Wolfgang, Preiss, James A, Ayanian, Nora, and Sukhatme, Gaurav S (2019). Sim-to-(multi)-real: Transfer of low-level robust control policies to multiple quadrotors. In International Conference on Intelligent Robots and Systems (IROS), pages 59–66. Müller, Matthias, Heidelberger, Bruno, Teschner, Matthias, and Gross, Markus (2005). Meshless deformations based on shape matching. ACM Transactions on Graphics (TOG), 24(3):471–478. Murray-Smith, Roderick and Johansen, Tor Arne (1997). Multiple Model Approaches to Modelling and Control. Taylor and Francis, London. Nelles, Oliver (2001). Nonlinear System Identification . Springer, Berlin. Nesterov, Yurii (1983). A method for unconstrained convex minimization problem with the rate of convergenceo(1/k 2 ). In Proceedings of the USSR Academy of Sciences, pages 543–547. Nesterov, Yurii (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media. Nocedal, Jorge and Wright, Stephen J. (2006). Numerical Optimization. Springer, New York, NY, USA, second edition. OpenAI, Akkaya, Ilge, Andrychowicz, Marcin, Chociej, Maciek, Litwin, Mateusz, McGrew, Bob, Petron, Arthur, Paino, Alex, Plappert, Matthias, Powell, Glenn, Ribas, Raphael, Schneider, Jonas, Tezak, Nikolas, Tworek, Jerry, Welinder, Peter, Weng, Lilian, Yuan, Qiming, Zaremba, Wojciech, and Zhang, Lei (2019). Solving Rubik’s cube with a robot hand. CoRR, abs/1910.07113. Papadimitriou, Christos H and Tsitsiklis, John N (1987). The complexity of Markov decision processes. Mathematics of operations research, 12(3):441–450. 189 Parisotto, Emilio, Ba, Lei Jimmy, and Salakhutdinov, Ruslan (2016). Actor-mimic: Deep multitask and transfer reinforcement learning. In International Conference on Learning Representations (ICLR). Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, Desmaison, Alban, Köpf, Andreas, Yang, Edward Z., DeVito, Zach, Raison, Martin, Tejani, Alykhan, Chilamkurthy, Sasank, Steiner, Benoit, Fang, Lu, Bai, Junjie, and Chintala, Soumith (2019). PyTorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703. Peng, Xue Bin, Andrychowicz, Marcin, Zaremba, Wojciech, and Abbeel, Pieter (2017). Sim-to-real transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537. Peshkin, Leonid, Meuleau, Nicolas, and Kaelbling, Leslie Pack (1999). Learning policies with external memory. In International Conference on Machine Learning (ICML), pages 307–314. Petersen, Ian R. and McFarlane, Duncan C. (1994). Optimal guaranteed cost control and filtering for uncertain linear systems. IEEE Transactions on Automatic Control, 39(9):1971–1977. Pinto, Lerrel, Davidson, James, Sukthankar, Rahul, and Gupta, Abhinav (2017). Robust adversarial reinforcement learning. CoRR, abs/1703.02702. Preiss, James A., Arnold, Sébastien M. R., Wei, Chen-Yu, and Kloft, Marius (2019). Analyzing the variance of policy gradient estimators for the linear-quadratic regulator. In NeurIPS Workshop on Optimization Foundations for Reinforcement Learning. Preiss, James A., Hausman, Karol, and Sukhatme, Gaurav S. (2018). Learning a system-ID embedding space for domain specialization with deep reinforcement learning. In NeurIPS Workshop on Reinforcement Learning under Partial Observability. Preiss, James A., Millard, David, Yao, Tao, and Sukhatme, Gaurav S. (2022). Tracking fast trajectories with a deformable object using a learned model. In International Conference on Robotics and Automation (ICRA). Preiss, James A. and Sukhatme, Gaurav S. (2021). Suboptimal coverings for continuous spaces of control tasks. In Conference on Learning for Dynamics and Control (L4DC), pages 547–558. Qiao, Yi-Ling, Liang, Junbang, Koltun, Vladlen, and Lin, Ming C. (2020). Scalable differentiable physics for learning and control. In International Conference on Machine Learning (ICML), pages 7847–7856. Raffin, Antonin, Hill, Ashley, Gleave, Adam, Kanervisto, Anssi, Ernestus, Maximilian, and Dormann, Noah (2021). Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8. Rosenblatt, Frank (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. Rusu, Andrei A., Vecerík, Matej, Rothörl, Thomas, Heess, Nicolas, Pascanu, Razvan, and Hadsell, Raia (2017). Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning (CoRL), pages 262–270. Sabelhaus, Andrew P. and Majidi, Carmel (2021). Gaussian process dynamics models for soft robots with shape memory actuators. In 2021 IEEE 4th International Conference on Soft Robotics (RoboSoft), pages 191–198. 190 Sadeghi, Fereshteh and Levine, Sergey (2017). CAD2RL: real single-image flight without a single real image. In Robotics: Science and Systems (RSS). Safonov, Michael and Athans, Michael (1977). Gain and phase margin for multiloop LQG regulators. IEEE Transactions on Automatic Control, 22(2):173–179. Schaul, Tom, Horgan, Daniel, Gregor, Karol, and Silver, David (2015). Universal value function approximators. In International Conference on Machine Learning (ICML), pages 1312–1320. Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, et al. (2020). Mastering Atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609. Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael I., and Moritz, Philipp (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889–1897. Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov, Oleg (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347. Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, and Sridharan, Karthik (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635–2670. Shi, Guanya, Hönig, Wolfgang, Shi, Xichen, Yue, Yisong, and Chung, Soon-Jo (2021). Neural-Swarm2: Planning and control of heterogeneous multirotor swarms using learned interactions. IEEE Transactions on Robotics, pages 1–17. Shi, Guanya, Shi, Xichen, O’Connell, Michael, Yu, Rose, Azizzadenesheli, Kamyar, Anandkumar, Animashree, Yue, Yisong, and Chung, Soon-Jo (2019). Neural lander: Stable drone landing control using learned dynamics. In International Conference on Robotics and Automation (ICRA), pages 9784–9790. Siciliano, Bruno, Sciavicco, Lorenzo, Villani, Luigi, and Oriolo, Giuseppe (2009). Robotics: Modelling, Planning and Control. Advanced Textbooks in Control and Signal Processing. Springer-Verlag, London. Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel, Thore, et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144. Simchowitz, Max and Foster, Dylan (2020). Naive exploration is optimal for online LQR. In International Conference on Machine Learning (ICML), pages 8937–8948. Singh, Sumeet, Richards, Spencer M, Sindhwani, Vikas, Slotine, Jean-Jacques E, and Pavone, Marco (2021). Learning stabilizable nonlinear dynamics with contraction-based regularization. The International Journal of Robotics Research, 40(10-11):1123–1150. Song, Yuda and Sun, Wen (2021). PC-MLP: Model-based reinforcement learning with policy cover guided exploration. In International Conference on Machine Learning (ICML), pages 9801–9811. Sontag, Eduardo D. (2013). Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer Science & Business Media. Sperry, Elmer A (1921). Aeroplane stabilizer. US Patent US1368226A. 191 Stilwell, Daniel J. and Rugh, Wilson J. (1999). Interpolation of observer state feedback controllers for gain scheduling. IEEE Transactions on Automatic Control, 44(6):1225–1229. Sulsky, D., Chen, Z., and Schreyer, H. L. (1994). A particle method for history-dependent materials. Computer Methods in Applied Mechanics and Engineering, 118(1):179–196. Sun, Ji-Guang (1998). Perturbation theory for algebraic Riccati equations. SIAM Journal on Matrix Analysis and Applications, 19(1):39–65. Sun, Yue and Fazel, Maryam (2021). Learning optimal controllers by policy gradient: Global optimality via convex parameterization. In Conference on Decision and Control (CDC), pages 4576–4581. Sutton, Richard S. and Barto, Andrew G. (2018). Reinforcement learning: An introduction. Adaptive computation and machine learning. MIT Press, second edition. Tan, Wen, Marquez, Horacio J., and Chen, Tongwen (2004). Operating point selection in multimodel controller design. In American Control Conference, pages 3652–3657. Tao, Terence (2012). Topics in random matrix theory. Graduate studies in mathematics. American Mathematical Society, Providence, RI. Teh, Yee Whye, Bapst, Victor, Czarnecki, Wojciech M., Quan, John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas, and Pascanu, Razvan (2017). Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pages 4499–4509. Terzi, Enrico, Bonassi, Fabio, Farina, Marcello, and Scattolini, Riccardo (2021). Learning model predictive control with long short-term memory networks. International Journal of Robust and Nonlinear Control. Thieffry, Maxime, Kruszewski, Alexandre, Duriez, Christian, and Guerra, Thierry-Marie (2018). Control design for soft robots based on reduced-order model. IEEE Robotics and Automation Letters, 4(1):25–32. Thuruthel, Thomas George, Falotico, Egidio, Renda, Federico, and Laschi, Cecilia (2019). Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators. IEEETransactions on Robotics, 35(1):124–134. Tits, André L and Yang, Yaguang (1996). Globally convergent algorithms for robust pole assignment by state feedback. IEEE Transactions on Automatic Control, 41(10):1432–1452. Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba, Wojciech, and Abbeel, Pieter (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), pages 23–30. Todorov, Emanuel, Erez, Tom, and Tassa, Yuval (2012). MuJoCo: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033. Tonkens, Sander, Lorenzetti, Joseph, and Pavone, Marco (2021). Soft robot optimal control via reduced order finite element models. In International Conference on Robotics and Automation (ICRA), pages 12010–12016. Trefethen, Lloyd N and Embree, Mark (2005). Spectra and pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton University Press. Tropp, Joel A. (2015). An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230. 192 Tu, Stephen and Recht, Benjamin (2019). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory (COLT), pages 3036–3083. van Baar, Jeroen, Sullivan, Alan, Cordorel, Radu, Jha, Devesh K., Romeres, Diego, and Nikovski, Daniel (2019). Sim-to-real transfer learning using robustified controllers in robotic tasks involving complex dynamics. In International Conference on Robotics and Automation (ICRA), pages 6001–6007. van Hasselt, Hado P, Guez, Arthur, Hessel, Matteo, Mnih, Volodymyr, and Silver, David (2016). Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pages 4287–4295. Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran (2020). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations (ICLR). Wang, Sheng-De, Kuo, Te-Son, and Hsu, Chen-Fa (1986). Trace bounds on the solution of the algebraic matrix Riccati and Lyapunov equation. IEEE Transactions on Automatic Control, 31(7):654–656. Wei, Chen-Yu, Jahromi, Mehdi Jafarnia, Luo, Haipeng, Sharma, Hiteshi, and Jain, Rahul (2020). Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. In International Conference on Machine Learning (ICML), pages 10170–10180. Wierstra, Daan, Foerster, Alexander, Peters, Jan, and Schmidhuber, Juergen (2007). Solving deep memory POMDPs with recurrent policy gradients. In International Conference on Artificial Neural Networks , pages 697–706. Williams, Ronald J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256. Wriggers, Peter (2008). Nonlinear Finite Element Methods. Springer-Verlag, Berlin Heidelberg. Xu, Pan and Gu, Quanquan (2020). A finite-time analysis of Q-learning with neural network function approximation. In International Conference on Machine Learning (ICML), pages 10555–10565. Yang, Zhaoyang, Merrick, Kathryn E, Abbass, Hussein A, and Jin, Lianwen (2017). Multi-task deep reinforcement learning for continuous action control. In International Joint Conference on Artificial Intelligence (IJCAI), pages 3301–3307. Yang, Zhuoran, Jin, Chi, Wang, Zhaoran, Wang, Mengdi, and Jordan, Michael I. (2020). Provably efficient reinforcement learning with kernel and neural function approximations. In Advances in Neural Information Processing Systems (NeurIPS). Yoon, Myung-Gon, Ugrinovskii, Valery A., and Pszczel, Marek (2007). Gain-scheduling of minimax optimal state-feedback controllers for uncertain LPV systems. IEEE Transactions on Automatic Control, 52(2):311–317. Yu, Tianhe, Kumar, Saurabh, Gupta, Abhishek, Levine, Sergey, Hausman, Karol, and Finn, Chelsea (2020). Gradient surgery for multi-task learning. CoRR, abs/2001.06782. Yu, Tianhe, Quillen, Deirdre, He, Zhanpeng, Julian, Ryan, Hausman, Karol, Finn, Chelsea, and Levine, Sergey (2019). Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), pages 1094–1100. 193 Yu, Wenhao, Tan, Jie, Liu, C. Karen, and Turk, Greg (2017). Preparing for the unknown: Learning a universal policy with online system identification. In Robotics: Science and Systems (RSS). Zhang, Chiyuan, Vinyals, Oriol, Munos, Rémi, and Bengio, Samy (2018). A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893. Zhu, Jihong, Cherubini, Andrea, Dune, Claire, Navarro-Alarcon, David, Alambeigi, Farshid, Berenson, Dmitry, Ficuciello, Fanny, Harada, Kensuke, Li, Xiang, Pan, Jia, and Yuan, Wenzhen (2021). Challenges and outlook in robotic manipulation of deformable objects. CoRR, abs/2105.01767. Zhu, Yuke, Wang, Ziyu, Merel, Josh, Rusu, Andrei A., Erez, Tom, Cabi, Serkan, Tunyasuvunakool, Saran, Kramár, János, Hadsell, Raia, de Freitas, Nando, and Heess, Nicolas (2018). Reinforcement and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems (RSS). Zimmermann, Simon, Poranne, Roi, and Coros, Stelian (2021). Dynamic manipulation of deformable objects with implicit integration. IEEE Robotics and Automation Letters, 6(2):4209–4216. 194
Abstract (if available)
Abstract
The interface between machine learning and control has enabled robots to move outside the laboratory into challenging real-world settings. Deep reinforcement learning can scale empirically to very complex systems, but we do not yet understand precisely when and why it succeeds. Control theory focuses on simpler systems, but delivers interpretability, mathematical understanding, and guarantees. We present projects that combine these strengths.
In empirical work, we propose a framework for tasks with complex dynamics but known reward functions. We restrict the use of learning to the dynamics modeling stage, and act based on this model using traditional state-space control. We apply this framework to robotic manipulation of deformable objects.
In theoretical work, we deploy the well-understood linear quadratic regulator (LQR) problem as a test case to "look inside" algorithms and problem structure. First, we investigate how reinforcement learning algorithms depend on properties of the dynamical system by bounding the variance of the REINFORCE policy gradient estimator as a function of the LQR system matrices. Second, we introduce the framework of suboptimal covering numbers to quantify how much a good multi-system policy must change with respect to the dynamics parameters, and bound the covering number for a simple class of LQR systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Closing the reality gap via simulation-based inference and control
PDF
Scaling robot learning with skills
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Accelerating robot manipulation using demonstrations
PDF
Learning from planners to enable new robot capabilities
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Motion coordination for large multi-robot teams in obstacle-rich environments
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Optimization-based whole-body control and reactive planning for a torque controlled humanoid robot
PDF
Learning objective functions for autonomous motion generation
Asset Metadata
Creator
Preiss, James Alan
(author)
Core Title
Characterizing and improving robot learning: a control-theoretic perspective
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/21/2022
Defense Date
05/10/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Control,control theory,machine learning,OAI-PMH Harvest,reinforcement learning,robotics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav Suhas (
committee chair
), Ayanian, Nora (
committee member
), Nayyar, Ashutosh (
committee member
), Nikolaidis, Stefanos (
committee member
)
Creator Email
jamesalanpreiss@gmail.com,japreiss@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111373697
Unique identifier
UC111373697
Legacy Identifier
etd-PreissJame-10890
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Preiss, James Alan
Type
texts
Source
20220722-usctheses-batch-959
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
control theory
machine learning
reinforcement learning
robotics