Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Bayesian methods for autonomous learning systems
(USC Thesis Other)
Bayesian methods for autonomous learning systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BAYESIAN METHODS FOR AUTONOMOUS LEARNING SYSTEMS by Jo-Anne S. Y. Ting A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2009 Copyright 2009 Jo-Anne S. Y. Ting Acknowledgements Thisdissertationcouldnothavebeencompletedwithoutthesupportandhelpofmany. I have been very fortunate to have Stefan Schaal as an advisor, collaborator and sounding board for the many frustrations faced over the years. He has provided not only guidance and inspiration, but a nurturing and supportive environment for growth that anyone could hope for. I must thank Gaurav Sukhatme, Nicolas Schweighofer, Sethu Vijayakumar, Laurent Itti, Fei Sha and Leana Golubchik for their invaluable research (and career) advice and feedback. I would also like to thank my fellow collaborators Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, Peter Strick, Mitsuo Kawato and John Kalaska for their help with the neural-EMG data analyses. I am indebted to Sherri Fagan and Laura Lopez for all the work they have done to make the Phd-related paperwork process less painful. All the folks at CLMC—past and present—have been a warm and welcoming family, helpingmeacclimatizetothesunny, concretesprawlofLosAngeles. The“other”robotic labs on the fourth floor, RESL and Interaction Lab, have also been great contributors to providing a socially stimulating environment at USC. A special thank you goes out to ii AaronD’SouzaforbeingsuchawonderfulmentorandfriendandtoMrinalKalakrishnan, who has been a great help in the past year. I could not have survived the last five years without those who came along for fun outdoor (beach, mountain & road) excursions and sing-and-drink distractions from re- search. I would like especially to thank the Baker family—Sue, Teresa and Benji—and the Brasca family—Angiola and Carlo—for making me feel so welcomed in Los Angeles and Milan, respectively, and for being my “family away from family”. I am deeply grateful for the encouragement of my parents (Sharon and Joseph), my siblings(Michele, MikeandAdrian)andfriendswho, evenfromafar, haveofferedaready ear, insightful words and unlimited moral support whenever it was needed. I would like to thank Leslie Cheng, Yiffie Zhu and Madelaine Saginur for being truly fantastic friends over the years. My final thank you goes to Claudio Brasca for his love and patience. iii Table of Contents Acknowledgements ii List Of Tables vii List Of Figures viii Abstract xii Chapter 1: Introduction 1 1.1 Autonomous Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 What is Autonomous Learning? . . . . . . . . . . . . . . . . . . . 2 1.1.2 Autonomous Learning and Model Complexity . . . . . . . . . . . . 2 1.1.3 Why Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Approximate Inference Methods . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Bayesian Learning: Quantities of Interest . . . . . . . . . . . . . . 4 1.2.2 Variational Approximations . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Expectation-Propagation . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Laplace Approximations . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.5 MCMC Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: High Dimensional Linear Regression 12 2.1 Computationally Tractable Linear Regression . . . . . . . . . . . . . . . . 15 2.2 Variational Bayesian Least Squares . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 EM-based Linear Regression . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Automatic Relevance Determination . . . . . . . . . . . . . . . . . 22 2.3 Real-time Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1.4 Non-Normal Synthetic Data . . . . . . . . . . . . . . . . 36 2.4.2 EMG Prediction from Neural Firing . . . . . . . . . . . . . . . . . 38 iv 2.4.2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4.3 Real-time Analysis for Brain-Machine Interfaces . . . . . . . . . . 48 2.5 Interpretation of Neural Data Analyses. . . . . . . . . . . . . . . . . . . . 49 2.5.1 Sergio & Kalaska (1998) data set . . . . . . . . . . . . . . . . . . . 50 2.5.2 Kakei, Hoffman & Strick (1999) and Kakei, Hoffman & Strick (2001) data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter 3: Parameter Identification in Noisy Linear Regression 59 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.1 Parameter Estimation in Rigid Body Dynamics . . . . . . . . . . . 63 3.1.2 Modeling Input Noise in Linear Regression . . . . . . . . . . . . . 65 3.2 Bayesian Regression with Input Noise . . . . . . . . . . . . . . . . . . . . 69 3.2.1 EM-based Joint Factor Analysis . . . . . . . . . . . . . . . . . . . 69 3.2.2 Automatic Feature Detection . . . . . . . . . . . . . . . . . . . . . 70 3.2.3 Inference of the Regression Solution . . . . . . . . . . . . . . . . . 76 3.3 Post-processing for Physically Consistent Rigid Body Parameters . . . . . 79 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.2 Robotic Oculomotor Vision Head . . . . . . . . . . . . . . . . . . . 87 3.4.3 Robotic Anthropomorphic Arm . . . . . . . . . . . . . . . . . . . . 88 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 4: Dealing with Outlier-Infested Data 92 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Linear Regression with Outliers . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2.1 Bayesian Regression for Automatic Outlier Detection. . . . . . . . 96 4.2.2 Incremental Version . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3 An Outlier-Robust Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.2 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3.3 The Robust Weighted Kalman Filter . . . . . . . . . . . . . . . . . 106 4.3.4 Monitoring the Residual Error . . . . . . . . . . . . . . . . . . . . 111 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.4.1 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.4.1.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 113 4.4.1.2 LittleDog Robot . . . . . . . . . . . . . . . . . . . . . . . 115 4.4.2 Weighted Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . 117 4.4.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 118 4.4.2.2 LittleDog Robot . . . . . . . . . . . . . . . . . . . . . . . 121 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 v Chapter 5: Nonlinear High-Dimensional Regression 125 5.1 Bayesian Local Kernel Shaping . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.1.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . 140 5.2 Extension to Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . 141 5.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3.2 Robot Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Chapter 6: Conclusion 159 6.1 Summary of Dissertation Contributions . . . . . . . . . . . . . . . . . . . 159 6.2 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.2.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.2.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3 Opportunities for Future Research . . . . . . . . . . . . . . . . . . . . . . 163 Reference List 164 Appendix A Variational Bayesian Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 176 A.1 Derivation of Variational Bayesian Least Squares . . . . . . . . . . . . . . 176 A.2 Pseudocode for Variational Bayesian Least Squares . . . . . . . . . . . . . 179 A.3 EM Update Equations for Real-time Implementation . . . . . . . . . . . . 180 Appendix B EMG Prediction from Neural Data: Plots . . . . . . . . . . . . . . . . . . . . . 182 B.1 Construction of Cross-validation Sets . . . . . . . . . . . . . . . . . . . . . 182 B.2 Plots of Percentage Matches with Baseline Study . . . . . . . . . . . . . . 186 B.3 Real-time Analysis for Brain-Machine Interfaces. . . . . . . . . . . . . . . 188 B.4 Plots of EMG Traces and Velocities . . . . . . . . . . . . . . . . . . . . . 189 Appendix C Bayesian Factor Analysis for Regression . . . . . . . . . . . . . . . . . . . . . . 196 C.1 Factor Analysis for Regression. . . . . . . . . . . . . . . . . . . . . . . . . 196 C.2 Derivation of Bayesian Joint Factor Analysis . . . . . . . . . . . . . . . . 197 C.3 Monitoring the Incomplete Log Likelihood . . . . . . . . . . . . . . . . . . 203 Appendix D Bayesian Local Kernel Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 D.1 Deriving a Lower Bound Using Convex Duality . . . . . . . . . . . . . . . 204 D.2 Derivation of Bayesian Kernel Shaping . . . . . . . . . . . . . . . . . . . . 205 vi List Of Tables 2.1 Percentage relevant neuron matches found, averaged over all muscles . . 47 3.1 Rootmeansquarederrorsforposition, velocityandfeedbackcommandfor the robotic oculomotor vision head . . . . . . . . . . . . . . . . . . . . . 87 3.2 Rootmeansquarederrorsforposition, velocityandfeedbackcommandfor the robotic anthropomorphic arm . . . . . . . . . . . . . . . . . . . . . . 88 4.1 Average prediction error for a linear function evaluated in batch form . . 113 5.1 Average prediction error for synthetic functions (i) and (ii) . . . . . . . . 148 5.2 Average prediction error for the nonlinear 2-d CROSS function . . . . . . 150 vii List Of Figures 2.1 Graphical model for linear regression . . . . . . . . . . . . . . . . . . . . 18 2.2 Graphical model for probabilistic backfitting . . . . . . . . . . . . . . . . 19 2.3 Graphical model for variational Bayesian least squares . . . . . . . . . . . 23 2.4 Average prediction error for synthetic 100 input-dimensional data . . . . 34 2.5 Average prediction error for synthetic non-Normal 100 input-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Average prediction error for Sergio & Kalaska (1998) M1 neural data . . 43 2.7 Average prediction error for Kakei et al. (1999) M1 neural data. . . . . . 44 2.8 Average prediction error for Kakei et al. (2001) PM neural data . . . . . 44 2.9 AveragenumberofrelevantM1neuronsfoundforSergio&Kalaska(1998) data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.10 Average number of relevant M1 neurons found for Kakei et al. (1999) data 46 2.11 Average number of relevant PM neurons found for Kakei et al. (2001) data 46 2.12 Observed vs. predicted EMG traces under isometric force conditions for the infraspinatus muscle, given M1 neural firing from Sergio & Kalaska (1998). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.13 Observed vs. predicted EMG traces under movement force conditions for the infraspinatus muscle from Sergio & Kalaska (1998) . . . . . . . . . . 52 3.1 Graphical model for joint Factor Analysis for regression . . . . . . . . . . 67 3.2 Graphical model for joint factor analysis for efficient estimation . . . . . 70 viii 3.3 Graphical model for Bayesian version of joint factor analysis . . . . . . . 72 3.4 Sarcos robotic oculomotor vision head . . . . . . . . . . . . . . . . . . . . 84 3.5 Sarcos robotic anthropomorphic arm . . . . . . . . . . . . . . . . . . . . 85 3.6 Average prediction error on noiseless test data for synthetic 100 input- dimensional data with varying input SNR levels . . . . . . . . . . . . . . 86 4.1 Motorcycle data set and Old Faithful data set . . . . . . . . . . . . . . . 95 4.2 Graphical model of Kalman filter and robust weighted Kalman filter . . . 105 4.3 Predicted error for synthetic batch data sets, evaluated in an incremental manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.4 LittleDog quadruped robot by Boston Dynamics . . . . . . . . . . . . . . 115 4.5 Predicted vs. observed offset data between q IMU and q MOCAP . . . . . . . 116 4.6 Predicted data shown for cosine function with noise and outliers . . . . . 119 4.7 Predicted data shown for cosine function with noise and outliers . . . . . 120 4.8 Observed quaternion data from LittleDog robot: a slowly drifting noisy signal with outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.1 Graphical model of Bayesian local kernel shaping. . . . . . . . . . . . . . 131 5.2 Graphs of the function q i for different r values . . . . . . . . . . . . . . . 133 5.3 Effect of outliers (in black circles) on local kernel shaping . . . . . . . . . 139 5.4 Observed vs. predicted output made by stationary GP, augmented GP and Bayesian local kernel shaping for function (i). . . . . . . . . . . . . . 145 5.5 Observed vs. predicted output made by stationary GP, augmented GP and Bayesian local kernel shaping for function (ii). . . . . . . . . . . . . . 146 5.6 Observed vs. predicted output made by stationary GP, augmented GP and Bayesian local kernel shaping for function (iii). . . . . . . . . . . . . 147 ix 5.7 Target function vs. predicted function produced by Bayesian local kernel shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.8 Learnt weighting kernels in input space for 2-dimensional CROSS function 150 5.9 Predicted results from Bayesian local kernel shaping onmotorcycle impact data set (Silverman 1985) . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.10 PredictedresultsfromaugmentedGPonmotorcycleimpactdataset(Silverman 1985) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.11 PredictedresultsfrominfinitemixtureofGPexpertsonmotorcycleimpact data set (Silverman 1985) . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.12 Predicted results from alternate infinite mixture of GP experts on motor- cycle impact data set (Silverman 1985) . . . . . . . . . . . . . . . . . . . 153 5.13 SensAble Phantom haptic robotic arm . . . . . . . . . . . . . . . . . . . . 153 5.14 Desired versus actual trajectories for SensAble Phantom robot arm . . . . 155 B.1 Cross-validation splits for the Sergio & Kalaska (1998) M1 neural data set 183 B.2 Cross-validation splits for the Kakei et al. (1999) M1 neural data set . . . 184 B.3 Cross-validation splits for the Kakei et al. (2001) PM neural data set . . 185 B.4 Percentage of M1 neuron matches for the Sergio & Kalaska (1998) data . 186 B.5 Percentage of M1 neuron matches for the Kakei et al. (1999) data . . . . 187 B.6 Percentage of PM neuron matches for the Kakei et al. (2001) data . . . . 187 B.7 Coefficient of determination values and number of relevant neurons found by batch VBLS and real-time VBLS in Sergio & Kalaska (1998) data set. 188 B.8 Observedvs. predictedEMGtracesfortheECRBmuscleinthesupinated wrist condition from Kakei et al. (1999) . . . . . . . . . . . . . . . . . . . 190 B.9 Observedvs. predictedEMGtracesfortheECRBmuscleinthesupinated wrist condition from Kakei et al. (2001) . . . . . . . . . . . . . . . . . . . 191 B.10 Observed vs. predicted velocities in thex direction for the supinated wrist condition from Kakei et al. (1999) . . . . . . . . . . . . . . . . . . . . . . 192 x B.11 Observed vs. predicted velocities in they direction for the supinated wrist condition from Kakei et al. (1999) . . . . . . . . . . . . . . . . . . . . . . 193 B.12 Observed vs. predicted velocities in thex direction for the supinated wrist condition from Kakei et al. (2001) . . . . . . . . . . . . . . . . . . . . . . 194 B.13 Observed vs. predicted velocities in they direction for the supinated wrist condition from Kakei et al. (2001) . . . . . . . . . . . . . . . . . . . . . . 195 xi Abstract We propose a set of Bayesian methods that help us toward the goal of autonomous learningsystems. Systemsthatcanreactautonomously,withminimalhumansupervision, have the potential for significant impact, especially in applications with considerable uncertainty in the environment. In current algorithms, parameter values are set using heuristic techniques, statistical cross-validation or other search procedures to find the “right” values. We rely on Bayesian inference as a principled way to eliminate open parameters, resulting in a black-box-like approach. Weareinterestedinscenarioswheretheinputdataishigh-dimensional(andmanyin- putdimensionsmayberedundantorirrelevant)andwherereal-time,incrementallearning may be needed. Such data sets are common in the domain of robotics and brain-machine interfaces, which are the main areas of evaluation in this dissertation. We start by ex- amining the problem of regression since classification can be performed by interpreting regression outputs in a binary way. This dissertation first introduces a set of autonomous Bayesian methods that learn from data with the following properties: a high number of input dimensions, noise in input data, and outliers. All these methods can be leveraged together to develop a local Bayesian kernel shaping framework for nonlinear regression. The Bayesian kernel xii shaping algorithm we propose is the first step towards realizing real-time autonomous learning systems. Even though the version described in this thesis is in batch form, it is computationally efficient and can be used in not only local methods, but also global nonlinearmethodssuchasGaussianprocessesfornon-stationaryfunctionapproximation. xiii Chapter 1 Introduction 1.1 Autonomous Learning A long-standing dream of machine learning is to create autonomous learning systems that can operate, with minimal human supervision, in home, research and industrial applications. An autonomous learning system learns from experience, not requiring any application-specific tuning of parameters by the human user, while having a “black-box”- likeabilitytoworkacrossdifferenthardwareplatformsandenvironments. Manualtuning parameters, such as those used in gradient descent or structure selection, need to be minimized and, ideally, avoided. Most of the current learning algorithms require some amount of application-specific tailoring. Oneoftheprimarychallengesinmachinelearningistodevelopalgorithmsthat areapplicableacrossabroadrangeofapplicationswithaminimalamountofmodification. While a universal black box may not be possible for all systems, significant progress can be made in some domains. 1 Inthisdissertation,weaddresslearningproblemsinsensor-richanddata-richenviron- ments, such as those provided by autonomous vehicles, surveillance networks, biological systems and robotics. In these scenarios, the input data is high-dimensional and is used to make predictions, often in real-time. 1.1.1 What is Autonomous Learning? Our goal is to devise algorithms that can deal autonomously and efficiently with high dimensional data which is typically contaminated by noise, redundancy and irrelevant dimensions. Thealgorithmsmustlearnnonlinearfunctions, potentiallyinanincremental and real-time fashion (where samples are available sequentially, one at a time, instead of all together at the outset), for robust classification and regression. In this dissertation, an autonomous learning algorithm is loosely defined and considered to be an algorithm that can perform on different systems, environments and data sets, with minimal tuning of parameters, interference, and involvement by the user. 1.1.2 Autonomous Learning and Model Complexity Givenourgoalofautonomouslearningsystems, thechallengeofthelearningprocessisto removeanytuningofmodelparametersneededbytheuserandtoincludeitaspartofthe inference procedure in order to find a model that balances data fitting and model com- plexity,i.e.,Occam’srazor. Existingapproachesthatattempttofindthis“right”amount of model complexity include cross-validation (Stone 1974, Stone & Brooks 1990), early stopping (Finnof, Hergert & Zimmerman 1993), a penalized cost function with some reg- ularization criterion—such as minimum description length (Rissanen 1978) or the Akaike 2 Information Criterion (Akaike 1974), maximum-margin approaches (Taskar, Chatalba- shev, Koller & Guestrin 2005), and maximizing Bayesian evidence (Mackay 1992), to name a few. It is this last approach of maximal Bayesian evidence that we choose to adopt for the following chapters. 1.1.3 Why Bayesian? A statistical Bayesian framework offers many advantages. These include automatic com- plexity regularization, tractable approximate inference (when combined with approxi- mation methods to reduce computational complexity)—which is especially important in data-rich, real-time domains, confidence measures of performance, the incorporation of domain (or prior) knowledge, and integration in other probabilistic systems. However, Bayesian approaches require that prior distributions be chosen, and this is typically done for analytical convenience rather than real knowledge of the problem. Additionally, computationally intractable integrals arise that are hard to solve analyti- cally (Jordan, Ghahramani, Jaakkola & Saul 1999). 1.2 Approximate Inference Methods Variousapproximationmethodshavebeenproposedtocopewiththeintractableintegrals that result from Bayesian inference. Some of these include variational approximations, e.g., (Jordan et al. 1999, Ghahramani & Beal 2000, Beal 2003), Expectation-Propagation (EP) (Minka 2001), Laplace’s method (MacKay 2003, Bishop 2006) and Markov Chain Monte Carlo (MCMC) sampling methods, e.g., (Gelfand 1996). 3 Variational methods, EP and Laplace’s method are deterministic techniques which approximate the marginal posterior distribution with a Gaussian. In contrast, Monte Carlo sampling methods are nondeterministic and rely on the law of large numbers to evaluate the integral. Some of these methods, such as EP and variational EM, are iter- ative. All attempt to choose the best approximation to the probability density function from within a tractable class of distributions (e.g., Gaussian, exponential, concave or convex, factorized, etc.). Before we delve into the details of various approximate inference methods, let us introduce, in the next section, a few quantities used in Bayesian learning. 1.2.1 Bayesian Learning: Quantities of Interest To perform Bayesian inference, let us first assume a data set D has been observed and thatwewanttoestimatetheunderlyingmodelofhowthedatawasgenerated. Wedenote the model parameters as θ and M as the model. InaBayesianframework,probabilitydistributionsareplacedoverparametersinorder to describe beliefs and uncertainties about parameter values. The prior distributionp(θ) describes the belief about the true values ofθ. We can account for the information in the observed data D and use it to update the belief in θ to produce a posterior distribution p(θ|D). The three quantities of interest in Bayesian learning consist of the following: • Evidence for model M: p(D|M)= Z p(D,θ|M)dθ = Z p(D|θ)p(θ|M)dθ (1.1) 4 Given a maximal Bayesian evidence framework, our goal is to maximize the maxi- mizetotradeoffdatafittingwithmodelcomplexity. Wecandothisbymaximizing the marginal likelihood, which can be arrived by integrating out model parameters from the evidence. • Posterior distribution of parameters: We can use Bayes’ rule to compute the posterior distribution as: p(θ|D)= p(D|θ)p(θ) p(D) (1.2) where p(D|θ) is the likelihood and p(D) is the marginal likelihood. • Predictive distribution: Given a new data sample x, the predictive distribution is: p(x|D)= Z p(x,θ|D)dθ = Z p(x|θ)p(θ|D)dθ (1.3) Notice that the posterior distribution p(θ|D) is needed to calculate the predictive distribution. A common problem is that posterior p(θ|D) cannot be evaluated in closed form be- cause the marginal likelihoodp(D) consists of an integral that is analytically intractable. Indeed, the integrals in Eq. (1.1) and (1.3) may also be hard to solve analytically. The next section describes some of the methods to approximate analytically intractable inte- grals. 5 1.2.2 Variational Approximations Variational methods have been applied in the fields of physics, statistics and control theory in the form of calculus of variations, linear and non-linear moment problems and dynamic programming. In the recent past, variational methods have been developed for approximate inference and estimation in the probabilistic models commonly found in machine learning. Variational methods convert a complex problem into a simpler one by decoupling the degrees of freedom in the original problem. The decoupling is achieved by introducing additional parameters, known as variational parameters, that must be optimized for the problem. In this thesis, we leverage variational approximations in order to get a tractable lower boundtothemarginalloglikelihood. Inparticular,weusetwoapproximations: afactorial variational approximation and a variational approach on concave or convex functions. VariationalMeanFieldTheory: Variationalmeanfieldtheoryapproximatesacom- plex distribution by fully factorizing it into individual component distributions. Mean fieldtheoryhasbeenusedbyJaakkola&Jordan(2000),Jordanetal.(1999)andGhahra- mani & Beal (2000) (the latter, under the name “variational Bayes”), to name a few, for finding the lower bound to the marginal likelihood and for providing analytical approxi- mations to parameter posterior distributions. The factorial variational approximation can be incorporated to the Expectation- Maximization (EM) algorithm (Dempster, Laird & Rubin 1977) to optimize iteratively a lower bound to the marginal likelihood. The iterations of variational Bayesian EM con- sist of a variational E-step where the hidden variables are inferred using an ensemble of 6 models (according to their posterior probability) and a variational M-step where a poste- rior distribution over model parameters is inferred. Instead of adopting a fully factorized assumption of the posterior distribution, it is possible to use a more structured mean field approach, where the variational distribution factors across partitioned, disjoint sets of variables (Beal 2003). Convex Duality: Variational transformations can also be achieved by convex dual- ity (Rockafellar 1972) principle to obtain lower (or upper) bounds on a function. The principle of convex duality states that a concave function f(x) can be represented via a conjugate or dual function as follows: f(x)=min λ λ T f(x)−f ∗ (λ) (1.4) where the conjugate function f ∗ (λ) can be obtained from the dual expression: f ∗ (x)=min x λ T (x)−f(x) (1.5) The framework of convex duality applies to convex functions as well, with minimum operator replaced by the maximum operator. A disadvantage with factorized variational methods is that the quality of the ap- proximation may not be necessarily high, depending on the dependency between hidden variables. Note that, aside from the two briefly sketched here, there exist other vari- ational approaches for approximate Bayesian inference. These include the Bethe and Kikuchi family of variational methods (Yedidia, Freeman & Weiss 2001), approximations 7 to bound partition functions with higher order polynomials (Leisink & Kappen 2001), moreelaboratevariationalmethodsforupperboundsonpartitionfunctions(Wainwright, Jaakkola & Willsky 2002). 1.2.3 Expectation-Propagation EP is an iterative algorithm that tries to choose the best approximation of the posterior, withinsometractableclassofdistributions. Itisanextensionofassumed-densityfiltering (ADF), e.g., (Maybeck 1979), which is a one-pass sequential method for computing an approximateposteriordistribution. ADFprocessesobservationssequentially, onebyone, updatingtheposteriordistributionafteradatasampleisobserved. Thesequentialnature of ADFmeansthat information that isdiscarded earlyone mayturn out to be important at a later step. In contrast, EP makes additional passes, including information from later observations so that choices made earlier can be refined. This iterative refinement means that the resulting approximated posterior is independent of observation order and more accurate than ADF. EP exploits the fact that the posterior can be expressed as a product of simpler parametric terms (i.e., in the exponential family). These simpler terms are approximated such that the Kulback-Leibler (KL) divergence (Kulback & Leibler 1951) between the true posterior and the approximated posterior are minimized. The result is a system of coupled equations that are iterated to reach a fixed-point. While EP has no issues with local minima and is typically faster than other approaches, it is not guaranteed to converge. 8 1.2.4 Laplace Approximations Laplace’s method approximates the posterior by a Gaussian distribution, which is found using a second order Taylor expansion around the model. Since the approximation is symmetric and locally fitted around the mode, the tails of the true posterior may not be covered, giving a poor approximation. 1.2.5 MCMC Sampling MCMC methods generate samples from the posterior p(θ|D) by using evaluations of the unnormalizedposteriorp(D|θ)p(θ). Thestatisticsofthesamplesareusedtoapproximate properties of the posterior distribution. A Markov chain of parameter valuesθ 0 ,θ 1 ,...,θ n is generated such that the distribution of θ n asymptotically reaches that of the true posterior as the sequence length n increases. AdifficultywithMCMCsamplingmethodsisdetermininghowmanystepsareneeded in order to converge to the desired distribution, i.e., the Markov chain’s mixing time. However, MCMC sampling methods are advantageous when the posterior has a complex shape that is not necessarily Gaussian. Popular MCMC sampling methods include the Metropolis-Hastings method (Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953, Hastings1970),Gibbssampling(Geman&Geman1984),importancesampling(Robert& Casella 2005) and slice sampling (Neal 2003). More details on MCMC sampling methods can be found in (Neal 1993) and (MacKay 2003). 9 1.3 Key Challenges In this dissertation, we consider supervised learning problems and are interested in sce- narios with many data samples, high-dimensional data and where real-time learning (and fast computation) is needed. In particular, we focus on regression problems and learning models. Typicalmachinelearningdomains,incontrast,dealwithofflineanalysisoftrain- ing data and do not need any “online” learning, i.e., models are not updated or re-learnt during test time. Challenges to autonomous real-time learning can be divided into modeling challenges and algorithmic constraints. The existence of noise in inputs, outliers, irrelevant and redundant variables, and high dimensions pose modeling challenges for algorithms. Ad- ditionally, we would like algorithms to satisfy the following algorithmic constraints: be runnable in real-time, computationally efficient, and autonomous (requiring no cross- validation or tuning of parameters). There exist a wealth of methods in the fields of machine learning and statistics that address some—bust not all–of these challenges and least of all not in an “autonomous” way. For the rest of the manuscript, we will be putting the pieces of the puzzle together, introducingindividual autonomous algorithmsthatprogressivelybuildoneachotherand that address larger subsets of challenges. Leveraging these autonomous methods, we will describe an automatic kernel shaping algorithm that learns the region of data samples that a local model should consider and cover. The algorithm can be applied not only local methods but global methods. 10 1.4 Dissertation Outline The remainder of this dissertation is organized as follows: • Chapter 2 discusses how to perform linear regression in high-dimensional domains, where the input data has a large number of input dimensions—many of which are redundant or irrelevant. • Chapter 3 considers the problem of high-dimensional linear regression when the input data is contaminated with noise. In such situations, typical regression meth- ods are unable to handle noise in the input data and, consequently, produce biased estimates. • Chapter 4 considers more realistic sensory data where outliers, as well as noise, are present in observed data. We propose a Bayesian formulation of weighted least squares that can automatically detect outliers in real-time. We incorporate this idea into the Kalman filter in order to learn an outlier-robust filter. • Chapter5movesontothenonlinearhigh-dimensionalregressionproblem. Weintro- duce a Bayesian kernel shaping algorithm that automatically learns the bandwidth of a local model. We demonstrate that Bayesian kernel shaping can be leveraged in not only local methods (such as locally weighted regression) but also in global methods that are linear in the parameters (such as Gaussian process regression). • Chapter 6 concludes with a summary of contributions presented in this thesis and a discussion of future work. 11 Chapter 2 High Dimensional Linear Regression In recent years, there has been growing interest in large scale analyses of brain activity with respect to associated behavioral variables. In the area of brain-machine interfaces, neuralfiringhasbeenusedtocontrolanartificialsystemlikearobot(Nicolelis2001,Tay- lor, Tillery & Schwartz 2002), to control a cursor on a computer screen via non-invasive brain signals (Wolpaw & McFarland 2004), or to classify visual stimuli presented to a subject (Kamitani & Tong 2004, Haynes & Rees 2005). The brain signals are typically high dimensional, with large numbers of redundant and irrelevant signals. Linear mod- eling techniques like linear regression are among the primary analysis tools (Wessberg & Nicolelis 2004, Musallam, Corneil, Greger, Scherberger & Andersen 2004) for such data. However, the computational problem of data analysis not only involves data fitting, but also requires that the model extracted from the data has good generalization properties. Goodgeneralizationiscrucialforpredictingbehaviorfromfutureneuralrecordings. Two examples where accurate behavior prediction is relevant include i) the continual online interpretation of brain activity to control prosthetic devices and ii) longitudinal scientific studies of information processing in the brain. 12 Surprisingly, robust linear modeling of high dimensional data is non-trivial as the danger of fitting noise and numerical problems is high. Classical techniques like ridge regression, stepwise regression (Draper & Smith 1981) or Partial Least Squares (PLS) regression (Wold 1975) are prone to overfitting and require careful human supervision for useful results. Other methods such as Least Absolute Shrinkage and Selection Operator (LASSO) regression (Tibshirani 1996) attempt to shrink certain regression coefficients to zero by L1-norm regularization, resulting in interpretable models that are sparse. However, these L1-regularized regression methods have typically an open parameter, such as a regularization parameter, that needs to be set or optimized. Some of the methods for solving L1-regularized regression problems include convex optimization techniques such as sequential quadratic programming or interior-point methods, e.g., (Kim, Koh, Lustig, Boyd & Gorinevsky 2007), coordinate descent methods (Friedman, Hastie & Tibshirani 2007), the Gauss Seidel method (Shevade & Keerthi 2003), generalized itera- tivescaling(Goodman2004),anditerativere-weightedleastsquares(Lokhorst1999,Lee, Lee, Abeel & Ng 2006). In this chapter, we focus on improving linear data analysis for the high dimensional scenarios described above. Our goal is to develop a statistically robust “black-box” ap- proach that automatically detects the most relevant features for generalization. With the help of a variational Bayesian approximation and the introduction of “probabilistic back- fitting” for linear regression, we develop a computationally efficient Bayesian treatment of linear regression with automatic relevance detection (ARD) (Neal 1994). 13 Wedemonstratethatthealgorithm,calledVariationalBayesianleastsquares(VBLS), can significantly improve the computational efficiency of sparse Bayesian learning, while performingfeaturedetectionandautomaticrelevancedetermination. Additionally,VBLS avoids any potentially expensive cross-validation or tuning of meta parameters by the user, offering a statistically robust, automatic method that can be applied across data sets from various systems. VBLS can be interpreted as a Bayesian version of backfitting that does not require any sampling, making it suitable for implementation in incremental form for real-time applications (e.g., as in application domains such as robotics, brain-machine interfaces, tracking systems etc.). The algorithm’s iterative nature is invaluable in real-time sit- uations where decisions need to be made quickly such that an approximate solution is acceptable. In these scenarios, waiting a longer time for a very accurate solution may notbeanacceptablealternative. Thealgorithmismostadvantageouswhenembeddedin other non iterative methods—such as the Relevance Vector Machine of Tipping (2001)— sinceitisabletooffersignificantrelativecomputationalimprovement. Inthisway,wecan apply the algorithm to high-dimensional problems in both linear and nonlinear scenarios. We also present an incremental, real-time version of the algorithm, demonstrating its suitability for real-time interfaces between brains and machines. In addition to evalu- ations on several synthetic data sets and benchmark data sets, we apply the algorithm to the reconstruction of electromyographic (EMG) data from motor cortical firing, for the purpose of identifying if M1 and PM cortical neurons can directly predict EMG traces (Todorov 2000). 14 2.1 Computationally Tractable Linear Regression Before developing our algorithm, it is useful to briefly revisit classical linear regression techniques. Assuming there areN observed data samples in the data setD ={x i ,y i } N i=1 (where x i ∈ < d×1 are inputs and y i are scalar outputs), the standard model for linear regression is: y i = d X m=1 b m x im + (2.1) where b is the regression vector made up of b m components, d is the number of input dimensions, and is additive mean-zero noise. Given D, the goal is to estimate the optimal linear coefficients b = [b 1 b 2 ··· b d ] T which combine the input dimensions to produce the output y. It is easy to see that under our current noise model, the optimal estimate of the regression parameters (in a least-squares or maximum-likelihood sense) is given by: b OLS = X T X −1 X T y (2.2) where X denotes a matrix whose rows contain the vector x i and y is a column vector containing the corresponding y i . Eq. (2.2) is also known as the ordinary least squares (OLS)solution. ThemainproblemwithOLSregressioninhigh-dimensionalinputspaces is that the full rank assumption of X T X −1 is often violated due to underconstrained data sets. 15 Ridge regression (Hoerl & Kennard 1970) can “fix” such problems numerically by stabilizing the matrix inversion with a diagonal term X T X+αI −1 , but usually intro- duces uncontrolled bias. Additionally, if the input dimensionality exceeds around 1000 dimensions, the matrix inversion can become prohibitively computationally expensive. Several ideas exist how to improve over OLS. First, stepwise regression (Draper & Smith 1981) can be employed. However, it has been strongly criticized for its po- tential for overfitting and its inconsistency in the presence of collinearity in the input data (Derksen & Keselman 1992). To deal with such collinearity directly, dimensionality reductiontechniqueslikePrincipalComponentsRegression(PCR)andFactorRegression (FR) (Massey 1965) are useful. These methods retain components in input space with large variance, regardless of whether these components influence the prediction (Schaal, Vijayakumar&Atkeson1998), andcaneveneliminatelowvarianceinputsthatmayhave high predictive power for the outputs (Frank & Friedman 1993). Another class of linear regression methods are projection regression techniques, most notably PLS regression (Wold 1975). PLS regression performs computationally inex- pensive O(d) univariate regressions along projection directions, chosen according to the correlation between inputs and outputs. While slightly heuristic in nature, PLS regres- sion is a surprisingly successful algorithm for high-dimensional, ill-conditioned regression problems, although it also has a tendency to overfit (Schaal et al. 1998). There are also more efficient methods for matrix inversion, e.g., (Hastie & Tibshirani 1990, Strassen 1969), but these methods assume a well-condition regression problem a priori and degrade in the presence of collinearities in inputs. Other sparse matrix fac- torization and conjugate gradient techniques, e.g., (Golub & Van Loan 1989, Chow & 16 Saad 1998, Smola & Scholkoph 2000), are preconditioners, attempting to transform ma- trices by factoring them into more computationally convenient and analytically simpler forms before inverting them. Finally, there is a class of sparsity inducing methods such as LASSO (Least Absolute Shrinkage and Selection Operator) regression (Tibshirani 1996) that—along with other L1-regularized regression methods—attempt to shrink certain regression coefficients in the solution to zero by using a L1 penalty norm, instead of a L2 penalty norm used by ridge regression. These methods are suitable for high-dimensional data sets, at the expense of requiring an open parameter (i.e., a fixed bound on the penalty norm). In thesemethods,theregularizationparameterneedstobeoptimized,usingcross-validation, convex optimization or other efficient search techniques. In the next section, we describe a variational Bayesian linear regression algorithm (VBLS) that automatically regularizes against problems of overfitting. The iterative nature of the algorithm—due to its formulation as an EM problem—avoids the computa- tionalcostandnumericalproblemsofmatrixinversionsthatarefacedinhigh-dimensional OLS regression (e.g., previously proposed sparse variational linear regression methods of Tipping (2001) and in Bishop (2006)). VBLS can be embedded into other iterative methods in order to realize computationally efficient updates, as we shall demonstrate in proceeding chapters. Note, however, that if accurate results are needed (and computational resources are unlimited) for data sets with fully relevant input dimensions, VBLS is not as efficient as the matrix inversion in OLS. The advantage of VBLS arises when dealing with high di- mensionalinputspaces,servingasanefficientandrobust“automatic”regressionmethod. 17 y i x i1 x i2 x id . . . b i= 1,..,N ! y Figure2.1: Graphicalmodelforlinearregression. Randomvariablesareincircularnodes, observed random variables are in shaded double circles, and point estimated parameters are in square nodes. Conceptually, the algorithm can be interpreted as a Bayesian version of either backfitting or PLS regression. 2.2 Variational Bayesian Least Squares 2.2.1 EM-based Linear Regression Figures 2.1 to 2.3 illustrate the progression of graphical models needed to develop a robust Bayesian version of linear regression (D’Souza, Vijayakumar & Schaal 2004, Ting, D’Souza,Yamamoto,Yoshioka,Hoffman,Kakei,Sergio,Kalaska,Kawato,Strick&Schaal 2005). Figure2.1depictsthestandardlinearregressionmodel. Partoftheinspirationforour algorithmcomesfromPLSregression,motivatedbythequestionofhowtofindmaximally predictiveprojectionsininputspace,whichisalsopartofvariousother“subset”selection techniques in regression (Wessberg & Nicolelis 2004). If we knew the optimal projection 18 y i x i1 x i2 x id . . . i= 1,..,N z i1 z i2 z id b d b 2 b 1 ! y ! z1 ! z2 ! zd Figure 2.2: Graphical model for probabilistic backfitting. Random variables are in circu- lar nodes, observed random variables are in shaded double circles, and point estimated parameters are in square nodes. direction of the input data, the entire regression problem could be solved by a univariate regression between the projected data and the outputs: this optimal projection direction is simply the true gradient between inputs and outputs. Since we do not know this projection direction, we now encode its coefficients as hidden variables, in the tradition of EM algorithms (Dempster et al. 1977). Figure 2.2 shows the corresponding graphical model. The unobserved variablesz im (wherei=1,...,N denotes the index into the data set of N data points) are the result of the input variables being projected on the respective projection direction component (i.e., b m ). Then, the z im ’s are summed up to form a 19 predictedoutputy i . Moreformally,wecanmodifythelinearregressionmodelinEq.(2.1) to become: y i = d X m=1 z im + y (2.3) z im =b m x im + zm (2.4) For a probabilistic treatment with EM, we make a standard normal assumption of all distributions in form of: y i |z i ∼Normal y i ;1 T z i ,ψ y z im |x im ∼Normal(z im ;b m x im ,ψ zm ) (2.5) where 1 = [1,1,...,1] T . While this model is still identical to OLS, notice that in the graphical model of Figure 2.2, the regression coefficients b m are behind the fan-in to the outputs y i . We call this model Probabilistic Backfitting since we can view the resulting derived update equation for the regression coefficient b m as a probabilistic version of backfitting. With the introduction of unobserved, random variables z im , we are essentially in a situationwherewewishtooptimizetheparametersφ= n {b m ,ψ zm } d m=1 ,ψ y o , giventhat we have observed variables D = {x i ,y i } N i=1 . This situation fits very naturally into the framework of maximum-likelihood estimation via the EM algorithm. 20 TheEMalgorithmmaximizesthe incomplete loglikelihoodlogp(y|X)bymaximizing the expected complete log likelihoodhlogp(y,Z|X;φ)i 1 , where: logp(y,Z|X;φ)=log " N Y i=1 p(y i |z i )(z i |x i ) # = N X i=1 logp(y i |z i )+ N X i=1 logp(z i |x i ) =− N 2 logψ y − 1 2ψ y N X i=1 y i −1 T z i 2 − N 2 d X m=1 logψ zm − d X m=1 1 2ψ zm N X i=1 (z im −b m x im ) 2 +const y,Z (2.6) where z i ∈< d×1 consists of z im elements and Z∈< N×d consists of the vectors z i in its rows. TheresultingEMupdatesrequirestandardmanipulationsofnormaldistributions— please refer to appendix B.4 of (D’Souza 2004) for the derivations—and are shown below: E-step: 1 T Σ z 1= d X m=1 ψ zm !" 1− 1 s d X m=1 ψ zm !# (2.7) σ 2 zm =ψ zm 1− 1 s ψ zm (2.8) hz im i=b m x im + 1 s ψ xm y i −b T x i (2.9) M-step: b m = P N i=1 hz im ix im P N i=1 x 2 im (2.10) ψ y = 1 N N X i=1 y i −1 T hz i i 2 +1 T Σ z 1 (2.11) ψ zm = 1 N N X i=1 (hz im i−b m x im ) 2 +σ 2 zm (2.12) 1 where h·i denotes the expectation operator 21 where we define s = ψ y + P d m=1 ψ xm and Σ z = Cov(z|y,X) is the covariance matrix of z. This EM version of least squares regression is guaranteed to converge to the same solution as OLS (D’Souza et al. 2004). One EM update has a computationally complexity ofO(Nd) per EM iteration, where d is the number of input dimensions, instead of theO(d 3 ) associated with OLS regression or even O(d 2 ), if more efficient and robust matrix inversion methods are used. This efficiency comes at the cost of an iterative solution, instead of a one-shot solution for b as in OLS. Should the number of EM iterations be significant, it is true that the run-time of the EM algorithm could be as long as non-iterative approaches. However, as previously mentioned, the true benefit of our iterative approach arises when dealing real-time applications (where decisions need to be made quickly in a short amount of time such that an approximate solution is acceptable) and also when embedded in other iterative methods in order to realize more computationally efficient update equations. The new EM algorithm appears to only replace the matrix inversion in OLS by an iterative method, as others have done with alternative algorithms (Strassen 1969, Hastie & Tibshirani 1990). However, the convergence guarantee of EM is an improvement over previous approaches. The true power of this probabilistic formulation becomes apparent when we add a Bayesian layer to achieve robustness in face of ill-conditioned data. 2.2.2 Automatic Relevance Determination From a Bayesian point of view, the parameters b m should be treated probabilistically as well, such that we can integrate them out to safeguard against overfitting. For this 22 y i 1 x id . . . i= 1,..,N ! y b 1 b d a !1 a !d b !1 b !d ! z1 ! zd z i1 z id ! d ! 1 Figure 2.3: Graphical model for variational Bayesian least squares. Random variables are in circular nodes, observed random variables are in shaded double circles, and point estimated parameters are in square nodes. purpose, as shown in Figure 2.3, we introduce precision variablesα m over each regression parameter b m , as previously done in (Tipping 2001): b|α∼ d Y m=1 Normal(b m ;0,1/α m ) α∼ d Y m=1 Gamma(α m ;a αm,0 ,b αm,0 ) (2.13) where α ∈ < d×1 consists of α m components and {a αm,0 ,b αm,0 } are the initial hyperpa- rameter values for α m . Wenowhaveamechanismthatinfersthesignificanceofeachdimension’scontribution to the observed outputy. The key quantity that determines the relevance of a regression inputistheparameterα m . Apriori,weassumethateveryb m hasameanzerodistribution with broad variance 1/α m . If the posterior value of α m turns out to be very large after all model parameters are estimated (equivalent to a very small variance of b m ), then the corresponding distribution of b m must be sharply peaked at zero. Such a posterior 23 gives strong evidence that b m is very close to 0 and that the regression input x m has no contribution to the output. Thus, this Bayesian model automatically detects irrelevant input dimensions and regularizes against ill-conditioned data sets. EventhoughEq.(2.13)looksverysimilartothatofTipping(2001)andthelaterwork of Bishop (2006), our model has the key property that it is computationally efficient, re- quiringO(Nd) per EM iteration. In contrast, the methods of Bishop (2006) and Tipping (2001) take O(d 3 ) and O(N 3 ), per iteration, respectively, becoming prohibitively expen- sive for large data sets with a very large input dimensionality,d. It is the efficient nature of our proposed algorithm, Variational Bayesian Least Squares, that makes it suitable for real-time analysis of very large amounts of very high-dimensional data, as required in brain-machine interfaces. We discuss this application in more detail in the Evaluation section. The final model for VBLS has the following distributions: y i |z i ∼Normal 1 T z i ,ψ y z im |b m ,α m ,x im ∼Normal b m x im , ψ zm α m b m |α m ∼Normal 0, 1 α m α m ∼Gamma(a αm,0 ,b αm,0 ) (2.14) As a note, it should be observed that the Gaussian prior used above forb m is a standard prior in Bayesian linear regression, e.g., (Bishop 2006). However, the Laplace prior could beusedaswell,andtheresult,whenusedwithMAPestimation,willbesimilartoLASSO. Wechoosetonotpursuethisdirection,butnotethattheLaplacedensitycanbere-written 24 in a hierarchical manner as done above by modeling the variance of b m as a Gamma distribution with one hyperparameter, i.e., an exponential, as done by (Figueiredo 2003). Integrating out the hyperparameter gives the Laplace marginal prior. The dependency of z im on the precision α m may seem unnecessary, but Gelman, Carlin, Stern & Rubin (2000) provide a justification: it is reasonable to assume that the variance in z im scales with the variance in b m since increasing our uncertainty in the prior of b m should imply a corresponding increase in the uncertainty of z im as well. In thiscase, we willobtainajointposterior distributionQ(b,α), whichisthen marginalized to get the individual distributions Q(b) and Q(α). With this formulation, the marginal distribution over b is now a product of Student-t distributions instead of the Gaussian distributions. As the graphical model in Figure 2.3 shows, the set of unobserved variables in the model is now n b,α,{z i } N i=1 o . An EM-like algorithm (Ghahramani & Beal 2000) can be used to find the posterior updates of all distributions, where we maximize the in- complete log likelihood logp(y|X) by maximizing the expected complete log likelihood hlogp(y,Z,b,α|X;φ)i: logp(y,Z,b,α|X;φ) =logp(y,Z,b,α|X;Ψ z ,ψ y ,a α ,b α ) =log " N Y i=1 p(y i |z i )p(z i |b,α,x i )p(b|α)p(α) # = N X i=1 logp(y i |z i )+ N X i=1 d X m=1 logp(z im |b m ,α m ,x im )+ d X m=1 logp(b m |α m )+ d X m=1 logp(α m ) 25 =− N 2 logψ y − 1 2ψ y N X i=1 y i −1 T z i 2 − N 2 d X m=1 log ψ zm α m − d X m=1 α m 2ψ zm (z im −b m x im ) 2 + 1 2 d X m=1 logα m − 1 2 d X m=1 α m b 2 m + d X m=1 (a αm,0 −1)logα m − d X m=1 b αm,0 α m +const y,Z,b,α (2.15) wherea αm,0 andb αm,0 aretheinitialparametervaluesthataresettoreflectourconfidence in the prior distribution of b m . In order to obtain a tractable posterior distribution over all hidden variablesb,z i andα, we use a factorial variational approximation of the true posterior (Ghahramani & Beal 2000): Q(α,b,Z)=Q(α,b)Q(Z). Note that the connection from the α m to the corresponding z im in Figure 2.3 is an intentional design. As previously mentioned, under this graphical model, the marginal distribution of b m becomes a Student t-distribution, allowing for traditional hypothesis testing(Gelmanetal.2000). TheminimalfactorizationoftheposteriorintoQ(α,b)Q(Z) would not be possible without this special design. ThevariationalBayesianapproximationusedhereallowsustoreachatractableposte- rior distribution over all hidden variables, such that we can proceed to infer the posterior distributions. Variational Bayesian learning approximates the intractable joint distri- bution over hidden states and parameters with a simpler distribution, e.g., assuming independence between hidden states and parameters such that the posterior distribu- tions are factorized. An exact Bayesian solution is not feasible since one would need 26 to compute the marginals of the joint posterior distribution—and this is not analyti- cally possible. For discussions on the quality of variational Bayesian approximations and how they compare to the true solution, please refer to (Jordan et al. 1999, Jaakkola & Jordan 2000, Attias 2000, Ghahramani & Beal 2000). We will return to this point in the Discussion section. After some algebraic manipulations (details on the derivations can be found in Ap- pendix A.1), the final EM posterior update equations become: E-step: Σ z = 1 ψ y 11 T +Ψ −1 z hAi −1 =Ψ z hAi −1 − Ψ z hAi −1 11 T Ψ z hAi −1 ψ y +1 T Ψ z hAi −1 1 (2.16) hz i i=Σ z 1 ψ y 1y i +Ψ −1 z hAihB|Aix i = Ψ z hAi −1 1 ψ y +1 T Ψ z hAi −1 1 ! y i + hB|Ai− Ψ z hAi −1 11 T hB|Ai ψ y +1 T Ψ z hAi −1 1 ! x i (2.17) σ 2 bm|αm = ψ zm hα m i N X i=1 x 2 im +ψ zm ! −1 (2.18) hb m |α m i= N X i=1 x 2 im +ψ zm ! −1 N X i=1 hz im ix im ! (2.19) ˆ a αm =a αm,0 + N 2 (2.20) ˆ b αm =b αm,0 + 1 2ψ zm N X i=1 z 2 im − N X i=1 x 2 im +ψ zm ! −1 N X i=1 hz im ix im ! 2 (2.21) hα m i= ˆ a αm ˆ b αm (2.22) M-step: ψ y = 1 N N X i=1 y i −1 T hz i i 2 +1 T Σ z 1 (2.23) 27 ψ zm = 1 N N X i=1 hα m i(hz im i−hb m |α m ix im ) 2 +hα m iσ 2 zm +hα m iσ 2 bm|αm 1 N N X i=1 x 2 im ! (2.24) where hAi, hB|Ai, Ψ z are diagonal matrices of hαi, hb|αi, ψ z , respectively. Σ z is a diagonal covariance matrix with a diagonal vector of σ 2 z . Note that: z 2 im =hz im i 2 +σ 2 zm where σ 2 zm is the mth term of the vector σ 2 z . Initialization of Priors: The hyperparameters of α m are learnt using EM, as shown by Eqs. (2.20) and (2.21). We set the initial values of the hyperparameters, a α,0 and b α,0 , in an uninformative way and use values of a αm,0 = 10 −8 and b αm,0 = 10 −8 for all m=1,...,d. This means that initial value ofα m is 1, with high uncertainty, i.e.,α m has a rather flat prior distribution. These initial hyperparameter values for α m need never be changed, regardless of the data set or system. In this way, the algorithm retains a “black-box” like quality. Notice that the update equation forhb m |α m i can be rewritten recursively so that the posterior mean of b m in the (n+1)th EM iteration is: hb m |α m i (n+1) = P N i=1 x 2 im P N i=1 x 2 im +ψ zm ! hb m |α m i (n) + ψ zm sα m P N i=1 y i −hb|αi (n)T x i x im P N i=1 x 2 im +ψ zm (2.25) 28 where s = ψ y +1 T Ψ z hAi −1 1. Examining Eq. (2.25), we see that the first time is a decaying term. In the absence of a correlation between the current input dimension and the residual error (i.e., if the second term is zero), then after some number of EM iter- ations, the mean of the current regression coefficient hb m |α m i will approach zero. The resultingregressionsolutionregularizesoverthenumberofretainedinputsinthefinalre- gressionvector,performingafunctionalitysimilartoAutomaticRelevanceDetermination (ARD) (Neal 1994). The update equations of VBLS, Eqs. (2.16) to (2.24), have an algorithmic complexity of O(Nd) per EM iteration. However, the number of EM iterations required before convergence is an open issue and could be many. Hence, VBLS is most advantageous in incremental, real-time scenarios where, due to hard time constraints, an approximately accurate solution is satisfactory in lieu of an accurate solution that takes unacceptably long to compute. One can further show that the marginal distribution of all b m is a t-distribution with t = hb m |α m i/σ bm|αm and 2ˆ a α degrees of freedom, which allows a principled way of determining whether a regression coefficient was excluded by means of standard hypothesis testing. The pseudocode for VBLS can be found in Appendix A.2. 2.3 Real-time Implementation Duetoitscomputationallyefficientnature,theVBLSalgorithmpresentedinAlgorithm1 lendsitselftoscenarioswherefast,onlinelearningwithlargeamountsofhigh-dimensional data is required, such as real-time brain-machine interfaces. Previous work by Sato & 29 Ishii (2000) and Sato (2001) has shown that an online version of the Variational Bayes framework can be derived, such that online model selection can be done with guaranteed convergence. A scalar discount factor or forgetting rate is typically introduced in order to forget estimates that were calculated earlier (and hence, were less accurate). (Sato & Ishii 2000, Sato 2001) introduce a time-dependent schedule for the discount factor and proveconvergenceoftheonlineEM-basedalgorithm. Sincethemainfocusofthischapter is on the batch form of the algorithm, we will show only a proof-of-concept and use a constant-valued discount factor (with a heuristically-set value) in order to demonstrate that the batch VBLS algorithm can be translated into incremental form. We leave the detailed theoretical development of the online version of the algorithm with a discount factor schedule for future work. In particular, we introduce a forgetting rate, 0 ≤ λ ≤ 1, to exponentially discount data collected in the past, as done in (Ljung & Soderstrom 1983). The forgetting rate enters the algorithm by accumulating sufficient statistics of the batch algorithm in an incremental way. Settingλ=0 ensures that all past samples are forgotten, while setting λ = 1 ensures that none of observed samples are forgotten. We can then extract the sufficient statistics by examining the batch EM equations, Eqs. (2.16) to (2.24). The incremental EM update equations are listed in Appendix A.3. 2.4 Evaluation We turn to the application and evaluation of VBLS in the context of predicting EMG datafromneuraldatarecordedintheprimarymotor(M1)andpremotor(PM)corticesof 30 monkeys(Tingetal.2005,Ting, D’Souza, Yamamoto, Yoshioka, Hoffman, Kakei, Sergio, Kalaska, Kawato, Strick & Schaal 2008). The key questions addressed in this application were i) whether EMG data can be reconstructed accurately with good generalization, ii) how many neurons contribute to the reconstruction of each muscle, and iii) how well the VBLS algorithm compares to other analysis techniques. The underlying assumption of this analysis was that the relationship between cortical neural firing and muscle activity is approximately linear. Before applying VBLS to real data, however, we first run it on synthetic data sets where“groundtruth”isknowninordertobetterevaluateitsperformanceinacontrolled setting. 2.4.1 Synthetic Data 2.4.1.1 Data sets We generated random input training data consisting of 100 dimensions, 10 of which were relevant dimensions. The other 90 were either irrelevant or redundant dimensions, as we explain below. Each of the first 10 input dimensions was drawn from a Gaussian distribution with some random covariance. The output data was then generated from the relevant input data using the vector b ∈ < 10×1 , where each coefficient of b, b m , was drawn from a Normal(0,100) distribution, subject to the fact that it cannot be zero (since this would indicate an irrelevant dimension). Additive mean-zero Gaussian noise of varying levels was added to the outputs. 31 Noise in the outputs was parameterized with the coefficient of determination, r 2 , of standard linear regression, defined as: r 2 = σ 2 y −σ 2 res σ 2 y where σ 2 y is the variance of the outputs and σ 2 res is the variance of the residual error. We added noise scaled to the variance of the noiseless outputs ¯ y such that σ 2 noise =cσ 2 ¯ y , where c = 1 r 2 −1. Results are quantified as normalized mean squared errors (nMSE), that is, the mean squared error on the test set normalized by the variance of the outputs of the test set. Note that the best normalized mean squared training error that can be achieved by the learning system under this noise level is 1−r 2 , unless the system overfits the data. We used a value of r 2 = 0.8 for high output noise and a value of r 2 = 0.9 for lower output noise. A varying number of redundant data vectors was added to the input data, generated fromrandomconvexcombinationsofthe10relevantvectors. Finally, weaddedirrelevant data columns, drawn from a Normal(0,1) distribution, until a total of 100 input dimen- sionswasreached,generatingtraininginputdatathatcontainedirrelevantandredundant dimensions. We created the test data set in a similar manner, except that the input data and output data were left noise-free. For our experiments, we considered a synthetic training data set withN =1000 data samples and a synthetic test data set with 20 data samples. We examined the following four different combinations of redundant,v, and irrelevant,u, 32 inputdimensionsinordertobetteranalyzetheperformanceofthealgorithmsondifferent data sets: i) v =0,u=90 (all the 90 input dimensions are irrelevant) ii) v =30,u=60 iii) v =60,u=30 iv) v =90,u=0 (all the 90 input dimensions are redundant) 2.4.1.2 Methods We compared VBLS to four other methods that were previously described in Section 2.1: i) ridge regression, ii) stepwise regression, iii) PLS regression and iv) LASSO regression. For ridge regression, we introduced a small ridge parameter value of 10 −10 to avoid ill- conditioned matrix inversions. We used Matlab’s “stepwisefit” function to run stepwise regression. The number of PLS projections for each data set fit was found by leave-one- outcross-validation. Finally,wechosetheoptimaltuningparameterinLASSOregression using k-fold cross-validation. 2.4.1.3 Results For evaluation, we calculated the prediction error on noiseless test data, using the learnt regression coefficients from each technique. Results are quantified as normalized mean squared errors (nMSE). Figure 2.4 shows the average prediction error for noiseless test data, given training data where the output noise is either high (r 2 = 0.8) or low(er) (r 2 =0.9). 33 v=0, u=90 v = 30, u = 60 v = 60, u = 30 v = 90, u = 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nMSE Ridge Regression STEP PLS LASSO VBLS All 90 dimensions redundant All 90 dimensions irrelevant (a) Average prediction error for training data where output noise has r 2 =0.9. v=0, u=90 v = 30, u = 60 v= 60, u = 30 v = 90, u = 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nMSE All 90 dimensions redundant All 90 dimensions irrelevant (b) Averagepredictionerrorfortrainingdatawhereoutputnoisehas r 2 =0.8. Figure 2.4: Average normalized mean squared prediction error for synthetic 100 input- dimensional data with a varying level of output noise in the training data, averaged over 10 trials. The number of redundant dimensions is denoted by v, and the number of irrelevant dimensions is u. 34 All the algorithms were executed on 10 randomly generated sets of data. The pre- dictive nMSE results reported in Figure 2.4 were averaged over the 10 trials. Note that the best training nMSE values possible under the two noise conditions are 0.1 for the low noise case and 0.2 for the high noise case. The training nMSE values were omitted for both graphs since all algorithms attained training errors that were around the lowest possible values. From Figures 2.4(a) and 2.4(b), we see that regardless of output noise level, VBLS achieves either the lowest predictive nMSE value or a predictive nMSE value compa- rable with that of the other four algorithms. In general, as the number of redundant input dimensions increases and the number of irrelevant input dimensions decreases, the prediction error improves (i.e., it decreases). This may be attributed to the fact that redundancy in the input data provides more “information”, making the problem easier to solve. The performance of stepwise regression degrades as the number of redundant dimen- sions increases, as shown in Figures 2.4(a) and 2.4(a), due to its inability to cope with collinear data. LASSO regression appears to perform quite well, compared with PLS regression and ridge regression. This is unsurprising, given it is known for its ability to produce sparse solutions. In summary, we can confirm that VBLS performs very well—as well as or better than classical robust regression methods (such as LASSO) on synthetic tests. Interestingly, PLS regression and ridge regression are significantly inferior in problems that have a large number of irrelevant dimensions. Stepwise regression has deteriorated performance as soon as co-linear inputs are introduced. 35 2.4.1.4 Non-Normal Synthetic Data v=0, u=90 v = 30, u = 60 v = 60, u = 30 v = 90, u = 0 0 0.05 0.1 0.15 0.2 0.25 0.3 nMSE Ridge Regression STEP PLS LASSO VBLS All 90 dimensions redundant All 90 dimensions irrelevant Figure 2.5: Average normalized mean squared prediction error for synthetic non-Normal 100 input-dimensional data, with an output noise of r 2 = 0.9999 in the training data, averaged over 10 trials. The number of redundant dimensions is denoted by v, and the number of irrelevant dimensions is u. We can also examine synthetic data sets which do not correspond to the generative model (i.e., data and noise that are not generated from Normal distributions) in order to evaluate how dependent our model is on the Normal prior distributions that we assumed. SyntheticdataisgeneratedinasimilarfashionasinSection2.4.1.1,with100dimensions— 10 of which are relevant dimensions. The other 90 dimensions are chosen to be either irrelevant or redundant. However, there are two differences between this synthetic data and that of Section 2.4.1.1. Firstly, the first 10 relevant input dimensions were generated from a multi-modal distribution, instead of a Normal distribution. Specifically, each of the relevant 10 input 36 dimensionswasdrawnfromasum/mixtureof10Gaussiandistributions, witheachGaus- sian distribution having a different mean and variance, i.e., x m ∼ P N p=1 Normal(μ p ,σ 2 p ), form=1,...,10 whereσ p is drawn randomly from a uniform distribution between 0 and 2 andμ p is drawn similarly from a uniform distribution between 0 and2. The second dif- ference between the non-Normal synthetic data set and the data of Section 2.4.1.1 is the additive output noise. Instead of Gaussian distributed noise, noise drawn from a Student t-distribution was added to the outputs. We chose a noise level of r 2 = 0.9999 for the output noise, such that the noise was scaled to the variance of the noiseless outputs ¯ y. Redundant and irrelevant data vectors were added to the input data in a similar way as described in Section 2.4.1.1. The test data was created in a similar manner, except the input and output data were left noise-free. As in Section 2.4.1.1, we considered synthetic training data with N = 1000 data samples and a synthetic test data set with 20 data samples. Figure2.5showsthepredictionnMSEvalues, averagedover10trials. Wecanobserve that both VBLS and LASSO outperform the other classical regression methods on non- Normal synthetic data sets. This figure demonstrates that even for data sets that do not follow the Normal prior distributions assumed in our generative model, VBLS continues to perform quite competitively. 37 2.4.2 EMG Prediction from Neural Firing 2.4.2.1 Data sets We investigated data from two different neurophysiological experiments. In the first experiment by Sergio & Kalaska (1998), a monkey moved a manipulandum in a center- out task in eight different directions, equally spaced in a horizontal planar circle of 8cm radius. A variation of this experiment held the manipulandum rigidly in place, while the monkeyappliedisometricforcesinthesameeightdirections. Inbothconditions(whether the monkey was applying a movement or an isometric force), feedback was given through visual display on a monitor. Neural activity for 71 M1 neurons was recorded in all conditions, along with the EMG outputs of 11 muscles 2 . After preprocessing, we obtained a total of 2320 data samples for each neuron/muscle pair, collected over all eight directions and for both movement and isometric force con- ditions. Each data sample consisted of the average firing rates from a particular neuron (averaged over a window of 10msec) and the corresponding EMG activation 3 from a par- ticular muscle. A sampling interval of 10msec was used. For each sample in this data set, a delay of 50msec between M1 cortical neural firing and EMG muscle activation was empirically chosen, based on estimates from measurements. Thesecondexperiment, conductedbyKakei, Hoffman&Strick(1999, 2001), involved a monkey trained to perform eight different combinations of wrist flexion-extension and 2 The 11 arm muscles analyzed included the 1) surpraspinatus, 2) infraspinatus, 3) subscapularis, 4) rostraltrapezius, 5)caudaltrapezius, 6)posteriordeltoid, 7)medialdeltoid, 8)anteriordeltoid, 9)triceps medial head, 10) brachialis and 11) pectoralis muscles. 3 EMG was recorded from pairs of shoulder and elbow muscles, implanted percutaneously with Teflon- coated single-stranded stainless steel wires. EMG activity was amplified, rectified and integrated (over 10msec bins) to generate summed histograms of activity. The EMG data had no physically meaningful units. 38 radial-ulnar movements while in three different arm postures (pronated, supinated and midwaybetweenthetwo). Theseexperimentsresultedintwodatasets. Forthefirstdata set (Kakei et al. 1999), the EMG outputs of 7 contributing muscles 4 were recorded, along with the neural data of 92 M1 neurons at all three wrist postures, resulting in 2616 data samples for each neuron/muscle pair. Similar to the Sergio & Kalaska (1998) data set, each data sample consisted of the average firing rates from a particular neuron (averaged overawindowof10msec)andthecorrespondingEMGactivationfromaparticularmuscle. A sampling interval of 10msec was used. For each sample in the Kakei et al. (1999) data set, a delay of 20msec 5 between M1 cortical neural firing and EMG muscle activation was chosen empirically, based on estimates from measurements. The second data set (Kakei etal.2001)alsoincludedEMGoutputsofthesame7muscles,butthistimecontainedthe recorded spiking data of 72 PM neurons at the three wrist postures. After preprocessing, the second Kakei et al. (2001) data set had 2592 data samples for each neuron/muscle pair. For each sample, a delay of 30msec 6 between PM cortical neural firing and EMG muscle activation was assumed. 4 EMG was recorded using pairs of single-stranded stainless steel wires placed transcutaneously into each muscle. The 7 arm muscles considered were the 1) extensor carpi ulnaris (ECU), 2) extensor digito- rum 2 and 3 (ED23), 3) extensor digitorum communis (EDC), 4) extensor carpi radialis brevis (ECRB), 5) extensor carpi radialis longus (ECRL), 6) abductor pollicis longus (APL), and 7) flexor carpi radialis (FCR) muscles. 5 The results of our analyses are insensitive to a delay in the range of 20−60msec since there was only a very small numerical difference between the quality of the fit of the data in this interval. Delays of 50msec or higher are physiologically more plausible. 6 Within a delay range of 30−80msec, there is no real difference in the quality of fit of our analyses. 39 2.4.2.2 Methods As a baseline comparison, EMG reconstruction was obtained through a combinatorial search over possible regression models. This approach served as our baseline study (re- ferred to as ModelSearch in the figures). A particular model is characterized by a subset of neurons that is used to predict the EMG data. For the Sergio & Kalaska (1998) data, given 71 neurons, the number of possible models that exist for a particular muscle is: 71 X m=1 71 m Sincetheorderofthecontributingneuronsisnotimportant,theaboveexpressionliststhe combinationsinsteadofpermutationsofneurons. Thisvalueistoolargeforanexhaustive search. Therefore, we considered only possible combinations of up to 20 neurons, which required several weeks of computation on a 30-node cluster computer. The optimal predictive subset of neurons was determined from a series of 8-fold cross-validation sets. Forbothdatasets, thecross-validationprocedureusedinthebaselinestudy wasused in order to determine the optimal subset of neurons. Cross-validation was done in the context of the behavioral experiments and not in a statistically randomized way. For the Sergio & Kalaska (1998) experiment, the data was separated into different force cate- gories(isometricforceversusforcegeneratedduringmovement)andmovementdirections in space. Thus, cross-validation asked the meaningful question of whether isometric and movement conditions are predictive of each other and whether there is spatial general- ization. Similarly, for the Kakei et al. experiments, data was separated into directional 40 movements at the wrist (supinated, pronated and midway between the two wrist move- ments)anddirectionalmovementsinspace, whichagainallowedcross-validationtomake meaningful statements about generalization over postures and space. Figure B.1 in Appendix B.1 shows how these 8 cross-validation sets are constructed from the Sergio & Kalaska (1998) data. This baseline study (i.e., ModelSearch) served as a comparison for ridge regression, stepwise regression, PLS regression, LASSO regression and VBLS. These five algorithms used the same validation sets employed in the baseline study. Again, as described in Section 2.4.1.2, ridge regression was implemented using a small ridge regression parameter of 10 −10 , in order to avoid ill-conditioned matrices. We usedMatlab’s“stepwisefit”torunstepwiseregression,andthenumberofPLSprojections for each data fit was found by leave-one-out cross-validation. The average normalized mean squared error values depicted in Figure 2.6 demonstrate how well each algorithm performs,averagingthegeneralizationperformancesoverallthecross-validationsetsfrom Figure B.1. The average number of relevant neurons 7 (i.e., not including irrelevant neurons and neurons providing redundant information), shown in Figure 2.9, was calculated by aver- aging over the number of relevant neurons in each of the 8 training sets in Figure B.1. The final set of relevant neurons, used in Figure B.4 to calculate the percentage match of relevant neurons relative to those found by the baseline study (ModelSearch), was reached for each algorithm (except VBLS) by taking the common neurons found to be relevant over the 8 cross-validation sets. The relevant neurons found by VBLS and 7 Relevant neurons are those that contribute to the regression result in a statistically sound way, according to a t-test with p < 0.05. It should be noted that in noisy data, two neurons that carry the same signal, but have independent noise will usually both remain significant in our algorithm, as the combined signal of both neurons helps to average out the noise in the spirit of population coding 41 reportedinFigureB.4wereobtainedbyusingtheentiredatasetsincenocross-validation procedureisrequiredbyVBLS(i.e., dividingthedataintoseparatetrainingandtestsets is not necessary). As with all Bayesian methods, VBLS performs more accurately as the data size increases, without the danger of overfitting. Inference of relevant neurons in PLS was based on the subspace spanned by the PLS projections, while relevant neurons in VBLS were inferred from t-tests on the regression parameters, using a significance of p<0.05. Stepwise regression determined the number of relevant neurons from the inputs that were included in the final model. Note that since ridge regression retained all input dimensions, this algorithm was omitted in relevant neuron comparisons. Analogous to the first data set, a combinatorial analysis was performed on the Kakei etal.(1999)M1neuralandKakeietal.(2001)PMneuraldatasetsinordertodetermine the optimal set of M1 and PM neurons contributing to each muscle (i.e. producing the lowest possible prediction error) in a series of 6-fold cross-validation sets. Figures B.2 and B.3 in Appendix B.1i show the 6 cross-validation sets used for the M1 and PM neural data sets. PLS, stepwise regression, ridge regression and VBLS were applied using the same cross-validation sets, employing the same procedure described for the Sergio & Kalaska (1998) data set. The average normalized mean squared error values shown in Figures 2.7 and 2.8 illustrate the generalization performance of each algorithm, averaged over all the cross-validation sets shown in Figures B.2 and B.3 8 . The average number of relevant neurons shown in Figures 2.10 and 2.11 was calculated by averaging over the number of relevant neurons found in each of the 6 training sets from Figures B.2 and B.3. 8 Note that the partitioning of the data into training and test cross-validation sets was essentially an intuitive process that tried to use insights from the different experimental conditions in which the data was collected. 42 Training nMSE Test NMSE 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 nMSE Ridge Regression STEP PLS LASSO VBLS ModelSearch Figure 2.6: Normalized mean squared error, averaged over all cross-validation sets and over all muscles for Sergio & Kalaska (1998) M1 neural data set. As with the Sergio & Kalaska (1998) data set, the final set of relevant neurons—used in Figures 2.10 and 2.11—was obtained for each algorithm (except VBLS) by taking the common neurons found to be relevant over the 6 cross-validation sets. 2.4.2.3 Results Generalization Performance: Figures 2.6 to 2.8 show that VBLS resulted in a gen- eralization error comparable to that produced by the baseline study. In the Kakei et al. (1999)M1andKakeietal.(2001)PMneuraldatasets,allalgorithmsperformedsimilarly. However,ridgeregression,stepwiseregression,PLSregressionandLASSOregressionper- formedfarworseontheSergio&Kalaska(1998)M1neuraldataset, withridgeregression attaining the worst error. Such performance is typical for traditional linear regression methods on ill-conditioned high-dimensional data, motivating the development of VBLS. 43 Training nMSE Test NMSE 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 nMSE Ridge Regression STEP PLS LASSO VBLS ModelSearch Figure 2.7: Normalized mean squared error, averaged over all cross-validation sets and over all muscles for Kakei et al. (1999) M1 neural data. Training nMSE Test NMSE 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 nMSE Ridge Regression STEP PLS LASSO VBLS ModelSearch Figure 2.8: Normalized mean squared error, averaged over all cross-validation sets and over all muscles for Kakei et al. (2001) PM neural data. 44 1 2 3 4 5 6 7 8 9 10 11 0 10 20 30 40 50 60 70 Average Number of Neurons Muscle Number STEP PLS LASSO VBLS ModelSearch Figure 2.9: Average number of relevant M1 neurons found over all cross-validation sets for Sergio & Kalaska (1998) data set. Interestingly, in Figure 2.7, we observe that the prediction errors of ridge regression and of the baseline study (i.e. ridge regression using a selected subset of M1 neurons) are quite similar for the Kakei et al. (1999) M1 neural data set. This suggests that, for this particular data set, there is little advantage in performing a time-consuming manual search for the optimal subset of neurons. A similar observation can be made for the Kakei et al. (2001) PM neural data set when examining Figure 2.8, although this effect is less pronounced in the PM neural data set. In contrast, Figure 2.6 shows a sharp difference between the predictive error values of ridge regression and the baseline study’s combinatorial-like model search. This may be attributed to the fact that the Sergio & Kalaska (1998) M1 neural data set is somehow much richer and hence, more challenging to analyze. 45 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 90 Average Number of Neurons Muscle Number STEP PLS LASSO VBLS ModelSearch Figure 2.10: Average number of relevant M1 neurons found over all cross-validation sets for Kakei et al. (1999) data. 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 Average Number of Neurons Muscle Number STEP PLS LASSO VBLS ModelSearch Figure 2.11: Average number of relevant PM neurons found over all cross-validation sets for Kakei et al. (2001) data. 46 Table 2.1: Percentage neuron matches found by each algorithm, as compared to those found by the baseline study (ModelSearch), averaged over all muscles of each data set. STEP PLS LASSO VBLS Sergio & Kalaska (1998) M1 data 7.2% 7.4% 5.4% 94.2% Kakei et al. (1999) M1 data 65.1% 42.9% 80.6% 94.4% Kakei et al. (2001) PM data 22.9% 14.2% 44.5% 91.5% Average Number of Relevant Neurons: The average number of relevant M1 neu- rons found by VBLS was slightly higher than the baseline study, as seen in Figures 2.9 to 2.11. This is unsurprising since the baseline studies did not consider all possible com- bination of neurons. For example, the baseline study for Sergio & Kalaska (1998) data setconsideredpossible combinations ofuptoonly 20neurons, insteadofthe full setof71 neurons. In particular, notice that in Figures 2.10 and 2.11, small amounts of the total 92 M1 neurons and 72 PM neurons were found to be relevant by the baseline study for certain muscles (e.g., muscles 1, 6 and 7). Percentage Relevant Neuron Match: We compared the relevant neurons identi- fied by each algorithm with those found by the baseline combinatorial-like model search in an attempt to evaluate how well each algorithm in comparison to the model search approach. Table 2.1 shows that the relevant neurons identified by VBLS coincided at a very high percentage with those of the baseline model search results, while stepwise and PLS regression had inferior outcomes. The table illustrates that VBLS was able to reproduce comparable results to a combinatorial-like model search approach. However, the main advantage of VBLS arises in its speed: VBLS took 8 hours for all validation sets on a standard PC while the model search took weeks on a cluster computer. LASSO 47 regression matched a high percentage of the relevant M1 and PM neurons in the Kakei et al. data sets, but fared far worse on the Sergio & Kalaska (1998) data set. These per- centage values for the Kakei et al. data sets are perhaps inflated and should be given less consideration since the numbers of relevant M1 and PM neurons found by the baseline study are relatively small for certain muscles. Figures B.4, B.5 and B.6 in Appendix B.2 show the detailed breakdown of percentage M1 and PM neuron matches for each algorithm on each muscle. The consistent and good generalization properties of VBLS on all neural data sets, as shown in Figures 2.6, 2.7 and 2.8 suggests that the Bayesian approach of VBLS sufficiently regularizes the participating neurons such that no overfitting occurs, despite finding a larger number of relevant neurons. In general, VBLS achieved comparable performance with the baseline study when reconstructing EMG data from M1 or PM neurons. Note that VBLS is an iterative statistical method, which performs slower than classical “one-shot” linear least squares methods(i.e.,ontheorderofseveralminutesforthedatasetsinouranalyses). Neverthe- less,itachievescomparableresultswithourcombinatorialmodelsearch,whileperforming at much faster speeds. 2.4.3 Real-time Analysis for Brain-Machine Interfaces Both neural data sets analyzed in Section 2.4.2 are inherently real-time data—collected online, stored and then analyzed in batch form (i.e., a sampling interval is used, and a delay between neural firing and EMG activity is empirically chosen in order to extract the data samples to be used in the batch form of the data). As a result, in the real-time 48 simulations, we took the batch form of the data and presented it sequentially, one data sample at each time step. We applied the real-time version of VBLS, derived in Section 2.3, on the Sergio & Kalaska (1998) data set since this was the more interesting of the three presented in Section 2.4.2. We used a forgetting rate of λ = 0.999, assumed each sample of the data set arrived sequentially at different time steps, and iterated through the incremental VBLS equations, Eqs. (A.13) to (A.12), twice for each time step. FigureB.7(a)showsthecoefficientofdeterminationvalues,r 2 (wherer 2 =1−nMSE), for both the batch and real-time versions of VBLS on the entire Sergio & Kalaska (1998) data set. Figure B.7(b) shows the number of relevant M1 neurons found by batch VBLS and real-time VBLS for the same data set. For the real-time version of VBLS, the r 2 values and relevant neurons reported were from the last time step. We can see from both figures that the real-time and batch versions of VBLS achieve a similar level of performance. The average r 2 values—averaged over all 11 muscles—confirm this: batch VBLS had an average r 2 value of 0.7998, while real-time VBLS had an average r 2 value of 0.7966. 2.5 Interpretation of Neural Data Analyses While the main focus of this chapter lies in the introduction of a robust linear regression technique for high-dimensional data, we would like to discuss how our analysis technique can be exploitedfor the interpretation of the neurophysiological data that we used inthis 49 study. We show plots of EMG activity for a muscle in the Sergio & Kalaska (1998) data set, but further plots referred to in the text below can be found in Appendix B.4. Each of the eight EMG plots in Figures 2.12 and 2.13 shows the following three EMG traces: i) the raw average EMG trajectories ii) the predicted EMG activity, as obtained by VBLS using all available data in all conditions (VBLS-full) iii) the predicted EMG activity, as obtained by VBLS using only half the data for fitting 9 (VBLS-cv) This last cross-validated fit tests how well isometric M1 neural recordings can predict movement EMG and how well movement-related M1 neural recordings can predict iso- metric EMG. Alternatively, it tests whether the neuron to EMG relationship is the same between the isometric and the movement conditions. 2.5.1 Sergio & Kalaska (1998) data set One of the main results reported by Sergio & Kalaska (1998) was that the firing of the reported M1 neurons had strong correlation with EMG-like (or force-like) signals in both movement and isometric conditions. In contrast, evidence for correlationswith kinematic data (such as movement direction, velocity, or target direction) was less pronounced. We generated Figures 2.12 and 2.13, both of which reproduce similar illustrations to Figures 3A and 3B in Sergio & Kalaska (1998). The two figures show the EMG activity 9 For the isometric condition, only movement data was used for fitting. For the movement condition, only isometric data was used for fitting. 50 Observed EMG Predicted EMG (VBLS−full) Predicted EMG (VBLS−cv) 0 500 1000 1500 0 5 10 15 msec Direction 4 0 500 1000 1500 0 5 10 15 msec Direction 3 0 500 1000 1500 0 5 10 15 msec Direction 2 0 500 1000 1500 0 5 10 15 msec Direction 5 0 500 1000 1500 0 5 10 15 msec Direction 1 0 500 1000 1500 0 5 10 15 msec Direction 6 0 500 1000 1500 0 5 10 15 msec Direction 7 0 500 1000 1500 0 5 10 15 msec Direction 8 Figure 2.12: Observed vs. predicted EMG traces under isometric force conditions for the infraspinatus muscle, given M1 neural firing from Sergio & Kalaska (1998). The center plot shows the trajectories in eight different directions—in the (x,y) plane—taken by the hand. This figure is taken from Sergio & Kalaska (1998). Each of the eight plots surrounding this center plot shows EMG traces over time for each hand trajectory. 51 Observed EMG Predicted EMG (VBLS−full) Predicted EMG (VBLS−cv) 0 500 1000 1500 0 5 10 15 msec Direction 4 0 500 1000 1500 0 5 10 15 msec Direction 3 0 500 1000 1500 0 5 10 15 msec Direction 2 0 500 1000 1500 0 5 10 15 msec Direction 5 0 500 1000 1500 0 5 10 15 msec Direction 1 0 500 1000 1500 0 5 10 15 msec Direction 6 0 500 1000 1500 0 5 10 15 msec Direction 7 0 500 1000 1500 0 5 10 15 msec Direction 8 Figure2.13: Observedvs. predictedEMGtracesundermovementforceconditionsforthe infraspinatus muscle, given M1 neural firing from Sergio & Kalaska (1998). The center plot shows the trajectories in eight different directions—in the (x,y) plane—taken by the hand. This figure is taken from Sergio & Kalaska (1998). Each of the eight plots surrounding this center plot shows the EMG traces over time for each hand trajectory. 52 of the infraspinatus muscle in all eight isometric force production directions (Figure 2.12) and movement directions (Figure 2.13). The trajectories, shown in (x,y) coordinates, taken by the hand are illustrated in the center of each figure. These center figures are taken from the original figures of Sergio & Kalaska (1998) since we did not have access to the hand trajectory data. As Figures 2.12 and 2.13 both show, M1 neural firing predicts the EMG traces very well in general. The cross-validation tests also demonstrate very good EMG reconstruc- tion, thus confirming Sergio & Kalaska (1998)’s results that the recorded M1 neurons have sufficient information to extract signals of the time-varying dynamics and the temporal envelopes of EMG activities. 2.5.2 Kakei et al. (1999) and Kakei et al. (2001) data sets The main message in Kakei et al. (1999) and Kakei et al. (2001) was that one can find neurons in M1 that carry intrinsic (muscle-based) information and neurons that carry extrinsic–that is, (x,y) task space–information. In contrast, the PM cortex had predom- inantly extrinsic neurons. For our analyses, we had access to the average firing rates of the M1 and PM neurons and the corresponding EMG traces, as well as the (x,y) movement as performed by the hand. Thus, we used VBLS to predict the EMG activity in all three arm posture conditions (pronated, supinated and midway between the two) from the neural firing and to predict the (x,y)-velocity trajectories from neural firing. Note that all this data was obtained from the same highly trained monkey, such that it was possible to i) re-use 53 EMG data obtained during the M1 experiment as target for the PM data and ii) share the same (x,y) data across the M1 and PM experiment. We illustrate our results in a similar form as in Figures 2.12 and 2.13, showing plots for the extensor carpi radialis brevis (ECRB) muscle and only for the supination posture. FigureB.8(a)showstheEMGfitsforM1neurons,whileFigureB.9(a)showsthesamefits for PM neurons. The center plots illustrate recorded (x,y) movement in the horizontal planeinthisposture. Interestingly,bothM1andPMneuronsachieveaverygood EMG reconstruction 10 . Figures B.10(a), B.12(a), B.11(a) and B.13(a) demonstrate the (x,y)-velocity fits for M1 and PM neurons, respectively, in the supination condition 11 . The quality of fit appearsreducedincomparisontotheEMGdata,butitishardtoquantifythisstatement as EMG and (x,y)-velocities have quite different noise levels such that r 2 values cannot be compared. In order to judge whether M1 or PM neurons achieve better fits for EMG and (x,y)- velocity data, we compared the r 2 values from all experimental conditions in a pairwise student’s t-test. No significant difference could be found between either the quality of EMG fitting or the (x,y)-velocity fits. Thus, our analysis concludes that both M1 and PM carry sufficient information to predict EMG activity. It should be noted, however, that in Kakei et al.’s original 10 It should be noted that, potentially, the hand movement from Kakei et al. is of significant lower complexity than the arm movement data of Sergio & Kalaska (1998). The temporal profiles of the EMG data in Kakei et al. is much simpler, such that it may be easier to predict it. Support for this latter hypothesis comes from the fact that essentially all statistical methods we tested performed equally well on the EMG prediction problem. Thus, future work will have to examine whether PM neurons would also be able to predict more complex EMG traces. 11 The optimal delay value between M1 cortical neural firing and the resulting direction of movement was found to be 80msec since this value lead to the lowest fitting error. In a similar fashion, the optimal delay between PM cortical neural firing and the resulting direction of movement was found to be 90msec. 54 experiment, neurons were classified into extrinsic or intrinsic neurons according to how much their tuning properties were compatible with intrinsic or extrinsic variables. Their analysis was a single neuron analysis, while our investigation looked at the predictive capabilitiesoftheentirepopulationofneurons. Thus, ourresultsarenotincontradiction withKakeietal.,butrather,demonstratetheimportantdifferencebetweenthepredictive capabilities of a single neuron vs. that of the population code. The latter is of particular importance for brain-machine interfaces, and our results provide further evidence for the informationrichnessofcorticalareasthat,fromtheviewofsingleneuronanalysis,seemed to be much more specialized. We also analyzed the neurons that were found to be relevant for EMG prediction and (x,y)-velocity prediction, using t-tests performed on the inferred regression coefficients. Inparticular,wewonderedwhethersomeneuronsinPMandM1wouldspecializeonEMG prediction, while others would prefer (x,y)-velocity prediction. However, no interesting specialization could be found. For example, of all 72 PM neurons, we found that 4.17% were relevant to (x,y)-velocity prediction only, 15.28% were relevant to EMG prediction only, and 79.17% were relevant to both velocity and EMG prediction (leaving 1.39% of PMneuronstobeirrelevanttobothvelocityandEMGprediction). Ofall92M1neurons, wefoundthat4.35%wererelevantto(x,y)-velocitypredictiononly,26.09%wererelevant to EMG prediction only, and 65.22% were relevant to both velocity and EMG prediction. Thus, the majority of neurons were involved in both EMG and velocity prediction. This rich information about different movement variables in both M1 and PM most likely contributes to the success of various brain-machine interface projects, where the precise placement of electrode arrays seemingly does not matter too much. 55 2.6 Discussion Thischapteraddressestheproblemofanalyzinghigh-dimensionaldatawithlinearregres- sion techniques, typically encountered in neuroscience and the new field of brain-machine interfaces. In order to achieve robust statistical results, we introduced a novel Bayesian technique for linear regression analysis with automatic relevance determination, called Variational Bayesian Least Squares. In contrast to previously proposed variational lin- ear regression methods, VBLS is computationally efficient, requiring O(d)—instead of O(d 3 )—updates per EM iteration. It is guaranteed to converge to the global solution. VBLS has no parameters that need to be tuned; the prior distribution ofα can be set to be wide and uninformative (e.g., the hyperparameters a α,0 and b α,0 can be set to 10 −8 ) and need never be changed from data set to data set, making it “black-box”-like and autonomous in its execution. While VBLS is an iterative statistical method (performing slower than classical one- shot linear least squares), it can be embedded into other more complex iterative methods to realize a savings in computational efficiency. Its iterative nature means that it is most advantageous in real-time scenarios where time constraints favor an approximately accuratesolutionthatiscomputedquicklyoveranextremelyaccuratesolutionthattakes unacceptably too long. VBLS also lends itself to incremental implementation as would be needed in real-time analyses of brain information. A point of concern that one could raise against the VBLS algorithm is in how far the variationalapproximationinthisalgorithmaffectsthequalityoffunctionapproximation. It is known that factorial approximations to a joint distribution create more peaked 56 distributions, such that one could potentially assume the VBLS might tend a bit towards overfitting. It is important to notice, however, that in the case of VBLS, a more peaked distribution over the posterior distribution ofb m actually entails a stronger bias towards excluding the associated input dimensions. A more peaked distribution over b m pushes the regression parameter closer to zero. Thus, VBLS will be on the slightly pessimistic side of function fitting and is unlikely to overfit, which corresponds to our empirical experience. EMG Prediction from Neural Firing: Our final application of VBLS examined how well motor cortical activity can predict EMG activity and end-effector velocity data as collected in monkey experiments in previous publications (Sergio & Kalaska 1998, Kakei et al. 1999, Kakei et al. 2001). Our analysis confirmed that neurons in M1 carry significant information about EMG activity and end-effector velocity. These results were also obtained in the original papers but with single-neuron analysis techniques and not a population code read-out as essentially performed by VBLS. Interestingly, we also discovered that PM carries excellent information about EMG and end-effector velocity— ithasbeenpreviouslysuggestedthatonlyend-effectorinformationistheprimaryvariable codedinPM.Mostlikely,thisresultisduetousingpopulationcode-basedanalysisinstead of single neuron analysis. Our findings did not suggest that either M1 or PM has a significant specialized pop- ulation of neurons that only correlates with either EMG or end-effector data. Instead, we found that most neurons were statistically significant for both EMG and end-effector 57 data prediction. This rich information in the motor cortices mostly likely contributes sig- nificantly to the success of brain-machine interface experiments, where electrode arrays are placed over large cortical areas and the reconstruction of behavioral variables seems to be relatively easy. VBLS offers an interesting new method to perform such read-outs even in real-time with high statistical robustness. 58 Chapter 3 Parameter Identification in Noisy Linear Regression Learningtheequationsofmotionofacomplexphysicalsystemforthepurposeofcontrolis acommonprobleminrobotics. Atypicalsystemidentificationapproachwouldfirstcollect a representative data set from the robot by measuring positions and motor commands during some explorative movements. Then, velocity and acceleration information would beobtainedbynumericaldifferentiationofpositiondata. Thedatawouldalsobedigitally filtered to reduce noise. As a third step, a function approximator would be applied to learn the mapping from positions, velocities and accelerations to motor commands. Such a function often has hundreds of inputs for complex robots. Finally, this mapping would be inserted into the control loop of the robot, where appropriate motor commands are predicted from desired position, velocity and acceleration information—all of which are noiseless data. The example scenario above is representative for a large number of system identifi- cation problems. From a machine learning point of view, the interesting components are that the learning data is high dimensional, has irrelevant and redundant dimensions and, despite digital filtering, usually contains a significant amount of noise in the inputs to 59 the function approximator. Moreover, predictions are required from noiseless input data since inputs generated during control originate from a planning system without noise. The quality of control strongly depends on the quality of the learnt internal model in advanced controllers and is critical in many robotic applications such as haptic devices, surgical robotics and safe compliant assistive robots in human environments. Ideally, system identification can be performed based on the CAD data of a robot provided by the manufacturer, at least in the context of rigid body dynamic (RBD) systems—which will be the exemplary scope of this paper. However, many modern light- weight robots such as humanoid robots have significant additional nonlinear dynamics beyond the rigid body dynamics model, due to actuator dynamics, routing of cables, use of protective shells and other sources. In such cases, instead of trying to explicitly model all possible nonlinear effects in the robot, empirical data-driven system identification methods appear to be more useful. Under the assumption that a rigid body dynamics model is sufficient to capture the entire robot dynamics, this problem is theoretically straightforward as all unknown parameters of the robot such as mass, center of mass and inertial parameters appear linearly in the rigid body dynamics equations (An, Atkeson & Hollerbach 1988). Hence, after appropriate re-arrangement of the RBD equations of motion, parameter identification can be performed with linear regression techniques. In this chapter, we address the problem above in the context of linear regression since an extension to nonlinear regression is straightforward using locally weighted learning methods(Atkeson,Moore&Schaal1997). Ifwewantedtousetraditionallinearregression techniques for this scenario, we would encounter several deficiencies. 60 Ill-conditioned data: For high dimensional robotic systems, it is not easy to generate sufficientlyrichdatasothatallparameterswillbeproperlyidentifiable. Asaresult, the regression problem for RBD parameter estimation is almost always numerically ill-conditionedandbearsthedangerofgeneratingparameterestimatesthatstrongly deviate from the true values, despite a seemingly low error fit of the data. For such ill-conditioned data sets in high dimensional spaces, most traditional linear regression techniques break down numerically since they are unable to generate sparse and unbiased solutions identifying redundant and/or irrelevant dimensions. Noisy sensory data: Observed sensory data is typically noisy. Noise sources exist in both input and output data, and this effect is additionally amplified by numerical differentiation to obtain derivative data. Even digital filtering will always leave some noise in the signal in order to avoid oversmoothing of data. Traditional linear regression techniques like OLS regression are only capable of dealing with noise in the output data, and the presence of input noise introduces a persistent bias to the regressionsolution. AlternativemethodssuchasTotalLeastSquares(TLS)(Golub &VanLoan1989,VanHuffel&Vanderwalle1991)—otherwiseknownasorthogonal- least squares regression (Hollerbach & Wampler 1996) or, in statistics, as errors-in- variables (EIV) when applied to a linear model (Van Huffel & Lemmerling 2002)— address input noise, but they assume the variances of input noise and output noise are the same (Rao & Principe 2002). In real-world systems, this assumption is not necessarilytrueand,again,theresultingestimateswillbebiased,leadingtoinferior generalization. There also exist noise-robust versions of standard algorithms such 61 as Principal Component Analysis (PCA) (Sanguinetti, Milo, Rattray & Lawrence 2005) and other approaches such as Factor Analysis for regression, e.g., (Massey 1965), that attempt to explicitly model the input noise. Physically implausible parameter estimates: Finally, there is no mechanism in the regression problem for RBD model identification that ensures the identified param- etersarephysicallyplausible. Particularlyinthelightofinsufficientlyrichdataand nonlinearities beyond the RBD model, one often encounters physically incorrectly identified parameters such as negative values on the diagonal of an inertial matrix. Various methods exist to deal with some of the problems mentioned above, such as regression based on singular-value decomposition (SVD) or ridge regression to cope with ill-conditioned data (Belsley, Kuh & Welsch 1980), stepwise regression (Draper & Smith 1981), LASSO regression (Tibshirani 1996) or other L1-regularized methods to produce sparse solutions, and TLS/orthogonal-least squares/EIV, Factor Analysis or noise-robust PCAtoaddressinputnoise(Hollerbach&Wampler1996). Nevertheless,acomprehensive approach addressing the entire set of issues has not been suggested so far. Recent work suchasRao, Erdogmus, Rao&Principe(2003)hasaddressedtheproblemofinputnoise, but in the context of system identification of a time-series, while ignoring the problems associated with ill-conditioned data in high dimensional spaces. Inthischapter,wemotivatetheproblemofinputnoiseinlinearregressionapplications and identify possible solutions. Leveraging the Bayesian framework for linear regression developed in Chapter 2, we present an extension of the algorithm that can handle noisy 62 high-dimensional input data, while using Bayesian regularization methods to ensure ro- bustness to ill-conditioned data (Ting, D’Souza & Schaal 2006). A post-processing step ensures that the rigid body parameters are physically consistent by nonlinearly project- ing the results of the Bayesian estimate onto the constraints. We evaluate our approach on a parameter estimation problem for the RBD model (Ting, Mistry, Peters, Schaal & Nakanishi 2006). Our Bayesian estimation approach to the RBD parameter estimation that has all the desired properties below: • Explicitly identifies input and output noise in the data • Is robust in face of ill-conditioned data • Detects non-identifiable parameters • Produces physically correct parameter estimates 3.1 Background 3.1.1 Parameter Estimation in Rigid Body Dynamics Let us examine some of the problems associated with traditional system identification methods before introducing our de-noising solution. We embed our discussions in the context of RBD parameter estimation—a problem that is linear in the open parameters despitethehighlevelofnonlinearityoftheRBDequationsofmotion. Wediscussgeneral nonlinear system identification at the end of this chapter. 63 The general RBD equations of motions are (Sciavicco & Siciliano 1996): M(q)¨ q+C(q,˙ q)˙ q+G(q)=τ (3.1) where q,˙ q,¨ q denote the vectors of joint positions, velocities, and accelerations, respec- tively. The matrixM(q) is the RBD inertial matrix, the matrixC(q,˙ q) has terms about coriolis and centripetal forces, and the vector G(q) represents torques due to gravity. Eq. (3.1) has one row for every degree-of-freedom (DOF) of the robot, e.g., 30-50 rows for a humanoid robot. Every DOF is physically characterized by at least 10 parameters: a mass parameter, a center of mass vector, and a positive definite inertial matrix; fric- tion parameters can increase the number of parameters. Thus, for robot systems with many DOFs, identifying RBD parameters is a problem involving hundreds of dimensions. Interestingly, these parameters appear linearly in Eq. (3.1), such that, after some com- plex rearrangement of the terms in Eq. (3.1), the system identification problem for RBD becomes a linear regression problem. We can now switch to viewing this system identification problem from the stance of machine learning. Let us assume we have a data set {x i ,y i } N i=1 consisting of N samples, where x i ∈ < d×1 (d is the dimensionality of the input data) and y i is a scalar. As mentioned previously, the RBD equations can be re-arranged to yield this structure. We create a matrix X ∈ < N×d , where the input vectors x i are arranged in the rows of X, and a vector y∈< N×1 , where the corresponding scalar outputs y i are coefficients of y. 64 3.1.2 Modeling Input Noise in Linear Regression In Chapter 2, we introduced a computationally efficient Bayesian linear regression algo- rithm that is suitable for high-dimensional data. Unfortunately, it does not model noise in input data. To address this, we express the general model for linear regression with noise-contaminated input and output data as follows: y i = d X m=1 w zm t im + y i x im =w xm t im + x im (3.2) where t i is noiseless input data composed of t im elements, w z and w x are regression vectors composed of w zm and w xm elements, respectively, and y and x are additive mean-zero Gaussian noise. Only X and y are observable. If the input data is noiseless (i.e., x m = w xm t m ), then we obtain the familiar linear regression equation of y = b T OLS x+ y , where b OLS,m = w zm /w xm . We can show this using simple algebraic manipulation: Given: y i = d X m=1 w zm t im + y i (3.3) x im =w xm t im or, equivalently, t im = x im w xm (3.4) Substituting Eq. (3.4) into Eq. (3.3) gives: y i = P d m=1 b OLS,m t im + y , where b OLS,m = w zm /w xm . The slightly more general formulation in Eq. (3.2) with distinctw xm andw zm 65 coefficientswillbeusefulinpreparingournewalgorithm. Allthelinearmethodsreviewed in Chapter 2 do not account for noise in input data. When the input data is contaminated with noise, it can be shown that the OLS estimate willbeb OLS,noise =γb true , where 0<γ <=1(andb true =b OLS ), andthe exact value of γ depends on the amount of input noise. Consider the predictive distribution p(y q |x q ,X,y) = R p(y q ,t|x q ,X,y)p(t)dt. If we assume that the prior distributions on y andtareGaussians, thenweknowthatp(y q |x q ,X,y)isaGaussianaswell. Wecanthen evaluate the integral and infer the mean and variance of p(y q |x q ,X,y), resulting in the following: hy q |x q i=W z W x W T x W x +Ψ x −1 x q =b OLS,noise x q where W z is a diagonal matrix with w zm entries on its diagonal (and similarly, for W x and Ψ x ). Further algebraic manipulation of the expression for b OLS,noise reveals the following: b OLS,noise =W z W x W T x W x +Ψ x −1 =W z W x W −1 x W −1 x W T x W x W −1 x W −1 x +Ψ x W −1 x W −1 x −1 =W z W −1 x I+Ψ x W −1 x W −1 x −1 =b OLS I+Ψ x W −1 x W −1 x −1 (3.5) where,inthesecondlineofEq.(3.5),weintroduceamultiplicativefactorofW −1 x W −1 x W x W x or I (the identity matrix). Hence, b OLS,noise is always less than or equal to b true . Thus, 66 y i x i1 x i2 x id . . . i 1 N w x1 w x2 x w z1 z t i d 1 Figure 3.1: Graphical model for joint Factor Analysis for regression. Random variables are in circular nodes, observed random variables are in shaded double circles, and point estimated parameters are in square nodes. OLS regression underestimates the regression vector and produces biased predictions, a problem that cannot be fixed by adding more training data. Intentionally, the input/output noise model formulation in Eq. (3.2) was chosen such that it coincides with a version of a Factor Analysis (Massey 1965) tailored for regression problems(pleaserefertoAppendixC.1formoredetailsonFactorAnalysisforregression). The intuition of this model is given in Figure 3.1: every observed input x im and output y i is assumed to be generated by a set of hidden variables t im and contaminated with some noise, as given in Eq. (3.2). The graphical model in Figure 3.1 compactly describes the full multi-dimensional system: the variables x im , t im , w xm and w zm are duplicated d times for the d input dimensions of the data—as represented by the four nodes in the plate indexed by m. The other plate, indexed byi, shows that there are N samples of observed{x i ,y i } data. The goal of learning is to find the parametersw xm andw zm , which can only be achieved 67 by estimating the hidden variables t im and the variances of all random variables. In order to ensure that all parameters of the model are well-constrained, it needs to be assumed that all t im follow a Gaussian distribution with mean zero and unit variance, i.e., t im ∼Normal(0,1). The specific version of Factor Analysis for regression depicted in Figure 3.1 is called joint-space Factor Analysis regression or Joint Factor Analysis (JFA) regression, as both input and output variables are treated the same in the estimation process (i.e., only their joint distribution matters). While Joint Factor Analysis regression is well-suited for modeling regression problems with noisy input data, it does not handle ill-conditioned data very well and is computationally expensive for high dimensions due to a repeated high dimensional matrix inversion in the ensuing iterative estimation procedure. The goal of learning is to find parameters w xm and w zm , which can only be achieved byestimatingthehiddenvariablest im ,z im andthevariancesofallrandomvariables. Op- timal prediction can then be performed with either noisy or noiseless inputs, by deriving the appropriate conditional distributions. In the following section, we develop a Bayesian treatment of Joint Factor Analysis regression that is robust to ill-conditioned data, automatically detects non-identifiable parameters, detects noise in input and output data. 68 3.2 Bayesian Regression with Input Noise 3.2.1 EM-based Joint Factor Analysis To start with, we introduce the hidden variables z im , as done in Section 2.2, so that our noisy linear regression model from Eq. (3.2) now becomes: y i = d X m=1 z im + y i z im =w zm t im + z im x im =w xm t im + x im (3.6) ThenewmodelisshowngraphicallyinFigure3.2. WeusetheEMalgorithmtodetermine all open parameters using maximum likelihood estimation, with the following probability distributions over random variables: y i |z i ∼Normal(1 T z i ,ψ y ) z im |t im ,w zm ∼Normal(w zm t im ,ψ zm ) x im |t im ,w xm ∼Normal(w xm t im ,ψ xm ) t im ∼Normal(0,1) (3.7) where1=[1,1,...,1] T ,z i ∈< d×1 is composed ofz im elements,w z ∈< d×1 is composed of w zm elements, andw x ,ψ z andψ x aresimilarlycomposedofw xm ,ψ zm andψ xm elements, respectively. As Figure 3.2shows, the regressioncoefficientsw zm are now behind the fan- in to the output y i . 69 i x i1 i2 x id . . . i= N 1 w x2 1 z z ! " # t id z i2 z i1 $ i % t i1 Figure 3.2: Graphical model for joint factor analysis for efficient estimation. Random variables are in circular nodes, observed random variables are in shaded double circles, andpointestimatedparametersareinsquarenodes. Notethatsquarenodesforthenoise variances ψ y and ψ zm have been omitted from the model for graphical clarity. This efficient Joint Factor Analysis formulation decouples the input dimension and operateswithO(d)perEMiteration—wheredisthenumberofinputdimensions,instead of the approximately O(d 3 ) per EM iteration in traditional Joint Factor Analysis. 3.2.2 Automatic Feature Detection TheefficientmaximumlikelihoodformulationofJointFactorAnalysisregressionis, how- ever, still vulnerable to ill-conditioned data. Thus, we introduce a Bayesian layer on top ofthismodelbytreatingtheregressionparametersw z andw x probabilisticallytoprotect against overfitting, as shown in Figure 3.3. To do this, we introduce so-called “precision” variables α m over each regression parameter w zm . The same α m is also placed over each w xm , leading to a coupled regularization of w zm and w xm . As a result, the regression 70 parameters are now distributed asw zm ∼Normal(0,1/α m ) andw xm ∼Normal(0,1/α m ), where α m takes on a Gamma distribution with parameters a αm and b αm , shown below: w z |α∼ d Y m=1 Normal(w zm ;0,1/α m ) w x |α∼ d Y m=1 Normal(w xm ;0,1/α m ) α∼ d Y m=1 Gamma(α m ;a αm,0 ,b αm,0 ) (3.8) where a αm,0 and b αm,0 are initial hyperparameter values for the distribution of α m . The rationale of this Bayesian modeling technique is as follows. The key quantity that de- termines the relevance of a regression input is the parameter α m . A priori, we assume that everyw zm has a mean-zero distribution with broad variance 1/α m . We also assume that the precisionα m has an initial value 1 with large variance by setting both the initial values a αm,0 and b αm,0 to 10 −6 . If the posterior value of α m turns out to be very large after all model parameters are estimated, then the corresponding posterior distribution of w zm must be sharply peaked at zero. Thus, this gives strong evidence that w zm = 0 and that the input t m contributes no information to the regression model. If an input t m contributes no information to the output, then it is also irrelevant how much it con- tributes to x im . That is to say, the corresponding inputs x m could be treated as pure noise. Coupling both w zm and w xm with the same precision variable α m accomplishes exactly this effect. In this way, the Bayesian approach automatically detects irrelevant input dimensions and regularizes against ill-conditioned data sets. Even with the Bayesian layer added, the entire regression problem can be treated as an EM-like learning problem (Ghahramani & Beal 2000). Our goal is to maximize 71 y i x i1 x i2 x id . . . i= 1,.., & t i2 t id z i2 z i1 z id t i1 w x1 w x2 w xd w z1 ’ ( 2 w zd ! 1 ! 2 ! d ) * + b !1 a !d b ! , - ! . / !2 Figure 3.3: Graphical model for Bayesian version of joint factor analysis for noisy linear regression. Random variables are in circular nodes, observed random variables are in shaded double circles, and point estimated parameters are in square nodes. Note that square nodes for the noise variances ψ y and ψ zm have been omitted from the model for graphical clarity. the log likelihood logp(y|X), i.e., the ‘incomplete” log likelihood, as all hidden prob- abilistic variables are marginalized out. Due to analytical problems, we do not have access to this incomplete log likelihood, but rather only to a lower bound of it. This lower bound is based on an expected value of the so-called “complete” data likelihood, hlogp(y,Z,T,w z ,w x ,α|X)i,formulatedoverallvariablesofthelearningproblem,where: logp(y,Z,T,w z ,w x ,α|X) = N X i=1 logp(y i |z i )+ N X i=1 d X m=1 logp(z im |w zm ,t im )+ N X i=1 d X m=1 logp(x im |w xm ,t im ) + N X i=1 d X m=1 logp(t im )+ d X m=1 log{p(w zm |α m )p(α m )} + d X m=1 log{p(w xm |α m )p(α m )}+const y,Z,T,wz,wx,α (3.9) 72 andwhereZ∈< N×d withthevectorz i initsrowsandT∈< N×d withthevectort i inits rows. The expectation of this complete data likelihood should be taken with respect to the true posterior distribution of all hidden variablesQ(α,w z ,w x ,Z,T). Unfortunately, this is an analytically intractable expression. Instead, a lower bound can be formulated using a technique from variational calculus where we make a factorial approximation of the true posterior in terms of: Q(α,w z ,w x ,Z,T)=Q(α)Q(w z )Q(w x )Q(Z,T). We now have a mechanism that infers the significance of each dimension’s contribution to the observed output y and observed inputs X. We can derive the EM update equations using standard manipulations of Normal and Gamma distributions (please refer to Appendix C.2 for derivations), reaching the following: E-step: σ 2 wzm = 1 1 ψzm P N i=1 ht 2 im i+hα m i (3.10) hw zm i= σ 2 wzm ψ zm N X i=1 hz im t im i (3.11) σ 2 wxm = 1 1 ψxm P N i=1 ht 2 im i+hα m i (3.12) hw xm i= σ 2 wxm ψ xm N X i=1 x im ht im i (3.13) ˆ a αm =a αm,0 +1 (3.14) ˆ b αm =b αm,0 + hw 2 zm i+hw 2 xm i 2 (3.15) M-step: 73 ψ y = 1 N N X i=1 y 2 i −21y i hz i i+1 T hz i z T i i1 (3.16) ψ zm = 1 N N X i=1 hz 2 im i−2hw zm ihz im t im i+hw 2 zm iht 2 im i (3.17) ψ xm = 1 N N X i=1 x 2 im −2hw xm iht im ix im +hw 2 xm iht 2 im i (3.18) wherethecovariancematrix,Σ,ofthejointposteriordistributionofZandTis Σ zz Σ zt Σ tz Σ tt , with: Σ zz =M− M11 T M ψ y +1 T M1 (3.19) Σ tt =K −1 +K −1 hW z i T Ψ −1 z Σ zz Ψ −1 z hW z iK −1 (3.20) Σ zt =−Σ zz hW z iΨ −1 z K −1 (3.21) Σ tz =Σ T zt (3.22) K=I+hW T x W x iΨ −1 x +hW T z W z iΨ −1 z (3.23) M=Ψ z +hW z i I+hW T x W x iΨ −1 x +(Σ Wz ) mm Ψ −1 z −1 hW z i T (3.24) and where hW x i is a diagonal d by d matrix with hw x i along its diagonal. Similarly, hW z i, Ψ x , Ψ z are d by d diagonal matrices with diagonal vectors of hw z i,ψ x and ψ z , respectively. The E-step updates for Z and T are then: hz i i= y i ψ y 1 T Σ zz +x i hW x i T Ψ −1 x Σ tz (3.25) ht i i= y i ψ y 1 T Σ zz hW z iΨ −1 z K −1 +x i hW x i T Ψ −1 x Σ tt (3.26) 74 σ 2 z =diag(Σ zz ) (3.27) σ 2 t =diag(Σ tt ) (3.28) cov(z,t)=diag(Σ zt ) (3.29) The final regression solution regularizes over the number of retained inputs in the regression vector, performing a functionality similar to ARD (Neal 1994). It is important to notice that the resulting generalized EM updates still have a computational complexity of O(d) for each EM iteration—a level of efficiency that has not been accomplished with previous Joint Factor Analysis regression models, especially with one containing a full Bayesian treatment of JFA regression. The result is an efficient Bayesian algorithm that is robust to high dimensional ill-conditioned noisy data. Initialization of Parameters: Initialization of parameters is crucial when using the EM algorithm and will affect the quality of the final converged solution. The observed {x i ,y i } data is assumed to be preprocessed to have a mean of 0 and a variance of 1. Included below is a description of how each parameter is initialized. Note that this set of initializations does not need to be modified from data set to data set. • w x : These weights are all initialized to 1, with a little bit of randomness added to them (i.e. 1+randn(1,d)∗0.1, where d is the number of input dimensions). • w z : These weights are initialized to the correlation vector found using Partial Least Squares (x T y) and normalized such that the length is 1. 75 • ψ x : The variance of x is set to the observed variance of x (that is, 1, since x has been preprocessed already). • ψ z : The variance of the hidden variablez is set to 1 with a little bit of randomness added (1+randn(1,d)∗0.1). • ψ y : ψ y is initialized to 1. • {a αm,0 ,b αm,0 }: Both a αm,0 and b αm,0 are initialized to 10 −8 so that the prior over α m is flat and uninformative. That is to say, the precision α m has an initial mean of 1, with a very large variance. For details on how to check for convergence of the EM algorithm, please refer to Appendix C.3. 3.2.3 Inference of the Regression Solution Estimating the rather complex probabilistic Bayesian model for Joint Factor Analysis regression gives us the distributions and mean values for all hidden variables. However, one additional step is required to infer the final regression parameters, which, in our application, are the RBD parameters. For this purpose, we consider the predictive dis- tribution p(y q |x q ) for a new noisy test input x q and its unknown output y q . We can calculate hy q |x q i, the mean of the distribution associated with p(y q |x q ), by conditioning y q on x q and marginalizing out all hidden variables. Since an analytical solution of the resulting integral is only possible for the probabilistic Joint Factor Analysis regression modelinFigure3.2andnotforthefullBayesiantreatment, werestrictourcomputations to this simpler probabilistic model, and assume that W x and W z are replaced by their 76 point estimateshW x i andhW z i, such that our results will hold in approximation for the Bayesian model. Thus, the predictive distribution is: p(y q |x q ,X,Y)= Z Z p(y q ,Z,T|x q ,X,Y)dZdT (3.30) where X and Y are noisy input and noisy output data used for training. From solving this integral, can infer the value of the regression estimate ˆ b since hy q |x q i = ˆ b T x q . The resulting regression estimate, given noisy inputs x q and noisy outputs y q , is ˆ b noise : ˆ b noise = ψ y 1 T B −1 ψ y −1 T B −1 1 Ψ −1 z hW z iA −1 noise hW x i T Ψ −1 x (3.31) where Ψ x is a diagonal matrix with the vector ψ x on its diagonal (hW x i, hW z i, Ψ z are similarly defined diagonal matrices with vectors ofhw x i,hw z i andψ z on their diagonals, respectively) and where: A noise =I+ W T x W x Ψ −1 x + W T z W z Ψ −1 z (3.32) B= 11 T ψ y +Ψ −1 z −Ψ −1 z hW z i T A −1 hW z iΨ −1 z (3.33) If we compare ˆ b noise in Eq. (3.31) to ˆ b JFA , the regression estimate derived for Joint Factor Analysis regression (which can be arrived at also by conditioning y on x and marginalizing the latent variables) is: ˆ b JFA =W z A −1 JFA W T x Ψ −1 x (3.34) 77 A JFA =I+W T x W x Ψ −1 x W x wecanseethat ˆ b noise containsanadditionalterm W T z W z Ψ −1 z initsAexpression, due totheintroductionofhiddenvariablesz. ˆ b noise isscaledbyanadditionalvariance-related term because of this issue as well. It is important to note that the regression vector ˆ b noise given by Eq. (3.31) is for optimal prediction from noisy input data. However, for system identification in RBD, we are interested in obtaining the true regression vector, which is the regression vector that predicts output from noiseless inputs. Thus, the result in Eq. (3.31) is not quite suitable and what we want to calculate is the mean of p(y q |t q ), where t q are noiseless inputs. To address this issue, we can take the limit of ˆ b noise by letting ψ x → 0 and interpret the resulting expression to be the true regression vector for noiseless inputs (as ψ x → 0, the amount of input noise approaches 0). The resulting regression vector estimate ˆ b true becomes: ˆ b true = ψ y 1 T C −1 ψ y −1 T C −1 1 Ψ −1 z hW z i T hW x i −1 (3.35) where C = 11 T ψy +Ψ −1 z , and this is the desired regression vector estimate for noiseless data that we use in our evaluations. 78 3.3 Post-processing for Physically Consistent Rigid Body Parameters Before our evaluations, we need to return for a moment to the specifics of our intended application domain of RBD parameter estimation. Given a Bayesian estimate of the RBD parameters, we would like to ensure that the inferred regression vector satisfies the constraints given by positive definite inertia matrices and the parallel axis theorem. In our RBD estimation problem, there are 11 RBD parameters for each DOF, which we arrange in an 11-dimensional vector θ consisting of the following parameters: mass, three center of mass coefficients multiplied by the mass and six inertial parameters. This choice of parameterization is the only one that is identifiable using linear regression (An et al. 1988). Additionally, we include viscous friction as the 11th parameter. Inordertoenforcetheaforementionedphysicalconstraints,weintroducea11-dimensional virtual parameter vector ˆ θ that we assume is used in a nonlinear transformation to gen- erateθ, e.g.,θ =f( ˆ θ). This nonlinear transformation between virtual parameters ˆ θ and actual parameters θ is shown below for one DOF: θ 1 = ˆ θ 2 1 θ 2 = ˆ θ 2 ˆ θ 2 1 θ 3 = ˆ θ 3 ˆ θ 2 1 θ 4 = ˆ θ 4 ˆ θ 2 1 θ 5 = ˆ θ 2 5 + ˆ θ 2 4 + ˆ θ 2 3 ˆ θ 2 1 θ 6 = ˆ θ 5 ˆ θ 6 − ˆ θ 2 ˆ θ 3 ˆ θ 2 1 (3.36) θ 7 = ˆ θ 5 ˆ θ 7 − ˆ θ 2 ˆ θ 4 ˆ θ 2 1 θ 8 = ˆ θ 2 6 + ˆ θ 2 8 + ˆ θ 2 2 + ˆ θ 2 4 ˆ θ 2 1 θ 9 = ˆ θ 6 ˆ θ 7 + ˆ θ 8 ˆ θ 9 − ˆ θ 3 ˆ θ 4 ˆ θ 2 1 θ 10 = ˆ θ 2 7 + ˆ θ 2 9 + ˆ θ 2 10 + ˆ θ 2 2 + ˆ θ 2 3 ˆ θ 2 1 θ 11 = ˆ θ 2 11 79 In essence, the virtual parameters ˆ θ correspond to the square root of the mass, the true center-of-mass coordinates (i.e., not multiplied by the mass), a Cholesky decomposition of the DOF’s inertial matrix at the center of gravity to ensure positive definiteness of the inertial matrix, and the square root of the viscous friction coefficient. The functions in Eq. (3.36) encode the parallel axis theorem and some additional constraints, ensuring that the mass and viscous friction coefficients remain strictly positive. Given the above formulation, any arbitrary set of virtual parameters gives rise to a physically consistent set of actual parameters for the RBD problem. For a robotic system with s DOFs, Eq. (3.36) is repeated for each DOF. Since there are 11 features for each DOF, the result isa11s-dimensionalregressionvectorθ,whereθ m =f m ( ˆ θ)(form=1..dwhered=11s), There are at least two possible ways to enforce the physical constraints of RBD pa- rameters in our Bayesian estimation algorithm. The first (ideal) approach involves refor- mulatingouralgorithmusingthevirtualparameters ˆ θ describedpreviouslyinsteadofthe actual parameters θ. Unfortunately, this method will lead to an analytically intractable set of update equations due to the nonlinear relationship between virtual and actual pa- rameters. In the second approach, we can consider a post-processing step, where the unconstrained parameters are appropriately projected onto the constrained parameters. For this purpose, we assume that we would like to find the optimal virtual parameters in a least squares sense, i.e., by minimizing the cost function: J = 1 2 (y−Xθ) T (y−Xθ) (3.37) 80 where X and y are input and output data, and we have the constraints of θ m =f m ( ˆ θ). For the moment, we will ignore issues of noise in input data and ill-conditioned data sets. Let us assume that some arbitrary estimation algorithm generated an estimate for the unconstrained parameters as θ uc . Thus, the constrained parameters can be written as θ =θ uc +Δθ, where Δθ denotes the difference between constrained and unconstrained parameters. Substituting this into Eq. (3.37) results in: J = 1 2 (y−Xθ) T (y−Xθ) = 1 2 D (y−Xθ uc ) T (y−Xθ uc ) E − D (y−Xθ uc ) T XΔθ E + 1 2 Δθ T X T XΔθ (3.38) Minimizingthiscostfunctionwithrespecttothevirtualparametersonlyrequiresconsid- eration of the second and third terms of Eq. (3.38) since the first term does not depend on the virtual parameters. Now, let us consider algorithms to generate θ uc . Among the most straightforward algorithms is OLS, which is equivalent to reformulating Eq. (3.37) in terms of θ uc : J uc = 1 2 (y−Xθ uc ) T (y−Xθ uc ) , (3.39) taking the derivative ∂Juc ∂θuc and setting it to zero: ∂J ∂θ uc =−(y−Xθ uc ) T X=0 (3.40) If we insert this result into Eq. (3.38), we see that the second term of this cost function equals zero, leaving only the third term to be considered in order to obtain the optimal 81 virtual parameters. Thus, we can conclude that for optimal projection of the uncon- strained parameters onto the constrained parameters, all we need to do is to minimize thedifferencebetweenunconstrainedandconstrainedparametersunderthemetricX T X. We can consider other algorithms (other than OLS) to generate θ uc . For instance, SVD regression (Belsley et al. 1980) performs OLS in a subspace of the original input dimensionality of the regression problem. Thus, the cost functions in Eqs. (3.39) and (3.38) would be formulated only over the input dimensions that were identified to be relevant to the regression problem. Hence, the results regarding the minimization of the difference between unconstrained and constrained parameters hold as well. More interestingly, if we use our Bayesian estimation method to generate θ uc , the result will be similar to SVD regression in that some of the input dimensions will be eliminated. Additionally, thealgorithmalsoestimatesthenoiseintheinputs andreturns a regression vector that can be applied to noiseless query points. If we re-express the noisy inputs X as X t +Γ, where X t are noiseless inputs and Γ is the input noise, then we can re-write the third term of Eq. (3.38) in terms of de-noised quantities: 1 2 Δθ T X T t X t Δθ (3.41) The second term of Eq. (3.38) does not yield exactly zero as in an OLS regression, but, empirically, it is very close to zero, such that only the term in Eq. (3.41) matters in the actual optimization problem. 82 Insummary, wecanseethatinordertominimizetheleastsquarederrorinEq.(3.37) with respect to the physically constrained parameters of RBD, we can follow an approxi- matetwo-stepprocedure. First,weapplyourBayesianalgorithm(oranyotheralgorithm, forthatmatter)tocomeupwithanoptimalunconstrainedparameterestimateθ uc . Then, we find the virtual parameter estimates ˆ θ (and the corresponding physically consistent parameter estimates θ) such that the error between θ and θ uc is minimized in the sense of Eq. (3.41). If the noiseless inputs are not estimated explicitly, the termX t is replaced by the noisy inputs X. The optimization of Eq. (3.41) is easily achieved numerically as it is a simple convex function with a unique global minimum. If θ uc is estimated by OLS or SVD regression, the results for the constrained parameters are optimal. Ifθ uc is estimated by our Bayesian or any other nonlinear method, the results for the constrained parameters are approximately optimal. Empirically, we found that the above proposed procedure always achieves satisfying results. 3.4 Evaluation We evaluated our algorithm on both synthetic data and robotic data for the task of sys- tem identification. The goal of these evaluations was to determine how well our Bayesian de-noising algorithm performs compared to other standard techniques for parameter es- timation in the presence of noisy input and noisy output data. First,westartbyevaluatingouralgorithmonasyntheticdatasetinordertoillustrate its effectiveness at de-noising input and output data. Then, we apply the algorithms on a 7 DOF robotic oculomotor vision head and a 10 DOF robotic anthromorphic arm, both 83 Figure 3.4: Robotic oculomotor vision head by Sarcos (Cambridge, MA). shown in Figures 3.4 and 3.4, respectively, for the task of parameter estimation in rigid body dynamics. 3.4.1 Synthetic Data We synthesized random input training data consisting of 10 relevant dimensions and 90 irrelevant and redundant dimensions. The first 10 input dimensions were drawn from a multi-dimensional Gaussian distribution with a random covariance matrix. The output data was generated using an ordered regression vector b true = [1,2,...,10] T . Output noise was added with a signal-to-noise ratio (SNR) of 5. Then, we added Gaussian noise with varying SNRs (a SNR of 2 for strongly noisy input data and a SNR of 5 for less noisy input data) to the relevant 10 input dimensions. A varying number of redundant data vectors was added to the input data, and these were generated from random convex 84 Figure 3.5: Robotic anthropomorphic arm by Sarcos (Cambridge, MA). combinations of the 10 noisy relevant data vectors. Finally, we added irrelevant data columns, drawn from a Normal(0,1) distribution, until a total of 100 input dimensions were attained. The result was an input training dataset that contained irrelevant and redundant dimensions. Test data was created using the same method outlined above, except that input and output data were both noiseless. We compared our Bayesian de-noising algorithm with the following methods: i) OLS regression; ii) stepwise regression (Draper & Smith 1981), which tends to be inconsis- tent in the presence of collinear inputs (Derksen & Keselman 1992); iii) PLS regres- sion(Wold1975),aslightlyheuristicbutempiricallysuccessfulregressionmethodforhigh dimensional data; iv) LASSO regression (Tibshirani 1996), which gives sparse solutions by shrinking certain coefficients to zero under the control of a regularization parameter; v) our probabilistic treatment of Joint Factor Analysis regression in Figure 3.2; and vi) 85 0 0.05 0.1 0.15 0.2 0.25 nMSE Ridge Regression STEP PLS LASSO JFA BAYES r=90, u=0 r=0, u=90 r=30, u=60 r=60, u=30 All 90 dimensions redundant All 90 dimensions irrelevant (a) Average prediction error for training data with a SNR of 2 0 0.05 0.1 0.15 0.2 0.25 nMSE r=90, u=0 r=0, u=90 r=30, u=60 r=60, u=30 All 90 dimensions irrelevant All 90 dimensions redundant (b) Average prediction error for training data with a SNR of 5 Figure 3.6: Average normalized mean squared prediction errors (nMSE) on noiseless test data for a 100 dimensional dataset with 10 relevant input dimensions and various combi- nationsofredundantinputdimensionsv andirrelevantinputdimensionsu,averagedover 10 trials. Output data has SNR=5. Algorithms evaluated include OLS, stepwise regres- sion (STEP), PLS regression (PLS), LASSO regression (LASSO), Joint Factor Analysis regression (JFA) and our Bayesian de-noising algorithm (BAYES). 86 ourBayesiande-noisingalgorithmshowninFigure3.3. Inthissyntheticevaluation,there was no need to constrain parameters according to some physical consistency rules. The Bayesian de-noising algorithm had an improvement of 10 to 300% compared to otheralgorithms,astheblackbarsinFigures3.6(a)and3.6(b)illustrate. Oneinteresting observation is that for the case where the 90 input dimensions are all irrelevant, the Bayesian de-noising algorithm did not give a significant reduction in error as in the other three scenarios. This result can be explained by the fact that the other algorithms suffer primarily from redundant inputs, but not so much from irrelevant inputs, which does not cause numerical problems. The true power of our Bayesian algorithm lies in its ability to identify relevant dimensions in the presence of redundant and irrelevant data. 3.4.2 Robotic Oculomotor Vision Head Table 3.1: Root mean squared errors for position (in radians), velocity (radians/sec) and feedback command (in Newton-meters) for the robotic oculomotor vision head. Algorithm Position (rad) Velocity (rad/s) Feedback (Nm) Ridge Regression 0.0291 0.2465 0.3969 Bayesian De-noising 0.0243 0.2189 0.3292 LASSO Regression 0.0308 0.2517 0.4274 Stepwise Regression FAILURE FAILURE FAILURE The Sarcos robotic oculomotor vision head, shown in Figure 3.4, has 7 DOFs, giving 77 features in total (there are 11 features per DOF). The kinematic structure of robotic systems always creates non-identifiable parameters and thus, redundancies (An et al. 1988). We implemented a computed torque control law on the robot, using estimated parameters from each technique. Table 3.1 shows the root mean squared errors averaged over all DOFs. The Bayesian parameter estimation approach performed around 10 to 87 20%betterthanridgeregressionwithgradientdescent,wellLASSOregressionperformed worse. Stepwise regression produced RBD parameters that were physically impossible to runontherobotichead. Thiscanbeattributedtostepwiseregression’sfailuretoidentify the relevant features in the data set. 3.4.3 Robotic Anthropomorphic Arm We also evaluated the parameter estimation algorithms on a 10 DOF Sarcos robotic anthropomorphic arm, shown in Figure 3.4, and evaluations were done in a similar way as for the robot head. We collected about a million data points from the arm and downsampled the data to a more manageable size of 500,000. Table 3.2 shows the results averaged over all 10 DOFs. The Bayesian parameter estimation approach performed around 5 to 17% better than the other techniques. LASSO regression failed, due to over-aggressive clipping of relevant dimensions, and stepwise regression produced RBD parameters that were impossible to run on the robotic arm. Table 3.2: Root mean squared errors for position (in radians), velocity (radians/sec) and feedback command (in Newton-meters) for the robotic anthropomorphic arm. Algorithm Position (rad) Velocity (rad/s) Feedback (Nm) Ridge Regression 0.0210 0.1119 0.5839 Bayesian De-noising 0.0201 0.0930 0.5297 LASSO regression FAILURE FAILURE FAILURE Stepwise regression FAILURE FAILURE FAILURE 88 3.5 Discussion This chapter addresses the problem of learning for system identification, as, for example, in a scenario where we have observed a system through empirical data and would like to uncover its true parameters. Learning for system identification differs from learning for prediction. Learningforpredictionisthemorecommonproblemsettinginmanymachine learning techniques for regression. Good prediction is often possible without modeling all components of the generative system. For instance, linear regression with noise in both the input and output data achieves surprisingly good prediction results on test data that hasthesamenoisepropertiesasthetrainingdata, despitethewell-knownfactthatlinear regression is not built to deal with noise in the input data. The interesting new component of system identification comes from the desire to use the identified model in other ways than in the training scenario. In robotics, a typical example is the use of the system model for prediction with noiseless input data. In this scenario, the training data might have been contaminated by a large amount of input noise. Anothertypicalapplicationistocreateananalyticalinverseofanidentifiedmodel as often needed in model-based control. For such applications, the system model needs to be identified as accurately as possible. This is only possible if all parameters of the data generating model (in particular all noise processes) are identified accurately. We address linear system identification for situations where noise exists in both input and output data—a typical case in most robotic applications where data is derived from noisy sensors. Additionally, we allow for the case of high-dimensional data, where many input dimensions are potentially redundant or irrelevant. To date, no efficient and robust 89 algorithmhasbeensuggestedforsuchaproblemsetup. Inspiredbyfactoranalysisregres- sion, a classical machine learning technique, we develop a novel full Bayesian treatment of the linear system identification problem. Due to effective Bayesian regularization, this algorithm is robust to high dimensional, ill-conditioned data with noise-contaminated in- put and output data and remains computationally efficient, i.e.,O(d) per iteration of the underlyingEM-likealgorithm,wheredisthenumberofinputdimensions. Thisalgorithm has no parameters that need manual tuning. The algorithm is, however, iterative, but so is probabilistic factor analysis. The iterative nature of the algorithm allows it to be embedded into other more complex, iterative methods and makes it suitable for real-time learning scenarios. We used this algorithm to estimate parameters in rigid body dynamics—an estima- tion problem that is linear in the unknown parameters. Since these parameters have physical meaning, it was necessary to enforce physical consistent parameters with a post- processingstep. Thephysicalconstraintsarosefrompositivedefinitenessofinertiamatri- ces,positivenessofmassparameters,andtheparallelaxistheorem. Wedemonstratedthe efficiency of our algorithm by applying it to a synthetic dataset, a 7 DOF robotic vision head and a 10 DOF robotic anthropomorphic arm. Our algorithm successfully identified thesystemparameterswith10to300%higheraccuracythanalternativemethodsonsyn- thetic data for parameter estimation in linear regression. It performed 5 to 25% better on real robot data, proving to be a competitive alternative for parameter estimation on complex high degree-of-freedom robotic systems. If desired, our Bayesian algorithm can easily be extended to nonlinear system iden- tification in the framework of Locally Weighted Learning (LWL) (Atkeson et al. 1997). 90 The only modification needed is to change the linear regression problem to a Bayesian weighted linear regression problem (Gelman et al. 2000). Thus, a piecewise linear model identification can be achieved, similar to Schaal & Atkeson (1998) and (Vijayakumar, D’Souza & Schaal 2005). Parameters identified in such a nonparametric way usually lack any physical interpretability, such that our suggested post-processing to enforce physical correctnessoftheparametersisnotapplicable. Wewillleavetheproblemoffullnonlinear system identification for future work. 91 Chapter 4 Dealing with Outlier-Infested Data Robotic systems and their control mechanisms rely crucially on the quality of sensory data in order to make robust control decisions. While certain sensors such as poten- tiometers or optical encoders are inherently easy to assess in their noise characteristics, other sensors such as visual systems, global positioning system (GPS) devices and sonar sensorsmayprovidemeasurementsthatareinfestedbyoutliers. Thus,robustandreliable outlier removal is necessary in order to include these types of data in control processes. The particular application domain of legged locomotion is especially vulnerable to per- ceptualdataofpoorquality,asoneundetectedoutliercanpotentiallydisturbthebalance controller to the point that the robot loses stability. Additionally, for real-time applications, storing data samples may not be an option due to the high frequency of sensory data and insufficient memory resources. In this scenario, sensor data is made available one sample at a time (arriving sequentially over time) and must be discarded once they have been observed. 92 4.1 Background Anoutlierisgenerallydefinedasanobservationthat“liesoutsidesomeoverallpatternof distribution” (Moore & McCabe 1999). Outliers may arise from sensor noise (producing values that fall outside the valid range of values), temporary sensor failures or unantic- ipated disturbances in the environment (e.g., a brief change of lighting conditions for a visual sensor). A typical approach of detecting outliers is to characterize what normal observations look like, and then to single out samples that deviate from these normal properties. Existing methods for outlier detection include i) methods that classify a data sample based on a (Mahalanobis) distance from the expected value, ii) approaches that use information-theoretic principles, such as selecting the subset of data points that min- imize the prediction error, and iii) techniques that assume that the data was generated by some special generative model. Outlier classification based on a Mahalanobis distance can work well, but it requires thesettingofathresholdthatdefineswhetherapointisanoutlierornot. Thisthreshold typically is determined using expert domain knowledge or tuned manually beforehand in order to determine its empirically optimal value for the system. In information-theoretic approaches, outlier detection may be done through active learning (Abe, Zadrozny & Langford 2006), clustering (Breitenbach & Grudic 2005, Ng, Jordan & Weiss 2001) or mixture models (Aitkin & Wilson 1980, Scott 2005). These methods may require sampling, the setting of certain parameters (i.e. the optimal k in k-means clustering (MacQueen 1967)), and may not all lend themselves to a real-time implementation. AnothercommonmethodistheRandomSampleConsensus(RANSAC) 93 algorithm (Fischler & Bolles 1981). RANSAC tries to find the subset of data samples that produces the lowest error in an iterative fashion. Unfortunately, this may be too computationally intensive for real-time applications and may involve heuristic methods to narrow down the searchable space of subsets. Mixture models fall into the second and third category, assuming the data was gen- erated by some underlying structure, e.g. a mixture of a Gaussian distribution and a uniform distribution (Fox, Burgard, Dellaert & Thrun 1999, Konolige 2001, Faul & Tipping 2001). The probabilistic assumptions of this approach, however, may be restric- tive and may not work as well on data sets where outliers and inliers are not demarked by a large margin. Finally, there also exist outlier-robust versions of standard algorithms such as Independent Component Analysis (ICA) and PCA, e.g., (Hubert, Rousseeuw & Vanden Branden 2005). The ideal algorithm should detect and remove outliersin real-time—without the need parameter tuning, sampling or model assumptions. In this chapter, we propose a novel Bayesian algorithm that automatically detects outliers in general linear models. We introduce our Bayesian linear regression algorithm, before presenting a modified version that can be implemented in real-time. We evaluate our algorithm on both synthetic and roboticdata,demonstratinghowitperformsatleastaswellasotherstandardapproaches. In certain cases, it outperforms well-tuned alternative methods. Finally, we extend this idea to the Kalman filter, producing a filter that tracks observations and detects outliers in the observations. We show that this outlier-robust Kalman filter performs as well as other robust Kalman filter algorithms and requires no parameter tuning by the user, offering a competitive alternative to filtering sensor data. 94 Before going into greater detail, though, it is important to realize that in order to distinguish outliers from inliers, some amount of prior knowledge about the presence of outliersisnecessary. Asanillustrativeexample,considerFigures4.1(a)and4.1(b),which show the number of motorcycle impacts (Silverman 1985) and time of eruptions for the Old Faithful geyser in Yellowstone National Park (Azzalini & Bowman 1990). It would be tricky to distinguish noise or outliers from the data structure if the source of the data set (i.e., how noisy the data is or how many outliers appear in the data) was not known. This domain knowledge translates naturally into a noise or outlier prior. 0 20 40 60 −150 −100 −50 0 50 100 Time (ms) Acceleration (g) (a) Motorcycle impact data set 0 2 4 6 40 50 60 70 80 90 100 Duration (min) Interval (min) (b) Old Faithful geyser eruption data set Figure 4.1: Motorcycle impact data set from (Silverman 1985) and Old Faithful geyser eruption data set (Azzalini & Bowman 1990). 4.2 Linear Regression with Outliers Given an observed data set D = {x i ,y i } N i=1 with N data samples, where x i ∈ < d×1 , y i is a scalar and d is the number of input dimensions, we can arrange the input vectors 95 x i in the rows of the matrix X and set the corresponding scalar outputs y i to be the coefficients of the vector y. A general model for linear regression is then: y i =b T x i + y i (4.1) where b ∈ < d×1 and y i is additive mean-zero Gaussian noise. The OLS estimate of the regression vectorb OLS is X T X −1 X T y. However, it is not uncommon for observed data to have outliers, and if outliers are not removed, the regression estimate b OLS will be biased. 4.2.1 Bayesian Regression for Automatic Outlier Detection We can modify Eq. (4.1) so that the observed outputs y have heteroscedastic (i.e., un- equal) variances, introducing a weight w i for each y i such that the variance of y i is weighted with w i . Gelman et al. (2000) do this, but assume the weights are known be- forehand. Usingincorrectestimatesfortheweightsmayleadtodeterioratedperformance. As a result, we favor a different approach and treat the weights probabilistically in order to learn them. Another robust regressionalgorithm with aBayesiantreatment isthat ofFaul & Tip- ping (2001). A Gaussian prior is placed over b, and hyperparameters are introduced in order to have automatic relevance determination on the input data features. However, a mixture model is used to explain outliers (with uniform or Gaussian distribution used to capture them). In contrast, our model makes no assumptions about the underlying 96 data structure. It is a Bayesian treatment of weighted regression that detects and elim- inates outliers automatically (Ting, D’Souza & Schaal 2007). It has the following prior probability distributions on its random variables: y i ∼Normal b T x i ,σ 2 /w i b∼Normal(b 0 ,Σ b,0 ) w i ∼Gamma(a w i ,b w i ) (4.2) where b 0 ∈ < d×1 is the prior mean of b; Σ b,0 is the prior covariance of b and a d by d diagonal matrix; and σ 2 is the variance of the mean-zero normally distributed output noise. Wecantreat the entire regressionproblemasan EMlearningproblem. Our goal isto maximize the log likelihood logp(y|X). Due to analytical issues, we do not have access to the log likelihood, but instead, only a lower bound of it. The lower bound is based on an expected value of the “complete” data likelihood,hlogp(y,b,w|X)i, where: logp(y,b,w|X)= N X i=1 logp(y i |x i ,w i ,b)+logp(b)+ N X i=1 logp(w i ) (4.3) The expectation of the complete data likelihood should be taken with respect to the true posterior distribution of all hidden variables Q(b,w). Since this is an analytically intractable expression, a lower bound can be formulated using a technique from varia- tional calculus where we make a factorial approximation of the true posterior as follows: Q(b,w) = Q(b)Q(w). While losing a small amount of accuracy, all resulting posterior 97 distributions over hidden variables become analytically tractable. The final posterior EM-update equations are listed below: Σ b = Σ −1 b,0 + 1 σ 2 N X i=1 hw i ix i x T i ! −1 (4.4) hbi=Σ b Σ −1 b,0 b 0 + 1 σ 2 N X i=1 hw i iy i x i ! (4.5) hw i i= a w i ,0 + 1 2 b w i ,0 + 1 2σ 2 y i −hbi T x i 2 + 1 2σ 2 x T i Σ b x i (4.6) σ 2 = 1 N N X i=1 y i −hbi T x i 2 +x T i Σ b x i (4.7) These update equations need to be run iteratively until all parameters and the complete log likelihood converge to steady values. Examining Eq. (4.6) reveals that if the prediction error iny i is so large that it domi- nates over the other denominator terms, then the weight hw i i of that point will be very small. As this prediction error term in the denominator goes to ∞, hw i i approaches 0. As can be seen in both Eqs. (4.4) and (4.5), a data point with an extremely small weight will have a smaller contribution to the calculation of the regression estimate hbi. This effect is equivalent to the detection and removal of an outlier if the weight of the data point (x i ,y i ) is small enough. Initialization of Priors: A few comments should be made regarding the initialization of the priors used in Eqs. (4.4) to (4.7). First of all, Σ b,0 —the prior covariance of b—need only to be set to a large enough value (e.g., 10 3 I, where I is the identity matrix), which corresponds to an uninformative prior on b (i.e., the probability 98 distribution is a relatively flat Gaussian). Σ b,0 in Eq. (4.4) can be interpreted to be a stabilizing ridge-like value, similar to that of ridge regression, to ensure that the regression does not break down in the presence of collinear input data. Secondly, b 0 is usually initialized to zero, unless informative prior knowledge is available. As b 0 is multiplied by Σ −1 b,0 , it does not have any real influence on the update equations unless Σ b,0 is chosen to be informative. Thirdly, the prior scale parameters a w i ,0 and b w i ,0 should be selected so that the weightshw i i are 1 with some confidence. That is to say, we start by assuming that all points are inliers. For example, we can set a w i ,0 = 1 and b w i ,0 = 1 so that hw i i has a prior mean of a w i ,0 /b w i ,0 = 1 with a variance of a w i ,0 /b 2 w i ,0 = 1. By using these values, the maximum value of hw i i is capped at 1.5. If the user has good reason to insert strong biases towards particular parameter values (e.g., some prior knowledge on the amount of outliers), then these values should be used. Otherwise, the set of prior parameter values outlined above can be used. The key point of this Bayesian treatment of weighted regression with heteroscedastic varianceisthateachdatapointwillbeassignedaposteriorweightthatisindicativeofthe amount of variance it has, relative to the average variance of the dataset. Consequently, a data point will be downweighted if its variance is much higher than that of the average variance. This algorithm does not require any tuning of threshold values or any user intervention beforehand, performing automatic outlier detection and removal in a black box-like way. 99 4.2.2 Incremental Version The algorithm above is suitable if the data D is available in batch form. However, as in most robotic systems, data is often available from sensors one sample at a time, and filtering of the data needs to be done in a real-time, incremental (i.e. online) fashion. Hence, we take the Bayesian weighted model from Eq. (4.2) and modify it to make it an onlinealgorithm. Astypicalinonlinealgorithms,weintroduceaforgettingratetospecify the window over which we wish to average data (Ljung & Soderstrom 1983). We use a scalar forgetting rate,λ, where 0≤λ≤1, to exponentially discount data collected in the past. The forgetting rate enters the algorithm by accumulating the sufficient statistics of the batch algorithm in an incremental way. The sufficient statistics can be extracted by examining the EM update equations in Eqs. (4.4) to (4.7). As the kth data point becomes available from the sensors, we can calculate the update equations for b and σ 2 as follows: Σ (k) b = Σ −1 b,0 + 1 σ 2 sum wxx T k −1 (4.8) hbi (k) =Σ (k) b Σ −1 b,0 b 0 + 1 σ 2 sum wyx k (4.9) σ 2 (k) = 1 N k sum wy 2 k −2sum wyx k hbi (k) + hbi (k) T sum wxx T k hbi (k) +1 T diag n sum wxx T k Σ (k) b oi (4.10) where the sufficient statistics, exponentially discounted by λ, are: N k =1+λN k−1 (4.11) 100 sum wxx T k =hw k ix k x T k +λsum wxx T k−1 (4.12) sum wyx k =hw k iy k x k +λsum wyx k−1 (4.13) sum wy 2 k =hw k iy 2 k +λsum wy 2 k−1 (4.14) and all of N k ,sum wxx T k ,sum wyx k ,sum wy 2 k are 0 for k =0. Notice that the calculation of the posterior covariances of b in Eqs. (4.4) and (4.8) requiresamatrixinversion,resultinginacomputationalcomplexityofO(d 3 ). Thiswillbe fineforlow-dimensionalsystems. However,forsystemswherethedatahasalargenumber of input dimensions, the matrix inversion becomes computationally prohibitive. In such situations, Eq. (4.8) can be re-written recursively, as in Recursive Least Squares (Ljung & Soderstrom 1983, Bierman 1977), in order to reduce the computational complexity to O(d) per EM iteration. Given knowledge of the frequency of incoming data, the value of λ can be set accordingly, since the number of data samples that is not “forgotten” is 1/(1−λ). Additionally, the regression estimates come with a measure of confidence (the posterior covariance of b), such that the quality of the estimates and predictions can be judged. Naturally, this incremental approximation of the batch Bayesian algorithm comes at a cost since data points that initially appeared to be outliers may actually have been inliers (once we have collected enough data samples to realize this). If the forgetting rate λ used is small enough, then this effect will be less pronounced since the window size of past data samples we are averaging over will be small as well. Hence, if this inlier falls outside the window of the past 1/(1−λ) data samples, the effect of mistaking an inlier as 101 an outlier will be less pronounced. At the same time, λ should not be too small in order to ensure that the discrepancy in results between the incremental and batch versions is not too great. This trade-off between preserving equivalency with the batch version and discounting past events is a known issue with the use of forgetting factors for incremental algorithms. 4.3 An Outlier-Robust Kalman Filter 4.3.1 Background The Kalman filter (Kalman 1960, Kalman & Bucy 1961) is widely used for estimating the state of a dynamic system, given noisy measurement data. It is the optimal linear estimator for linear Gaussian systems, giving the minimum mean squared error (Morris 1976). Unlike techniques that require access to the entire set of observed samples, such as the Kalman smoother, e.g., (Jazwinski 1970, Bar-Shalom, Li & Kirubarajan 2001), the Kalman filter assumes that only observations up to the current time step have been observed, making it suitable for real-time tracking. Using state estimates, the filter can alsoestimatewhatthecorresponding(output)datashouldbe. However,theperformance of the Kalman filter degrades when the observed data contains outliers. To address this, previous work has tried to make the Kalman filter more robust to outliers by addressing the sensitivity of the squared error criterion to outliers (Tukey 1960, Huber 1964). One class of approaches considers non-Gaussian distributions for random variables, e.g., (Sorensen & Alspach 1971, West 1981, West 1982, Smith & West 1983) since multivariate Gaussian distributions are known to be susceptible to outliers. 102 For example, Meinhold & Singpurwalla (1989) use multivariate Student-t distributions. However, the resulting estimation of parameters may be quite complicated for systems with transient disturbances. Other efforts have modeled observation and state noise as non-Gaussian, heavy-tailed distributionsinordertoaccountfornon-Gaussiannoiseandoutliers,e.g.,(Masreliez1975, Masreliez& Martin 1977, Schick & Mitter 1994). These filtersare typically more difficult to implement and may no longer provide the conditional mean of the state vector. Other approaches use resampling techniques, e.g., (Kitagawa 1987, Kramer & Sorenson 1988), or numerical integration, e.g., (Kitagawa 1996, Kitagawa & Gersch 1996), but these may require heavy computation not suitable for real-time applications. Yet another class of methods uses a weighted least squares approach, as done in robust least squares (Ryan 1997, Huber 1973), where the measurement residual error is assignedsomestatisticalproperty. Someofthesealgorithmsfallunderthefirstcategoryof approaches as well, assuming non-Gaussian distributions for variables. Each data sample is assigned a weight that indicates its contribution to the hidden state estimate at each time step. This technique has been used to produce a Kalman filter that is more robust to outliers, e.g., (Durovic & Kovacevic 1999, Chan, Zhang & Tse 2005). Nevertheless, these methods usually model the weights as some heuristic function of the data, e.g., the Huber function (Huber 1973), and often require tuning or cross-validation of threshold parameters for optimal performance. Using incorrect or inaccurate estimates for the weights may lead to deteriorated performance, so special attention and care is necessary when using these techniques. 103 In this section, we are interested in making the Kalman filter more robust to the outliers in the observations (i.e. the filter should identify and eliminate possible outliers asittracksobserveddata). Estimationofthesystemdynamicsanddetectionofoutliersin the states are different problems and left for future work. We extend the weighted least squares approach of Section 4.2 to the Kalman filter, resulting in a filter that detects outliers in the observations without any parameter tuning or heuristic methods (Ting, Theodorou & Schaal 2007). The filter learns the weights of each data sample and the system dynamics, using an incremental EM framework (Dempster et al. 1977). For ease of analytical computation, we assume Gaussian distributions for variables and states. 4.3.2 The Kalman Filter Let us assume we have data observed overN time steps,{z k } N k=1 , and the corresponding hidden states as {θ k } N k=1 , where θ k ∈ < d 2 ×1 ,z k ∈ < d 1 ×1 . Assuming a time-invariant system, the Kalman filter system equations are: z k =Cθ k +v k θ k =Aθ k−1 +s k (4.15) where C∈< d 1 ×d 2 is the observation matrix, A∈< d 2 ×d 2 is the state transition matrix, v k ∈ < d 1 ×1 is the observation noise at time step k, and s k ∈ < d 2 ×1 is the state noise at time step k. We assume v k and s k to be uncorrelated additive mean-zero Gaussian noise: v k ∼ Normal(0,R), s k ∼ Normal(0,Q), where R ∈ < d 1 ×d 1 is a diagonal matrix with r ∈ < d 1 ×1 on its diagonal, and Q ∈ < d 2 ×d 2 is a diagonal matrix with q ∈ < d 2 ×1 104 on its diagonal. R and Q are covariance matrices for the observation and state noise, respectively. z k!1 z k!2 z k . . . ! k ! 0 " 1 2 3 4 5 . . . (a) Kalman filter . . . ! k ! 6 "2 ! 7 "1 . . . 8 9 !2 : ; !1 < = z k! > z k!1 z k (b) Robust weighted Kalman filter Figure4.2: GraphicalmodelofKalmanfilterandrobustweightedKalmanfilter. Circular nodes are random variables and shaded double circles are observed random variables. System matrices have been omitted for the sake of graphical clarity. Figure 4.2(a) shows the graphical model for the standard Kalman filter. Its corre- sponding filter propagation and update equations are, for k =1,..,N: Propagation: θ 0 k =Ahθ k−1 i (4.16) Σ 0 k =AΣ k−1 A T +Q (4.17) Update: S 0 k = CΣ 0 k C T +R −1 (4.18) 105 K 0 k =Σ 0 k C T S 0 k (4.19) hθ k i=θ 0 k +K 0 k z k −Cθ 0 k (4.20) Σ k = I−K 0 k C Σ 0 k (4.21) where hθ k i is the posterior mean vector of the state θ k , Σ k is the posterior covariance matrix of θ k , and S 0 k is the covariance matrix of the residual prediction error—all at time stepk. In this problem, the system dynamics—C,A,R andQ—are unknown, and it is possible to use a maximum likelihood framework to estimate these parameter val- ues (Myers & Tapley 1976). Unfortunately, this standard Kalman filter model considers all data samples to be part of the data cloud and is not robust to outliers. 4.3.3 The Robust Weighted Kalman Filter To overcome this limitation, we introduce a scalar weight w k for each observed data sample z k such that the variance of z k is weighted with w k , as done in Section 4.2.1. We model the weights to be Gamma distributed random variables, as done previously in Section 4.2.1 for weighted linear regression, and learn estimates for the system dynamics ateachtimestep. AGammapriordistributionischosenfortheweightsinordertoensure they remain positive. Figure 4.2(b) shows the graphical model of the robust weighted Kalman filter. The resulting prior distributions of the random variables are: z k |θ k ,w k ∼Normal(Cθ k ,R/w k ) θ k |θ k−1 ∼Normal(Aθ k−1 ,Q) w k ∼Gamma(a w k ,b w k ) (4.22) 106 Using an incremental EM framework, we can learn the weights of each data sample and system dynamics by maximize the log likelihood logp(z 1:N ). We have access to only a lower bound of this measure, based on the expected value of the complete log likelihood—which, until time step k, is given by: logp(z 1:k ,θ 0:k ,w)=log ( k Y i=1 p(z i |θ i ,w i )p(θ i |θ i−1 )p(θ 0 ) ) (4.23) The expectation of the complete log likelihood should be taken with respect to the true posterior distribution of all hidden variables Q(w,θ). Since this is an analyti- cally intractable expression, we use a technique from variational calculus to construct a lower bound and make a factorial approximation of the true posterior as follows: Q(w,θ) = Q K i=1 Q(w i ) Q K i=1 Q(θ i |θ i−1 )Q(θ 0 ). The factorization of θ considers the in- fluence of each θ i from within its Markov blanket, conserving the Markov property that Kalman filters, by definition, have. While losing a small amount of accuracy, all resulting posterior distributions over hidden variables become analytically tractable. This factorial approximation was chosen purposely so thatQ(w k ) is independent fromQ(θ k ); performing joint inference ofw k and θ k doesnotmakesenseinthecontextofourgenerativemodel. Theposteriordistribution of the weights w can then be inferred, in an variational E-step, to be: hw k i= a w k ,0 + 1 2 b w k ,0 + D (z k −C k θ k ) T R −1 k (z k −C k θ k ) E (4.24) 107 To derive the posterior mean and update ofθ, we can first derive the recursive prop- agation and update equations of the Kalman filter using Bayes’ rule and the Markov assumption. The resulting propagation and update equations of the robust weighted Kalman filter are: Propagation: θ 0 k =Ahθ k−1 i (4.25) Σ 0 k =AΣ k−1 A T +Q (4.26) Update: S 0 k = CΣ 0 k C T + 1 hw k i R −1 (4.27) K 0 k =Σ 0 k C T S 0 k (4.28) hθ k i=θ 0 k +K 0 k z k −Cθ 0 k (4.29) Σ k = I−K 0 k C Σ 0 k (4.30) Eqs. (4.25) to (4.30) can be re-expressed, with a little algebraic manipulation, in terms of the posterior mean and variance for θ: Σ k = hw k iC T k R −1 k C k + AΣ k−1 A T +Q k −1 −1 (4.31) hθ k i=Σ k AΣ k−1 A T +Q k −1 A k hθ k−1 i+hw k iC T k R −1 k z k (4.32) 108 The system dynamics—C,A,R and Q—can be estimated in the M-step to be: C k = k X i=1 hw i iz i hθ i i T ! k X i=1 hw i i θ i θ T i ! −1 (4.33) A k = k X i=1 hθ i ihθ i−1 i T ! k X i=1 θ i−1 θ T i−1 ! −1 (4.34) r km = 1 k k X i=1 hw i i D (z im −C k (m,:)θ i ) 2 E (4.35) q kn = 1 k k X i=1 D (θ in −A k (n,:)θ i−1 ) 2 E (4.36) wherem=1,..,d 1 ,n=1,..,d 2 ;r km is themth coefficient of the vectorr k ;q kn is thenth coefficient of the vectorq k ;C k (m,:) is themth row of the matrixC k ;A k (n,:) is thenth row of the matrix A k ; and a w k ,0 and b w k ,0 are prior scale parameters for the weight w k . Sincestoringsensordataisnotpossibleinreal-timeapplications,Eqs.(4.33)to(4.36)— whichrequireaccesstoallobserveddatasamplesuptotimestepk—needtobere-written using only values observed, calculated or used in the current time stepk. We can do this by collecting sufficient statistics in Eq. (4.33) to (4.36) and rewriting them as: (4.36) and rewriting them as: C k =sum wzθ T k sum wθθ T k −1 (4.37) A k =sum θθ 0 k sum θ 0 θ 0 k −1 (4.38) r km = 1 k h sum wzz km −2C k (m,:)sum wzθ km +diag n C k (m,:)sum wθθ T k C k (m,:) T oi (4.39) q kn = 1 k h sum θ 2 kn −2A k (n,:)sum θθ 0 kn +diag n A k (n,:)sum θ 0 θ 0 k A k (n,:) T oi (4.40) 109 wherem=1,..,d 1 ,n=1,..,d 2 . Thesufficientstatistics,whichareallafunctionofvalues observed, calculated or used in time step k (e.g., hw k i, z k , hθ k i, hθ k−1 i etc.) are: sum wzθ T k =hw k iz k hθ k i T +sum wzθ T k−1 sum wθθ T k =hw k i θ k θ T k +sum wθθ T k−1 sum θθ 0 k =hθ k ihθ k−1 i T +sum θθ 0 k−1 sum θ 0 θ 0 k = θ k−1 θ T k−1 +sum θ 0 θ 0 k−1 sum wzz km =hw k iz 2 km +sum wzz k−1 sum wzθ km =hw k iz km θ k +sum wzθ k−1,m sum θ 2 kn = θ 2 kn +sum θ 2 k−1,n sum θθ 0 kn =hθ kn ihθ k−1 i+sum θθ 0 kn Eqs. (4.24) to (4.30) and (4.37) to (4.40) should be computed once for each time step k, e.g., (Ghahramani & Hinton 1996, Neal & Hinton 1999), when the data sample z k becomes available. Initialization of Priors: A few remarks should be made regarding the initialization of priors used in the equations above. In particular, the prior scale parameters a w k ,0 and b w k ,0 should be selected so that the weights hw k i are 1 with some confidence. That is to say, the algorithm starts by assuming most data samples are inliers. As described in Section 4.2.1, we set a w k ,0 = 1 and b w k ,0 = 1 and use these values for any data set unless prior information regarding presence of outliers is available. If the user has prior knowledge regarding the strong or weak presence of outliers in the data set (and hence, a good reason to insert strong biases towards partic- ular parameter values), the prior scale parameters of the weights can be modified accordingly to reflect this. Since some prior knowledge about the observed data’s propertiesmustbeknowninordertodistinguishwhetheradatasampleisanoutlier 110 orpartofthedata’sstructure,thisBayesianapproachprovidesanaturalframework to incorporate this information. Secondly, the algorithm is relatively insensitive to the the initialization ofA andC and will always converge to the same final solution, regardless of these values. For our experiments, we initialize C =A =I, where I is the identity matrix. Finally, the initial values of R and Q should be set based on the user’s initial estimate of hownoisytheobserveddatais(e.g.,R=Q=0.01Ifornoisydata,R=Q=10 −4 I for less noisy data (Maybeck 1979)). Outlier detection in the Kalman filter emerges in a similar manner to linear regression. If the prediction error of a data sample z k is very large, then the weight hw k i of that data sample will be very small. This leads to a very small S 0 k and small Kalman gain K 0 k . In short, the influence of the data sample z k will be downweighted when predictingθ k . The resulting robust weighted Kalman filter has a computational complexity on the same order as that of a standard Kalman filter since matrix inversions are still needed (for the calculation of covariance matrices). In comparison to other Kalman filters that use heuristics or require more involved implementation, this outlier-robust Kalman filter is principled and easy to implement. 4.3.4 Monitoring the Residual Error A common sanity check is to monitor the residual error of the data z 1:N and the hidden states θ 1:N in order to ensure that the residual error values stay within the 3σ bounds computed by the filter (Maybeck 1979). If we had access to the true state θ k for time stepk, we would plot the residual state error (θ k −hθ k i) for all time stepsk, along with 111 the corresponding ±3σ k values, where σ 2 k = diag{Σ k }. We would also plot the residual predictionerror(z k −CAhθ k−1 i)foralltimestepsk,alongwiththecorresponding±3σ z k values, where σ 2 z k =diag{S 0 k }. With these graphs, we should observe the residual error values remaining within the ±3σ bounds and check that the residual error does not diverge over time. Residual monitoringmaybeusefultoverifythatspuriousdatasamplesarerejectedsinceprocessing of these samples may result in corrupted filter computations. It offers a peek into the Kalman filter, providing insights as to how the filter performs. 4.4 Evaluation 4.4.1 Linear Regression We evaluated our algorithm’s ability to automatically detect outliers on a synthetic data set, beforeimplementingitonaroboticdog, LittleDog, manufacturedbyBostonDynam- ics Inc. (Cambridge, MA). We compared it to four other techniques for outlier detection: i) a thresholding approach that classifies a data sample as an outlier if its Mahalanobis distance exceeds an optimal hand-tuned threshold; ii) a 2-component mixture model (a Gaussian distribution to account for inliers and a uniform distribution for outliers); iii) robust least squares regression with bisquare weights (Hoaglin 1983); and iv) Faul & Tipping’s variational Bayesian algorithm for robust regression (see Section 4.2.1). 112 Table 4.1: Average normalized mean squared prediction error for a linear function with 5 input dimensions, evaluated in batch form over 10 trials for all algorithms: σ is the standard deviation of the true conditional output mean and SNR of the outputs is 10. Algorithm Distance of outliers from mean +3σ + 2σ + σ Thresholding 0.0903 0.0503 0.0232 Mixture Model 0.1327 0.0688 0.0286 Robust Least Squares 0.1890 0.1518 0.0880 Robust Regression 0.1320 0.0683 0.0282 Bayesian weighted regression 0.0273 0.0270 0.0210 4.4.1.1 Synthetic Data First, we evaluated all five algorithms on a linear regression problem, where the data is available in batch form. The synthetic data set had 5 input dimensions, 1000 data points (20% of which were outliers), and additive Gaussian noise with a signal-to-noise ratio (SNR) of 10. Outliers were created to be kσ away from the true mean of the outputs, where σ is the standard deviation of the true conditional mean of the outputs and k is an integer scaling factor. Table 4.1 shows the average prediction error on noiseless test data for all algorithms. Bayesian weighted linear regression achieves the lowest average normalized mean squared error (nMSE). Thresholding works well when the threshold value is hand-tuned optimally, while the remaining methods are less robust to outliers. Next, the algorithms were evaluated incrementally using a forgetting rate of λ = 0.999 on the synthetic training data from the first experiment. Robust least squares was omittedsinceitisabatchalgorithm,andmakingitrecursiveisnon-trivial. Figures4.3(a) and 4.3(b) track the error in the predicted outputs on the training data. The Bayesian weighted algorithm, shown in the dark blue dotted line, reduces the error to a value that is lowest. 113 0 200 400 600 800 1000 0 0.05 0.1 0.15 0.2 Sample Index nMSE Thresholding Mixture model Robust regression Bayes. regression (a) Outliers are at least 3σ from true output mean 0 200 400 600 800 1000 0 0.05 0.1 0.15 0.2 Sample Index nMSE Thresholding Mixture model Robust regression Bayes. regression (b) Outliers are at least 2σ from true output mean Figure 4.3: Normalized mean squared error values for the synthetic batch data sets used in Table 4.1, evaluated in an incremental manner for thresholding, mixture models, robust regression of Faul & Tipping (2001), and Bayesian weighted regression (Bayes. regression). λ=0.999. 114 Figure 4.4: LittleDog quadruped robot by Boston Dynamics (Cambridge, MA). 4.4.1.2 LittleDog Robot We evaluated the algorithms on a 12 degree-of-freedom (DOF) robotic dog, LittleDog, as shown in Figure 4.4. The dog has two sources that measure its orientation: the motion capture (MOCAP) system and an on-board inertia measurement unit (IMU). Both provide a quaternion q of the robot’s orientation: q MOCAP from the MOCAP and q IMU fromtheIMU.q IMU driftsovertimesincetheIMUcannotprovidestableorientation estimationbutitssignalisclean. q MOCAP hasoutliersandnoise,butnodrift. Wewantto estimate the offset between q MOCAP and q IMU , and this offset is a noisy, outlier-infested, slowly drifting signal. Figure 4.5(a) shows the offset data between q MOCAP and q IMU for one of the four quaternion coefficients, collected over 6000 data samples. The thresholding, mixture model and variational Bayesian robust regression approaches appear sensitive to outliers (occurring between the 4000th and 5000th sample). In comparison, the incremental version of Bayesian weighted regression is far less sensitive to outliers. 115 0 2000 4000 6000 −1 −0.5 0 0.5 1 Sample Index Y (Output) Observed output Thresholding Mixture model Robust regression Bayes. regression Batch Bayesian (a) Predicted versus observed outputs 0 2000 4000 6000 −0.1 0 0.1 0.2 0.3 Sample Index Y (Output) (b) Magnified view of Figure 4.5(a) Figure 4.5: Predicted versus observed offset data between the q IMU and q MOCAP , shown for one of the four quaternion coefficients (λ = 0.999). Observed outputs are noisy and contain outliers. 116 4.4.2 Weighted Kalman Filter We evaluated our robust weighted Kalman filter on synthetic and robotic data sets and compared it with three other filters. We omitted the filters of Durovic & Kovacevic (1999) and Chan et al. (2005) since we had difficulty implementing them and getting them to work. Instead, we used a hand-tuned thresholded Kalman filter to serve as a baselinecomparison. WecomparedourrobustweightedKalmanfiltertotwootherfilters: i) the standard Kalman filter and ii) a Kalman filter where outliers are determined by thresholding on the Mahalanobis distance. This threshold value is manually hand-tuned for a particular data set. If the Mahalanobis distance is less than a certain threshold value, then it is considered an inlier and processed. Otherwise, it is an outlier and ignored. This threshold value is hand-tuned manually in order to find the optimal value for a particular data set. If we have a priori access to the entire data set and are able to tune this threshold value accordingly, the thresholded Kalman filter gives near-optimal performance. First, we simulate a real data set where hidden states are unknown and only access to observed data is available. Although they are linear, Kalman filters are commonly used to track more interesting “nonlinear” behaviors (i.e., not just a straight line). For this reason, we try the methods on a synthetic data set exhibiting nonlinear behavior, where thesystemdynamicsareunknown. Wealsoconductedexperimentsonasyntheticdataset where the system dynamics of the generative model are known. These experiments yield similar performance results to that where the system dynamics are unknown. Finally, we run all Kalman filters on data collected from LittleDog. 117 For these experiments, we are interested in the Kalman filter’s prediction of the ob- served (output) data and detection of outliers in the observations. We are not interested in the estimation of the system dynamics or in the estimation (or outlier detection) of the states. Estimation of the system matrices for the purpose of parameter identification and detection of outliers in the states are different problems and left to another paper. 4.4.2.1 Synthetic Data We created data exhibiting nonlinear behavior, where C, A, R, Q and states are un- known, high noise is added to the (output) data, and a data sample is an outlier with 1% probability. One-dimensional data is used for ease of visualization, and Figure 4.6(a) showsanoisycosinefunctionwithoutliers,over500timesteps. Foroptimalperformance, C,A,R andQ were manually tuned for the standard Kalman filter—a tricky and time- consuming process. In contrast, the system dynamics were learnt for the thresholded Kalman filter using a maximum likelihood framework (i.e. using Eqs.(4.37) to (4.40) without any weights). Figure 4.6(b) shows how sensitive the standard Kalman filter is to outliers, while the weighted robust Kalman filter seems to detect them quite well. In Figure 4.7(a), we compare the weighted robust Kalman filter with thresholded filter. Both filters appear to perform as well, which is unsurprising, given the amount of manual tuning required by the thresholded Kalman filter. Figure 4.7(b) shows that the residual prediction error on the outputs stays within the ±3σ bounds. Graphs showing the estimated states were omitted, but they show similar trends in the accuracy results. 118 0 100 200 300 400 500 −1 0 1 2 3 Time step Output data Noisy output Outliers (a) Observed noisy output data with outliers 0 100 200 300 400 500 −1 0 1 2 3 Time step Output data Outliers Kalman Filter Weighted Robust KF (b) Predicted data for the Kalman filter (KF) and the weighted robust KF Figure 4.6: One-dimensional data showing a cosine function with noise and outliers (and unknown system dynamics) for 500 samples at 1 sample/time step 119 0 100 200 300 400 500 −1 0 1 2 3 Output data Time step Outliers Thresholded KF Weighted Robust KF (a) PredicteddataforthethresholdedKF,alternativeKF and weighted robust KF. All perform similarly 0 100 200 300 400 500 −2 −1 0 1 2 Time step Residual Output Error Residual prediction error +/− 3 sigma bounds (b) Residual prediction error for the weighted robust Kalman filter Figure 4.7: One-dimensional data showing a cosine function with noise and outliers (and unknown system dynamics) for 500 samples at 1 sample/time step 120 4.4.2.2 LittleDog Robot 0 2000 4000 6000 −0.5 0 0.5 1 Time step Output data Observed output Figure4.8: ObservedquaterniondatafromLittleDogrobot: aslowlydriftingnoisysignal with outliers. Data isshown for 6000 samples. We assume thateach data sample is made available at each time step. WepresentthetrackingperformanceofthefiltersonquaterniondatafromLittleDog, described previously in Section 4.4.1.2. Figure 4.8 shows the quaternion data over 6000 data samples, at 1 sample/time step. There are various approaches to estimating this slowly drifting signal, depending on the quality of estimate desired. We can estimate it with a straight line, as done in Sec- tion 4.4.1.2. However, if we want to estimate this slowly drifting signal more accurately, we can use the proposed outlier-robust Kalman filter to track it. For optimal perfor- mance, we, once again, manually tuned C, A, R and Q for the standard Kalman filter. The system dynamics of the thresholded Kalman filter were learnt, and its threshold parameter was manually tuned for best performance on this data set. The standard Kalman filter fails to detect outliers occurring between the 4000th and 5000thsample,asseeninFigure4.9(a). Figure4.9(b)showsthatthethresholdedKalman 121 0 2000 4000 6000 −0.1 −0.05 0 0.05 0.1 0.15 Time step Output data Kalman Filter Weighted Robust KF (a) Predicted quaternion data for the Kalman filter and robust weighted Kalman filter 0 2000 4000 6000 −0.1 −0.05 0 0.05 0.1 0.15 Time step Output data Thresholded KF Weighted Robust KF (b) PredictedquaterniondataforthethresholdedKalman filter and robust weighted Kalman filter Figure 4.9: Predicted quaternion data for the Kalman filter and robust weighted Kalman filter, shown for 6000 samples. We assume that each data sample is made available at each time step. Note the change of scale in axis from Figure 4.8. 122 filter does not react as violently as the standard Kalman filter to outliers and, in fact, appears to perform as well as the robust weighted Kalman filter. This is to be expected, given that we hand-tuned the threshold parameter for optimal performance. In this experiment, the advantages offered by the weighted Kalman filter are clear. It outperforms the traditional Kalman filter, while achieving a level of performance on par with a thresholded Kalman filter (where the threshold value is manually tuned for optimal performance). 4.5 Discussion We introduced a Bayesian weighted regression algorithm that automatically detects and eliminates outliers in real-time. We extended this approach to produce a Kalman filter that is able to track and detect outliers in the observations. Both methods are easy to use and do not require any parameter tuning, interference from the user, heuristics or sampling. Both learn the value of weights associated with the data samples, but they also require an initial weight prior be set (that is, an outlier prior that indicates how strong the presence of outliers are in the data). However, in order to perform outlier detection correctly, this prior is necessary in order to distinguish outliers from structure in the data. The Bayesian weighted regression algorithm outperforms standard methods in batch or incremental settings. The outlier-robust Kalman filter performs as well as a “hand-tuned” Kalman filter that requires a priori knowledge of the entire data. The calculation of the posterior covariances of b in Eqs. (4.4) and (4.8) requires a matrix inversion, resulting in a computational complexity of O(d 3 ). This will be fine 123 for low-dimensional systems. However, for systems where the data has a large number of input dimensions, the matrix inversion becomes computationally prohibitive. In such situations, Eq. (4.8) can be re-written recursively, as in Recursive Least Squares (Ljung & Soderstrom 1983, Bierman 1977), in order to reduce the computational complexity to O(d) per EM iteration. The regression estimates come with a measure of confidence, such that the quality of the estimates and predictions can be judged. Notice that the calculation of the posterior covariances of b in Eqs. (4.4) and (4.8) requires a matrix inversion, resulting in a com- putational complexity of O(d 3 ). This will be fine for low-dimensional systems. However, for systems where the data has a large number of input dimensions, the matrix inversion becomes computationally prohibitive. In such situations, Eq. (4.8) can be re-written re- cursively, as in Recursive Least Squares (Ljung & Soderstrom 1983, Bierman 1977), in order to reduce the computational complexity to O(d) per EM iteration. 124 Chapter 5 Nonlinear High-Dimensional Regression Havingaddressedtheproblemofnoisyhigh-dimensionallinearregression,wenowmoveto nonlinear high-dimensional regression problems. Gaussian processes (Bernardo & Smith 1994, Williams & Rasmussen 1995, Rasmussen 1996) (GPs) are competitive function approximators for nonlinear regression problems. However, they can be computationally prohibitive for large data sets since they require a matrix inversion that takes around O(N 3 ), where N is the number of data samples. Additionally, they are not suitable for high-dimensional ill-conditioned problems (where many of the input dimensions are redundant or irrelevant). Although recent work has tried to overcome these limitations, e.g., (Csato & Opper 2002, Lawrence, Seeger & Herbrich 2003, Y. Shen & Seeger 2006, Snelson&Ghahramani2006a,Snelson&Ghahramani2006c),Gaussianprocessregression is not quite ready for real-time robot learning applications, where the input data is high- dimensional and fast incremental learning is necessary. In such scenarios, local methods such as Locally Weighted Projection Regression (LWPR) (Vijayakumar et al. 2005) may bemoresuitableandhavebeenshowntobeeffectiveforreal-timelearningofmotorskills on humanoid robots. 125 We could consider how one would perform Bayesian locally weighted regression in a batchsetting(i.e. theentiredatasetisavailableattheoutset), wherethelocalityofeach model is learnt probabilistically. In Schaal & Atkeson (1994), locally weighted learning was applied to robotic devilsticking, and the locality of each model was determined by incremental gradient descent based on stochastic leave-one-out cross-validation. We could also move beyond locally weighted regression and consider the more gen- eral category of kernel-based methods, which include Parzen windows, kernel regression, locally weighted regression, radial basis function networks, Reproducing Kernel Hilbert Spaces, Support Vector Machines, and Gaussian process regression. Most algorithms start with parameterizations that are the same for all kernels, independent of where in data space the kernel is used, but later recognize the advantage of locally adaptive ker- nels (Friedman 1984, Poggio & Girosi 1990, Fan & Gijbels 1996). Such locally adaptive kernels are useful in scenarios where the data characteristics vary greatly in different parts of the workspace (e.g., in terms of data density, curvature and output noise). For instance, inGaussianprocess(GP)regression, usinganonstationarycovariancefunction, e.g., (Paciorek & Schervish 2004), allows for such a treatment. Performing optimizations individually for every kernel, however, becomes rather complex and is prone to overfit- ting due to a flood of open parameters. Previous work has suggested gradient descent techniqueswithcross-validationmethodsorinvolvedstatisticalhypothesistestingforop- timizing the shape and size of a kernel in a learning system (Fan & Gijbels 1992, Fan & Gijbels 1995, Schaal & Atkeson 1994, Friedman 1984). 126 In this chapter, we consider local kernel shaping by averaging over data samples with the help of locally polynomial models and formulate this approach, in a Bayesian frame- work, for function approximation with both piecewise linear models and nonstationary GP regression. Our local kernel shaping algorithm (Ting, Kalakrishnan, Vijayakumar & Schaal 2008) is computationally efficient—capable of handling large data sets, can deal with functions of strongly varying curvature, data density and output noise, and even rejects outliers automatically. The algorithm automatically learns the size of the data neighborhood contributing to each local model (i.e. the “bandwidth” or spatial distance metric of the local model). A Bayesian approach offers error bounds on the distance metric and incorporates this uncertainty in the predictive distributions. Our approach to nonstationary GP regression differs from previous work by avoiding Markov Chain Monte Carlo (MCMC) sampling (Rasmussen & Ghahramani 2002, Meeds & Osindero 2005) and by exploiting the full nonparametric characteristics of GPs in or- der to accommodate nonstationary data. Other approaches to kernel shaping include the random varying coefficient model (Longford 1993), local polynomial modeling (Fan & Gijbels 1996), use of asympototic analysis (Schucany 1995), and entropy-based mea- sures(Ormoneit&Hastie1999)—wherethevolumeinsteadofthebandwidthofthekernel is optimized, to name a few. Many of these methods, however, require cross-validation or are sensitive to the initialization of parameters. One of the core application domains for our work is learning control, where computa- tionallyefficientfunctionapproximationandhighlyaccuratelocallinearizationsfromdata are crucial for deriving controllers and for optimizing control along trajectories (Atkeson & Schaal 1997). The high variations from fitting noise, seen in Figures 5.4(a) and 5.5(a) 127 areharmfultothelearningsystem, potentiallycausingthecontrollertobeunstable. Our final evaluations illustrate such a scenario by learning an inverse kinematics model for a real robot arm. 5.1 Bayesian Local Kernel Shaping Wedevelopourapproachinthecontextofnonparametriclocallyweightedregressionwith locally linear polynomials (Atkeson et al. 1997), assuming, for notational simplicity, only a one-dimensional output—extensions to multi-output settings are straightforward. We assume a training set of N samples, D ={x i ,y i } N i=1 , drawn from a nonlinear function: y =f(x)+ that is contaminated with mean-zero (but potentially heteroscedastic) noise . Each data sample consists of a d-dimensional input vector x i and an output y i . We wish to approximate a locally linear model of this function at a query point x q ∈< d×1 in order to make a prediction y q =b T x q , where b∈< d×1 . We assume the existence of a spatially localized weighting kernel: w i =K(x i ,x q ,h) 128 that assigns a scalar weight to every{x i ,y i } according to its Euclidean distance in input spacefromthequerypointx q . ApopularchoiceofthefunctionK istheGaussiankernel: w i =exp n −0.5(x i −x q ) T H(x i −x q ) o (where H is a positive semi-definite diagonal matrix with h on its diagonal) since the weight of a data sample is then lower when its input point is further away from the query inputx q . That is to say, we assume that the further away a training data sample is from the query point in input space, the more we downweight that training sample. However, other kernels may be used as well (Atkeson et al. 1997). Thebandwidthh∈< d×1 ofthekernelrepresentshowwidetheweightingkernelisand dictatesthequalityoffitofthelocalmodel. hisaformofdistancemetric,ameasurethat determines the size and shape of the weighting kernel and is the size of the local regime in input space to be linearized. h should be chosen as a function of the local curvature of f(x) and the data density around x q . If we find the “right” bandwidth as a function of x q to avoid oversmoothing or overfitting the data, nonlinear function approximation can be solved accurately and efficiently. Our goal is to find a Bayesian formulation of determining b and h simultaneously. As previously mentioned, past work has involved use of cross-validation, involved statisticalhypothesistestingorsearchtofindtheoptimaldistancemetricvalue. However, these methods may be sensitive to initialization values (for gradient descent), require manualmeta-parametertuningorbequitecomputationallyinvolved. Inthenextsection, 129 we propose a variational Bayesian algorithm that learns both b andh simultaneously in an EM-like framework. 5.1.1 Model For the locally linear model at the query point x q , we can introduce hidden random variables z (D’Souza et al. 2004) and modify the linear model y i =b T x i so that: y i = d X m=1 z im + z im =b T m x im + zm (5.1) where zm and are both additive noise terms: ∼Normal 0,σ 2 zm ∼Normal(0,ψ zm ) Thez variables allow us to derive computationally efficient and numerically robustO(d) EM-like updates, as we will see later. Note that x im = [x im 1] T and b m = [b m b m0 ] T , wherex im is themth coefficient ofx i ,b m is themth coefficient ofb andb m0 is the offset value. 130 i= 1,..,N . . . ! 2 w i2 x i2 x id w id z i2 y i z id ! i1 ! i2 ! id b 1 b 2 b d h 2 h 1 h d z i1 w i1 x i1 Figure 5.1: Graphical model of Bayesian local kernel shaping. Random variables are in circles, and observed random variables are in shaded double circles. The prediction at the query point x q is then P d m b T m x im . We assume the following prior distributions for our model, shown graphically in Figure 5.1: p(y i |z i )∼Normal 1 T z i ,σ 2 p(z im |x im )∼Normal b T m x im ,ψ zm p(b m |ψ zm )∼Normal(0,ψ zm Σ bm,0 ) p(ψ zm )∼Scaled-Inv-χ 2 (n m0 ,ψ zm,0 ) (5.2) where 1 is a vector of 1s, z i ∈ < d×1 , z im is the mth coefficient of z i , and Σ bm,0 is the prior covariance matrix of b m and a 2×2 diagonal matrix. n m0 and σ 2 mN0 are the prior parameters of the Scaled-inverse-χ 2 distribution 1 . The Scaled-Inverse-χ 2 distribution was used forψ zm since it is the conjugate prior for the variance parameter of a Gaussian distribution. 1 nm0 is the number of degrees of freedom parameter and σ 2 mN0 is the scale parameter 131 In contrast to classical treatments of Bayesian weighted regression (Gelman et al. 2000) where the weights enter as a heteroscedastic correction on the noise variance of each data sample, we associate a scalar indicator-like weight, w i ∈ {0,1}, with each sample{x i ,y i } inD. The sample is fully included in the local model ifw i =1 and excluded if w i =0. We define the weight w i to be: w i = d Y m=1 w im (5.3) where w im is the weight component in the mth input dimension. While previous meth- ods model the weighting kernel K as some explicit function, we treat the weights w im probabilistically and model them as Bernoulli-distributed random variables: p(w im )∼Bernoulli(q im ) (5.4) choosing a symmetric bell-shaped function for the parameter q im : q im = 1 1+(x im −x qm ) 2r h m (5.5) wherex qm isthemthcoefficientofx q ,h m isthemthcoefficientofh,andr>0isapositive integer. The function forq im has the property that data samples are downweighted more if they are further located from the query point in input space. (x im −x qm ) is taken to the power 2r in order to ensure that the resulting expression is positive. Adjusting r affects how long the tails of the kernel are, as Figure 5.2 shows. For smaller values of 132 −5 0 5 0 0.2 0.4 0.6 0.8 1 x i q i r=1 r=2 r=3 Figure 5.2: Graphs of the function q i = 1 1+(x i −xq) 2r h for 1-dimensional inputs x i , where x q =0 and h=1, over different r values. r, the weighting kernel takes on a shape with a narrower peak, but longer tails. We use r =2 for all our experiments to avoid long tails. We model w im with a Bernoulli distribution since the weights w need to be positive and between 0 and 1. q im is then the parameter of the Bernoulli distribution and is, by definition, also restricted to values between 0 and 1. As pointed out in Atkeson et al. (1997), the particular mathematical formulation of a weighting kernel is largely computationally irrelevant for locally weighted learning. Our choice of weighting kernel (i.e., choice of function for q im ) was dominated by the desire to obtain analytically tractable learning updates (see the next section for more details). We place a Gamma prior over the bandwidth h m : p(h m )∼Gamma(a hm0 ,b hm0 ) (5.6) 133 wherea hm0 andb hm0 areparametersoftheGammadistribution,toensurethatapositive weighting kernel width. 5.1.2 Inference We can treat the entire regression problem as an EM learning problem (Dempster et al. 1977,Ghahramani&Beal2000). DefiningXtobeamatrixwithinputvectorsx i arranged in its rows and y as a vector with coefficients y i , we would like to maximize the log likelihood logp(y|X) for generating the observed data. We can maximize this incomplete log likelihood by maximizing the expected value of the complete log likelihood, shown below: y,Z,b,w,h,σ 2 ,ψ z |X)= N Y i=1 p(y i ,z i ,b,w i ,h,σ 2 ,ψ z |x i ) In our model, each data sample i has an indicator-like scalar weight w i associated with it, allowing us to express the complete log likelihood L, in a similar fashion to mixture models, as: L=log " N Y i=1 " p(y i |z i ,σ 2 )p(z i |x i ,b,ψ z ) w i d Y m=1 p(w im ) # d Y m=1 p(b m |ψ zm )p(ψ zm )p(h m )p(σ 2 ) # which can be expanded to: L= N X i=1 w i logp(y i |z i ,σ 2 )+ N X i=1 w i logp(z i ,|x i ,b,ψ z )+ N X i=1 d X m=1 logp(w im ) + d X m=1 logp(b m |ψ zm )+ d X m=1 logp(ψ zm )+ d X m=1 logp(h m )+logp(σ 2 ) (5.7) 134 If we expand the term P N i=1 P d m=1 logp(w im ) in Eq. (5.7) above, we get: N X i=1 d X m=1 logp(w im =1) w im + N X i=1 d X m=1 logp(w im =0) (1−w im ) = N X i=1 d X m=1 " w im log 1 1+(x im −x qm ) 2r h m +(1−w im )log 1− 1 1+(x im −x qm ) 2r h m !# = N X i=1 d X m=1 w im log 1 1+(x im −x qm ) 2r h m + N X i=1 d X m=1 (1−w im )log (x im −x qm ) 2r h m 1+(x im −x qm ) 2r h m = N X i=1 d X m=1 (1−w im )log(x im −x qm ) 2r h m − N X i=1 d X m=1 log 1+(x im −x qm ) 2r h m (5.8) Given that the prior on h m is a Gamma distribution, there is a problematic log term, −log 1+(x im −x qm ) 2r h m , in the expression above that prevents us from deriving an analytically tractable expression for the posterior of h m . To address this, we use a variationalapproachonconcave/convexfunctionssuggestedbyJaakkola&Jordan(2000) in order to produce analytically tractable expressions. We can find a lower bound to the term −log 1+(x im −x qm ) 2r hm so that: −log 1+(x im −x qm ) 2r h m ≥−λ im (x im −x qm ) 2r h m +const hm (5.9) where λ im is a variational parameter to be optimized in an M-step of our final EM-like algorithm and can be done so by optimizing with respect to the expected complete log likelihood: λ im = 1 1+(x im −x qm ) 2r hh m i (5.10) Please refer to Appendix D.1 for derivations of the lower bound in Eq. (5.9). 135 Ourchoiceofweightingkernelq im allowsustofindalowerboundtoLinthismanner. We explored the use of other weighting kernels (such as a quadratic negative exponential, among others). However, we encountered difficulties in finding a lower bound to the problematic terms in logp(w im ) so that analytically tractable inference for the posterior of h m could be done. The resulting lower bound to L, ˆ L, is then: L≥ ˆ L=− N X i=1 w i log2πσ 2 − 1 2σ 2 N X i=1 w i y i −1 T z i 2 − 1 2 N X i=1 w i d X m=1 log2πψ zm − N X i=1 w i d X m=1 1 2ψ zm z im −b T m x im 2 + N X i=1 d X m=1 (1−w im )log(x im −x qm ) 2r h m − N X i=1 d X m=1 λ im (x im −x qm ) 2r h m − 1 2 d X m=1 log|Σ bm,0 |− 1 2 d X m=1 logψ zm − d X m=1 1 2ψ zm b T m Σ −1 bm,0 b m − d X m=1 n m0 2 +1 logψ zm − d X m=1 n m0 ψ zmN0 2ψ zm + d X m=1 (a hm0 −1)logh m − d X m=1 b hm0 h m +const y,Z,b,w,h,σ 2 ,ψz (5.11) We would like to maximize the lower bound to the log likelihood in Eq. (5.11) and find the parameter values that correspond to this. The expectation of this complete data likelihood ˆ L should be taken with respect to the true posterior distribution of all hidden variablesQ(b,ψ z ,z,h). Since this is an analytically tractable expression, a lower boundcanbeformulatedusingavariationalfactorialapproximationofthetrueposterior, e.g., (Ghahramani & Beal 2000), as follows: Q(b,ψ z ,h,z)=Q(b,ψ z )Q(h)Q(z). 136 All resulting posterior distributions over hidden variables become analytically tractable. Assuming that θ = {b,ψ z ,h}, the posterior of w im , p(w im = 1|y i ,z i ,x i ,θ,w i,k6=m ) can be inferred using Bayes’ rule: p(y i ,z i |x i ,θ,w i,k6=m ,w im =1) Q d t=1,t6=m hw it i p(w im =1) p(y i ,z i |x i ,θ,w i,k6=m ,w im =1) Q d t=1,t6=m hw it i p(w im =1)+p(w im =0) (5.12) wherew i,k6=m denotes the set of weights{w ik } d k=1,k6=m . For the dimensionm, we account for the effect of weights in the otherd−1 dimensions. This is a result ofw i being defined as the product of weights in all dimensions, as seen in Eq. (5.3). The posterior mean of w im is then hp(w im =1|y i ,z i ,x i ,θ,w i,k6=m )i, and: hw i i= d Y m=1 hw im i where h.i denotes the expectation operator. The final posterior EM update equations (along with posterior updates for w im ) are listed in below: E-step: Σ bm = Σ −1 bm,0 + N X i=1 hw i ix im x T im ! −1 (5.13) hb m i=Σ bm N X i=1 hw i ihz im ix im ! (5.14) Σ z i |y i ,x i = Ψ zN hw i i − 1 s i Ψ zN hw i i 11 T Ψ zN hw i i (5.15) hz i i= Ψ zN 1 s i hw i i + I d,d − Ψ zN 11 T s i hw i i bx i (5.16) 137 ψ zmN = P N i=1 hw i i hz im i−hb m i T x im 2 + P N i=1 hw i iΣ z i |y i ,x i +n m0 ψ zmN0 +hb m i T Σ −1 bm,0 hb m i n m0 + P N i=1 hw i i (5.17) hw im i= q im A Q d k=1,k6=m hw ik i i q im A Q d k=1,k6=m hw ik i i +1−q im (5.18) hh m i= a hm0 +N − P N i=1 hw im i b hm0 + P N i=1 λ im (x im −x qm ) 2r (5.19) M-step: σ 2 = 1 P N i=1 hw i i N X i=1 D w i y i −1 T z i 2 E (5.20) λ im = 1 1+(x im −x qm ) 2r h m (5.21) where: hw i i= d Y m=1 hw im i s i =σ 2 +1 T Ψ zN hw i i 1 q im =λ im = 1 1+(x im −x qm ) 2r hh m i A i =Normal y i ;1 T hz i i,σ 2 d Y m=1 Normal z im ;hb m i T x im ,ψ zm and I d,d is a d×d identity matrix, bx i is a d by 1 vector with coefficients hb m i T x im , Ψ zN is a diagonal matrix with ψ zN on its diagonal, Note that to avoid division by zero, hw i i needs to be capped to some small non-zero value. Please refer to Appendix D.2 for derivations of the EM update equations listed above. 138 −2 −1 0 1 2 −1 0 1 2 3 4 x y Training data Stationary GP Kernel Shaping Figure 5.3: Effect of outliers (in black circles) on local kernel shaping CloserexaminationofEq.(5.14)—theexpressionforhb m i—showsthatitisastandard Bayesianweightedregressionupdate(Gelmanetal.2000),i.e.,adatasampleiwithlower weightw i will be downweighted in the regression. Since the weights are influenced by the residual error at each data point, as Eq. (5.18) shows, an outlier will be downweighted appropriately and eliminated from the local model. Figure 5.3 shows how local kernel shaping is able to ignore outliers that a classical GP fits. Note that the posterior mean ofh m is a function of the number of samples that are excluded from the local model and the weighted distance that each sample is from the query point (i.e., the denominator of h m ). InitializationofPriors: Afewremarksshouldbemaderegardingtheinitializationof priorsusedinEqs.(5.13)to(5.21). Σ bm,0 canbesetto10 6 Itoreflectalargeuncertainty associated with the prior distribution of b. The initial noise variance, ψ zm,0 , should be set to the best guess on the noise variance. To adjust the strength of this prior,n m0 can 139 be set to the number of samples one believes to have seen with noise variance ψ zm,0 — the more points, the strong the prior. For instance,n m0 could be set to a fraction of the numberofsamplesinthetrainingset. Finally, theinitialhoftheweightingkernelshould be set so that the kernel is broad and wide. We use values of a hm0 = b hm0 = 10 −6 so thath m0 =1 with high uncertainty. In a similar fashion to outlier detection in Section 4, some sort of initial belief about the noise level is needed to distinguish between noise and structure in the training data. Aside from the initial prior on ψ zm , we used the same priors for all data sets in our evaluations. 5.1.3 Computational Complexity Foronelocalmodel,theEMupdateequationshavea computational complexity ofO(Nd) per EM iteration,wheredisthenumberinputdimensionsandN isthesizeofthetraining set. This efficiency arises from the introduction of the hidden random variablesz i , which allowshz i iandΣ z i |y i ,x i tobecomputedinO(d)andavoidsad×dmatrixinversionwhich would typically require O(d 3 ). Nonstationary GP methods, e.g., (Paciorek & Schervish 2004), require O(N 3 ) + O(N 2 ) for training and prediction, while other more efficient stationary GP methods, e.g., (Snelson & Ghahramani 2006b), have O(M 2 N) +O(M 2 ) training and prediction costs (where M << N is the number of pseudoinputs used in Snelson & Ghahramani (2006b)). Our algorithm requiresO(NdI EM ), whereI EM is the number of EM iterations (with a maximal cap of 1000 iterations used). The algorithm also does not require any MCMC 140 sampling as in Rasmussen & Ghahramani (2002) and Meeds & Osindero (2005), making it more appealing to real-time applications. 5.2 Extension to Gaussian Processes We can apply the algorithm in Section 5.1 not only to locally weighted learning with linear models, but also to derive a non-stationary GP method. A GP is defined by a mean and and a covariance function, where the covariance function K captures depen- denciesbetweenanytwopointsasafunctionofthecorrespondinginputs,i.e.,k(x i ,x j )= cov f(x i ),f(x 0 j ) , wherei,j =1,..,N. Standard GP models use a stationary covariance function, where the covariance between any two points in the training data is a function of the distances |x i −x j |, not on their locations. Stationary GPs perform suboptimally for functions that have different properties in various parts of the input space (e.g., dis- continuous functions) where the stationary assumption fails to hold. Various methods have been proposed to specify non-stationary GPs. These include defininganon-stationaryMat´ erncovariancefunction(Paciorek&Schervish2004),adopt- ingamixtureoflocalexpertsapproach(Tresp2000,Rasmussen&Ghahramani2002,Shi, Murray-Smith & Titterington 2005, Meeds & Osindero 2005) to use independent GPs to cover data in different regions of the input space, and using multidimensional scaling to map a non-stationary spatial GP into a latent space (Schmidt & O’Hagan 2003). Given the data setD drawn from the functiony =f(x)+, as previously introduced in Section 5.1.1, we propose an approach to specify a non-stationary covariance function. 141 Assuming the use of a quadratic negative exponential covariance function, the covariance function of a stationary GP is as follows: k(x i ,x j )=v 2 1 exp(−0.5 d X m=1 h m (x im −x 0 jm ) 2 )+v 0 wherethe(d+2)hyperparameters{h 1 ,h 2 ,...,h d ,v 0 ,v 1 }areoptimized. Inanon-stationary GP, the covariance function could then take the following form (Gibbs 1997): k(x i ,x j )=v 2 1 2 p h im h jm h im +h jm ! 1/2 exp − d X m=1 (x im −x jm ) 2 2(h im +h jm ) ! +v 0 (5.22) whereh im is the bandwidth of the local model centered atx im andh jm is the bandwidth of the local model centered atx jm . The hyperparameters to be optimized now consist of only v 0 and v 1 . We learn first the values of{h im } d m=1 for all training data samplesi=1,...,N, using our proposed local kernel shaping algorithm and then optimizing the hyperparametersv 0 andv 1 . To make a prediction for a test samplex q , we learn also the values of{h qm } d m=1 , i.e., the bandwidth of the local model centered at x q . Importantly, since the covariance function of the GP is derived from locally constant models, we learn with locally con- stant, instead of locally linear, polynomials. We use r = 1 for the weighting kernel in order keep the degree of nonlinearity consistent with that in the covariance function (i.e., quadratic). Even though the weighting kernel used in the local kernel shaping algorithm is not a quadratic negative exponential, it has a similar bell shape, but with a flatter top and shorter tails. Because of this, our proposed method is an approximated form of 142 a non-stationary GP. Nonetheless, the augmented GP is able to capture non-stationary propertiesofthefunctionf withoutneedingMCMCsampling,unlikepreviouslyproposed non-stationary GP methods (Rasmussen & Ghahramani 2002, Meeds & Osindero 2005). 5.3 Experimental Results We evaluate our Bayesian local kernel shaping algorithm on synthetic data sets, in order to establish that its performance is competitive to other state-of-the-art techniques for nonlinearregressionsuchaslocallyweightedprojectionregression(LWPR)andGaussian process regression (GPR). We also demonstrate the effectiveness of our algorithm on a real robotic data set, learning an inverse kinematics model for a robot arm. 5.3.1 Synthetic Data First, we show our local kernel shaping algorithm’s bandwidth adaptation abilities on severalsyntheticdatasets, comparingittoastationaryGPandourproposedaugmented non-stationary GP. One-dimensional synthetic data: For the ease of visualization, we consider the fol- lowing one-dimensional functions, similar to those in Paciorek & Schervish (2004): (i) a function with a discontinuity (ii) a spatially inhomogeneous function (iii) a straight line 143 The data set for function (i) consists of 250 training samples, 201 test inputs (evenly spaced across the input space) and output noise with σ 2 = 0.3025. The data set for function ii) consists of 250 training samples, 101 test inputs and an output signal-to- noise ratio (SNR) of 10. The data set for function iii) has 50 training samples, 21 test inputs and an output SNR of 100. Figures 5.4 to 5.6 show the predicted outputs of a stationary GP, augmented non- stationary GP and the local kernel shaping algorithm for data sets (i)-(iii). The local kernel shaping algorithm smoothes over regions where a stationary GP overfits and yet, it still manages to capture regions of highly varying curvature, as seen in Figures 5.4(a) and 5.6(a). It correctly adjusts the bandwidths h with the curvature of the function. When the data looks linear, the algorithm opens up the weighting kernel so that all data samples are considered, as Figure 5.6(b) shows. Our proposed augmented non-stationary GPalsocanhandlethenon-stationarynatureofthedatasetsaswell,anditsperformance is quantified in Table 5.1. Returning to our motivation to use these algorithms to obtain linearizations for learning control, it is important to realize that the high variations from fitting noise, as shown by the stationary GP in Figures 5.4(a) and 5.5(a), are detrimental for learning algorithms, as the slope (or tangent hyperplane, for high-dimensional data) would be wrong. Table 5.1 reports the normalized mean squared prediction error (nMSE) values for function (i) and function (ii) data sets, averaged over 20 random data sets. We also performed Gibbs sampling Adaptive Rejection Sampling to sample from the conditional distribution of h. 144 −2 −1 0 1 2 −4 −2 0 2 4 x y Training data Stationary GP Aug GP Kernel Shaping (a) Observed training data vs. predicted output for func- tion (i) 0 1 w −2 −1 0 1 2 10 0 10 3 10 7 x h w xq h (b) LearntweightandbandwidthfromBayesianlocalker- nel shaping Figure 5.4: Observed vs. predicted output made by stationary GP, augmented GP and Bayesian local kernel shaping for function (i). Figure 5.4(b) shows the bandwidths learnt by local kernel shaping and the corresponding weighting kernels (in dotted black lines) for various input query points (shown in red circles). 145 −2 −1 0 1 2 −2 −1 0 1 2 3 4 x y Training data Stationary GP Aug GP Kernel Shaping (a) Observed training data vs. predicted output for func- tion (ii) 0 1 w −2 −1 0 1 2 10 0 10 6 x h h w xq (b) LearntweightandbandwidthfromBayesianlocalker- nel shaping Figure 5.5: Observed vs. predicted output made by stationary GP, augmented GP and Bayesianlocalkernelshapingforfunction(ii). Figure5.5(b)showsthebandwidthslearnt by local kernel shaping and the corresponding weighting kernels (in dotted black lines) for various input query points (shown in red circles). 146 −2 −1 0 1 2 −2 −1 0 1 2 3 4 5 x y Training data Stationary GP Aug GP Kernel Shaping (a) Observed training data vs. predicted output for func- tion (iii) 0 1 w −2 −1 0 1 2 10 −6 10 0 10 6 x h w xq h (b) LearntweightandbandwidthfromBayesianlocalker- nel shaping Figure 5.6: Observed vs. predicted output made by stationary GP, augmented GP and Bayesianlocalkernelshapingforfunction(iii). Figure5.6(b)showsthebandwidthslearnt by local kernel shaping and the corresponding weighting kernels (in dotted black lines) for various input query points (shown in red circles). 147 Table 5.1: Average normalized mean squared prediction error values, for a stationary GP model, our augmented non-stationary GP, local kernel shaping. Results, averaged over 20 random data sets, are shown for functions (i) and (ii). Method Function (i) Function (ii) Stationary GP 0.1251±0.013 0.0230±0.0047 Augmented non-stationary GP 0.0110±0.0078 0.0212±0.0067 Local Kernel Shaping 0.0092±0.0068 0.0217±0.0058 2-dimensional synthetic data: We also evaluated Bayesian local kernel shaping on 500 noisy training data samples, drawn from the 2-dimensional function (CROSS 2D), generated from: y =max exp(−10x 2 1 ),exp(−50x 2 2 ),1.25exp(−5(x 2 1 +x 2 2 )) as previously examined in Schaal et al. (1998) and Vijayakumar et al. (2005). Mean-zero noise with a variance of 0.01 was added to the outputs. Figure 5.7(a) shows the target function, evaluated over a noiseless test input, resulting in 1681 data points on a 41×41 grid in the 2×2 square in input space. Figure 5.7(b) shows the predicted outputs for Bayesian local kernel shaping, and Figure 5.8 depicts the learnt distance metric values h over a subset of the test data points scattered over the 41× 41 grid (shown as red circles). As before, we see that the width of the weighting kernel adjusts according to the curvature of the function. Table 5.2 compares the performance of Bayesian local kernel shaping to GPR and LWPR,averagedover10randomlychosentrainingdatasets. Performancewasquantified in terms of normalized mean squared prediction error (nMSE) value on the noiseless test sets. 148 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 x 2 x 1 y (a) Target Function −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 x 2 x 1 y (b) Predicted Function by Bayesian local kernel shaping Figure 5.7: a) nonlinear 2-dimensional CROSS target function to be approximated; b) approximated/predicted function produced by Bayesian local kernel shaping 149 −2 0 2 −3 −2 −1 0 1 2 3 x 1 x 2 Figure 5.8: Learnt weighting kernels in input space for 2-dimensional CROSS target function in Figure 5.7. Small red circles indicate the test input points and centers of the weighting kernels. Training data has an output noise has a variance of 0.01. Table 5.2: Average normalized mean squared error (nMSE) comparisons between Gaus- sianProcessregression(GPR),LWPRandBayesianlocalkernelshapingforthenonlinear 2-d CROSS function. Results shown are averaged over 10 trials and the training data set has N =500 samples. Algorithm nMSE std-dev GPR 0.01991 0.00314 LWPR 0.02556 0.00416 Local Kernel Shaping 0.02609 0.00532 150 Motorcycledataset: Figures5.9and5.10showtheresultsofthelocalkernelshaping algorithmandtheproposedaugmentednon-stationaryGPonthe“real-world”motorcycle data set (Silverman 1985) consisting of 133 samples (with 80 equally spaced input query points used for prediction). We also show results from two previously proposed MCMC- based non-stationary GP methods: an infinite mixture of GP experts from Rasmussen & Ghahramani (2002), shown in Figure 5.11, and an alternate infinite mixture of GP experts (Meeds & Osindero 2005), shown in Figure 5.12. We can see that the augmented non-stationary GP and the local kernel shaping algorithm both capture the leftmost flatter region of the function, as well as some of the more nonlinear and noisier regions after 30msec. 0 10 20 30 40 50 60 −150 −100 −50 0 50 100 Time (ms) Acceleration (g) Training data Kernel Shaping Stationary GP Figure5.9: PredictedresultsfromBayesianlocalkernelshapingonthemotorcycleimpact data set from Silverman (1985), 151 0 10 20 30 40 50 60 −150 −100 −50 0 50 100 Time (ms) Acceleration (g) Training data Aug GP Stationary GP Figure5.10: PredictedresultsfromouraugmentedGPmethodonthemotorcycleimpact data set from Silverman (1985), 0 10 20 30 40 50 60 ï ï ï 0 50 100 Time (ms) Acceleration (g) Training data Stationary GP iMGPE Figure 5.11: Predicted results from infinite mixture of GP experts (iMGPE) on the motorcycle impact data set from Silverman (1985), Graph is taken from Rasmussen & Ghahramani (2002). 152 ï ï ï ) Acceleration (g) Training Data AiMoGPE SingleGP Time (ms) Figure 5.12: Predicted results from alternate infinite mixture of GP experts (AiMoGPE) on the motorcycle impact data set from Silverman (1985), Graph is taken from Meeds & Osindero (2005). Figure 5.13: SensAble Phantom haptic robotic arm 153 5.3.2 Robot Data Next, we move on to an example application: learning an inverse kinematics model for a 3 degree-of-freedom (DOF) haptic robot arm (manufactured by SensAble, shown in Figure5.13)inordertocontroltheend-effectoralongadesiredtrajectory. Thiswillallow us to verify that the kernel shaping algorithm can successfully deal with a large, noisy real-world data set with outliers and non-stationary properties—typical characteristics of most control learning problems. We collected 60,000 data samples from the arm while it performed random sinusoidal movements within a constrained box volume of Cartesian space. Each sample consists of the arm’s joint angles q, joint velocities ˙ q, end-effector position in Cartesian space x, and end-effector velocities ˙ x. From this data, we first learn a forward kinematics model: ˙ x=J(q)˙ q (5.23) where J(q) is the Jacobian matrix. The transformation from ˙ q to ˙ x can be assumed to be locally linear at a particular configuration q of the robot arm. We learn the forward model using kernel shaping, building a local model around each training point only if that point is not already sufficiently covered by an existing local model (e.g., having an activation weight of less than 0.2). Using insights into robot geometry, we localize the models only with respect to q while the regression of each model is trained only on a mapping from ˙ q to ˙ x—these geometric insights are easily incorporated as priors in the Bayesian model. This procedure resulted in 56 models being built to cover the entire space of training data. 154 Desired Analytical IK z (m) x (m) ï 1 ï 1 ï 1 1 2 (a) Analytical solution Learnt IK Desired z (m) x (m) ï 1 ï 1 ï 1 1 2 (b) Learnt solution Figure 5.14: Desired versus actual trajectories for SensAble Phantom robot arm 155 We artificially introduce a redundancy in our inverse kinematics problem on the 3- DOF arm by specifying the desired trajectory (x, ˙ x) only in terms of x, z positions and velocities, i.e., the movement is supposed to be in a vertical plane in front of the robot. Analytically, the inverse kinematics equation is: ˙ q=J # (q)˙ x−α(I−J # J) ∂g ∂q (5.24) whereJ # (q)isthepseudo-inverseoftheJacobian. Thesecondtermisanoptimalsolution to the redundancy problem, specified here by a cost functiong in terms of joint anglesq. To learn a model forJ # , we can reuse the local regions ofq from the forward model, where J # is also locally linear. The redundancy issue can be solved by applying an additional weight to each data point according to a reward function (Peters & Schaal 2008). In our case, the task is specified in terms of{˙ x, ˙ z}, so we define a reward based on adesiredy coordinate,y des ,thatwewouldliketoenforceasasoftconstraint. Ourreward function is g = e − 1 2 h(k(y des −y)−˙ y) 2 , where k is a gain and h specifies the steepness of the reward. This ensures that the learnt inverse model chooses a solution which produces a ˙ y that pushes the y coordinate toward y des . We invert each forward local model using a weighted linear regression, where each data point is weighted by the weight from the forward model and additionally weighted by the reward. We test the performance of this inverse model (Learnt IK) in a figure-eight tracking task as shown in Figure 5.14. As seen, the learnt model performs as well as the analytical inverse kinematics solution (IK), with root mean squared tracking errors in positions and 156 velocities very close to that of the analytical solution. This demonstrates that kernel shaping is an effective learning algorithm for use in robot control learning applications. Applying any arbitrary nonlinear regression method (such as a GP) to the inverse kinematics problem would, in fact, lead to unpredictably bad performance. The inverse kinematics problem is a one-to-many mapping and requires careful design of a learning problem to avoid problems with non-convex solution spaces (Jordan & Rumelhart 1991). Our suggested method of learning linearizations with a forward mapping (which is a proper function), followed by learning an inverse mapping within the local region of the forward mapping, is one of the few clean approaches to the problem. Instead of using locally linear methods, one could also use density-based estimation techniques like mixture models (Ghahramani 1994). However, these methods must select the correct mode in order to arrive at a valid solution, and this final step may be computationally intensive or involve heuristics. For these reasons, applying a MCMC-type approach or GP-based method to the inverse kinematics problem was omitted as a comparison. 5.4 Discussion We introduced a Bayesian formulation of spatially local adaptive kernels for locally weighted regression. The local kernel shaping algorithm is computationally efficient, making it suitable for large data sets, and can handle non-stationary functions where the data density, curvature and amount of noise in the data vary spatially. The algorithm can also be integrated into nonlinear algorithms such as GPs, offering an approach that does not require any sampling and making it more appealing for real-time problems. 157 We show that our local kernel shaping method is particularly useful for learning control, demonstrating results on an inverse kinematics problem, and envision extensions to more complex problems with redundancy, e.g., learning inverse dynamics models of complete humanoid robots. Note that our algorithm requires only one prior be set by the user, i.e., the prior on the output noise. All other biases are initialized the same for all data sets and kept uninformative. In general, the bias on the output noise is needed in order to distinguish structurefromnoiseindata—whichisalsothecaseforoutlierdetection. Inouralgorithm, the noise bias needs to be specified in a very coarse way. Initscurrentform,ourBayesiankernelshapingalgorithmisbuiltforhigh-dimensional inputs due to its low computational complexity: it scales linearly with the number of input dimensions. However, numerical problems may arise in case of redundant and irrelevant input dimensions. Future work will address this issue through the use of an automaticrelevantdeterminationfeature. Notethatallcommentsaboutthefunctionality of local function approximation in high-dimensional spaces are the same as those made in (Vijayakumar et al. 2005). Other future extensions include an online implementation of the local kernel shaping algorithm. 158 Chapter 6 Conclusion In this dissertation, we presented a series of Bayesian algorithms for the development of autonomous learning systems. The algorithms are black-box-like in property, requiring minimaltuningofparametersbytheuser. ABayesianapproachallowstheusertoexpress domain-specific knowledge intuitively. Instead of having to search for specific values or range of values to set parameters to, the user need only to specify a priori probability distribution over parameters. 6.1 Summary of Dissertation Contributions Algorithmic contributions: • Reformulation of linear regression as a sparse, numerically robust, computationally efficient (for an iterative method), automatic and truly black-box method • Accurate model identification (of a generative model) with input and output noise • Automatic outlier detection • Nonlinear regression for heteroscedastic and non-stationary functions 159 Contributions to application domains: • Modeling of data for brain-machine interfaces • Accurate parameter identification for rigid body dynamics • Robust state estimation for tracking • Learning nonlinear internal models for robotics 6.2 Summary of Chapters 6.2.1 Chapter 2 Chapter 2 addresses how linear regression in high dimensions can be performed, espe- cially where the input data has a large number of input dimensions—many of which are redundant or irrelevant. These kinds of data are typically encountered in the fields of neuroscience and brain-machine interfaces. Variational Bayesian least squares, the algorithm introduced in the chapter, is com- putationally efficient, requiring O(d) updates per EM iteration—where d is the input dimensionality, is guaranteed to converge to the global solution and has no parameters to be tuned. The initial prior distribution over the precision variablesα can be set to be wide and uninformative. The hyperparameter values for the initial prior ofα need never be changed from data set to data set, making VBLS autonomous and “black-box”-like. Even though VBLS is an iterative statistical method, it can be embedded into other iterative methods to make them more computationally efficient. Its iterative nature also makes it suitable for real-time learning scenarios, where time constraints dictate that an 160 approximately accurate solution is better than an extremely accurate one that takes too long to compute. 6.2.2 Chapter 3 Chapter 3 considers high-dimensional scenarios introduced in Chapter 2, but takes into account the more realistic scenario where observed, sensory data (in particular, observed input data) is contaminated with noise. A novel full Bayesian treatment of the linear system identification is introduced in the chapter. The algorithm is robust to high- dimensional, ill-conditioned data with noisy input and output data, remains computa- tionally efficient (again, O(d) per EM iteration) and has no parameters that need to be tuned. The algorithm is, however, iterative, but then again, so is probabilistic factor analysis for regression. As with VBLS, the iterative nature of the algorithm makes it suitable for real-time learning scenarios and also for incorporation into other iterative methods to de-noise input data. The algorithm was applied to the problem of estimating parameters in rigid body dynamics—an estimation problem that is linear in the unknown parameters. Physical constraintsoftheparameters(suchaspositivemass,apositivedefiniteinertialmatrix,the parallel axis theorem, etc.) had to be satisfied. As a result, we introduced an additional post-processingstepafterthealgorithmisexecutedinordertoensureresultingparameter estimates satisfy physical constraints. The additional post-processing step projects the inferred solution onto the feasible space of possible solutions. 161 The chapter focuses on the problem of learning for system identification (for the purpose of finding the true parameters of a system), versus learning for prediction. Good prediction is often possible without modeling all components of the generative system, while parameter estimation comes useful when the identified model is used in other ways that are different than in the training scenario. For example, in the application domain of robotics, the analytical inverse of the identified model is often needed in model-based control. 6.2.3 Chapter 4 Chapter 4 examines how outliers in noisy, sensory data can be detected and removed, especially in real-time. In particular, we propose an automatic approach to outlier detec- tion and removal that can be incorporated into the Kalman filter, as well as other more complex filters. The Bayesian weighted regression framework introduced in the chapter is easy to use. It does not require any parameter tuning beyond the weight prior (which is, in fact, an outlier prior, needed in order to distinguish outliers from data structure), interference from the user, heuristics or sampling. In this framework, each data sample has a scalar weight associated with it and the optimal value of the weights are learnt. 6.2.4 Chapter 5 Finally, chapter 5 moves on to the nonlinear high-dimensional regression problem. We introduce a Bayesian kernel shaping algorithm that automatically learns the bandwidth of a local model. The Bayesian kernel shaping algorithm is computationally efficient and 162 can handle non-stationary functions where the data density, curvature and amount of noise in the data vary spatially. The algorithm can be leveraged in not only in locally methods, suchaslocallyweightedregression,butalsoinglobalmethodsthatarelinearin the parameters, such as Gaussian process regression, in order to capture non-stationary properties in data. The algorithm requires that only one prior be set by the user, i.e., the output noise prior. Similar to outlier detection, this noise prior is needed to distinguish noise from structure in the data. The noise bias in Bayesian kernel shaping needs to be specified coarsely. 6.3 Opportunities for Future Research The next step towards realizing autonomous systems is to bring the Bayesian kernel shaping algorithm developed in Chapter 5, as well as the other algorithms, to real-time learning scenarios. The version of Bayesian kernel shaping presented in this thesis needs also be modified to handle data with many redundant and irrelevant input features. A potential extension is to apply the estimation of alocal model’sbandwidth from a spatial to a temporal setting—specifically, to the estimation of forgetting rates in incremental learning (e.g., as in Recursive Least squares). Estimating the optimal value of forgetting rates is equivalent to determining the window of previously observed samples to consider and average over when trying to make predictions. 163 A related, interesting research direction is the problem of online model selection and how it can be done in an autonomous, black-box-like manner. The order in which sam- ples are observed will affect the resulting model that is selected. Nevertheless, if one coulddeterminewhetherasamplewouldbediscardedortakenintoaccountonceitisob- served, then the resulting mechanism may be effective alternative solution to bandwidth adaptation. 164 Reference List Abe, N., Zadrozny, B. & Langford, J. (2006), Outlier detection by active learning, in ‘Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, ACM Press, New York, NY, USA, pp. 767–772. Aitkin, M. & Wilson, G. T. (1980), ‘Mixture models, outliers and the EM algorithm’, Technometrics 22, 325–331. Akaike,H.(1974),‘Anewlookatthestatisticalmodelidentification’, IEEE Transactions on Automatic Control 19, 716–723. An, C. H., Atkeson, C. G. & Hollerbach, J. M. (1988), Model-based control of a robot manipulator, MIT Press. Atkeson, C., Moore, A. & Schaal, S. (1997), ‘Locally weighted learning’, AI Review 11, 11–73. Atkeson, C. & Schaal, S. (1997), Robot learning from demonstration, in ‘Proceedings of the 14th international conference on Machine learning’, Morgan Kaufmann, pp. 12– 20. Attias, H. (2000), A Variational Bayesian framework for graphical models, in ‘Advances in Neural Information Processing Systems 13’, MIT Press. Azzalini, A. & Bowman, A. W. (1990), ‘A look at some data on the Old Faithful geyser’, Applied Statistics 39, 357–365. Bar-Shalom, Y., Li, X. R. & Kirubarajan, T. (2001), Estimation with Applications to Tracking and Navigation, Wiley. Beal,M.J.(2003),VariationalalgorithmsforapproximateBayesianinference,PhDthesis, Gatsby Computational Neuroscience Unit, University College London. Belsley, D. A., Kuh, E. & Welsch, R. E. (1980), Regression diagnostics: Identifying influential data and sources of collinearity, Wiley. Bernardo, J. M. & Smith, A. F. (1994), Bayesian Theory, John Wiley. Bierman, G. J. (1977), ‘Factorization methods for discrete sequential estimation’, Math- ematics in Science and Engineering 128. Bishop, C. M. (2006), Pattern recognition and Machine learning, Springer. 165 Breitenbach,M.&Grudic,G.Z.(2005),Clusteringthroughrankingonmanifolds,in‘Pro- ceedings of the 22nd International Conference on Machine Learning’, ACM Press, New York, NY, USA, pp. 73–80. Chan, S. C., Zhang, Z. G. & Tse, K. W. (2005), A new robust Kalman filter algorithm under outliers and system uncertainties, in ‘Proceedings of the IEEE International Symposium on Circuits and Systems’, IEEE Press, pp. 4317–4320. Chow, E. & Saad, Y. (1998), ‘Approximate inverse preconditioners via sparse-sparse iterations’, SIAM Journal of Scientific Computing 19(3), 995–1023. Csato, L. & Opper, M. (2002), ‘Sparse online Gaussian processes’, Neural Computation 14(2), 641–669. Dempster, A., Laird, N. & Rubin, D. (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of Royal Statistical Society. Series B 39(1), 1–38. Derksen, S. & Keselman, H. (1992), ‘Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables’, British Journal of Mathematical and Statistical Psychology45, 265–282. Draper, N. R. & Smith, H. (1981), Applied Regression Analysis, Wiley. D’Souza, A. (2004), Towards tractable parameter-free statistical learning, PhD thesis, Department of Computer Science, University of Southern California. D’Souza, A., Vijayakumar, S. & Schaal, S. (2004), The Bayesian backfitting relevance vector machine, in ‘Proceedings of the 21st International Conference on Machine Learning’, ACM Press. Durovic, Z.M.&Kovacevic, B.D.(1999), ‘Robustestimationwithunknownnoisestatis- tics’, IEEE Transactions on Automatic Control 44, 1292–1296. Fan, J. & Gijbels, I. (1992), ‘Variable bandwidth and local linear regression smoothers’, Annals of Statistics 20, 2008–2036. Fan, J. & Gijbels, I. (1995), ‘Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation’, Journal of the Royal Statistical Society B 57, 371–395. Fan, J. & Gijbels, I. (1996), Local polynomial modeling and its applications, Chapman and Hall. Faul, A. C. & Tipping, M. E. (2001), A variational approach to robust regression., in ‘Proceedings of the International Conference on Artificial Neural Networks’, pp. 95– 102. Figueiredo, M. (2003), ‘Adaptive sparseness for supervised learning’, IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1150–1159. 166 Finnof, W., Hergert, F. & Zimmerman, H. G. (1993), ‘Improving model selection by nonconvergent methods’, Neural Networks 6, 771–783. Fischler, M.A.&Bolles, R.C.(1981), ‘Randomsampleconsensus: aparadigmformodel fitting with applications to image analysis and automated cartography’, Communi- cations of the ACM 24(6), 381–395. Fox,D.,Burgard,W.,Dellaert,D.&Thrun,S.(1999),MonteCarlolocalization: efficient position estimation for mobile robots, in ‘Proceedings of National Conference of Artificial Intelligence’, pp. 343–349. Frank,I.&Friedman,J.(1993),‘Astatisticalviewofsomechemometricregressiontools’, Technometrics 35, 109–135. Friedman, J. H. (1984), A variable span smoother, Technical report, Stanford University. Friedman, J., Hastie, T. & Tibshirani, R. (2007), Pathwise coordinate optimization, Technical report, Department of Statistics, Stanford University. Gelfand, A. E. (1996), Model determination using sampling-based methods, in W. R. Gilks, S. Richardson & D. J. Spiegelhalter, eds, ‘Markov Chain Monte Carlo in Practice’, Chapman and Hall, chapter 9, pp. 145–161. Gelman, A., Carlin, J., Stern, H. & Rubin, D. (2000), Bayesian Data Analysis, Chapman and Hall. Geman, S. & Geman, D. (1984), ‘Stochastic relaxation, Gibbs distribution and Bayesian restoration of images’, IEEE Transactions on Pattern Analysis and Machine Intel- ligence 6, 721–741. Ghahramani, Z. (1994), Solving inverse problems using an EM approach to density esti- mation, in ‘Proceedings of the 1993 Connectionist Models summer school’, Erlbaum Associates, pp. 316–323. Ghahramani,Z.&Beal,M.(2000),Graphicalmodelsandvariationalmethods,inD.Saad &M.Opper,eds,‘AdvancedMeanFieldMethods-TheoryandPractice’,MITPress. Ghahramani,Z.&Hinton,G.(1996),Parameterestimationforlineardynamicalsystems, Technical report, University of Toronto. Ghahramani, Z. & Hinton, G. E. (1997), The EM algorithm for mixtures of factor ana- lyzers, Technical report, University of Toronto. Gibbs, M.N.(1997), BayesianGaussianProcessesforRegressionandClassification, PhD thesis, Department of Physics, University of Cambridge. Golub, G. H. & Van Loan, C. (1989), Matrix Computations, John Hopkins University Press. 167 Goodman, J. (2004), Exponential priors for maximum entropy models, in ‘Proceedings of the Annual Meeting of the Association for Computational Linguistics’, Morgan Kaufmann. Hastie, T. J. & Tibshirani, R. J. (1990), Generalized additive models, number 43 in ‘Monographs on Statistics and Applied Probability’, Chapman and Hall. Hastings, W. K. (1970), ‘Monte Carlo sampling methods using Markov chains and their applications’, Biometrika 57(1), 97–109. Haynes, J.&Rees, G.(2005), ‘Predictingtheorientationofinvisiblestimulifromactivity in human primary visual cortex’, Nature Neuroscience 8, 686. Hoaglin, D. C. (1983), Letter values: a set of selected order statistics, in D. C. Hoaglin, F. Mosteller & J. W. Tukey, eds, ‘Understanding Robust and Exploratory Data Analysis’, Wiley, pp. 33–57. Hoerl, A. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for nonorthog- onal problems’, Technometrics 12(3), 55–67. Hollerbach, J. M. & Wampler, C. W. (1996), The calibration index and the role of input noiseinrobotcalibration, inG.Giralt&G.Hirzinger, eds, ‘RoboticsResearch: The Seventh International Symposium’, Springer, pp. 558–568. Huber, P.J.(1964), ‘Robustestimationofalocationparameter’, Annals of Mathematical Statistics 35, 73–101. Huber, P. J. (1973), Robust Statistics, Wiley. Hubert, M., Rousseeuw, P. & Vanden Branden, K. (2005), ‘Robpca: a new approach to robust principal component analysis’, Technometrics 47, 64–79. Jaakkola, T. S. & Jordan, M. I. (2000), ‘Bayesian parameter estimation via variational methods’, Statistics and Computing 10, 25–37. Jazwinski, A. H. (1970), Stochastic Processes and Filtering Theory, Academic Press. Jordan, M. I., Ghahramani, Z., Jaakkola, T. & Saul, L. K. (1999), An introduction to variationalmethodsforgraphicalmodels,inM.I.Jordan,ed.,‘LearninginGraphical Models’, MIT Press. Jordan, M. I. & Rumelhart, D. E. (1991), Internal world models and supervised learn- ing, in ‘Machine Learning: Proceedings of Eighth Internatinoal Workshop’, Morgan Kaufmann, pp. 70–85. Kakei, S., Hoffman, D. & Strick, P. (1999), ‘Muscle and movement representations in the primary motor cortex’, Science 285, 2136–2139. Kakei, S., Hoffman, D. & Strick, P. (2001), ‘Direction of action is represented in the ventral premotor cortex’, Nature Neuroscience 4, 1020–1025. 168 Kalman, R. E. (1960), ‘A new approach to linear filtering and prediction problems’, Transactions of the ASME - Journal of Basic Engineering 183, 35–45. Kalman,R.E.&Bucy,R.S.(1961),‘Newresultsinlinearfilteringandpredictiontheory’, Journal of Basic Engineering, Transactions ASME, Series D 83, 95–108. Kamitani, Y. & Tong, F. (2004), ‘Decoding the visual and subjective contents of the human brain’, Nature Neuroscience 8, 679. Kim, S.-J., Koh, K., Lustig, M., Boyd, S. & Gorinevsky, D. (2007), ‘A method for large-scale l1-reguarlized least squares’, IEEE Journal on Selected Topics in Signal Processing 1(4), 606–617. Kitagawa, G. (1987), ‘Non-Gaussian state-space modeling of non-stationary time series’, Journal of the American Statistical Association 82, 1032–1063. Kitagawa, G. (1996), ‘Monte Carlo filter and smoother for non-Gaussian nonlinear state- space models’, Journal of the American Statistical Association 93, 1203–1215. Kitagawa, G. & Gersch, W. (1996), Smoothness priors analysis of time series, in ‘Lecture Notes in Statistics’, Springer-Verlag. Konolige, K. (2001), Robot motion: probabilistic model; sampling and Gaussian imple- mentations; Markov localization, Technical report, SRI International. Kramer, S.C.&Sorenson, H.W.(1988), ‘RecursiveBayesianestimationusingpiece-wise constant approximations’, Automatica 24(6), 789–801. Kulback, S. & Leibler, R. A. (1951), ‘On information and sufficiency’, The Annals of Mathematical Statistics 22(1), 79–86. Lawrence, N., Seeger, M. & Herbrich, R. (2003), Fast sparse Gaussian process methods: the informative vector machine, in ‘Proceedings of Advances in Neural Information Processing Systems 15’, Vol. 15, MIT Press. Lee, S., Lee, H., Abeel, P. & Ng, A. (2006), Efficient l1-regularized logistic regresssion, in ‘Proceedings of the 21st National Conference on Artificial Intelligence’, AAAI Press. Leisink, M.A. R. &Kappen, H.J. (2001), ‘Atighter bound for graphicalmodels’, Neural Computation 13(9), 2149–2170. Ljung, L. & Soderstrom, T. (1983), Theory and Practice of Recursive System Identifica- tion, MIT Press. Lokhorst, J. (1999), The LASSO and generalized linear models, Technical report, De- partment of Statistics, University of South Adelaide, South Australia, Australia. Longford, N. T. (1993), Random Coefficient Models, Oxford University Press. Mackay, D.J.C.(1992), ‘ApracticalBayesianframework forbackpropagationnetworks’, Neural Computation 4, 448–472. 169 MacKay, D. J. C. (2003), Information theory, inference, and learning algorithms, Cam- bridge University Press. MacQueen, J. B. (1967), ‘Some methods for classification and analysis of multivariate observations’, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297. Masreliez, C. (1975), ‘Approximate non-Gaussian filtering with linear state and observa- tion relations’, IEEE Transactions on Automatic Control 20, 107–110. Masreliez, C. & Martin, R. (1977), ‘Robust Bayesian estimation for the linear model and robustifying the Kalman filter’, IEEE Transactions on Automatic Control 22, 361– 371. Massey, W. (1965), ‘Principal component regression in exploratory statistical research’, Journal of the American Statistical Association 60, 234–246. Maybeck, P. S. (1979), Stochastic models, estimation, and control, Vol. 141 of Mathemat- ics in Science and Engineering, Academic Press. Meeds, E. & Osindero, S. (2005), An alternative infinite mixture of Gaussian process experts, in ‘Advances in Neural Information Processing Systems 17’, MIT Press. Meinhold, R. J. & Singpurwalla, N. D. (1989), ‘Robustification of Kalman filter models’, Journal of the American Statistical Association pp. 479–486. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953), ‘Equation of state calculations by fast computing machines’, Journal of Chemical Physics 21, 1087–1092. Minka, T. P. (2001), A family of algorithms for approximate Bayesian inference, PhD thesis, MIT Media Lab, Massachussetts Institute of Technology. Moore, D. S. & McCabe, G. P. (1999), Introduction to the Practice of Statistics, W.H. Freeman & Company. Morris, J. M. (1976), ‘The Kalman filter: A robust estimator for some classes of linear quadratic problems’, IEEE Transactions on Information Theory 22, 526–534. Musallam, S., Corneil, B., Greger, B., Scherberger, H. & Andersen, R. (2004), ‘Cognitive control signals for neural prosthetics’, Science 305, 258–262. Myers,K.A.&Tapley,B.D.(1976),‘Adaptivesequentialestimationwithunknownnoise statistics’, IEEE Transactions on Automatic Control 21, 520–523. Neal, R. (1994), Bayesian learning for neural networks, PhD thesis, Department of Com- puter Science, University of Toronto. Neal, R. M. (1993), Probabilistic inference using Markov chain Monte Carlo methods, Technical report, University of Toronto. 170 Neal, R. M. (2003), ‘Slice sampling’, 31, 705–767. Neal,R.M.&Hinton,G.E.(1999),AviewoftheEMalgorithmthatjustifiesincremental, sparse,andothervariants,inM.I.Jordan,ed.,‘LearninginGraphicalModels’,MIT Press, pp. 355–368. Ng,A.,Jordan,M.&Weiss,Y.(2001),Onspectralclustering: analysisandanalgorithm, in ‘Proceedings of Advances in Neural Information Processing Systems 14’. Nicolelis, M. (2001), ‘Actions from thoughts’, Nature 409, 403–407. Ormoneit, D. & Hastie, T. (1999), Optimal kernel shapes for local linear regression, in ‘Proceedings of Advances in Neural Information Processing Systems’, MIT Press. Paciorek, C. J. & Schervish, M. J. (2004), Non-stationary covariance functions for Gaus- sian process regression, in ‘Advances in Neural Information Processing Systems 16’, MIT Press. Peters, J. & Schaal, S. (2008), ‘Learning to control in operational space’, International Journal of Robotics Research 27, 197–212. Poggio,T.&Girosi,F.(1990),‘Regularizationalgorithmsforlearningthatareequivalent to multilayer networks’, Science 247, 213–225. Rao, Y. N., Erdogmus, D., Rao, G. Y. & Principe, J. (2003), Fast error whitening algo- rithms for system identification and control, in ‘Proceedings of International Work- shop on Neural Networks for Signal Processing’, IEEE press, pp. 309–318. Rao, Y. N. & Principe, J. (2002), Efficient total least squares method for system model- ing using minor component analysis, in ‘Proceedings of International Workshop on Neural Networks for Signal Processing’, IEEE press, pp. 259–268. Rasmussen, C. E. (1996), Evaluation of Gaussian processes and other methods for non-linear regression, PhD thesis, Department of Computer Science, University of Toronto. Rasmussen, C. E. & Ghahramani, Z. (2002), Infinite mixtures of Gaussian processes, in ‘Advances in Neural Information Processing Systems 14’, MIT Press. Rissanen, J. (1978), ‘Modeling by shortest data description’, Automatica 14, 465–471. Robert, C. P. & Casella, G. (2005), Monte Carlo statistical methods (Springer texts in Statistics, Springer. Rockafellar, R. T. (1972), ‘State constraints in convex control problems of bolza’, SIAM Journal on Control and Optimization 10, 691–715. Rustagi, J. S. (1976), ‘Variational methods in statistics’, Mathematics in Science and Engineering 121. Ryan, T. P. (1997), Modern Regression Methods, Wiley. 171 Sanguinetti, G., Milo, M., Rattray, M. & Lawrence, N. D. (2005), ‘Accounting for probe-levelnoiseinprincipalcomponentanalysisofmicroarraydata’,Bioinformatics 21(19), 3748–3754. Sato, M. (2001), ‘Online model selection based on the variational Bayes’, Neural Compu- tation 13(7), 1649–1681. Sato, M. & Ishii, S. (2000), ‘Online EM algorithm for the normalized Gaussian network’, Neural Computation 12(2), 407–432. Schaal, S. & Atkeson, C. (1994), Memory-based robot learning, in ‘Proceedings of the IEEEInternationalConferenceonRoboticsandAutomation’,IEEEpress,pp.2928– 2933. Schaal, S. & Atkeson, C. G. (1998), ‘Constructive incremental learning from only local information’, Neural Computation 10(8), 2047–2084. Schaal, S., Vijayakumar, S. & Atkeson, C. (1998), Local dimensionality reduction, in M. Jordan, M. Kearns & S. Solla, eds, ‘Proceedings of Advances in Neural Informa- tion Processing Systems 11’, MIT Press. Schick, I. C. & Mitter, S. K. (1994), ‘Robust recursive estimation in the presence of heavy-tailed observation noise’, Annals of Statistics 22(2), 1045–1080. Schmidt, A. M. & O’Hagan, A. (2003), ‘Bayesian inference for non-stationary spatial covariance structure via spatial deformations’, Journal of Royal Statistical Society. Series B 65, 745–758. Schucany, W. (1995), ‘Adaptive bandwidth choice for kernel regression’, Journal of the American Statistical Association 90, 535–540. Sciavicco, L. & Siciliano, B. (1996), Modeling and control of robot manipulators, MacGraw-Hill. Scott, D. W. (2005), Outlier detection and clustering by partial mixture modeling, in ‘Proceedings in Computational Statistics: 16th Symposium’, Physica-Verlag, Hei- delberg, pp. 453–465. Sergio, L. & Kalaska, J. (1998), ‘Changes in the temporal pattern of primary motor cortex activity in a directional isometric force versus limb movement task’, Journal of Neurophysiology80, 1577–1583. Shevade, S. & Keerthi, S. (2003), ‘A simple and efficient algorithm for gene selection using sparse logistic regression’, Bioinformatics 19(17), 2246–2253. Shi,J.Q.,Murray-Smith,R.&Titterington,D.M.(2005),‘HierarchicalGaussianprocess mixtures for regression’, Statistics and Computing 15(1), 31–41. Silverman, B. W. (1985), ‘Some aspects of the spline smoothing approach to non- parametric regression curve fitting’, Journal of Royal Statistical Society. Series B 47, 1–52. 172 Smith, A. F. M. & West, M. (1983), ‘Monitoring renal transplants: an application of the multiprocess Kalman filter’, Biometrics 39, 867–878. Smola, A. & Scholkoph, B. (2000), Sparse greedy matrix approximation for machine learning, in‘Proceedingsofthe17thInternationalConferenceonMachineLearning’, Morgan Kaufmann. Snelson, E. & Ghahramani, Z. (2006a), Sparse Gaussian processes using pseudo-inputs, in ‘Proceedings of Advances in Neural Information Processing Systems 18’, Vol. 18, MIT Press. Snelson, E. & Ghahramani, Z. (2006b), Sparse Gaussian processes using pseudo-inputs, in ‘Advances in Neural Information Processing Systems 18’, MIT Press. Snelson, E. & Ghahramani, Z. (2006c), Variable noise and dimensionality reduction for sparse Gaussian processes, in ‘Proceedings of Uncertainty in Artificial Intelligence 22’, Vol. 22, AUAI Press. Sorensen, H. W. & Alspach, D. L. (1971), ‘Recursive Bayesian estimation using Gaussian sums’, Automatica 7, 467–479. Stone,M.(1974),‘Cross-validatorychoiceandassessmentofstatisticalpredictions’,Jour- nal of Royal Statistical Society. Series B 36(1), 111–147. Stone, M. & Brooks, R. J. (1990), ‘Continuum regression: cross-validation sequen- tially constructed prediction embracing ordinary least squares, partial least squares and principal components regression’, Journal of Royal Statistical Society. Series B 52(2), 237–269. Strassen, V.(1969), ‘Gaussianeliminationisnotoptimal’, Num Mathematik13,354–356. Taskar, B., Chatalbashev, V., Koller, D. & Guestrin, C. (2005), Learning structured predictionmodels: alargemarginapproach,in‘InternationalConferenceonMachine Learning‘’, pp. 896–903. Taylor,D.,Tillery,S.&Schwartz,A.(2002),‘Directcorticalcontrolof3Dneuroprosthetic devices’, Science 296, 1829–1932. Tibshirani, R. (1996), ‘Regression shrinkage and selection via the LASSO’, Journal of Royal Statistical Society, Series B 58(1), 267–288. Ting, J., D’Souza, A. & Schaal, S. (2006), Bayesian regression with input noise for high dimensional data, in ‘Proceedings of the 23rd International Conference on Machine Learning’, ACM Press, pp. 937–944. Ting, J., D’Souza, A. & Schaal, S. (2007), Automatic outlier detection: a Bayesian approach, in ‘IEEE International Conference on Robotics and Automation’. 173 Ting, J., D’Souza, A., Yamamoto, K., Yoshioka, T., Hoffman, D., Kakei, S., Sergio, L., Kalaska, J., Kawato, M., Strick, P. & Schaal, S. (2005), Predicting EMG data from M1 neurons with variational Bayesian least squares, in ‘Proceedings of Advances in Neural Information Processing Systems 18’, MIT Press. Ting, J., D’Souza, A., Yamamoto, K., Yoshioka, T., Hoffman, D., Kakei, S., Sergio, L., Kalaska, J., Kawato, M., Strick, P. & Schaal, S. (2008), ‘Variational Bayesian least squares: an application to brain-machine interface data’, Neural Networks: Special Issue on Neuroinformatics 21(8), 1112–1131. Ting, J., Kalakrishnan, M., Vijayakumar, S.&Schaal, S.(2008), Bayesiankernelshaping for learning control, in ‘Proceedings of Advances in Neural Information Processing Systems 21’, MIT Press. Ting, J., Mistry, M., Peters, J., Schaal, S. & Nakanishi, J. (2006), A Bayesian approach to nonlinear parameter identification for rigid body dynamics, in ‘Proceedings of Robotics: Science and Systems’. Ting, J., Theodorou, E. & Schaal, S. (2007), Learning an outlier-robust Kalman filter, in ‘EuropeanConferenceonMachineLearning’,Vol.4701ofLectureNotes in Computer Science, Springer, pp. 748–756. Tipping, M. E. (2001), ‘Sparse Bayesian learning and the Relevance Vector Machine’, Journal of Machine Learning Research 1, 211–244. Todorov, E. (2000), ‘Direct cortical control of muscle activation in voluntary arm move- ments: a model’, Nature Neuroscience 3, 391–398. Tresp, V. (2000), Mixtures of Gaussian processes, in ‘Advances in Neural Information Processing Systems 13’, MIT Press. Tukey, J. W. (1960), A survey of sampling from contaminated distributions, in I. Olkin, ed.,‘ContributionstoProbabilityandStatistics’,StanfordUniversityPress,pp.448– 485. Van Huffel, S. & Lemmerling, P. (2002), Total Least Squares and Errors-in-Variables Modeling: Analysis, AlgorithmsandApplications,KluwerAcademicPublishers,Dor- drecht, Netherlands. Van Huffel, S. & Vanderwalle, J. (1991), ‘The total least squares problem: computational aspects and analysis’, Society for Industrial and Applied Mathematics . Vijayakumar, S., D’Souza, A. & Schaal, S. (2005), ‘Incremental online learning in high dimensions’, Neural Computation 17, 1–33. Wainwright, W. J., Jaakkola, T. & Willsky, A. S. (2002), A new class of upper bounds onthelogpartitionfunction, in‘ProceedingsofUncertaintyinArtificialIntelligence 18’, Vol. 18, Morgan Kaufmann. 174 Wessberg, J. & Nicolelis, M. (2004), ‘Optimizing a linear algorithm for real-time robotic control using chronic cortical ensemble recordings in monkeys’, Journal of Cognitive Neuroscience 16, 1022–1035. West, M. (1981), ‘Robust sequential approximate Bayesian estimation’, Journal of the Royal Statistical Society, Series B 43, 157–166. West, M. (1982), Aspects of recursive Bayesian estimation, PhD thesis, Department of Mathematics, University of Nottingham. Williams, C. K. I. & Rasmussen, C. E. (1995), Gaussian processes for regression, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, ‘Proceedings of Advances in Neural Information Processing Systems 8’, Vol. 8, MIT Press. Wold, H. (1975), Soft modeling by latent variables: the nonlinear iterative partial least squares approach, in J. Gani, ed., ‘Perspectives in probability and statistics, papers in honor of M. S. Bartlett’, Academic Press. Wolpaw, J. & McFarland, D. (2004), ‘Control of a two-dimensional movement signal by a noninvasive brain-computer interface in humans’, Proceedings of the National Academy of Sciences 101, 17849–17854. Y. Shen, A. N. & Seeger, M. (2006), Fast Gaussian process regression using KD-trees, in ‘Proceedings of Advances in Neural Information Processing Systems 18’, Vol. 18, MIT Press. Yedidia, J. S., Freeman, W. T. & Weiss, Y. (2001), Bethe free energy, Kikuchi approxi- mationsandbeliefpropagationalgorithms,TechnicalReportTR2001-16,Mitsubishi Electric Research Laboratories. 175 Appendix A Variational Bayesian Least Squares A.1 Derivation of Variational Bayesian Least Squares Consider the complete log likelihood of the model in Section 2.2.2, i.e., Eq. (2.15). We knowthatmaximizingthelowerboundimpliesthatthefunctionalF (Q,φ)overthespace of probability distributions Q(v H ), where v H are unobserved random variables. The calculus of variations, e.g., (Rustagi 1976), allows for the derivation of general updates fordistributionsthatareindividuallyfactored. Forthecasewhereweassumeindividually factored distributions of the random variables: Q(v H )=Q 1 (v 1 )Q 2 (v 2 )···Q n (v n ), thesolutionfortheindividualposteriordistributionsthatmaximizesthefunctionF(Q,φ) under the factorization assumption is: Q j (v j )= exphlogp(x D ,v H ;φ)i Q k6=j R exphlogp(x D ,v H ;φ)i Q k6=j dv j or equivalently: logQ j (v j )=hlogp(x D ,v H ;φ)i Q k6=j +const v j (A.1) where x D is the observed data,h·i Q k6=j denotes the expectation taken with respect to all distributions Q k except for Q j , and j =1,2,...,n. For the scenario where we do not assume a full factorized assumption of v H and instead, assume, for example: Q(v H )=Q 12 (v 1 ,v 2 )Q 3 (v 3 )···Q n (v n ) we would need to infer the joint distribution ofQ 12 (v 1 ,v 2 ). One approach of solving this istoassumethatthejointdistributionfactorizesintoQ 1 (v 1 |v 2 )Q 2 (v 2 )Q 3 (v 3 )···Q n (v n ). The resulting posterior distribution for Q 1 (v 1 |v 2 ) can be derived using Eq. (A.1) above. However, when maximizing the functional with respect to Q 2 , we would need to include 176 terms that have a dependency on v 2 in the numerator (or, equivalently, we could use Eq. (A.1) but then marginalize out v 1 ): Q 2 (v 2 )∝exp hlogp(x D ,v H ;φ)i Q k6=2 − Z Q 1 (v 1 |v 2 )logQ 1 (v 1 |v 2 )dv 1 or equivalently: logQ 2 (v 2 )=hlogp(x D ,v H ;φ)i Q k6=2 +Entropy{Q 1 (v 1 |v 2 )}+const (A.2) With the above in mind, we can derive the final EM updates for the model in Sec- tion 2.2.2, assuming that: Q(b,α,Z)=Q(b,α)Q(Z) =Q(b|α)Q(α)Q(Z) We can then infer the posterior distributions analytically: • For Q(b,α): logQ(b,α)= N 2 d X m=1 logα m − d X m=1 α 2ψ zm N X i=1 D (z im −b m x im ) 2 E + 1 2 d X m=1 logα m − 1 2 d X m=1 α m b 2 m + d X m=1 (a αm,0 −1)logα m − d X m=1 b αm,0 α m +const α,b From the above equation, we can deduce: Q(b,α)=Q(α) d Y m=1 Q(b m |α m ) Q(b m |α m )=Normal hb m |α m i,σ 2 bm|αm σ 2 bm|αm = ψ zm hα m i N X i=1 x 2 im +ψ zm ! −1 hb m |α m i= N X i=1 x 2 im +ψ zm ! −1 N X i=1 hz im ix im ! (A.3) Note that the posterior mean of b m |α m in Eq. (A.3) is independent of α m . As a result, we can re-write the conditional posterior of b m as: logQ(b,α)=− 1 2 d X m=1 " 1 σ 2 bm|αm (b m −hb m |α m i) 2 − hb m |α m i 2 σ 2 bm|αm + α m ψ zm N X i=1 z 2 im # 177 + d X m=1 a αm,0 + N 2 + 1 2 −1 logα m −b αm,0 α m +const α,b Taking the exponent of the above expression and integrating out b, we can derive the following posterior distribution forα: logQ(α)=− 1 2 d X m=1 " −log2πσ 2 bm|αm + α m ψ zm N X i=1 z 2 im − hb m |α m i 2 σ 2 bm|αm # + a αm,0 + N +1 2 −1 logα m −b αm,0 +const α =− 1 2 d X m=1 " α m ψ zm N X i=1 z 2 im − hb m |α m i 2 σ 2 bm|αm # + a αm,0 + N 2 −1 logα m −b αm,0 +const α From the above expression, we can infer the posterior distribution over α is: Q(α)= d Y m=1 Gamma ˆ a αm , ˆ b αm ˆ a αm =a αm,0 + N 2 ˆ b αm =b αm,0 + 1 2ψ zm N X i=1 z 2 im − N X i=1 x 2 im +ψ zm ! −1 N X i=1 hz im ix im ! 2 • For Q(Z): logQ(Z)=− 1 2ψ y N X i=1 y i −1 T z i 2 − d X m=1 * α m 2ψ y N X i=1 (z im −b m x im ) 2 + From the above, we can infer the posterior distribution of Z to be: Q(Z)= d Y m=1 Normal(hz i i,Σ z ) Σ z = 1 ψ y 11 T +Ψ −1 z hAi −1 =Ψ z hAi −1 − Ψ z hAi −1 11 T Ψ z hAi −1 ψ y +1 T Ψ z hAi −1 1 hz i i=Σ z 1 ψ y 1y i +Ψ −1 z hAihB|Aix i = Ψ z hAi −1 1 ψ y +1 T Ψ z hAi −1 1 ! y i + hB|Ai− Ψ z hAi −1 11 T hB|Ai ψ y +1 T Ψ z hAi −1 1 ! x i 178 hz im i=hb m |α m ix im + 1 s ψ zm hα m i y i −hb|αi T x i z 2 im = ψ zm hα m i − 1 s ψ zm hα m i 2 +hz im i 2 1 T hz i i= 1 2 d X m=1 ψ zm hα m i ! y i + 1− 1 s d X m=1 ψ zm hα m i !! hb|αi T x i 1 T z i z T i 1= d X m=1 ψ zm hα m i − 1 s d X m=1 ψ zm hα m i ! 2 + 1 T hz i i 2 where Ψ −1 hAi=diag[hα m i/ψ zm ] and s=ψ y +1 T Ψ z hAi −1 1. A.2 Pseudocode for Variational Bayesian Least Squares ThepseudocodeforVBLSislistedbelowinAlgorithm1. Toknowwhentostopiterating through the EM-based algorithm, we should monitor the incomplete log likelihood and stopwhenthevalueappearstohaveconverged. However,sincethecalculationofthetrue posterior distributionQ(α,b,Z) is intractable, we cannot determine the true incomplete log likelihood. Hence, for the purpose of monitoring the incomplete log likelihood in the EM algorithm, we monitor a lower bound of the incomplete log likelihood instead. In the derivation of VBLS, we approximated Q(θ), where θ = {α,b,Z}, as Q(α,b)Q(Z). Using this variational approximation, we can derive the lower bound to the incomplete log likelihood, where φ={ψ y ,ψ z }, to be: logp(y|X;φ)≥ Z Q(θ)log p(y,θ|X;φ) Q(θ) dθ ≥ Z Q(θ)logp(y,θ|X;φ)dθ− Z Q(θ)logQ(θ)dθ ≥hlogp(y,θ|X;φi Q(θ) − Z Q(θ)logQ(θ)dθ (A.4) 179 where Eq. (A.4) simplifies to: logp(y|X;φ)≥ − N 2 logψ y − 1 2ψ y N X i=1 y 2 i −2y i 1 T hz i i+1 T z i z T i 1 − N 2 d X m=1 logψ zm − d X m=1 hα m i 2ψ zm N X i=1 z 2 im −2hz im ihb m |α m ix im + D (b m |α m ) 2 E x 2 im − 1 2 d X m=1 hα m i D (b m |α m ) 2 E − N−1 2 d X m=1 log ˆ b αm −ˆ a αm − 1 2 log|Σ −1 z |− d X m=1 log ˆ b αm + 1 2 d X m=1 hα m i σ 2 bm|αm +1 +const y,φ (A.5) We stop iterating when the lower bound to the incomplete log likelihood has converged, i.e., when a certain likelihood tolerance,t, has been reached. Additionally, note that the input and output data are assumed to be centered (i.e. have a mean of 0) before we analyze the data set with VBLS. 0: Initialization: a α,0 = 10 −8 1, b α,0 = 10 −8 1; threshold value for lower bound to the incomplete log likelihood, t=10 −6 1: Start EM iterations: 2: repeat 3: Perform the E-step: Calculate Eqs. (2.16) to (2.22) 4: Perform the M-step: Calculate Eqs. (2.23) and (2.24) 5: Monitor the lower bound to the incomplete log likelihood, Eq. (A.5), to see if the likelihood tolerance t has been reached 6: until convergence of Eq. (A.5) Algorithm 1: Pseudocode for variational Bayesian least squares A.3 EM Update Equations for Real-time Implementation The incremental EM update equations for thekth time step, when data sample{x k ,y k } is available, are listed below. Note that ψ y (k−1) denotes the value of ψ y at the (k−1)th time step, and the same notation is used for all other parameters. E-step: Σ (k) z =Ψ (k−1) z A −1 (k−1) − Ψ (k−1) z A −1 (k−1) 11 T Ψ (k−1) z A −1 (k−1) s (k) (A.6) hzi (k) = Ψ (k−1) z A −1 (k−1) 1 s (k) ! y k +hBi (k−1) x k 180 − Ψ (k−1) z A −1 (k−1) 11 T hBi (k−1) s (k) x k (A.7) σ 2 bm (k) = ψ (k−1) zm hα m i (k−1) Σ (k) x 2 +ψ (k−1) zm −1 (A.8) hb m i (k) = Σ (k) x 2 Σ (k) x 2 +ψ (k−1) zm ! hb m i (k−1) + ψ (k−1) zm hα m i (k−1) Σ (k) yx −Σ (k) xx T hbi (k−1) Σ (k) x 2 +ψ (k−1) zm s (k) (A.9) hα m i (k) = a αm,0 +0.5Σ (k) N b αm,0 + 1 2ψ (k−1) zm Σ (k) N Σ z (k) +Σ (k) hz 2 i − Σ (k) x 2 +ψ (k−1) zm −1 Σ (k) hzix 2 (A.10) M-step: ψ (k) y = 1 Σ (k) N 1− Ψ (k−1) z A −1 (k−1) 1 s (k) ! 2 Σ (k) y 2 −2 hbi (k−1) T Σ (k) yx + hbi (k−1) T Σ (k) xx T hbi (k−1) +1 T Σ (k) z 1 (A.11) ψ (k) zm = 1 hα m i (k) ψ zm (k−1) s (k) ! 2 Σ (k) y 2 −2 hbi (k) T Σ (k) yx + hbi (k) T Σ (k) xx T hbi (k) Σ (k) N +hα m i (k) σ 2 bm (k) Σ (k) x 2 Σ (k) N +hα m i (k) Σ (k) z (A.12) where A −1 =hAi −1 (sinceA is a diagonal matrix consisting of the diagonal vectorα) and: s (k) =ψ (k−1) y +1 T Ψ (k−1) z A −1 (k−1) 1 (A.13) Additionally, the sufficient statistics used in Eqs. (A.6) to (A.12) are: Σ (k) x 2 =Σ (k−1) x 2 +x 2 im Σ (k) y 2 =Σ (k−1) y 2 +y 2 i Σ (k) yx =Σ (k−1) yx +y i x im Σ (k) xx T =Σ (k−1) xx T +x i x T i Σ (k) N =λΣ (k−1) N +1 Σ (k) hzix =λΣ (k−1) hzix +hz im ix im Σ (k) hz 2 i =λΣ (k−1) hz 2 i +hz im i 2 with certain sufficient statistics discounted by λ, as necessary. 181 Appendix B EMG Prediction from Neural Data: Plots B.1 Construction of Cross-validation Sets The next four figures, Figures B.1 to B.3, show how cross-validation sets are constructed for each of the data sets discussed in Section 2.4.2. In Figure B.1, for each type of force applied by the monkey to the manipulandum, there are eight possible directions that the manipulandum could have been moved. The dataisfrom(Sergio&Kalaska1998)andinvolvesneuralfiringfromtheM1cortex. Each circle shown in the figure is partitioned into eight equal portions, corresponding to the eight directional movements and numbered in increasing order (clockwise) starting from one. InFigureB.2,therearealsoeightpossibledirectionalmovementsforeachofthethree wristpositions 1 . ThedataisfromKakeietal.(1999)andinvolvesneuralfiringdatafrom the M1 cortex. As with the previous figure, each circle is partitioned into eight equal sections, corresponding to the eight directional movements and numbered in increasing order (clockwise) starting from one. Figure B.3 shows the construction of cross-validation sets, which are done in a similar fashion to Figure B.2—except it involves neural firing data from the PM cortex, taken from the Kakei et al. (2001) data set. 1 The three possible wrist positions are i) supinated, ii) pronated and iii) midway in between the two positions. 182 (a) Cross-validation set 1 (b) Cross-validation set 2 (c) Cross-validation set 3 (d) Cross-validation set 4 (e) Cross-validation set 5 (f) Cross-validation set 6 (g) Cross-validation set 7 (h) Cross-validation set 8 Figure B.1: Cross-validation splits for the Sergio & Kalaska (1998) M1 neural data set. 183 (a) Cross-validation set 1 (b) Cross-validation set 2 (c) Cross-validation set 3 (d) Cross-validation set 4 (e) Cross-validation set 5 (f) Cross-validation set 6 Figure B.2: Cross-validation splits for the Kakei et al. (1999) M1 neural data set. 184 (a) Cross-validation set 1 (b) Cross-validation set 2 (c) Cross-validation set 3 (d) Cross-validation set 4 (e) Cross-validation set 5 (f) Cross-validation set 6 Figure B.3: Cross-validation splits for the Kakei et al. (2001) PM neural data set. 185 B.2 Plots of Percentage Matches with Baseline Study The following three plots, Figures B.4 to B.6, show the percentage matches of neurons foundtoberelevantbyeachalgorithm, ascomparedtothosefoundbythebaselinestudy (Modelsearch). Percentage matches are shown for each muscle for all three neural data sets. 1 2 3 4 5 6 7 8 9 10 11 0 10 20 30 40 50 60 70 80 90 100 Percentage of Neuron Matches Muscle Number STEP PLS LASSO VBLS Figure B.4: Percentage of M1 neuron matches found by each algorithm, as compared to those found by the baseline study (ModelSearch), shown for each muscle in the Sergio & Kalaska (1998) data set. 186 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 90 100 Percentage of Neuron Matches Muscle Number STEP PLS LASSO VBLS Figure B.5: Percentage of M1 neuron matches found by each algorithm, as compared to those found by the baseline study (ModelSearch), shown for each muscle in the Kakei et al. (1999) data set. 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 90 100 Percentage of Neuron Matches Muscle Number STEP PLS LASSO VBLS Figure B.6: Percentage of PM neuron matches found by each algorithm, as compared to those found by the baseline study (ModelSearch), shown for each muscle in the Kakei et al. (2001) data set. 187 B.3 Real-time Analysis for Brain-Machine Interfaces 1 2 3 4 5 6 7 8 9 10 11 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r2 Muscle Number Batch VBLS Real-time VBLS (a) r 2 values. 1 2 3 4 5 6 7 8 9 10 11 0 10 20 30 40 50 60 Number of relevant neurons found Muscle Number (b) Number of relevant M1 neurons found. Figure B.7: Coefficient of determination values, r 2 =1−nMSE, and number of relevant neuronsfoundbyVBLS—boththebatchandreal-timeversions—fortheSergio&Kalaska (1998) data set. For the real-time, incremental version of VBLS, the number of relevant neurons found in the last time step is shown. 188 B.4 Plots of EMG Traces and Velocities The following six plots show the EMG traces and (x,y) velocities for various muscles undercertainwristconditionsintheKakeietal.(1999)andKakeietal.(2001)datasets. Thefirsttwoplots, FiguresB.8(a)andB.9(a), showtheobservedandpredictedEMG traces for the ECRB muscle while in the supinated wrist condition, given firing from M1 and PM neurons, respectively. For each figure, the center plot shows the trajectories in eight different directions—in the (x,y) plane—taken by the hand. Each of the eight plots surrounding the center plot shows the EMG traces over time for each hand trajectory, illustrating: i) the observed averaged EMG activity ii) the predicted EMG activity, as obtained by VBLS using data from all conditions The third and fourth figures, Figures B.10(a) and B.11(a), show the observed and predicted velocities in the x and y directions, given M1 neural firing, while the hand is in the supinated wrist condition. These figures are for the (Kakei et al. 1999) M1 neural data set. As with the first two figures, the center plot shows the trajectories in eight different directions—in the (x,y) plane—taken by the hand. Each of the eight plots surrounding this center plot shows the velocities (in m/sec) over time for each hand trajectory, illustrating: i) the observed velocities ii) the predicted velocities, as obtained by VBLS using data from all conditions Finally, the last two figures, Figures B.12(a) and B.13(a), show similar observed and predicted velocities in the x and y directions to Figures B.10(a) and B.11(a), except these correspond to neural firing in the PM cortex. The neural data set is from (Kakei et al. 2001). 189 Observed EMG Predicted EMG (VBLS−full) 0 500 1000 0 50 100 150 msec Direction 4 0 500 1000 0 50 100 150 msec Direction 3 0 500 1000 0 50 100 150 msec Direction 2 0 500 1000 0 50 100 150 msec Direction 5 −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 0 50 100 150 msec Direction 1 0 500 1000 0 50 100 150 msec Direction 6 0 500 1000 0 50 100 150 msec Direction 7 0 500 1000 0 50 100 150 msec Direction 8 Figure B.8: Observed vs. predicted EMG traces for the ECRB muscle in the supinated wrist condition from the Kakei et al. (1999) M1 neural data set. VBLS-full is the result of applying VBLS on the entire data set. 190 Observed EMG Predicted EMG (VBLS−full) 0 500 1000 0 50 100 150 msec Direction 4 0 500 1000 0 50 100 150 msec Direction 3 0 500 1000 0 50 100 150 msec Direction 2 0 500 1000 0 50 100 150 msec Direction 5 −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 0 50 100 150 msec Direction 1 0 500 1000 0 50 100 150 msec Direction 6 0 500 1000 0 50 100 150 msec Direction 7 0 500 1000 0 50 100 150 msec Direction 8 (a) Observed vs. predicted EMG traces Figure B.9: Observed vs. predicted EMG traces for the ECRB muscle in the supinated wrist condition from the Kakei et al. (2001) PM neural data set. VBLS-full is the result of applying VBLS on the entire data set. 191 Observed x velocities (m/sec) Predicted x velocities with VBLS (m/sec) 0 500 1000 −0.1 0 0.1 Direction 4 msec 0 500 1000 −0.1 0 0.1 Direction 3 msec 0 500 1000 −0.1 0 0.1 Direction 2 msec 0 500 1000 −0.1 0 0.1 Direction 5 msec −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 −0.1 0 0.1 Direction 1 msec 0 500 1000 −0.1 0 0.1 Direction 6 msec 0 500 1000 −0.1 0 0.1 Direction 7 msec 0 500 1000 −0.1 0 0.1 Direction 8 msec Figure B.10: Observed vs. predicted velocities in the x direction for the supinated wrist condition from the Kakei et al. (1999) M1 neural data set. 192 Observed y velocities (m/sec) Predicted y velocities with VBLS (m/sec) 0 500 1000 −0.1 0 0.1 Direction 4 msec 0 500 1000 −0.1 0 0.1 Direction 3 msec 0 500 1000 −0.1 0 0.1 Direction 2 msec 0 500 1000 −0.1 0 0.1 Direction 5 msec −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 −0.1 0 0.1 Direction 1 msec 0 500 1000 −0.1 0 0.1 Direction 6 msec 0 500 1000 −0.1 0 0.1 Direction 7 msec 0 500 1000 −0.1 0 0.1 Direction 8 msec Figure B.11: Observed vs. predicted velocities in the y direction for the supinated wrist condition from the Kakei et al. (1999) M1 neural data set. 193 Observed x velocities (m/sec) Predicted x velocities with VBLS (m/sec) 0 500 1000 −0.1 0 0.1 Direction 4 msec 0 500 1000 −0.1 0 0.1 Direction 3 msec 0 500 1000 −0.1 0 0.1 Direction 2 msec 0 500 1000 −0.1 0 0.1 Direction 5 msec −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 −0.1 0 0.1 Direction 1 msec 0 500 1000 −0.1 0 0.1 Direction 6 msec 0 500 1000 −0.1 0 0.1 Direction 7 msec 0 500 1000 −0.1 0 0.1 Direction 8 msec Figure B.12: Observed vs. predicted velocities in the x direction for the supinated wrist condition from the Kakei et al. (2001) PM neural data set. 194 Observed y velocities (m/sec) Predicted y velocities with VBLS (m/sec) 0 500 1000 −0.1 0 0.1 Direction 4 msec 0 500 1000 −0.1 0 0.1 Direction 3 msec 0 500 1000 −0.1 0 0.1 Direction 2 msec 0 500 1000 −0.1 0 0.1 Direction 5 msec −0.02 0 0.02 −0.02 0 0.02 X (m) Y (m) 0 500 1000 −0.1 0 0.1 Direction 1 msec 0 500 1000 −0.1 0 0.1 Direction 6 msec 0 500 1000 −0.1 0 0.1 Direction 7 msec 0 500 1000 −0.1 0 0.1 Direction 8 msec Figure B.13: Observed vs. predicted velocities in the y direction for the supinated wrist condition from the Kakei et al. (2001) PM neural data set. 195 Appendix C Bayesian Factor Analysis for Regression C.1 Factor Analysis for Regression Factor Analysis, e.g., (Ghahramani & Hinton 1997), is a latent variable model used to detection structure in relationships between variables and to reduce the number of variables. Theobserveddatay∈< p×1 isassumedtobegeneratedfromK latentvariables or factors t: y i =Wt i + i for alli, where 1≤i≤N andN is the number of samples observed in the data set. The number of latent variables K is unknown and can be found, for example, by selecting the number of eigenvalues greater than 1. If the latent variables are assumed to be independently distributed as: t i ∼Normal(0,I) i ∼Normal(0,Ψ), thenthefactorloadingsWandthediagonalnoisevariancematrixΨcanbeestimatedus- ingtheEMalgorithm(Ghahramani&Hinton1997)orBayesiantechniques(Ghahramani & Beal 2000). In Factor Analysis for regression (or joint-space Factor Analysis, since both x and y are jointly considered), we define a vector z that contains both the input data x∈< d×1 and scalar output data y: z≡ x y and W≡ W x W y and Ψ≡ Ψ x 0 0 T ψ y where W x , W y are factor loadings for x and y, respectively and Ψ x , ψ y are the noise variances for x and y, respectively. Once we estimate W and Ψ for the joint data space of z, we can condition y on x and marginalize out the latent variables t. The joint distribution of the observed and latent variables will have the following mean and covariance: x y =0 196 Cov x y = WW T +Ψ W W T I The covariance matrix can then be partitioned as follows: Cov x y = W x W T x +Ψ x W x W T y W x W T y W T x W y W T y +ψ y W y W T x W T y I leading to the following conditional distribution: y t |x = W y W T x W T x W x W T x +Ψ x −1 x Marginalizing out t, we obtain: hy|xi=W y W T x W x W T x +Ψ x −1 x =W y I+W T x Ψ −1 x W x −1 W T x Ψ −1 x | {z } b T JFR x where h·i indicates expectation and the matrix inversion lemma has been applied in the second line. Hence, b JFR =Ψ −1 x W x I+W T x Ψ −1 x W x −1 W T y (C.1) Note that the required matrix inversion of I+W T x Ψ −1 x W x is of the order of the latent dimensionalityK,whichmakesjoint-spaceFactorAnalysisforregressioncomputationally attractive for problems in which the underlying latent variable manifold is known to be relatively low dimensional (i.e. K d). C.2 Derivation of Bayesian Joint Factor Analysis To derive the final EM update equations for model in Section 3.2.2, we can proceed in a similar fashion as described in Appendix A.1. The complete log likelihood expression in Eq. (3.9) can be expanded to: logp(y,Z,T,w z ,w x ,α|X) =− N 2 logψ y − 1 2ψ y N X i=1 y i −1 T z i 2 − N 2 d X m=1 logψ zm − d X m=1 1 2ψ zm N X i=1 (z im −t im w zm ) 2 − N 2 d X m=1 logψ xm − d X m=1 1 2ψ xm N X i=1 (x im −t im w xm ) 2 + 1 2 d X m=1 logα m − 1 2 d X m=1 α m w 2 zm + 1 2 d X m=1 logα m − 1 2 d X m=1 α m w 2 xm 197 − 1 2 d X m=1 N X i=1 t 2 im + d X m=1 a α m,0 −1 logα m − d X m=1 b α m,0 α m +const The M-step equations can be derived by maximizing the complete log likelihood with respect to each of the parameters ψ y , ψ zm and ψ xm . Assuming that: Q(α,w z ,w x ,Z,T)=Q(α)Q(w z )Q(w x )Q(Z,T), we can then infer the posterior distributions of the random variables (E-step updates) analytically: • For Q(α): logQ(α)= d X m=1 logα m − 1 2 α m w 2 zm − 1 2 α m w 2 xm + a α m,0 −1 logα m −b α m,0 α m + const α = d X m=1 " a α m,0 +1−1 logα m − b α m,0 + w 2 zm + w 2 xm 2 ! α m # + const α Assuming Q(α)= Q d m=1 Q(α m ) and Q(α m )=Gamma(ˆ a αm , ˆ b αm ), ˆ a αm =a α m,0 +1 ˆ b αm =b α m,0 + w 2 zm + w 2 xm 2 (C.2) • For Q(W z ): logQ(W z )= d X m=1 " − 1 2ψ zm N X i=1 D (z im −w zm t im ) 2 E − 1 2 hα m iw 2 zm # + const Wz =− 1 2 d X m=1 " w 2 zm 1 ψ zm N X i=1 t 2 im +hα m i ! −2w zm 1 ψ zm N X i=1 hz im t im i !# − 1 2 " 1 ψ zm N X i=1 z 2 im # + const Wz Assuming Q(W z )= Q d m=1 Q(w zm ) and Q(w zm )=Normal(hw zm i,σ 2 wzm ), σ 2 wzm = 1 ψ zm N X i=1 t 2 im +hα m i ! −1 hw zm i=σ 2 wzm 1 ψ zm N X i=1 hz im t im i ! (C.3) 198 • For Q(W x ): logQ(W x )= d X m=1 " − 1 2ψ xm N X i=1 D (x im −w xm t im ) 2 E − 1 2 hα m iw 2 xm # + const Wx =− 1 2 d X m=1 " w 2 xm 1 ψ xm N X i=1 t 2 im +hα m i ! −2w xm 1 ψ zm N X i=1 x im ht im i !# − 1 2 " 1 ψ xm N X i=1 x 2 im # + const Wy Assuming Q(W x )= Q d m=1 Q(w xm ) and Q(w xm )=Normal(hw xm i,σ 2 wxm ), σ 2 wxm = 1 ψ xm N X i=1 t 2 im +hα m i ! −1 hw xm i=σ 2 wxm 1 ψ xm N X i=1 x im ht im i ! (C.4) 199 • For Q(η) (where η ={z i ,t i }): logQ(η) =− 1 2ψ y N X i=1 D y i −1 T z i 2 E − d X m=1 1 2ψ zm N X i=1 D (z im −t im w zm ) 2 E − d X m=1 1 2ψ xm N X i=1 D (x im −t im w xm ) 2 E − 1 2 d X m=1 N X i=1 t 2 im =− 1 2 N X i=1 1 ψ y y i −1 T z i 2 + D (z i −W y t i ) T Ψ −1 z (z i −W y t i ) E − 1 2 N X i=1 D (x i −W x t i ) T Ψ −1 x (x i −W x t i ) E − 1 2 N X i=1 t T i t i =− 1 2 N X i=1 −2 y i 1 T z i ψ y +z T i 11 T ψ y z i +z T i Ψ −1 z z i −2z T i Ψ −1 z hW y it i +t T i W T y W y Ψ −1 z − 1 2 N X i=1 h t T i I+ W T x W x Ψ −1 x t i −2t T i hW x i T Ψ −1 x x i i =− 1 2 N X i=1 z T i 11 T ψ y +Ψ −1 z z i +t T i I+ W T x W x Ψ −1 x + W T y W y Ψ −1 z t i − 1 2 N X i=1 " −2 y i P d m=1 z im ψ y −2 d X m=1 x im hw xm it im ψ xm −2 d X m=1 z im hw zm α m it im ψ zm # (C.5) where W T y Ψ −1 z W y = W T y W y Ψ −1 z since Ψ −1 z , W y are both diagonal matrices and, similarly, W T x Ψ −1 x W x = W T x W x Ψ −1 x . Now, since we know the joint posterior of z and t is joint Gaussian distribution, it will take the form: z i t i −μ T Σ −1 z i t i −μ = z i t i T Σ −1 z i t i −2μ T Σ −1 z i t i +μ T μ (C.6) Expanding out the quadratic term in [z i t i ] T from Eq. (C.6), we match up the coefficientsoftermsinEq.(C.5)tofindthattheinverseposteriorcovariancematrix is: Σ −1 = A zz A zt A tz A tt = " 11 ψy +Ψ −1 z −Ψ −1 z hW y i −hW y i T Ψ −1 z I+ W T x W x Ψ −1 x + W T y W y Ψ −1 z # (C.7) 200 To find Σ, we use the formula for the inverse of a partitioned matrix to get: Σ= Σ zz Σ zt Σ tz Σ tt = " A zz −A zt A −1 tt A tz −1 −Σ zz A zt A −1 tt −A −1 tt A zt Σ zz A −1 tt +A −1 tt A zt Σ zz A zt A −1 tt # We can use the Sherman Morrison Woodbury Theorem to simplifyΣ zz , noting that W T y W y =diag w z w T z +σ 2 wz = Σ Wy mm +hW y i T hW y i: Σ −1 zz =A zz −A zt A −1 tt A tz = 1 ψ y 11 T +Ψ −1 z −Ψ −1 z hW y i I+ W T x W x Ψ −1 x + W T y W y Ψ −1 z −1 hW y i T Ψ −1 z = 1 ψ y 11 T +Ψ −1 z −Ψ −1 z hW y i I+ W T x W x Ψ −1 x + W T y W y Ψ −1 z −1 hW y i T Ψ −1 z = 1 ψ y 11 T +Ψ −1 z −Ψ −1 z hW y i I+ W T x W x Ψ −1 x +(Σ Wy ) mm Ψ −1 x +hW y i T Ψ −1 z hW y i −1 hW y i T Ψ −1 z = 1 ψ y 11 T + Ψ z +hW y i I+ W T x W x Ψ −1 x + Σ Wy mm Ψ −1 z −1 hW y i T −1 leading to... Σ zz = 1 ψ y 11 T + Ψ z +hW y i I+ W T x W x Ψ −1 x + Σ Wy mm Ψ −1 z −1 hW y i T −1 −1 =M− M11 T M ψ y +1 T M1 where... M=Ψ z +hW y i I+ W T x W x Ψ −1 x + Σ Wy mm Ψ −1 z −1 hW y i T Therefore, the full posterior covariance of z and t consists of the following block matrices: Σ= Σ zz Σ zt Σ tz Σtt 201 where: Σ zz =M− M11 T M ψ y +1 T M1 Σ zt =−Σ zz hW z iΨ −1 z I+Ψ −1 x W T x W x + W T z W z Ψ −1 z −1 Σ tz =(Σ zt ) T Σ tt = I+ W T x W x Ψ −1 x + W T z W z Ψ −1 z −1 +A −1 tt A tz Σ zz A tz A −1 tt M=Ψ z +hW z i I+ W T x W x Ψ −1 x +(Σ Wz ) mm Ψ −1 z −1 hW z i T (C.8) Noticethatalltheaboveequationsinvolvediagonalmatricesand,asaresult,require a computational complexity that is linear in the number of dimensions. To find the means of z i and t i , we refer to Eq. (C.6) and, in a similar manner for the covariance matrix, match up coefficients of the term linear in [z i t i ] T to get: −2μΣ −1 z i t i =−2 y i P d m=1 z im ψ y + d X m=1 x im hw xm it im ψ xm ! =−2 h y i 1 T ψy x i hW x i T Ψ −1 x i z i t i so that: hz i i= h y i 1 T ψy x i hW x i T Ψ −1 x i Σ zz Σ tz ht i i= h y i 1 T ψy x i hW x i T Ψ −1 x i Σ zt Σ tt Therefore, hz i i= y i ψ y 1 T Σ zz +x i hW x i T Ψ −1 x Σ tz ht i i=− y i ψ y 1 T Σ zz hW z iΨ −1 z I+Ψ −1 x W T x W x + W T z W z Ψ −1 z −1 +x i hW x i T Ψ −1 x Σ tt σ 2 z =diag(Σ zz ) σ 2 t =diag(Σ t ) cov(z,t)=diag(Σ zt ) (C.9) As a final note: w 2 zm =hw zm i 2 +σ 2 wzm w 2 xm =hw xm i 2 +σ 2 wxm t 2 im =ht im i 2 +σ 2 t z 2 im =hz im i 2 +σ 2 z 202 C.3 Monitoring the Incomplete Log Likelihood To know when to stop iterating through the EM algorithm above, we should monitor the incomplete log likelihood and stop when the value appears to have converged. To calculate the incomplete log likelihood, we need to integrate out the variablesα,w z ,w x , Z, T from the complete log likelihood expression. But since the calculation of the true posterior distribution Q(θ) is intractable, we cannot determine the true incomplete log likelihood. Hence, for the purpose of monitoring the incomplete log likelihood in the EM algorithm, we can monitor the lower bound of the incomplete log likelihood instead. In the derivation of the EM algorithm, we reached an estimate of Q(θ), where θ = {α,w z ,w x ,Z,T}, to be Q(α)Q(w z )Q(w x )Q(Z,T). The lower bound to the incomplete log likelihood, where φ={ψ y ,ψ z ,ψ x } is: logp(y|X;φ)≥ Z Q(θ)log p(y,θ|X;φ) Q(θ) dθ = Z Q(θ)logp(y,θ|X;φ)dθ− Z Q(θ)logQ(θ)dθ =hlogp(y,θ|X;φ)i Q(θ) − Z Q(θ)logQ(θ)dθ and this simplifies to: logp(y|X;φ)≥ − N 2 logψ y − 1 2ψ y N X i=1 y i −2y i 1 T hz i i+1 T z i z T i 1 − N 2 d X m=1 logψ zm − d X m=1 1 2ψ zm N X i=1 z 2 im −2hw zm ihz im t im i+ w 2 zm t 2 im − N 2 d X m=1 logψ xm − d X m=1 1 2ψ xm N X i=1 x 2 im −2hw xm iht im ix im + w 2 xm t 2 im − 1 2 d X m=1 N X i=1 t 2 im − 1 2 d X m=1 hα m i w 2 zm − 1 2 d X m=1 hα m i w 2 xm − d X m=1 a αm0 log ˆ b αm − d X m=1 b αm0 hα m i − 1 2 log|Σ −1 z,t |− d X m=1 log ˆ b αm + 1 2 d X m=1 logσ 2 wzm + 1 2 d X m=1 logσ 2 wxm +Constant (C.10) Note that once the EM algorithm has converged and we have the final values for the un- knownvariablesandpoint-estimatedparameters,westillneedtoinferwhattheregression coefficient if we want to make predictions. 203 Appendix D Bayesian Local Kernel Shaping D.1 Deriving a Lower Bound Using Convex Duality Eq. (5.8) contains a problematic term of: −log 1+(x im −x qm ) 2r h m thatpreventsusfromderivingananalyticallytractableexpressionfortheposteriorofh m . We can use a variational approach on concave/convex functions suggested by Jaakkola & Jordan (2000) to find analytically tractable expressions. Specifically, we can find a lower bound to the term using the convex duality approximations: −log 1+(x im −x qm ) 2r h m ≥max λ im h −λ im h m (x im −x qm ) 2r −f ∗ (λ im ) i (D.1) where λ im is a variational parameter to be optimized and: f ∗ (λ im )=max λ im h −λ im h m (x im −x qm ) 2r +log 1+(x im −x qm ) 2r h m i We can then find the value of h m that maximizes f ∗ (λ im ): ∂f ∗ (λ im ) ∂h m =0 −(x im −x qm ) 2r λ im + (x im −x qm ) 2r 1+(x im −x qm ) 2r h m =0 h m (x im −x qm ) 2r = 1 λ im −1 Hence: f ∗ (λ im )=−λ im 1 λ im −1 +log 1+ 1 λ im −1 =−1+λ im −logλ im (D.2) 204 Substituting the expression for f ∗ (λ im ) in Eq. (D.1) into the lower bound Eq. (D.2), we get: −λ im h m (x im −x qm ) 2r −f ∗ (λ im )=−λ im h m (x im −x qm ) 2r +1−λ im +logλ im (D.3) and find the optimal value of the variational parameter λ im by maximizing Eq. (D.3): ∂ ∂λ im h −λ im hh m i(x im −x qm ) 2r +1−λ im +logλ im i =0 −hh m i(x im −x qm ) 2r −1+ 1 λ im =0 1 λ im =1+hh m i(x im −x qm ) 2r λ im = 1 1+hh m i(x im −x qm ) 2r In summary, we can find a lower bound to the problem log term in Eq. (5.8) in the form of: −log 1+(x im −x qm ) 2r h m ≥−λ im h m (x im −x qm ) 2r +1−λ im +logλ im (D.4) where the lower bound is maximized when: λ im = 1 1+hh m i(x im −x qm ) 2r (D.5) D.2 Derivation of Bayesian Kernel Shaping To derive the final EM update equations for the model in Section 5.1.1, we can proceed in a similar fashion as described in Appendix A.1. Assuming that: Q(b,ψ z ,h,z)=Q(b,ψ z )Q(h)Q(z) we can infer the posterior distributions of the random variables (E-step updates) analyt- ically: • For Q(b|ψ z ): Assuming that Q(b)= Q d m=1 Q(b m ): logQ(b m |ψ zm ) =− 1 2ψ zm N X i=1 hw i i D z im −b T m x im 2 E − 1 2ψ zm b T m Σ −1 bm,0 b m +const bm|ψzm =− 1 2ψ zm N X i=1 hw i i z 2 im −2hz im ihb m i T x im +b T m x im x T im b m − 1 2ψ zm b T m Σ −1 bm,0 b m +const bm|ψzm 205 =− 1 2 " b T m 1 ψ zm Σ −1 bm,0 + N X i=1 hw i ix im x T im ! b m −2b T m 1 ψ zm N X i=1 hw i ihz im ix im # +const bm|ψzm then the posterior covariance and mean of b m is: Σ bm = Σ −1 bm,0 + 1 ψ zmN N X i=1 hw i ix im x T im ! −1 hb m i=Σ bm N X i=1 hw i ihz im ix im ! (D.6) • For Q(ψ z ): Assuming Q(ψ z )= Q d m=1 Q(ψ zm ), logQ(ψ zm ) =− 1 2 logψ zm N X i=1 hw i i− 1 2ψ zm N X i=1 hw i i D z im −b T m x im 2 E − 1 2 logψ zm − 1 2ψ zm D b m TΣ −1 bm,0 b m E − n m0 2 +1 logψ zm − n m0 2 ψ zmN0 ψ zm − Z Q(b m |ψ zm )logQ(b m |ψ zm )db m +const ψzm =−logψ zm P N i=1 hw i i+n m0 +1 2 +1 ! − 1 2ψ zm n m0 ψ zmN0 + D b T m Σ −1 bm,0 b m E + N X i=1 hw i i D z im −b T m x im 2 E ! + 1 2 {log(2πψ zm Σ bm )+1}+const ψzm Given that: 1 2ψ zm N X i=1 hw i i D z im −b T m x im 2 E = 1 2ψ zm " N X i=1 hw i i z 2 im −2 N X i=1 hw i iz im x T im hb m i+ N X i=1 hw i ix T im b m b T m x im # = 1 2ψ zm N X i=1 hw i i hz im i−x T im hb m i 2 + 1 2ψ zm N X i=1 hw i ix T im ψ zm Σ bm x im + 1 2ψ zm N X i=1 hw i iΣ z i |y i ,x i 206 and − 1 2ψ zm D b T m Σ −1 bm,0 b m E =− 1 2ψ zm hb m i T Σ −1 bm,0 hb m i we then get: logQ(ψ zm ) =logψ zm n m0 + N X i=1 hw i i ! − 1 2ψ zm h n m0 ψ zmN0 +hb m i T Σ −1 bm,0 hb m i + N X i=1 hw i i hz im i−x T im hb m i 2 + N X i=1 hw i iΣ z i |y i ,x i # +const ψzm giving us: n m =n m0 + N X i=1 hw i i ψ zmN = 1 n 0 + P N i=1 hw i i h n m0 ψ zmN0 +hb m i T Σ −1 bm,0 hb m i + N X i=1 hw i i hz im i−x T im hb m i 2 + N X i=1 hw i iΣ z i |y i ,x i # (D.7) • For Q(h): Assuming Q(h)= Q d m=1 Q(h m ): logQ(h)= d X m=1 N X i=1 (1−hw i i)logh m (x im −x qm ) 2r − d X m=1 N X i=1 λ i (x im −x qm ) 2r h m + d X m=1 (a hm0 −1)logh m − d X m=1 b hm0 h m +const h = d X m=1 a hm0 +N− N X i=1 hw i i−1 ! logh m − d X m=1 b hm0 + N X i=1 λ im (x im −x qm ) 2r ! h m +const h where λ im = 1 1+hhmi(x im −xqm) 2r . From the above, we can infer the posterior mean of h m to be: hh m i= a hm0 +N − P N i=1 hw i i b hm0 + P N i=1 λ im (x im −x qm ) 2r (D.8) 207 • For Q(z i ): Inference of the posterior distribution of z i is done in a similar manner as in Ap- pendix A.1. 208
Abstract (if available)
Abstract
We propose a set of Bayesian methods that help us toward the goal of autonomous learning systems. Systems that can react autonomously, with minimal human supervision, have the potential for significant impact, especially in applications with considerable uncertainty in the environment. In current algorithms, parameter values are set using heuristic techniques, statistical cross-validation or other search procedures to find the "right" values. We rely on Bayesian inference as a principled way to eliminate open parameters, resulting in a black-box-like approach.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning objective functions for autonomous motion generation
PDF
Data-driven autonomous manipulation
PDF
Sense and sensibility: statistical techniques for human energy expenditure estimation using kinematic sensors
PDF
Computational model of stroke therapy and long term recovery
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Risk-aware path planning for autonomous underwater vehicles
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Design of adaptive automated robotic task presentation system for stroke rehabilitation
PDF
Closing the reality gap via simulation-based inference and control
PDF
Computational transcranial magnetic stimulation (TMS)
PDF
The wavelet filter: a new method for nonlinear, nongaussian filtering in one dimension
PDF
Kernel methods for unsupervised domain adaptation
PDF
The representation, learning, and control of dexterous motor skills in humans and humanoid robots
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Machine learning of motor skills for robotics
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Transfer learning for intelligent systems in the wild
PDF
Data-driven robotic sampling for marine ecosystem monitoring
PDF
Decision support systems for adaptive experimental design of autonomous, off-road ground vehicles
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
Asset Metadata
Creator
Ting, Jo-Anne Su Yin
(author)
Core Title
Bayesian methods for autonomous learning systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/01/2009
Defense Date
12/03/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autonomous learning systems,Bayesian statistics,function approximation,machine learning,OAI-PMH Harvest,real-time learning,regression
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Schaal, Stefan (
committee chair
), Schweighofer, Nicolas (
committee member
), Sha, Fei (
committee member
), Sukhatme, Gaurav S. (
committee member
)
Creator Email
joanne.ting@gmail.com,joanneti@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1961
Unique identifier
UC1195025
Identifier
etd-Ting-2611 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-151821 (legacy record id),usctheses-m1961 (legacy record id)
Legacy Identifier
etd-Ting-2611.pdf
Dmrecord
151821
Document Type
Dissertation
Rights
Ting, Jo-Anne Su Yin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
autonomous learning systems
Bayesian statistics
function approximation
machine learning
real-time learning
regression