Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Noise benefits in expectation-maximization algorithms
(USC Thesis Other)
Noise benefits in expectation-maximization algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOISE BENEFITS IN EXPECTATION-MAXIMIZATION ALGORITHMS by Osonde Adekorede Osoba A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2013 Copyright 2013 Osonde Adekorede Osoba Acknowledgements I am deeply grateful to my advisor Professor Bart Kosko for his patience, faith, and guidance throughout the development of this work. I am indebted to him for fostering much of my academic and personal growth over the past few years. I thank the other committee members Professor Antonio Ortega and Professor James Moore II. Their insightful feedback helped shape this dissertation. I had the privilege of collaborating closely with Kartik Audhkhasi and Professor Sanya Mitaim. I am grateful for our many late-night math sessions. Finally I would like to thank my friends and family for supporting me through my graduate career. In particular I would like to thank my friend and teacher Michael Davis and the United Capoeira Association community in Los Angeles. i Contents Acknowledgements i List of Tables vi List of Figures ix Preface x Abstract xii 1 Preview of Dissertation Results 1 1.1 Noisy Expectation-Maximization . . . . . . . . . . . . . . . . . . . . 2 1.2 Applications of Noisy Expectation-Maximization . . . . . . . . . . . . 4 1.3 Results on Bayesian Approximation . . . . . . . . . . . . . . . . . . . 6 2 The Expectation{Maximization (EM) Algorithm 9 2.1 A Review of Maximum Likelihood Estimation . . . . . . . . . . . . . 10 2.1.1 MLE Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 A Brief History of Maximum Likelihood Methods . . . . . . . 11 2.1.3 Numerical Maximum Likelihood Estimation . . . . . . . . . . 13 2.2 The Expectation Maximization Algorithm . . . . . . . . . . . . . . . 14 2.2.1 Motivation & Historical Overview . . . . . . . . . . . . . . . . 14 2.2.2 Formulation of the EM Algorithm . . . . . . . . . . . . . . . . 17 2.2.3 Complete Data Spaces for Incomplete Data Models . . . . . . 18 2.3 EM Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Convergence of the (G)EM Algorithm . . . . . . . . . . . . . . 23 2.3.2 Fisher Information in EM Algorithms . . . . . . . . . . . . . . 27 ii 2.4 Variations on the EM Theme . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 M{step Variations . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 E{step Variations . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 MAP EM for Bayesian Parameter Estimation . . . . . . . . . 31 2.5 Examples of EM Algorithms . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 EM for Curved-Exponential Family Models . . . . . . . . . . . 32 2.5.2 EM for Finite Mixture Models . . . . . . . . . . . . . . . . . . 34 2.5.3 EM for Positron Emission Tomography . . . . . . . . . . . . . 36 2.6 MM: A Generalization for EM . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Noisy Expectation{Maximization (NEM) 43 3.1 Noise Benets and Stochastic Resonance . . . . . . . . . . . . . . . . 44 3.1.1 Noise Benets in the EM Algorithm . . . . . . . . . . . . . . . 46 3.1.2 Intuition on EM Noise Benets . . . . . . . . . . . . . . . . . 47 3.2 Noisy Expectation Maximization Theorems . . . . . . . . . . . . . . . 48 3.2.1 NEM Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.2 NEM for Finite Mixture Models . . . . . . . . . . . . . . . . . 54 3.2.3 The Geometry of the GMM{ & CMM{NEM Condition . . . . 59 3.2.4 NEM for Mixtures of Jointly Gaussian Populations . . . . . . 64 3.2.5 NEM for Models with Log-Convex Densities . . . . . . . . . . 68 3.3 The Noisy Expectation-Maximization Algorithm . . . . . . . . . . . . 70 3.3.1 NEM via Deterministic Interference . . . . . . . . . . . . . . . 73 3.4 Sample Size Eects in the NEM algorithm . . . . . . . . . . . . . . . 74 3.4.1 Large Sample Size Eects . . . . . . . . . . . . . . . . . . . . 76 3.4.2 Small Sample Size: Sparsity Eect . . . . . . . . . . . . . . . 81 3.4.3 Asymptotic NEM Analysis . . . . . . . . . . . . . . . . . . . . 81 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 NEM Application: Clustering and Competitive Learning Algorithms 88 4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.1 Noisy Expectation-Maximization for Clustering . . . . . . . . 92 4.1.2 GMM-EM for Clustering . . . . . . . . . . . . . . . . . . . . . 93 4.1.3 Naive Bayes Classier on GMMs . . . . . . . . . . . . . . . . 94 4.1.4 The Clustering Noise Benet Theorem . . . . . . . . . . . . . 94 iii 4.2 The k-Means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 96 4.2.1 k-Means Clustering as a GMM{EM Procedure . . . . . . . . . 97 4.2.2 k{Means Clustering and Adaptive Resonance Theory . . . . . 99 4.3 Competitive Learning Algorithms . . . . . . . . . . . . . . . . . . . . 100 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5 NEM Application: Baum-Welch Algorithm for Training Hidden Markov Models 108 5.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.1 The Baum-Welch Algorithm for HMM Parameter Estimation . 110 5.2 NEM for HMMs: The Noise-Enhanced HMM (NHMM) . . . . . . . . 112 5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6 NEM Application: Backpropagation for Training Feedforward Neu- ral Networks 117 6.1 Backpropagation Algorithm for NN Training . . . . . . . . . . . . . . 118 6.1.1 Summary of NEM Results for Backpropagation . . . . . . . . 119 6.2 Backpropagation as Maximum Likelihood Estimation . . . . . . . . . 121 6.3 Backpropagation as an EM Algorithm . . . . . . . . . . . . . . . . . 123 6.4 NEM for Backpropagation Training . . . . . . . . . . . . . . . . . . . 127 6.4.1 NEM Conditions for Neural Network ML Estimation . . . . . 128 6.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7 Bayesian Statistics 134 7.1 Introduction: The Bayesian & The Frequentist . . . . . . . . . . . . 134 7.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.1 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3 Bayesian Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . 139 7.3.1 Bayes Estimates for Dierent Loss Functions . . . . . . . . . . 140 7.3.2 Measures of Uncertainty for Bayes Estimates . . . . . . . . . . 142 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 iv 8 Bayesian Inference with Fuzzy Function Approximators 144 8.1 Bayesian Inference with Fuzzy Systems . . . . . . . . . . . . . . . . . 145 8.2 Adaptive Fuzzy Function Approximation . . . . . . . . . . . . . . . . 148 8.2.1 SAM Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . 148 8.2.2 The Watkins Representation Theorem . . . . . . . . . . . . . 151 8.2.3 ASAM Learning Laws . . . . . . . . . . . . . . . . . . . . . . 151 8.2.4 ASAM Approximation Simulations . . . . . . . . . . . . . . . 159 8.2.5 Approximating Non-conjugate Priors . . . . . . . . . . . . . . 161 8.2.6 The SAM Structure of Fuzzy Posteriors . . . . . . . . . . . . . 162 8.2.7 Other Uniform Function Approximation Methods . . . . . . . 166 8.3 Doubly Fuzzy Bayesian Inference: Uniform Approximation . . . . . . 167 8.3.1 The Bayesian Approximation Theorem . . . . . . . . . . . . . 167 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9 Hierarchical and Iterative Bayesian Inference with Function Approx- imators 175 9.1 Approximate Hierarchical Bayesian Inference . . . . . . . . . . . . . . 176 9.2 Uniform Approximation for Hierarchical Bayesian Inference . . . . . . 177 9.2.1 The Extended Bayesian Approximation Theorem . . . . . . . 179 9.2.2 Adaptive Fuzzy Systems for Hierarchical Bayesian Inference . 184 9.2.3 Triply Fuzzy Bayesian Inference . . . . . . . . . . . . . . . . . 185 9.3 Semi-conjugacy in Fuzzy Posterior Approximation . . . . . . . . . . . 186 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 10 Conclusion and Future Directions 206 10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 10.2.1 NEM for Deep Learning . . . . . . . . . . . . . . . . . . . . . 207 10.2.2 NEM for Genomics: DNA Motif Identication . . . . . . . . . 211 10.2.3 NEM for PET & SPECT . . . . . . . . . . . . . . . . . . . . . 212 10.2.4 NEM for Magnetic Resonance Image Segmentation . . . . . . 214 10.2.5 Rule Explosion in Approximate Bayesian Inference . . . . . . 215 Bibliography 216 v List of Tables 7.1 Conjugacy relationships in Bayesian inference. A prior pdf of one type combines with its conjugate likelihood to produce a posterior pdf of the same type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2 Three Loss Functions and Their Corresponding Bayes Estimates. . . . 141 8.1 Mean squared errors for the 11 normal posterior approximations . . . 160 10.1 EM algorithm applications with possible NEM extensions . . . . . . . 207 vi List of Figures 2.1 Demonstration: noise injection can speed up the convergence of the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Demonstration: Finite mixture models may not be uniquely identiable 36 3.1 Stochastic Resonance on faint images using white Gaussian pixel noise 45 3.2 EM noise benet for a Gaussian mixture model . . . . . . . . . . . . 46 3.3 EM noise benet for a Cauchy mixture model . . . . . . . . . . . . . 60 3.4 EM noise benet for a 2-D Gaussian mixture model . . . . . . . . . . 61 3.5 The Geometry of a NEM set for GMM{ and CMM{NEM . . . . . . . 63 3.6 An illustration of a sample NEM set for a mixture of two 2-D jointly Gaussian populations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 EM noise benet for log-convex censored gamma model . . . . . . . . 70 3.8 GMM-NEM using deterministic interference on the data samples instead of random noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.9 GMM-CEM using chaotic deterministic interference on the data samples instead of random noise . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.10 Probability of satisfying the NEM sucient condition with dierent sample sizes M and at dierent noise standard deviations N . . . . . 80 3.11 Comparing the eect of the NEM noise sampling model on GMM{EM at dierent sample sizes M . . . . . . . . . . . . . . . . . . . . . . . . 82 3.12 Comparing the eects of noise injection via simulated annealing vs. noise injection via NEM on GMM{EM . . . . . . . . . . . . . . . . . 83 3.13 Noise benets and sparsity eects in the Gaussian mixture NEM at dierent sample sizes M . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1 Noise Benet for Classication Accuracy on a GMM-EM Model . . . 90 vii 4.2 Noise Benet for the Convergence Speed of a k-clustering Procedure . 91 4.3 Noise{benet in the convergence time of Unsupervised Competitive Learning (UCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1 Training a Noisy Hidden Markov Model . . . . . . . . . . . . . . . . . 109 5.2 Noisy Hidden Markov Model training converges in fewer iterations than regular hidden Markov Model training . . . . . . . . . . . . . . . . . 114 6.1 A 3-Layer Feedforward Articial Neural Network . . . . . . . . . . . 118 6.2 Comparison between backpropagation and NEM-BP using the squared- error cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3 Geometry of NEM Noise for Cross-Entropy Backpropagation . . . . . 129 6.4 Geometry of NEM Noise for Least-Squares Backpropagation . . . . . 131 6.5 Comparison between backpropagation and NEM-BP using a Cross- Entropy cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.1 Probabilistic graphical model for Bayesian data models in Chapter 8. 145 8.2 Five fuzzy if-then rules approximate the beta prior h() =(8; 5) . . 147 8.3 Six types of if-part fuzzy sets in conjugate prior approximations . . . 150 8.4 ASAMs can use a limited number of random samples or noisy random samples to estimate the sampling pdf . . . . . . . . . . . . . . . . . . 152 8.5 Comparison of conjugate beta priors and posteriors with their fuzzy approximators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.6 Comparison of conjugate gamma priors and posteriors with their fuzzy approximators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.7 Comparison of 11 conjugate normal posteriors with their fuzzy-based approximators based on a standard normal prior and 11 dierent normal likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.8 Comparison of a non-conjugate prior pdf h() and its fuzzy approxima- tor H() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.9 Approximation of a non-conjugate posterior pdf. Comparison of a non-conjugate posterior pdf f(jx) and its fuzzy approximator F (jx) 163 8.10 Doubly fuzzy Bayesian inference: comparison of two normal posteriors and their doubly fuzzy approximators . . . . . . . . . . . . . . . . . . 168 9.1 Probabilistic graphical model for Bayesian data models in Chapter 9. 176 viii 9.2 Hierarchical Bayes posterior pdf approximation using a fuzzy hyperprior178 9.3 Comparison between inverse-gamma (IG) hyperprior () and its fuzzy approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.4 Triply fuzzy Bayesian inference: comparison between a 2-D posterior f(;jx) and its triply fuzzy approximator F (;jx) . . . . . . . . . 187 9.5 Triply fuzzy Bayesian inference for a non-conjugate posterior: com- parison between a 2-D non-conjugate posterior and its triply fuzzy approximator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.6 Conjugacy and semi-conjugacy of the doubly fuzzy posterior if-part set functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.7 Doubly fuzzy posterior representation . . . . . . . . . . . . . . . . . . 197 9.8 Semi-conjugacy for Laplace set functions . . . . . . . . . . . . . . . . 204 10.1 A restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . 208 10.2 A deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 208 ix Preface This dissertation's main nding is that noise can speed up the convergence of the Expectation-Maximization (EM) algorithm. The Noisy Expectation Maximization (NEM) theorem states a sucient condition under which controlled noise injection causes the EM algorithm to converge faster on average. NEM algorithms modify the basic EM scheme to take advantage of the NEM theorem's average speed-up guarantee. The dissertation also proves that neural, fuzzy, and other function approximators can uniformly approximate the more general posterior probabilities in Bayesian inference. The dissertation derives theorems and properties about the use of approximators in general Bayesian inference models. The dissertation consists of 10 chapters. Chapter 1 is a short preview of all the main theorems and results in this dissertation. Chapter 2 gives a broad review of the EM algorithm. It also presents EM variations and examples of EM algorithms for dierent data models. This review lays the groundwork for discussing NEM and its convergence properties. Chapter 3 introduces Noisy Expectation Maximization. This chapter discusses the intuition behind the NEM theorem. Then it derives the NEM theorem which gives a condition under which noise injection improves the average convergence speed of the EM algorithm. Examples of NEM algorithms show this average noise benet. This chapter includes a full derivation and demonstration of NEM for the popular Gaussian mixture model. The chapter ends with a corollary to the NEM theorem which states that noise-enhanced EM estimation is more benecial for sparse data sets than for large data sets. Chapters 4, 5, and 6 apply NEM to three important EM models: mixture models for clustering, hidden Markov models, and feedfoward articial neural networks. The training algorithms for these three models are standard and heavily-used: k-means clustering, the Baum-Welch algorithm, and backpropagation training respectively. All x three algorithms are EM algorithms. These chapters present proofs showing that these algorithms are EM algorithms. The EM-subsumption proof for backpropagation is new. The subsumption proofs for k-means and Baum-Welch are not new. These subsumptions imply that the NEM theorem applies to these algorithms. Chapters 4, 5, and 6 derive the NEM conditions for each model and show that NEM improves the speed of these popular training algorithms. Chapter 7 reviews Bayesian statistics in detail. The chapter highlights the dierence between the Bayesian approach to statistics and the frequentist approach to statistics (which underlies maximum likelihood techniques like EM). It also shows that the Bayesian approach subsumes the frequentist approach. This subsumption implies that subsequent Bayesian inference results also apply to EM models. Chapters 8 and 9 give analyze the eects of model-function approximation in Bayesian inference. These chapters present the Bayesian Approximation theorem and its extension, the Extended Bayesian Approximation theorem. These theorems guarantee that uniform approximators for Bayesian model functions produce uniform approximators for the posterior pdf via Bayes theorem. Simulations with uniform fuzzy function approximators show sample posterior pdf approximations that validate the theorems. The use of fuzzy function approximators has the eect of subsuming most closed functional form Bayesian inference models via the recent Watkins Representation Theorem. Chapter 10 discusses ongoing and future research that extends these results. xi Abstract This dissertation shows that careful injection of noise into sample data can substantially speed up Expectation-Maximization algorithms. Expectation-Maximization algorithms are a class of iterative algorithms for extracting maximum likelihood estimates from corrupted or incomplete data. The convergence speed-up is an example of a noise benet or \stochastic resonance" in statistical signal processing. The dissertation presents derivations of sucient conditions for such noise-benets and demonstrates the speed-up in some ubiquitous signal-processing algorithms. These algorithms include parameter estimation for mixture models, the k-means clustering algorithm, the Baum-Welch algorithm for training hidden Markov models, and backpropagation for training feedforward articial neural networks. This dissertation also analyses the eects of data and model corruption on the more general Bayesian inference estimation framework. The main nding is a theorem guaranteeing that uniform approximators for Bayesian model functions produce uniform approximators for the posterior pdf via Bayes theorem. This result also applies to hierarchical and multidimensional Bayesian models. xii Chapter 1 Preview of Dissertation Results The main aim of this dissertation is to demonstrate that noise injection can improve the average speed of Expectation-Maximization (EM) algorithms. The EM discussion in Chapter 2 gives an idea of the power and generality of the EM algorithm schema. But EM algorithms have a key weakness: they converge slowly especially on high- dimensional incomplete data. Noise injection can address this problem. The Noisy Expectation Maximization (NEM) theorem (Theorem 3.1) in Chapter 3 describes a condition under which injected noise causes faster EM convergence on average. This general condition reduces to a simpler condition (Corollary 3.2) for Gaussian mixture models (GMMs). The GMM noise benet leads to EM speed-ups in clustering algorithms and in the training of hidden Markov models. The general NEM noise benet also applies to the backpropagation algorithm for training feedforward neural network. This noise benet relies on the fact that backpropagation is indeed a type of EM algorithm (Theorem 6.1). The secondary aim of this dissertation is to show that uniform function approxi- mators can expand the set of model functions (likelihood functions, prior pdfs, and hyperprior pdfs) available for Bayesian inference. Bayesian statisticians often limit themselves to a small set of closed-form model functions either for ease of analysis or because they have no robust method for approximating arbitrary model functions. This dissertation shows a simple robust method for uniform model function approxi- mation in Chapters 8 and 9. Theorem 8.2 and Theorem 9.1 guarantee that uniform approximators for model functions lead to uniform approximators for posterior pdfs. 1 1.1 Noisy Expectation-Maximization The Noisy Expectation Maximization (NEM) theorem (Theorem 3.1) is the major result in this dissertation. Theorem. [Noisy Expectation Maximization (NEM)]: An EM iteration noise benet occurs on average if E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0: (1.1) The theorem gives a sucient condition under which adding noiseN to the observed data Y leads to an increase in the average convergence speed of the EM algorithm. This is the rst description of a noise benet for EM algorithms. It relies on the insight that noise can sometimes perturb the likelihood function favorably. Thus noise injection can lead to better iterative estimates for parameters. This sucient condition is general and applies to any EM data model. The rst major corollary (Corollary 3.2) applies this sucient condition to EM algorithms on the Gaussian mixture model. This results in a simple quadratic noise screening condition for the average noise benet. Corollary. [NEM Condition for GMMs (in 1-D)]: The NEM sucient condition holds for a GMM if the additive noise samples n satisfy the following algebraic condition n 2 2n ( j y) for all GMM sub-populations j : (1.2) This quadratic condition denes the geometry (Figure 1.1) of the set of noise samples that can speed up the EM algorithm. 2 Figure 1.1: Geometry of NEM noise for a GMM. Noise samples in the blue overlapping region satisfy the NEM sucient condition and lead to faster EM convergence. Noise samples in the green box satisfy a simpler quadratic NEM sucient condition and also lead to faster EM convergence. Sampling from the green box is easier. This geometry is for a sample y of a 2-D GMM with sub-populations centered at 1 and 2 . x3.2.3 and x3.2.4 discuss these geometries in more detail. Noise injection subject to the NEM condition leads to better EM estimates on average at each iteration and faster EM convergence. Combining NEM noise injection with a noise decay per iteration leads to much faster overall EM convergence. We refer to the combination of NEM noise injection and noise cooling as the NEM algorithm (x3.3). A comparison of the evolution of EM and NEM algorithms on a sample estimation problem shows that the NEM algorithm reaches the stationary point of the likelihood function in 30% fewer steps than the EM algorithm (see Figure 1.2). 3 Log-likelihood Comparison of EM and Noise-enhanced EM Figure 1.2: NEM noise injection can speed up the convergence of the EM algorithm. The plot shows the evolution of an EM algorithm on a log-likelihood surface with and without noise injection. Both algorithms start at the same initial estimate and converge to the same point on the log-likelihood surface. The EM algorithm converges in 10 iterations while the noise-enhanced algorithm converges in 7 iterations|30% faster than the EM algorithm. 1.2 Applications of Noisy Expectation-Maximization Finding the NEM noise benet led to recasting other iterative statistical algorithms as EM algorithms to allow a noise boost. The NEM theorem is a general prescriptive tool for extracting noise benets from arbitrary EM algorithms. So these reinterpretations serve as a basis for introducing NEM noise benets into other standard iterative estimation algorithms. This dissertation shows NEM noise benets in three such algorithms: thek-means clustering algorithm (Chapter 4), the Baum-Welch algorithm (Chapter 5), and the backpropagation algorithm (Chapter 6). The most important of these algorithms is the backpropagation algorithm for feedforward neural network training. We show for the rst time that the backpropaga- tion algorithm is in fact a generalized EM (GEM) algorithm (Theorem 6.1) and thus benets from proper noise injection: Theorem. [Backpropagation is a GEM Algorithm]: The backpropagation update equation for a feedforward neural-network likelihood func- tion equals the GEM update equation. Thus backpropagation is a GEM algorithm. 4 This theorem illustrates a general theme in recasting estimation algorithms as EM algorithms: iterative estimation algorithms that make use of missing information and increase a data log-likelihood are usually (G)EM algorithms. Chapter 6 provides proof details and simulations of NEM noise benets for backpropagation. The NEM condition for backpropagation (Theorem 6.3) has interesting geometric properties as the backpropagation noise ball in Figure 1.3 illustrates. Geometry of NEM Noise for Backpropagation Figure 1.3: NEM noise for faster backpropagation using Gaussian output neurons. The NEM noise must fall inside the backpropagation \training error" sphere. This is the sphere with center c = t a (the error between the target output t and the actual output a) with radiusr =kck. Noise from the noise ball section that intersects with the error sphere will speed up backpropagation training according to the NEM theorem. The error ball changes at each training iteration. Deep neural networks can also benet from the NEM noise benet. Deep neural networks are \deep" stacks of restricted Boltzmann machines (RBMs). The depth of the network may help the network identify complicated patterns or concepts in complex data like video or speech. These deep networks are in fact bidirectional 5 associative memories (BAMs). The stability and fast training properties of deep networks are direct consequences of the global stability property of BAMs. . . . . . . . . . Hidden layers Visible layer Output layer Figure 1.4: A Deep Neural Network consists of a stack of restricted Boltzmann machines (RBMs) or bidirectional associative memories (BAMs). The so-called Contrastive Divergence algorithm is the current standard algorithm for pre-training deep networks. It is an iterative algorithm for approximate maximum likelihood estimation (x10.2.1). CD is also a GEM algorithm. Theorem 10.1 and Theorem 10.2 give the NEM noise benet conditions for training the RBMs in a deep network. The NEM condition for RBMs shares many geometrical properties with the NEM condition for backpropagation. 1.3 Results on Bayesian Approximation The last major results in this dissertation are the Bayesian approximation theorems in Chapters 8 and 9. They address the eects of using approximate model functions for Bayesian inference. Approximate model functions are common in Bayesian statistics because statisticians often have to estimate the true model functions from data or experts. This dissertation presents the rst general proof that these model approxima- tions do not degrade the quality of the approximate posterior pdf. Below is a combined statement of the two approximation theorems in this dissertation (Theorem 8.2 and Theorem 9.1): 6 Theorem. [The Unied Bayesian Approximation Theorem]: Suppose the model functions (likelihoods g, prior h, and hyperpriors ) for a Bayesian inference problem are bounded and continuous. Suppose also that the joint product of the model functions' uniform approximators GH is non-zero almost everywhere on the domain of interestD. Then the posterior pdf approximator F = GH R D GH also uniformly approximates the true posterior pdf f = gh R D gh This approximation theorem gives statisticians the freedom to use approximators to approximate arbitrary model functions|even model functions that have no closed functional form|without worrying about the quality of their posterior pdfs. Statisticians can choose any uniform approximation method to reap the benets of this theorem. Standard additive model (SAM) fuzzy systems are one such tool for uniform function approximation. Fuzzy systems can use linguistic information to build model functions. Figure 1.5 below shows an example of a SAM system approximating a pdf using 5 fuzzy rules. x8.2 discusses Fuzzy function approximation in detail. Chapter 9 addresses the complexities of approximate Bayesian inference in hierarchical or iterative inference contexts. Figure 1.5: A fuzzy function approximation for a (8; 5) prior pdf. An adaptive SAM (standard additive model) fuzzy system tuned ve fuzzy sets to give a nearly exact approximation of the beta prior. Each fuzzy rule denes a patch or 3-D surface above the input-output planar state space. The third rule has the form \If =A 3 then B 3 " where then-part set B 3 is a fuzzy number centered at centroid c 3 . This rule might have the linguistic form \If is approximately 1 2 then F () is large." 7 This dissertation also contains other minor results of note including a convergence theorem (Theorem 2.3) for a subset of minorization-maximization (MM) algorithms. MM algorithms generalize EM algorithms. But there are no published proofs of MM convergence. There is an extension of the GMM-NEM condition to mixtures of jointly Gaussian sub-populations (Corollary 3.4). There is also an alternate proof showing that the k-means algorithm is a specialized EM algorithm. Other proofs of this subsumption already exists in the literature. This dissertation ends with discussions about ongoing work to establish or demonstrate NEM noise benets in genomics and medical imaging applications (x10.2.2, x10.2.3, & x10.2.4). 8 Chapter 2 The Expectation{Maximization (EM) Algorithm The Expectation{Maximization (EM) algorithm [70], [112], [200] is an iterative statisti- cal algorithm that estimates high{likelihood parameters from incomplete or corrupted data. This popular algorithm has a wide array of applications that includes data clustering [4], [47], automated speech recognition [151], [246], medical imaging [268], [319], genome-sequencing [12], [187], radar denoising [292], and infectious-diseases tracking [8], [249]. A prominent mathematical modeler even opined that the EM algorithm is \as close as data analysis algorithms come to a free lunch" [105, p. 177]. Other data-mining researchers consider the EM algorithm to be one of the top ten algorithms for data-mining [305]. EM algorithms split an estimation problem into two steps: the Expectation (E) step and the Maximization (M) step. The E step describes the best possible complete model for the incomplete data given all current information. The M step uses that new complete model to pick higher likelihood estimates of the distribution parameters for the incomplete data. The improved parameter estimates from the M step lead to better complete models in the E step. The EM algorithm iterates between the E and M steps until the parameter estimates stop improving. This chapter presents the EM algorithm in full detail. The EM algorithm is a generalization of maximum likelihood estimation (MLE). It inherits many properties from MLE. The next section reviews MLE and motivates the EM generalization. Then we formulate the EM algorithm and examine some of its convergence properties. The chapter ends with a survey of instances, notable variants, and a generalization of the 9 EM algorithm. 2.1 A Review of Maximum Likelihood Estimation Maximum likelihood estimation methods search for parameter values which maximize a likelihood function `(). The likelihood `() is a statistic of the data y that denes a preference map over all possible parameters 2 for the data model [238]. This preference map quanties how well a parameter 0 describes or summarizes the sample y relative to all other parameters i 2 [83], [238]. The likelihood `() measures the quality of a parameter estimate using the observed sample y and the data's parametric pdf f(yj). Fisher [93], [94] formalized the use of the likelihood as a method for evaluating candidate parameter estimates. He dened the likelihood`() of for a sampley of the observed random variable (rv) Y as `() =f(yj) (2.1) where Y has the probability density function (pdf) f(yj). `() has the unusual property of being both a function of and a statistic of the data y. If an estimate 0 has low likelihood `( 0 ) for a given sample y then there is a low probability of observing y under the hypothesis that the true data pdf is f(yj 0 ). A parameter 1 is more likely than 0 if it gives a higher likelihood than 0 . So there is a higher probability of observing y if the true sampling pdf was f(yj 1 ) instead of f(yj 0 ). One approach to statistics asserts that this likelihood function contains all the objective evidence the experiment (i.e. the random sampling of Y ) provides about the unknown parameter . This assertion is the likelihood principle and it is often an implicit assumption in statistics applications [25] including Bayesian applications. 2.1.1 MLE Overview A basic question in statistics is: which pdff(y) best describes a set of observed random samplesfy i g n i ? Statisticians often invoke the simplifying assumption that the data samples come from one pdf in a class of parametric sampling pdfsff(yj)g 2 1 . Then 1 This assumption may be insucient if the parametric pdf class is not general enough. A rigorous statistician would ensure that the parametric pdf classff(yj)g 2 is provably dense in the class of appropriate, possibly non-parametric pdfsff (y)g 2 for y. Most estimation applications just assume a convenient pdf family and do not address this assumption rigorously. 10 the question becomes: which parameter estimate ^ gives the best parametric pdff(yj) for describing the observed random samplesfy i g n i ? The preceding discussion on likelihoods suggests that the best parameter estimate ^ is the value of the parameter that gives the highest observation probability forfy i g n i i.e. the most likely parameter. This is the Maximum Likelihood Estimate (MLE) [2], [83], [93], [238]. The likelihood `() of the random samplesfy i g n i is `() = n Y i f(y i j): (2.2) Thus the ML estimate ^ is ^ n = argmax 2 `(): (2.3) The product structure of the joint pdf and the exponential nature of many pdfs make it easier to optimize the logarithm of the likelihood ln`() instead of the likelihood `() itself. So we dene the log-likelihood L() as 2 L() = ln`(): (2.4) The logarithmic transformation preserves the likelihood function's stationary points because L 0 () =k ` 0 () where k is a strictly positive scalar for all viable values 3 of . So the ML estimate is equivalent to ^ n = argmax 2 L(): (2.5) The EM algorithm applies a generalized MLE approach to nd optimal parameters to t complicated pdfs for incomplete data samples. 2.1.2 A Brief History of Maximum Likelihood Methods labelsubsec:hist-mle 2 I use the L() and `() to denote L(jy) and `(jy) when the data random variable Y is unambiguous. 3 is viable if it has strictly positive likelihood `()> 0. i.e. f(yj)> 0 for the observed sample y. The viability condition is necessary becauseL 0 () = ` 0 () `() . ThusL preserves the stationary points of` only ifk = 1 `() 6= 0. The log-likelihoodL() also preserves the preference order specied by`() since `( 0 )`( 1 ) () ln`( 0 ) ln`( 1 ) by the monotone increasing property of the log-transformation. 11 Ronald Fisher introduced 4 maximum likelihood methods with his 1912 paper [92]. He disparaged the then state-of-the-art parameter estimation methods: the method of moments and the minimum mean squared estimation (MMSE) method. He criticized the fact that both methods select optimal parameters based on arbitrary criteria: matching nite-order moments and minimizing the mean squared error respectively. But his main criticism was that these methods give dierent estimates under dierent parameterizations of the underlying distribution. Suppose the sampling pdf f(yj) has a dierent functional representation ~ f(yj) where =t(). Then the best estimate ^ for should identify the best estimate ^ for via the parameter transformation ^ =t( ^ ): (2.6) The moments and MMSE methods do not obey this invariance principle i.e. ^ 6=t( ^ ) in general. Fisher argued that the invariance principle should hold for any truly optimal estimate since the observed samples do not change with re-parameterizations of the pdf. Fisher then proposed an alternate method of statistical estimation that maximized parameters for parametric pdfs. He showed that this method is invariant under parameter transformations. Fisher did not name the method until he wrote his 1922 paper [93] in which he formalized the idea of a likelihood as distinct from a probability. Likelihoods measure degree of certainty for parameters just as probabilities do for events. But likelihoods have no independent axiomatic foundation. Likelihoods, for example, do not have to integrate to one. And so there is no clear concept of likelihood marginalization over unwanted variables [18]. EM provides one likelihood equivalent to Bayesian posterior marginalization over unobserved data random variables. Fisher, Doob, and Wald later found other attractive statistical properties of maximum likelihood estimates. The ML estimate ^ n for a parameter is consistent ( ^ n ! in probability) [77]. It is strongly consistent ( ^ n ! with probability-one) under additional weak conditions [289]. The MLE is asymptotically ecient [94]: lim n V[ ^ n ] =I 1 n () (2.7) 4 Francis Y. Edgeworth have preempted Fisher [82], [243] in formulating the maximum likelihood principle in 1908. But his work did not gain attention until Fisher reformulated an equivalent idea outside the Bayesian setting many years later. 12 whereI() is the Fisher informationI() =E Yj [L 00 ()] [135]. And the MLE is asymptotically normal [77], [288]: lim n ( ^ n )N (0;I 1 n ): (2.8) 2.1.3 Numerical Maximum Likelihood Estimation Maximum Likelihood Estimation converts a statistical estimation problem into an optimization problem. We can solve the optimization problem analytically. But numerical methods are necessary when equation (2.5) has no analytic solution. ML estimates are roots of the derivative L 0 () of the log likelihood L(). This derivative L 0 () is also the score statistic S() S() =L 0 () = ` 0 () `() : (2.9) The Newton-Raphson (NR) method [312] is a Taylor-series-based iterative method for root-nding. NR uses the rst-derivative of the objective function to tune update directions. The rst-derivative of the score S 0 () is the second derivative of the log-likelihood L 00 (). This is the observed Fisher information [95], [238]: I() =L 00 () (2.10) Or I() p;p i;j = @ 2 L() @ i @ j (2.11) for a vector parameter = ( 1 ; ; p ). Thus the NR update equation is k+1 = k +I 1 ()S(): (2.12) The Fisher scoring method uses the expected Fisher informationI() =E Yj [I()] instead of the observed Fisher information I() in (2.12). Analytic and numerical optimization methods work well for MLE on simple data models. But they often falter when the data model is complex or when the data has random missing information artifacts. 13 2.2 The Expectation Maximization Algorithm Many statistical applications require complicated data models to account for exper- imental eects like data corruption, missing samples, sample grouping, censorship, truncation, and additive measurement errors. The appropriate likelihood functions for these data models can be very complicated. The Expectation Maximization (EM) algorithm [70], [200] is an extension of the MLE methods for such complicated data models. 2.2.1 Motivation & Historical Overview The EM algorithm is an iterative ML technique that compensates for missing data by taking conditional expectations over missing information given the observed data [70], [200]. The basic idea behind the EM algorithm is to treat the complex model for the observed data Y as an incomplete model. The EM algorithm augments the observed data random variable Y with a hidden random (latent) random variable Z. The aim of this augmentation is to complete the data model and obtain a simpler likelihood function for estimation. But the resulting complete log-likelihood function L c (jY;Z) needs to t the observed incomplete data Y =y. The EM algorithm addresses this by sequentially updating its best guess for the complete log-likelihood functionL c (jY;Z). The EM algorithm uses the conditional expectation of the complete log-likelihood E Z [L c (jy;Z)jY =y; k ] given the observed data 5 as its best guess for the compatible complete log-likelihood function on the k th iteration. E-Step: Q (j k ) =E Zjy; k [L c (jy;Z)] (2.13) M-Step: k+1 = argmax fQ (j k )g (2.14) The EM algorithm is a synthesis of ML statistical methods for dealing with com- plexity due to missing information. Orchard and Woodbury's Missing Information Principle [223] was the rst coherent formulation of the idea that augmenting a data 5 Some researchers incorrectly assume that this conditional expectation just lls in the missing data Z with the current estimateE[ZjY; k ]. The process of substituting estimates for missing random variables is imputation [191], [265]. EM is equivalent to direct imputation only when the data model has a log-linear likelihood [220], [295] like the exponential or gamma distributions (see x2.5.1). But EM does not impute estimates for Z directly when the data model has a nonlinear log-likelihood. 14 model with missing information can simplify statistical analysis. There had been decades of statistical work weaving this idea into ad hoc solutions for missing informa- tion problems in dierent areas. Hartley used this principle to address truncation and censoring eects in discrete data [120]. The Baum-Welch algorithm [16], [298] uses the same missing information principle to estimate ML parameters for hidden Markov models. Sundberg [276], [277] identied a similar theme in the analysis of incomplete generalized-exponential family models. Beale and Little [20] applied the missing information idea to estimate standard errors for multivariate regression coecients on incomplete samples. The Dempster et al. paper ([70]) distilled out the unifying principle behind these ad hoc solutions and extended the principle to a general formulation for solving other missing information MLE problems. The EM formulation was such a powerful synthesis that Efron argued 17 years later [86] that the eld of missing data analysis was an outgrowth of the EM algorithm 6 . The general EM formulation now applies to parameter estimation on a wide array of data models including Gaussian mixtures, nite mixtures, censored, grouped, or truncated models, mixtures of censored models, and multivariate t-distribution models. The EM algorithm is not so much a single algorithm as it is an algorithm schema or family of similar MLE techniques for handling incomplete data. The conceptual simplicity and the mature theoretical foundations of the EM schema are powerful incentives for recasting disparate estimation algorithms as EM algorithms. These algorithms include the Baum-Welch algorithm [298], Iteratively Re-weighted Least Squares (IRLS) [71], older parameter estimation methods for normal mixtures [248], Iterative Conditional Estimation (ICE) [69], Gaussian Mean-Shift [196], k-means clustering [47], [226], backpropagation [6] etc. This dissertation extends the EM algorithm schema via the use of noise injection. Noise injection causes the algorithm to explore more of the log-likelihood surface. Intelligent log-likelihood exploration can lead to faster average convergence speed for the EM algorithm. Figure 2.1 is a demonstration of an EM algorithm speed-boost due to noise injection. The gure shows the evolution of an EM and a noisy EM (NEM) algorithm on the same log-likelihood surface. The noise-enhanced algorithm converges to the same solution 30% faster than the EM algorithm. Subsequent chapters discuss the details of noise-enhanced EM algorithms like the one in this gure. 6 A claim which Rubin countered quite eectively in his rejoinder to Efron's paper [255] 15 Log-likelihood Comparison of EM and Noise-enhanced EM Figure 2.1: Plot showing that noise injection can speed up the convergence of the EM algorithm. The plot shows the evolution of an EM algorithm on a log-likelihood surface with and without noise injection. Both algorithms estimate a 2-dimensional ML parameter on the same data. Both algorithms also start at the same initial estimate and converge to the same point on the log-likelihood surface. The EM algorithm converges in 10 iterations while the noise-enhanced algorithm converges in 7 iterations{30% faster than the EM algorithm. 16 2.2.2 Formulation of the EM Algorithm We now formulate the EM algorithm using the following notation: 2 R d : unknown d-dimensional pdf parameter X : Complete data space Y : Observed data space Y : observed data random variable with pdf g(yj) Z : unobserved latent random variable with pdf f(zjy;) X = (Y;Z) : complete data random variable with pdff(xj) f(xj) =f(y;zj) : joint pdf of Z and Y L c (jx) = lnf(xj) : complete data log-likelihood L(jy) = lng(yj) : observed data log-likelihood ^ : MLE for t : an estimate for ^ at iteration k We use the full log-likelihood notation L(jy) instead of just L() to avoid confusion between log-likelihoods for the dierent random variables. The goal of the EM algorithm is to nd the estimate ^ that maximizesg(yj) given observed samples of Y i.e. ^ = argmax lng(yj) = argmax L(jy): (2.15) The algorithm makes essential use of the complete data pdf f(xj) to achieve this goal. The EM scheme applies when we observe an incomplete data random variable Y =r(X) instead of the complete data random variable X. The function r :X!Y models data corruption or information loss. X = (Y;Z) can often denote the complete data X where Z is a latent or missing random variable. Z represents any statistical information lost during the observation mapping r(X). This corruption makes the observed data log-likelihood L(jy) complicated and dicult to optimize directly in (2.5). The EM algorithm addresses this diculty by using the simpler complete likelihood 17 L c (jy;z) to derive a surrogate log-likelihood Q(j t ) as a replacement for L(jy). Q(j t ) is the average of L c (jy;z) over all possible values of the unobserved latent variable Z given the observation Y =y and the current parameter estimate t : Q(j t ) =E Z L c (jy;Z) Y =y; t = Z Z L c (jy;z)f(zjy; t ) dz: (2.16) Dempster et al. [70] (DLR) rst showed that any that increases Q(j t ) cannot reduce the likelihood dierence `(jy)`( t jy). This \ascent property" led to an iterative method that performs gradient ascent on the likelihood `(jy). This result underpins the EM algorithm and its many variants [90], [145], [192], [193], [204], [205]. A standard EM algorithm performs the following two steps iteratively on a vector y = (y 1 ;:::;y M ) of observed random samples of Y : (assume 0 is a suitable random initialization) Algorithm 2.1: The Expectation Maximization Algorithm Input : y =fy i g i : vector of observed incomplete data Output : ^ EM : EM estimate of parameter 1 while (k t t1 k 10 tol ) do 2 E-Step: Q (j t ) E Zjy;t [L c (jy; Z)] 3 M-Step: t+1 argmax fQ (j t )g 4 t t + 1 5 ^ EM t L( t jy) increases or remains constant with each EM iteration. Thus the algorithm performs a gradient ascent procedure on the likelihood surface L(jy). The algorithm stops when successive estimates dier by less than a given tolerancek t t1 k< 10 tol or whenkL( t jy)L( t1 jy)k< [112]. The algorithm converges when the estimate is close to a local optimum [36], [112], [200], [303]. 2.2.3 Complete Data Spaces for Incomplete Data Models EM algorithms optimize observed data log-likelihoods L(jY ) via the complete log- likelihood L c (jY ) on a complete data space (CDS). The CDS species the complete 18 data random variable X and its likelihood function L c (jX). The complete data random variable X and the associated latent variable Z depend crucially on the data model. The selection of Z or X determines the E-step function Q(j t ). Careful CDS selection can improve the analytic, computational, and convergence properties of the EM algorithm [89], [90]. So far we have assumed that the complete data random variable X augments the observed data Y with a latent variable Z via a direct product operation X = (Y;Z). This direct product complete data space identies the complete data random variable as X = (Y;Z). Then the data corruption function r(X) is a projection onto the observed data spaceY. And the observed data pdf is just a marginal pdf of the complete data pdf: X = (Y;Z) (2.17) Y =r(X) =proj Y (X) (2.18) g(yj) = Z r 1 (y) f(xj) dx = Z Z f(y;zj) dz : (2.19) This approach is standard for estimating parameters of mixture models. Mixture model data comes from an unknown convex mixture of sub-populations. The latent random variable Z identies the sub-population to which a sample Y belongs. The direct product CDS models are illustrative for exposition purposes. General observation functions r(:) are often re-expressible as functions over the observed data Y and some ancillary variable Z. But direct products are not the only general way to model CDSs. Many observation functions r() do not identify the latent variable Z explicitly. Another approach is to use the complete random variable X as the latent random variable Z. All the action is in the observation transformation r(:) between the complete and the observed data spaces. The right-censored gamma model gives such an example [52], [279]. Censored models are common in survival and lifetime data analysis. The latent random variable Z is the unobserved uncensored gamma random 19 variable. X =Z (2.20) Y =r(X) = minfX;Tg (2.21) g(yj) =f X (yjX2 [0;T ]) : (2.22) Suppose the data Y represents the lifetime of subjects after a medical procedure. Right censorship occurs if the survey experiment keeps track of lifetime data up until time-index T . Then the latent random variable Z is the true unobserved lifetime of a subject which may exceed the censorship time-index T . These CDSs are special cases. Identifying r(X) can be dicult for general data models. But a functional description of r(X) is not necessary. The main requirement for an EM algorithm is a description of how r(X) changes the complete likelihood L c into the observed likelihood L: Y =r(X) (2.23) g(yj) = Z r 1 (y) f(xj)dx (2.24) Where: r 1 (y) =fx2Xjr(x) =yg: (2.25) Fessler and Hero ([89], [90]) introduced a further generalization to the idea of a CDS. They dened an admissible complete data spaceX for the observed density g(yj) as a CDS whose joint density of X2X and Y2Y satises f(y;xj) =f(yjx)f(xj) (2.26) with f(yjx) independent of . The classical EM setup assumes that x selects y via the deterministic assignment y =r(x). The admissible CDS reduces to the classical deterministic CDS when the conditional pdf f(yjx) is just a delta function: f(y;xj) =(yr(x))f(xj): (2.27) The admissible CDS denition in equation (2.26) allows for a more general case where the corrupting process r(:) is a random transformation. A CDS is admissible under this denition only if the transformation y = r(x) is independent of . The use of 20 admissible complete data spaces adds another level of exibility to EM algorithms. The cascade [266] and SAGE variants [90] of the EM algorithm manipulate admissible CDSs to speed-up EM convergence for data models with large parameter spaces . 2.3 EM Convergence Properties The ascent property of the EM algorithm [70] is its main strength. It ensures the stability of the EM algorithm. And it is the rst step in the proof that the EM algorithm converges to desirable parameter estimates. Below is a statement and proof this property: Proposition 2.1. [The Ascent Property of the EM Algorithm]: L(jy)L( t jy)Q(j t )Q( t j t ) (2.28) Proof. By Bayes theorem: f(zjy;) = f(z;yj) g(yj) (2.29) lnf(zjy;) = lnf(z;yj)L(jy) (2.30) L(jy) = lnf(z;yj) lnf(zjy;): (2.31) Average out the latent random variable Z conditioned on the observed data y and the current parameter estimate t by applying the conditional expectation operator E Zjy;t []: E Zjy;t [L(jy)] =E Zjy;t [lnf(Z;yj)]E Zjy;t [lnf(Zjy;)] (2.32) Therefore L(jy) =Q(j t )H(j t ) (2.33) Where: Q(j t ),E Zjy;t [lnf(Z;yj)] (2.34) H(j t ),E Zjy;t [lnf(Zjy;)] (2.35) L(jy) has no dependence onZ in (2.32). Thus the expectation leaves it unchanged. The goal of MLE is to maximize L(jy). So our immediate goal is to force L(jy) 21 upwards. Compare L(jy) to its current value L( t jy) using (2.33): L(jy)L( t jy) =Q(j t )H(j t )Q( t j t ) +H( t j t ) (2.36) =Q(j t )Q( t j t ) + [H( t j t )H(j t )] (2.37) Examine the terms in square brackets: H( t j t )H(j t ) = Z X [lnf(zjy; t ) lnf(zjy;)]f(zjy; t )dx (2.38) = Z X ln f(zjy; t ) f(zjy;) f(zjy; t )dx (2.39) The expectation of a logarithm of a pdf ratio in the last line Z X ln f(zjy; t ) f(zjy;) f(zjy; t )dx =D (f(zjy; t )kf(zjy;)) (2.40) is the Kullback-Leibler divergence or the relative entropy D(k) [61], [184]. D(k) is always non-negative D (f(zjy; t )kf(zjy;)) 0: (2.41) Hence H( t j t )H(j t ) 0: (2.42) Thus Equations (2.37) and (2.42) give the ascent property inequality: L(jy)L( t jy)Q(j t )Q( t j t ) (2.43) This implies that any value of that increases Q relative to its current value Q( t j t ) will also increase L(jy) relative to L( t jy). The standard EM algorithm chooses the value of that gives a maximum increase in Q. i.e. t+1 = argmax Q(j t ): (2.44) The Generalized Expectation-Maximization (GEM) relaxes the M-step by requiring only improved intermediate estimates Q( t+1 j t ) Q ( t j t ) instead of maximized intermediate estimates. 22 Algorithm 2.2: Modied M{Step for Generalized EM. 1 M-Step: t+1 ~ such that Q( ~ j t )Q ( t j t ) The ascent property implies that the EM algorithm produces a sequence of estimates f t g 1 t=0 such that L( t+1 jy)L( t jy): (2.45) If we assume that the likelihood is bounded from above, then we get a nite limit for the log-likelihood sequence: lim t!1 L( t jy) =L <1: (2.46) The existence of this limit does not mean that L is a global maximum. L can be a stationary point i.e. a saddle point, a local maximum, or a global maximum. This limit also does not mean the sequence of EM iteratesf t g t=0 necessarily converges to a single point. DLR purported to prove this erroneous result. But Boyles [36] presented a counter-example of a GEM that fails to converge. The GEM in the counter-example produced non-convergent estimates stuck in an uncountable, compact, connected set. 2.3.1 Convergence of the (G)EM Algorithm GEM estimates do not converge to the global maximizer (the ML estimate) in general. The only guarantee is that GEM estimates t converge to a point in the set of stationary pointsS for the log-likelihood L. This section presents a proof of this claim. The proof is a direct application of Zangwill's Convergence Theorem A [315, p. 91]. The proof applies to the GEM algorithm and therefore to the EM algorithm by subsumption. We rst dene some terms. A point-to-set M : ! 2 is a function that the maps points 2 to subsets of V i.e. M :!M() . A point-to-set map M is closed on J if for all points 1 2J (2.47) such that t ! 1 (2.48) and t 2M( t ) (2.49) and t ! 1 for some 1 (2.50) 23 then 1 2M( 1 ): (2.51) This is a generalization of function continuity to point-to-set maps. A function f is continuous if lim t!1 f(x t ) =f( lim t!1 x t ): (2.52) The closedness property for the map M is similar. The last equation is exactly analogous to the closed map if we replaces equality = with set-elementhood2 and identify 1 = lim t M( t ). Zangwill [315] denes an algorithm for non-linear optimization as a sequence of point-to-set maps M t () that generate a sequence of estimatesf t g via the recursion: t+1 2M( t ): (2.53) Algorithms solve optimization problems by generating a sequence which converges to elements in the problem's solution setS . The GEM algorithm is an example. The next theorem by Zangwill gives conditions under which general optimization algorithms converge. A sketch of Zangwill's proof follows the theorem. Theorem 2.1. [Zangwill's Convergence Theorem A]: Suppose the algorithm M : ! 2 generates a sequencef t 2M( t1 )g t . LetS be the set of solution points for a continuous objective function L(). Suppose also that 1. t 2JV for all t where J is compact. 2. M is closed overS c . 3. if 2S c then L( )>L() for all 2M() 4. if 2S then L( )L() for all 2M(). Then the limit of the sequencef t g or any of its convergent subsequences is in the solution setS. Proof. See Zangwill [315, pp. 91{94]. The algorithm either stops at a solution or generates an innite sequence. Suppose f t g is an innite sequence on a compact set by condition (1). Sof t g must have a convergent subsequencef g with limit . L is a continuous function. So lim L( ) = 24 L( ). The sequencefL( t )g t is monotone increasing by conditions (3) and (4). Thus the original sequencefL( t )g t and the subsequencefL( )g converge to the same value L( ) lim t L( t ) = lim L( ) =L( ): (2.54) This holds for any subsequence offL( t )g t Suppose is not a solution for the objective function L. Dene a derived subsequencef k g k such that k 2M( ) (2.55) for all . This derived sequence is the original subsequencef g after a single step advance. The derived sequence will also have a convergent subsequencef ~ k g with limit lim ~ k ~ k = +1 . Then +1 2M( ) by the closure condition in (2). is not a solution by assumption. So L( +1 )>L( ) (2.56) by condition (3). A similar argument for (2.54) means that L( +1 ) =L( ): (2.57) This contradicts (2.56). Thus must be a solution. Theorem 2.2. [Wu's (G)EM Convergence Theorem]: Suppose the log-likelihood function L(jy) is continuous and bounded from above. Suppose also that the set J J =f2 jL(jy)L( 0 jy)g (2.58) is compact for all 0 and Q( j) is continuous in both and . Then the limit points of a (G)EM sequencef t g are stationary points of the L(jy). Proof. The theorem applies directly to GEM algorithms under the following identi- cations and assumptions [303]. The objective function L() is the observed data log-likelihood L(jy). The solution setS is the set of interior stationary points of L. S =f2int() j L 0 (jy) = 0g: (2.59) 25 The point-to-set algorithm map !M( t ) for the GEM algorithm is M( t ) =f2 j Q(j t )Q( t j t )g: (2.60) The current estimate t is an element of the set J t t 2J t =f2 jL(jy)L( t1 jy)g: (2.61) If 0 is the initial estimate then the largest J t set is J =J 0 =f2 jL(jy)L( 0 jy)g (2.62) and J t J for all t by the set denitions and the GEM ascent property. The log- likelihood L(jy) is bounded from above. The set J is compact by assumption and t 2J for allt, thus satisfying Zangwill's condition (1). This is a generalization of the case where the entire parameter space is compact. It is hard to verify that arbitrary M-map algorithms are closed overS c in general [315]. But M-maps for (G)EM algorithm are closed whenever Q( j) is continuous in both and [303], thus satisfying Zangwill's condition (2). Zangwill's conditions (3) and (4) follow because of the ascent property and because the log-likelihood has an upper bound. These four conditions imply that the Zangwill's convergence theorem holds. Thus the (G)EM iteratesf t g t converge to the solution setS of stationary points of the log-likelihood L. The nal (G)EM estimate is a xed-point of the GEM point-to-set map M. This is because it converges to estimates such that 2M( ): (2.63) Point-to-set maps or correspondences feature prominently in the theory of Nash equilibria in game theory. Kakutani's [153] xed-point theorem gives conditions under which these maps have xed-points. That theorem applies to (G)EM algorithms only if M( t ) is a convex set at each iteration. This is not always the case for EM algorithms. In summary, (G)EM guarantees that the estimatesf t g converge into a set of stationary points for the log-likelihood when the M map is closed. The M map 26 is closed when Q( j) is continuous in both and . This continuity condition holds for incomplete data models based on exponential-family pdfs [277], [303]. These exponential family models account for a large number of EM applications (e.g. censored gamma, Gaussian mixtures, convolutions of exponential pdfs). Stronger conditions on Q( j) and its rst-order partial derivatives can replace the closed M-map condition for convergence [191, cf. Theorem 8.2]. Further stronger conditions on Q( j) may restrict that solution set to just local maxima instead of stationary points. The (G)EM sequencef t g may not converge point-wise ( t ! ) if the sequence gets trapped in a compact, connected non-singleton set over which the likelihood is at (like in Boyles' [36] example). The (G)EM sequence must be a Cauchy sequence or the likelihood must have unique stationary points to avoid this kind of pathology [36], [121], [303]. But this kind of pathology is usually easy to detect in practice. And the convergence of the log-likelihood sequencefL( t jy)g t to stationary points is more important than the point-wise convergence of the estimate sequencef t g t for MLE problems. 2.3.2 Fisher Information in EM Algorithms The Fisher information I() =E Yj " @L() @ 2 # =E Yj @ 2 L() @ 2 (2.64) plays an important part in MLE methods. The Fisher information measures the average amount of information each sample carries about the parameter. It is a measure of precision for ML estimates. This is because the ML estimator ^ n is asymptotically normal with variance equal the Fisher information inverse [I n ()] 1 . So the estimator ^ n for largen has standard error approximately equal to (I n ()) 0:5 . Both the observed Fisher informationI() =L 00 () and the expected Fisher informationI() are useful for measuring the standard errors of EM estimates. But Efron and Hinkley [84] showed that the observed Fisher information is the preferred measure of standard errors. The Fisher information has a further descriptive role for EM algorithms. It quanties how much missing information the complete data random variable has to impute to complete the observed data. We can use equation (2.33) to analyze the 27 relative information in the observed and complete data samples: L(jy) =Q(j t )H(j t ) (2.65) ) @ 2 L(jy) @ 2 = @ 2 Q(j t ) @ 2 @ 2 H(j t ) @ 2 (2.66) )I observed =I complete I missing : (2.67) These second derivatives are the observed Fisher information for the observed data, the complete data, and the missing data respectively. The expectation over the population E[I] gives the Fisher informationI =E [I]. The expectations are over Y : E Yj @ 2 L(jy) @ 2 =E Yj @ 2 Q(j t ) @ 2 E Yj @ 2 H(j t ) @ 2 (2.68) )I observed () =I complete ()I missing (): (2.69) These observed and expected Fisher information decompositions in (2.67) and (2.69) are examples of Orchard and Woodbury's Missing Information Principle [223] in the EM scheme [195]. Equation (2.69) shows that missing information in an EM data model diminishes the information each sample of Y holds about the parameter . The more missing information the EM algorithm has to estimate, the less informative the observed samples are. Iterative MLE methods converge slowly to stated standard error levels when the samples are less informative. Dempster et al. [70] showed that the observed Fisher information controls the rate of EM convergence when the current estimate t is close to the xed-point (the local maximum) . The EM algorithm convergence rate EM is proportional to the ratio of missing to complete information [280]: EM / I missing I complete (2.70) in a small neighborhood of . Thus higher missing information (higher I missing ) leads to slower EM convergence in keeping with the Missing Information Principle. 28 2.4 Variations on the EM Theme 2.4.1 M{step Variations There are a number of variations of the EM algorithm. EM variants usually modify either the M-step or the E-step. The M-step modications are more straight-forward. The rst M-step variant is the GEM algorithm [70]. GEM is useful if the roots of the derivatives of Q are hard to nd. Another M-step variant is very useful for estimating multidimensional parameters =f i g d i . This variant replaces the M-step's single d-dimensional optimization for with a series of 1-dimensional conditional maximizations (CM) for i with all other f j g j6=i xed. The M-step may also conditionally maximize sub-selections of the parameter vector instead of single parameter dimensions. These conditional maxi- mizations are typically easier than performing a single multidimensional maximization. This variant is the Expectation Conditional Maximization (ECM) algorithm[204]. Fessler and Hero developed the idea of iteratively optimizing subsets of the EM parameter vector. Fessler and Hero were working from a medical image reconstruction point of view. They based their approach on the relation in (2.70) which states that EM convergence rates EM are inversely proportional to the amount of complete information embedded in the complete data space EM / (I complete ) 1 . Subspaces of the full parameter space correspond to smaller CDSs that are less informative than the full CDS [90]. Therefore iterative optimization on parameter subspaces can lead to faster EM convergence. The algorithm may alternate between dierent subspaces at each EM iteration. These switches change the complete data space for the E-step. So they called this EM-variant Space Alternating Generalized Expectation-maximization (SAGE). SAGE, ECM, and other parameter subspace methods are very appealing for applications with large parameter spaces. Image reconstruction and tomography applications [145], [268], [286] make heavy use of these EM methods. The parameter spaces in these applications are very large; every image pixel or voxel corresponds to a parameter. Liu and Rubin ([192]) described a simple extra variation on ECM. ECM (and EM in general) maximize the Q-function with the goal of improving the log-likelihood functionL(jy). Liu and Rubin proposed replacing some of the CM step of ECM with steps that conditionally maximize the log-likelihood L(jy) directly. They called this 29 the Expectation Conditional Maximization Either (ECME) where the \either" implies that the CM-steps either conditionally maximize the Q-function or the log-likelihood L(jy). Meng and van Dyk later combined SAGE and ECME into the more general Alternating Expectation Conditional Maximization (AECM) algorithm [205]. 2.4.2 E{step Variations These E{step variants are no longer EM algorithm in the strict sense. They do not always satisfy the EM ascent property. But they share the EM theme and they may satisfy the ascent property with high probability for some cases. E{Step variants try to simplify the computation of the conditional expectation and make the Q-function simpler to calculate. The Q-function is an expectation with respect to a conditional distribution Q(j t ) = Z Z lnf(y;Zj) df(zjY =y; t ) (2.71) =E ZjY =y;t [lnf(y;Zj)]: (2.72) A Monte Carlo approximation to this integral uses M samples of Z drawn from the current estimate of the conditional pdf f(zjY =y; t ). Then the approximation is ~ Q(j t ) = 1 M M X m=1 lnf(y;z m j): (2.73) This is the sample mean of the random variable lnf(y;Zj) for samples ofZ with pdf f(Zjy; t ). Thus the Strong Law of Large Numbers [79] implies that ~ Q(j t )!Q(j t ) with probability{one as M!1: (2.74) Replacing Q with ~ Q gives the Monte Carlo Expectation Maximization (MCEM) algorithm [295]. MCEM is similar to multiple imputation where multiple conditional sample draws replace each missing point in a data set [191], [256]. The MCEM iterations become increasingly similar to EM iterations as the number of Monte Carlo samplesM increases. MCEM may need to run multiple secondary iterations of Markov chain Monte Carlo (MCMC) [251] to get samples from the appropriate distribution on each EM iteration. 30 Stochastic Expectation Maximization (SEM) [46] also uses Monte Carlo methods to approximate the EM approach. SEM \completes" each observed data y with a single random sample of z drawn from f(zjY = y; t ). The M{step maximizes the complete log-likelihood L c (jY =y;Z =z). The SEM algorithm is most useful for data models where the complete data space has a simple direct product structure (e.g. mixture models). SEM draws a single Monte Carlo sample z per observed data sample y per iteration. While MCEM draws M Monte Carlo samples per observed data sample y per iteration. The dierence between SEM and MCEM mirrors the dierence between single imputation and multiple imputation as methods for handling missing data [191], [265]. 2.4.3 MAP EM for Bayesian Parameter Estimation Maximum A Posteriori (MAP) estimation picks parameters that maximize the posterior distribution f(jy) instead of the likelihood g(yj). Bayes theorem together with a prior pdf h() specify the posterior: f(jy)/h()g(yj) (2.75) The MAP estimate is thus ^ MAP = argmax f(jy) (2.76) or ^ MAP = argmax lnf(jy) (2.77) ^ MAP = argmax (lng(yj) + lnh()) (2.78) The MAP estimate is the mode of the posterior pdf f(jy). MAP estimation reduces to MLE when the prior pdf is a constant function. The log-prior term adds a penalty for poor choices of . This penalty eect makes MAP estimation an example of penalized maximum likelihood estimation methods [109]. However the penalty term for general penalized MLE methods may be arbitrary, motivated by optimization desiderata rather than specic Bayesian model considerations. The log prior term usually makes the objective function more concave and thus makes the optimization problem easier to solve [35]. MAP estimation for missing information problems may use a modied variant of 31 the EM algorithm. The MAP variant modies the Q-function by adding a log prior term P () = lnh(): Q(j t ) =E ZjY;t [L c (jy;Z)] +P (): (2.79) The MAP-EM framework is very prominent in the eld of medical image analysis [108], [125], [126]. The use of prior knowledge can help smoothen the image and prevents reconstruction artifacts [67]. 2.5 Examples of EM Algorithms We now outline EM algorithms for some popular data models. The E-step changes with the data models. The M-step is independent of the model. 2.5.1 EM for Curved-Exponential Family Models EM algorithms for curved-exponential family models apply to observed data with complete data random variables from exponential family distributions. An exponential- family pdf with vector parameter has the form f(xj) =B(x) exp T(x)()A() (2.80) and log-likelihood L(jx) = T(x)()A() + lnB(x): (2.81) Examples include the exponential (exp()), gamma ( (;)),Poisson (p()) and normal (N(; 2 )) pdfs: L exp (jx) =x 1 ln(); (2.82) L (;jx) =x 1 + ( 1) ln(()) ln; (2.83) L P (jx) =x ln ln(x!); (2.84) L N (;jx) = (x;x 2 ) (;0:5) 1 2 2 2 2 0:5 ln(2): (2.85) 32 EM algorithms on these models are the simplest [70], [276], [277], and have the best possible convergence properties [276], [303] for moderate amounts of missing information. Some of the Q-functions are simple polynomials or even linear functions in the case of exponential or geometric pdfs. One simple example is the EM algorithm on right-censored exponential data. The complete data random variable is X exp(). Right-censorship occurs when the observer only records data up to a certain valueC. This implies that the observed data Y is the minimum of the complete data X and the censorship point C. Censorship is common in time-limited trials (e.g. medical testing) and product reliability analyses. The Q-function is Q(j t ) =E XjY =y;t L(jX) (2.86) = ln() 1 E XjY;t X : (2.87) The log-linear exp() likelihood means that the EM algorithm just replaces the latent random variableX with the current best estimate for the latent random variable E XjY;t [X]. The surrogate likelihood is easy to maximize Q 0 (j t ) = 1 + 1 2 E XjY;t X : (2.88) Q 0 (j t ) = t+1 = 0 (2.89) ) t+1 =E XjY;t X : (2.90) Equations (2.90) is the EM-update for any incomplete data model on the exponential complete data space. The conditional expectation in the right-censorship case depends on the relation Y = minfX;Cg. Thus E XjY =y;t X = 8 < : y if y<C E XjfXCg; t if y =C (2.91) )E XjY =y;t X = 8 < : y if y<C C + t if y =C : (2.92) The gamma EM algorithm generalizes exponential EM algorithm using the following 33 Q-function for the vector parameter' = (;): Q('j' t ) =E XjY =y;' t X( 1 + ( 1)) ln(()) ln (2.93) Taking the conditional expectation gives Q('j' t ) =E Xjy;' t [X] 1 + ( 1) ln(()) ln (2.94) and r ' Q('j' t ) = n @ ln(()) @ ln; E Xjy;' t [X] 2 o (2.95) r ' Q('j' t ) '=' t+1 = 0: (2.96) The transcendental nature of the -function and its derivative results in intractable closed-forms for the M-step for general values of . The M-step can use numerical estimates here. GEM algorithms on this model would have a simpler numerical M-step. The M-step just does a local search for an estimate that increases the Q-function. An ECM algorithm is also useful here because the Q-function is complicated only in the coordinate. Users can decompose the M-step into an analytic conditional maximization in the coordinate and a numerical conditional maximization in the coordinate. 2.5.2 EM for Finite Mixture Models A nite mixture model [201], [248] is a convex combination of a nite set of sub- populations. The sub-population pdfs come from the same parametric family. Mixture models are useful for modeling mixed populations for statistical applications such as clustering and pattern recognition [307]. We use the following notation for mixture models. Y is the observed mixed random variable. K is the number of sub-populations. Z2f1;:::;Kg is the hidden sub-population index random variable. The convex population mixing proportions 1 ;:::; K form a discrete pdf for Z: P(Z =j) = j . The pdf f(yjZ =j; j ) is the pdf of the j th sub-population where 1 ;:::; K are the pdf parameters for each sub-population. is the vector of all model parameters =f 1 ;:::; K ; 1 ;:::; K g. The joint pdf f(y;zj) is f(y;zj) = K X j=1 j f(yjj; j ) [zj]: (2.97) 34 The marginal pdf for Y and the conditional pdf for Z given y are f(yj) = X j j f(yjj; j ) (2.98) and p Z (jjy; ) = j f(yjZ =j; j ) f(yj) (2.99) by Bayes theorem. We rewrite the joint pdf in exponential form for ease of analysis: f(y;zj) = exp " X j [ln( j ) + lnf(yjj; j )][zj] # ; (2.100) lnf(y;zj) = X j [zj] ln[ j f(yjj; j )]: (2.101) EM algorithms for nite mixture models estimate using the sub-population indexZ as the latent variable. An EM algorithm on a nite mixture model uses (2.99) to derive the Q-function Q(j t ) =E Zjy;t [lnf(y;Zj)] (2.102) = X z X j [zj] ln[ j f(yjj; j )] ! p Z (zjy; t ) (2.103) = X j ln[ j f(yjj; j )]p Z (jjy; t ): (2.104) This mixture model is versatile because we can replace the sub-populations by chang- ing the function f(yjj; j ). We can also use dierent parametric pdfs for the sub- populations. The versatility of FMMs makes them very popular in areas such as data clustering, genomics, proteomics, image segmentation, speech recognition, and speaker identication. Mixture Models are not identiable [281], [282]. This means there may not be a one-to-one mapping between distinct mixture model distributions and distinct parameter vectors in the FMM parameter space. So there may be no unique maximum likelihood parameter estimate that species an FMM. 35 Figure 2.2: Demonstration: nite mixture models (FMMs) may not be uniquely identiable. The gure shows a likelihood function for 8 samples of a mixture of two Gaussian distributions with equal variance sub-populations. Points on the likelihood space coordinate = (;) represent all possible FMM distributions of the form F =N (; 1:5)+(1)N (; 1:5). The 8 sample points generating the likelihood in the gure come from an FMM with true parameter values ( ; ) = (0:3; 3). But the sample likelihood function has two maxima on the coordinate space at approximately 1 = ( ; ) and 2 = (1 ; ). These are the dots in the contour plot. Both maxima represent alternate parameterizations of the same distribution. 2.5.3 EM for Positron Emission Tomography Positron Emission Tomography (PET) constructs a radioisotope emission prole for a biological organ. The emission prole image represents levels of metabolic activity, blood ow, or chemical activity within the organ. A PET scan starts with an injection of a radioisotope into the tissue of interest. An array of detectors around the tissue captures the radioisotope emissions (electrons in PET scans) from within the tissue. The model assumes a \ne enough" grid over the tissue so that emission statistics are uniform within each grid square. PET scans usually generate a series of 2-D images. So each grid square is a picture element or pixel. Each pixel can emit particles. The emissions are discrete events and the detectors simply count the number of emissions along multiple lines-of-sight. The geometry of the detector array determines the mode of reconstruction. It also sets up a latent random variable structure that admits an EM-model which Shepp and Vardi 36 developed [268], [286]. The PET-EM model is a geometry-dependent nite mixture model where the sub-populations have a stable discrete distribution. The probabilistic model for PET image reconstruction via EM assumes that emissions from each pixel are Poisson-distributed. A detector can only observe the sum of all pixel emissions along the detector's line-of-sight. So the detector records a geometry-dependent sum of Poisson random variables. The detector response is the observed random variable. The individual pixel emissions are the complete unobserved random variables. And the pixel intensities (Poisson parameters for each pixel) are the parameters we want to optimize. The model starts with a 2-D array of tissue pixels X =fX i g n i :X i Poisson( i ) (2.105) where n is the image size. The detector array is Y =fY j g d j :Y j Poisson( j ) (2.106) where d is the number of detectors. An emission from X i can go to one of many detectors Y j . The probability of emitting at X i and detecting at Y j is p ij . This gives an array of parameters P = ((p ij )) n;d i;j (2.107) where p ij =P Emit at X i ; Detect at Y j : (2.108) P depends on the geometry of the detector array. If we dene the portion of X i emissions that go to Y j as Z ij . Then Z ij =p ij X i : (2.109) So Y j = n X i Z ij (2.110) and Y j Poisson n X i p ij i : (2.111) The estimate for =f i g n i gives the average emission intensities at each pixel. This is the PET image reconstruction. 37 We use Z =fZ ij g i;j as the complete data random variable for the EM model. This gives the complete data pdf under an independent pixel assumption f(Zj) = Y j Y i f(z ij j) (2.112) = Y j Y i exp(p ij i ) (p ij i ) z ij z ij ! : (2.113) Thus lnf(Zj) = X j X i p ij i +z ij ln(p ij i ) ln(z ij !) (2.114) The Q function for the EM model is Q(j(t)) =E ZjY;(t) h lnf(Zj) i (2.115) Q(j(t)) = X j X i p ij i +E ZjY;(t) [Z ij ] ln(p ij i )E ZjY;(t) [ln(z ij !)]: (2.116) The Poisson pdf is an exponential-family pdf. So the Q-maximization has the closed- form: i (t + 1) = i (t) d X j Y j p ij P n k k (t)p kd (2.117) There has been a lot of work done on PET with EM. Some EM innovations originated from this problem space. These include the use of penalty terms or Gibbs priors to avoid singularities in the objective function [91], [125], block iterative approaches to estimation [145], and methods for convexifying the reconstruction objective function [67]. The basic structure of the model remains the same. 2.6 MM: A Generalization for EM The EM algorithm relies on an incomplete data structure for which we supply a complete data space. Some estimation problems do not t into this structure (e.g. estimating logistic regression coecients). This invalidates the E-step. Minorization- Maximization or (Majorization-Minimization) (MM) [21], [147] algorithms generalize the idea of a surrogate optimization function (theQ-function in EM). An MM algorithm 38 species a tangent minorizing function. A function q is a tangent minorizer for L(jy) if q(j t ) is a -parametrized function which depends on the current estimate t such that [304] q(j t )L(jy) 82 (2.118) and q( t j t ) =L( t jy): (2.119) MM algorithms maximize the surrogate q-function instead of L( t jy) just like the EM algorithm: t+1 = argmax q(j t ): (2.120) The dierence of equations (2.118) and (2.119) q(j t )q( t j t )L(jy)L( t jy) (2.121) establishes an analogous ascent property for MM algorithms. The MM algorithm transfers the optimization from the original objective function to a minorizing (or majorizing) function. Ideally the minorizing function is easier to optimize like the Q-function in the EM algorithm. Lange et al. [185] use the term optimization transfer instead of \MM" to highlight this transference behavior. The EM algorithm ts into the MM scheme with a simple modication: EM's Q-function needs a constant level-shiftL( t jy)Q( t j t ) to satisfy the MM conditions. Level-shifts do not change the optimization. So MM subsumes the EM algorithm. This subsumption depends on the decomposition L =QH (2.33) and the ascent property (Prop. 2.1). The minorizer for the EM algorithm is q EM (j t ) =Q(j t ) + [L( t jy)Q( t j t )]: (2.122) The ascent property implies the minorizing condition in (2.118): L(jy)L( t jy)Q(j t )Q( t j t ) (2.123) )L(jy)Q(j t ) + [L( t jy)Q( t j t )]: (2.124) 39 The tangency condition in (2.118) holds because q EM ( t j t ) =Q( t j t )Q( t j t ) +L( t jy) =L( t jy): The M-step is the same (or may be one of the aforementioned M{step variants). This establishes EM as a special case of MM. The EM subsumption argument shows that a primary property of EM and MM algorithms is that the gradient vectors of the objective and surrogate functions are parallel at the current estimate t . This ensures that the the surrogate function inherits the objective function's direction of steepest ascent/descent at t . The MM tangency property is just an indirect means to this goal. The tangency condition implies this parallel gradient vector property because curves that are tangent (but not crossing) at a point have parallel gradient vectors at that point of tangency. The EM algorithm also satises this parallel gradient property without requiring tangency since r L(jy) =t =r Q(j t ) =t : (2.125) This is one of Oakes' results on the derivatives of the observed likelihood in EM algorithms [203], [222]. The result holds because the residual H in the L =QH decomposition also has a stationary point precisely at = t . The Quadratic Lower-Bound (QLB) algorithm [32] is another example of an MM algorithm. The QLB q-function comes from local convexity considerations instead of EM's missing information considerations. For example, a QLB algorithm on an MLE problem would use the following minorizer q(j t ) =L( t jy) + ( t ) T rL( t jy) 1 2 ( t ) T B( t ) (2.126) where B is any positive denite matrix that satises the inequality r 2 L( t jy) + B 0: (2.127) The QLB update equation is thus: t+1 = t + B 1 rL( t jy): (2.128) 40 These QLB constraints eectively create local quadratic approximations for L at every step t. Thus QLB produces convex minorizing functions at every QLB iteration. This is not necessarily true for the EM-algorithm in general. MM methods bypass the need for a CDS specication. But the user must specify custom tangent-minorizers for each estimation problem. Proponents of MM methods argue that designing minorizer functions is less dicult than designing complete data spaces for an EM model. And MM methods provide an extra level of exibility: there are often many MM approaches to the same problem. However there is no published proof of convergence for general MM algorithms [147]. I present a new convergence proof for a restricted class of MM algorithms. It mirrors Wu's EM convergence theorem (Theorem 2.2). Theorem 2.3. [MM Convergence Theorem]: An minorization-maximization algorithm for optimizing a continuous, upper-bounded objective function L(jy) converges to the set of stationary points of L(jy) if the following conditions hold: the set J J =f2 jL(jy)L( 0 jy)g (2.129) is compact for all 0 , the MM point-to-set algorithm map !M( t ) is closed overS c MM . This MM map is M( t ) =f2 jq(j t )q( t j t )g: (2.130) Proof. Zangwill's Global Convergence Theorem A (Theorem 2.1) applies directly to MM algorithms under the following identications and assumptions. The objective function L() is the MM objective function L(jy). The solution setS MM is the set of interior stationary points of L. S MM =f2int()jL 0 (jy) = 0g: (2.131) The compactness assumption implies that Zangwill's conditions (1) holds just like in the (G)EM case. The closure assumption fullls Zangwill's condition (2). Conditions (3) and (4) in Zangwill's theorem follow by ascent property and boundedness of the 41 objective function L. Thus the iterates converge to limit points in the solution set of stationary points. Majorization-minimization is equivalent to the minorization-maximization of the negative of the objective function. So the proof applies to both types of MM algorithms. Continuity of the minorizing function q( j) in both and implies that the M map is closed overS c . EM and QLB algorithms are examples of MM algorithms that satisfy this closure condition. But the class of MM algorithms may be broad enough to include algorithms which violate the closure condition. 2.7 Conclusion The EM algorithm is a versatile tool for analyzing incomplete data. But the EM algorithm has some drawbacks. It may converge slowly for high-dimensional parameter spaces or when the algorithm needs to estimate large amounts of missing information (see x2.3.2, [200], [280]). EM implementations can also get very complicated for some models. And there is also no guarantee that the EM algorithm converges to the global maximum. The next few chapters develop and demonstrate a new EM scheme that addresses the EM algorithm's slow convergence. It improves average EM convergence rates via a novel application of a noise{benet or \stochastic resonance". Chapter 3 presents the theory behind this approach. Chapters 4, 5, 6 demonstrate the speed improvement in key EM applications. 42 Chapter 3 Noisy Expectation{Maximization (NEM) This chapter introduces the idea of a noise benet as a way to improve the average convergence speed of the EM algorithm. The result is a noise-injected version of the Expectation-Maximization (EM) algorithm: the Noisy Expectation Maximization (NEM) algorithm. The NEM algorithm (Algorithm 3.1) uses noise to speed up the convergence of the noiseless EM algorithm. The NEM theorem (Theorem 3.1) shows that additive noise speeds up the average convergence of the EM algorithm to a local maximum of the likelihood surface if a positivity condition holds. Corollary results give special cases when noise improves the EM algorithm. We demonstrate these noise benets on EM algorithms for three data models: the Gaussian mixture model (GMM), the Cauchy mixture model (CMM), and the censored log-convex gamma model. The NEM positivity condition simplies to a quadratic inequality in the GMM and CMM cases. A nal theorem shows that the noise{benet for independent identically distributed additive noise models decreases with sample size in mixture models. This theorem implies that the noise benet is most pronounced if the data is sparse. The next section (x3.1) reviews the concept of a noise benet or stochastic resonance (SR). We then formulate the idea of a noise benet for EM algorithms and discuss some intuition behind noise benets in this context. A Noisy EM algorithm is any EM algorithm that makes use of noise benets to improve the performance of the EM algorithm. x3.2 introduces the theorem and corollaries that underpin the NEM algorithm. x3.3 presents the NEM algorithm and some of its variants. x3.4 presents a 43 theorem that describes how sample size aects the NEM algorithm for mixture models when the noise is independent and identically distributed (i.i.d.). x3.4 also shows how the NEM positivity condition arises from the central limit theorem and the law of large numbers. 3.1 Noise Benets and Stochastic Resonance A noise benet or stochastic resonance (SR) [40], [100], [173] occurs when noise improves a signal system's performance: small amounts of noise improve the performance while too much noise degrades it. Examples of such improvements include improvements in signal-to-noise ratio [41], Fisher information [49], [51], cross-correlation [57], [58], or mutual information [175] between the input and output signal. Much early work on noise benets involved natural systems in physics [38], chem- istry [98], [180], and biology [1], [60], [118], [215]. Early descriptions of noise benets include Benzi's model for ice age periodicity [23], [24], noise-induced SNR improvements in bidirectional ring lasers [202], and Kramers' model of Brownian motion-assisted particle escape from a chemical potential wells [180]. These early works inspired the search for noise benets in nonlinear signal processing and statistical estimation. [41], [50], [53], [99], [199], [236]. Some of these statistical signal processing works describe screening conditions under which a system exhibits noise benets[237]. Forbidden interval theorems [179], for example, give necessary and sucient conditions for SR in threshold detection of weak binary signals. These signal processing screening theorems also help explain some previously observed noise benets in natural systems [178]. Figure 3.1 shows a typical example of a noise benet. The underlying grayscale image is barely perceptible without any additive noise (leftmost panel). Additive pixel noise makes the image more pronounced (middle panels). The image becomes worse as the noise power increases (rightmost panel). This example demonstrates how noise can improve the detection of subthreshold signals. Noise benets exhibit as non-monotonic curves of the system's gure-of-merit as a function of noise power. Noise improves the gure-of-merit only up to a point. Further noise degrades the gure-of-merit. Figure 3.2 below is an example of such a non-monotonic SR signature curve. This chapter presents a new type of noise benet on the popular EM algorithm (x2). 44 Figure 3.1: Stochastic Resonance on faint images (mandrill, Lena, and Elaine test images) using white Gaussian pixel noise. The images are faint because the gray-scale pixels pass through a suboptimal binary thresholder. The faint images (leftmost panels) become clearer as the power of the additive white Gaussian pixel noise increases. But increasing the noise power too much degrades the image beyond recognition (rightmost panels). A key weakness of the EM algorithm is its slow convergence speed in many applica- tions [200]. Careful noise injection can increase the average convergence speed of the EM algorithm. Theorem 3.1 states a general sucient condition for this EM noise benet. The EM noise benet does not involve a signal threshold unlike almost all SR noise benets [100]. We apply this general sucient condition and demonstrate the EM noise benet on three data models: the ubiquitous Gaussian mixture model (Figure 3.2), the Cauchy mixture model (Figure 3.3), and the censored gamma model (Figure 3.7). The simulations in Figure 3.11 and Figure 3.12 also show that the noise benet is faint or absent if the system simply injects blind noise that ignores the sucient condition. This suggests that the noise benet sucient condition may also be a necessary condition for some data models. The last results of this chapter show that the noise benet occurs most sharply in sparse data sets. 45 3.1.1 Noise Benets in the EM Algorithm Theorem 3.1 below states a general sucient condition for a noise benet in the average convergence time of the EM algorithm. Figure 3.2 shows a simulation instance of this theorem for the important EM case of Gaussian mixture densities. Small values of the noise variance reduce convergence time while larger values increase it. This U-shaped noise benet is the non-monotonic signature of stochastic resonance. The optimal noise speeds convergence by 27:2%. Other simulations on multidimensional GMMs have shown speed increases of up to 40%. Figure 3.2: EM noise benet for a Gaussian mixture model. The plot uses the noise- annealed NEM algorithm. Low intensity starting noise decreases convergence time while higher intensity starting noise increases it. The optimal initial noise level has standard deviation N = 2:5. The average optimal NEM speed-up over the noiseless EM algorithm is 27:2%. This NEM procedure adds noise with a cooling schedule. The noise cools at an inverse-square rate. The Gaussian mixture density is a convex combination of two normal sub-populationsN 1 andN 2 . The simulation uses 200 samples of the mixture normal distribution to estimate the standard deviations of the two sub-populations. The additive noise uses samples of zero-mean normal noise with standard deviation N screened through the GMM{NEM condition in (3.72). Each sampled point on the curve is the average of 100 trials. The vertical bars are 95% bootstrap condence intervals for the mean convergence time at each noise level. The EM noise benet diers from almost all SR noise benets because it does not involve the use of a signal threshold [100]. The EM noise benet also diers from most 46 SR noise benets because the additive noise can depend on the signal. Independent noise can lead to weaker noise benets than dependent noise in EM algorithms. This also happens with enhanced convergence in noise-injected Markov chains [99]. Figure 3.11 shows that the proper dependent noise outperforms independent noise at all tested sample sizes for a Gaussian mixture model. The dependent noise model converges up to 14:5% faster than the independent noise model. 3.1.2 Intuition on EM Noise Benets We can gain some intuition into the EM noise benet by examining the space of plausible likelihood functions for the observed data. The likelihood function `(jy) is a statistic of the observed data y. Its functional behavior often overshadows its statistical behavior. We use the likelihood function's statistical behavior to produce a noise benet. Suppose an experiment produces a data sampley 1 . Suppose also that the sampling pdf f(yj) is model-appropriate and suciently smooth. Then there is a n-ball B n (y) of samples around Y =y that the experiment could have emitted with almost equal probability. The size of this ball depends on the steepness and continuity of the sampling pdf at y. This understanding is consistent with the probabilistic model for the experiment but easy to ignore when working with sample realizations y. This n-ball of alternate representative samples species a set of likelihoods L(jy) =fy 0 2B n (y)j`(jy 0 )g (3.1) that can also model for the experiment. Some of the alternate samples y 0 2B n (y) assign a higher likelihood `( jy 0 ) to the true than the current sample's likelihood `( jy) does. So it is useful to pay attention toL(jy) while seeking estimates for . Analytic MLE nds the best estimate for in a single iteration. So there is little point in analyzing estimates from the set of alternative likelihoodsL(jy). The setL(jy) provides enough information to calculate bootstrap estimates [85] for the standard error of ^ for example. But the analytic Fisher information already summarizes that information since standard error is the inverse square root of the Fisher information. 1 If the data model has a sucient statistic T(y 1 ; ;y n ) then we can apply the subsequent analysis to the sampling pdf of T (y 1 ; ;y n ) instead. 47 EM and other iterative MLE techniques can benet from the informationL(jy) contains. The NEM algorithm uses noise to pick favorable likelihoods inL(jy) that result in higher likelihood intermediate estimates k . This leads to faster EM convergence. The NEM condition in Theorem 3.1 describes how to screen and pick favorable likelihoods at the current EM iteration. The idea behind the NEM condition is that sometimes small sample noise n can increase the likelihood of a parameter . This occurs at the local level when f(y +nj) > f(yj) (3.2) for probability density function (pdf)f, realizationy of random variableY , realization n of random noise N, and parameter . This condition holds if and only if the logarithm of the pdf ratio is positive: ln f(y +nj) f(yj) > 0: (3.3) The logarithmic condition (3.3) in turn occurs much more generally if it holds only on average with respect to all the pdfs involved in the EM algorithm: E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0 (3.4) where random variable Z represents missing data in the EM algorithm and where is the limit of the EM estimates k : k ! . The positivity condition (3.4) is precisely the sucient condition for a noise benet in Theorem 3.1 below. 3.2 Noisy Expectation Maximization Theorems The EM noise benet requires the use of the modied surrogate log-likelihood function Q N (j k ) =E Zjy; k [lnf(y +N;Zj)] (3.5) and its maximizer k+1;N = argmax fQ N (j k )g : 48 The modied surrogate log-likelihood Q N (j k ) equals the regular surrogate log- likelihood Q (j k ) when N = 0. Q(j ) is the nal surrogate log-likelihood given the optimal EM estimate . So maximizes Q(j ). Thus Q( j )Q(j ) for all : (3.6) An EM noise benet occurs when the noisy surrogate log-likelihood Q N ( k j ) is closer to the optimal value Q( j ) than the regular surrogate log-likelihood Q( k j ) is. This holds when Q N ( k j )Q( k j ) (3.7) or Q( j )Q( k j ) Q( j )Q N ( k j ) : (3.8) So the noisy perturbation Q N (j k ) of the current surrogate log-likelihood Q(j k ) is a better log-likelihood for the data than Q is itself. An average noise benet results when we take expectations of both sides of inequality (3.8): E N h Q( j )Q( k j ) i E N h Q( j )Q N ( k j ) i : (3.9) This average noise benet takes on a more familiar form if we re-express (3.9) in terms of the relative entropy pseudo-distance. The relative entropy (or Kullback-Leibler divergence)[61], [184] between pdfs f 0 (x) and f 1 (x) is D (f 0 (x)kf 1 (x)) = Z X ln f 0 (x) f 1 (x) f 0 (x) dx (3.10) =E f 0 [lnf 0 (x) lnf 1 (x)] : (3.11) when the pdfs have the same support. It quanties the average information{loss incurred by using the pdff 1 (x) instead off 0 (x) to describe random samples of X [26]. It is an example of a Bayes risk [26], [42]. Let f 0 (x) be the nal EM complete pdf and f 1 (x) be the current EM or NEM complete pdf. Then the relative-entropy pseudo-distances in the noisy EM context are c k (N) =D (f(y;zj )kf(y +N;zj k )) (3.12) and c k =D (f(y;zj )kf(y;zj k )) =c k (0) : (3.13) 49 The average noise benet (3.9) occurs when the nal EM pdf f(y;zj ) is closer in relative-entropy to the noisy pdff(y +N;zj k ) than it is to the noiseless pdff(y;zj k ). So noise benets the EM algorithm when c k c k (N) : (3.14) This means that the noisy pdf is a better information-theoretic description of the complete data than the regular pdf. The noisy pdf incurs lower average information{ loss or Bayes risk than the regular pdf. The proof of the NEM theorem in the next section shows that the relative-entropy and expected Q-dierence formulations of the average noise benet are equivalent. Manipulating the relative-entropy formulation leads to the NEM sucient condition that guarantees the average EM noise benet. 3.2.1 NEM Theorem The Noisy Expectation Maximization (NEM) Theorem below uses the following notation. The noise random variableN has pdff(njy). So the noiseN can depend on the data Y . Independence implies that the noise pdf becomes f(njy) =f N (n).f k g is a sequence of EM estimates for . = lim k!1 k is the converged EM estimate for . Assume that the dierential entropy of all random variables is nite. Assume also that the additive noise keeps the data in the likelihood function's support. Theorem 3.1. [Noisy Expectation Maximization (NEM)]: An EM iteration noise benet Q( j )Q( k j ) Q( j )Q N ( k j ) (3.15) occurs on average if E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0: (3.16) Proof. First show that each expectation ofQ-function dierences in (3.9) is a distance pseudo-metric. Rewrite Q as an integral: Z Z ln[f(y;zj)]f(zjy; k ) dz : (3.17) 50 c k =D(f(y;zj )kf(y;zj k )) is the expectation over Y because c k = ZZ [ln(f(y;zj )) ln(y;zj k )]f(y;zj ) dz dy (3.18) = ZZ [ln(f(y;zj )) ln(y;zj k )]f(zjy; )f(yj ) dz dy (3.19) =E Yj k h Q ( j )Q ( k j ) i : (3.20) c k (N) is likewise the expectation over Y : c k (N) = ZZ [ln(f(y;zj )) ln(y +N;zj k )]f(y;zj ) dz dy (3.21) = ZZ [ln(f(y;zj )) ln(y +N;zj k )]f(zjy; )f(yj ) dz dy (3.22) =E Yj k h Q ( j )Q N ( k j ) i : (3.23) Take the noise expectation of c k and c k (N): E N c k =c k (3.24) E N c k (N) =E N c k (N) : (3.25) So the distance inequality c k E N [c k (N)] : (3.26) guarantees that noise benets occur on average: E N;Yj k h Q ( j )Q ( k j ) i E N;Yj k h Q ( j )Q N ( k j ) i (3.27) We use the inequality condition (3.26) to derive a more useful sucient condition for a noise benet. Expand the dierence of relative entropy terms c k c k (N): c k c k (N) = ZZ Y;Z ln f(y;zj ) f(y;zj k ) ln f(y;zj ) f(y +N;zj k ) f(y;zj ) dy dz (3.28) 51 = ZZ Y;Z ln f(y;zj ) f(y;zj k ) + ln f(y +N;zj k ) f(y;zj ) f(y;zj ) dy dz (3.29) = ZZ Y;Z ln f(y;zj )f(y +N;zj k ) f(y;zj k )f(y;zj ) f(y;zj ) dy dz (3.30) = ZZ Y;Z ln f(y +N;zj k ) f(y;zj k ) f(y;zj ) dy dz : (3.31) Take the expectation with respect to the noise term N: E N [c k c k (N)] =c k E N [c k (N)] (3.32) = Z N ZZ Y;Z ln f(y +n;zj k ) f(y;zj k ) f(y;zj )f(njy) dy dz dn (3.33) = ZZ Y;Z Z N ln f(y +n;zj k ) f(y;zj k ) f(njy)f(y;zj ) dn dy dz (3.34) =E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) : (3.35) The assumption of nite dierential entropy for Y and Z ensures that lnf(y;zj)f(y;zj ) (3.36) is integrable. Thus the integrand is integrable. So Fubini's theorem [97] permits the change in the order of integration in (3.35): c k E N [c k (N)] i E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0 (3.37) Then an EM noise benet occurs on average if E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0: (3.38) The NEM theorem also applies to data models that use the complete data as their latent random variable. The proof for these cases is the same. The NEM positivity condition in these models changes to E X;Y;Nj ln f(X +Nj k ) f(Xj k ) 0: (3.39) 52 The relative entropy denition of the noise benet also allows for a more general version of NEM condition in users replace noise addition y +N with other methods of noise injection(y;N) such as multiplicative noise injection y:N. The NEM condition for generalized noise injection is E Y;Z;Nj ln f((Y;N);Zj k ) f(Y;Zj k ) 0 (3.40) The proof of this general condition is exactly the same as in the additive noise case with f N (y;zj k ) =f((y;N);zj k ) replacing f(y +N;zj k ). The NEM Theorem implies that each iteration of a suitably noisy EM algorithm moves closer on average towards the EM estimate than does the corresponding noiseless EM algorithm [229]. This holds because the positivity condition (3.16) implies that E N [c k (N)] c k at each step k since c k does not depend on N from (3.13). The NEM algorithm uses larger overall steps on average than does the noiseless EM algorithm for any number k of steps. The NEM theorem's stepwise noise benet leads to a noise benet at any point in the sequence of NEM estimates. This is because we get the following inequalities when the expected value of inequality(3.6) takes the form Q( k j )E[Q N ( k j )] for any k: (3.41) Thus Q( j )Q( k j )Q( j )E[Q N ( k j )] for any k: (3.42) The EM (NEM) sequence converges when the left (right) side of inequality (3.42) equals zero. Inequality (3.42) implies that the dierence on the right side is closer to zero at any step k. NEM sequence convergence is even stronger if the noise N k decays to zero. This noise annealing implies N k ! 0 with probability one. Continuity of Q as a function of Y implies that Q N k (j k )! Q(j k ) as N k ! 0. This holds because Q(j k ) = 53 E Zjy; k [lnf(y;Zj)] and and because the continuity of Q implies that lim N!0 Q N (j k ) =E Zjy; k [lnf( lim N!0 (y +N);Zj)] (3.43) =E Zjy; k [lnf(y;Zj)] (3.44) =Q(j k ): (3.45) The evolution of EM algorithms guarantees that lim k Q( k j ) =Q( j ). This gives the probability-one limit lim k Q N k ( k j ) =Q( j ): (3.46) So for any > 0 there exists a k 0 such that for all k>k 0 : Q( k j )Q( j ) < and (3.47) Q N k ( k j )Q( j ) < with probability one. (3.48) Inequalities (3.41) and (3.48) imply that Q( k j ) is -close to its upper limit Q( j ) and E[Q N k ( k j )]Q( k j ) and Q( j )Q( k j ): (3.49) So the NEM and EM algorithms converge to the same xed-point by (3.46). And the inequalities (3.49) imply that NEM estimates are closer on average to optimal than EM estimates are at any step k. 3.2.2 NEM for Finite Mixture Models The rst corollary of Theorem 3.1 gives a dominated-density condition that satises the positivity condition (3.16) in the NEM Theorem. This strong point-wise condition is a direct extension of the pdf inequality in (3.2). Corollary 3.1. [Dominance Condition for NEM]: The NEM positivity condition E Y;Z;Nj ln f(Y +N;Zj) f(Y;Zj) 0 (3.50) 54 holds if f(y +n;zj)f(y;zj) (3.51) for almost all n;y;z. Proof. The following inequalities need hold only for almost all y;z, and n: f(y +n;zj)f(y;zj) (3.52) i ln [f(y +n;zj)] ln [f(y;zj)] (3.53) i ln [f(y +n;zj)] ln [f(y;zj)] 0 (3.54) i ln f(y +n;zj) f(y;zj) 0: (3.55) Thus E Y;Z;Nj ln f(Y +N;Zj) f(Y;Zj) 0: (3.56) Most practical densities do not satisfy the condition (3.51) in Corollary 3.1 2 . But Corollary 3.1 is still useful for setting conditions on the noise N to produce the eect of the condition (3.51). We use Corollary 3.1 to derive conditions on the noise N that produce NEM noise benets for mixture models. NEM mixture models use two special cases of Corollary 3.1. We state these special cases as Corollaries 3.2 and 3.3 below. The corollaries use the nite mixture model notation in x2.5.2. Recall that the joint pdf of Y and Z is f(y;zj) = X j j f(yjj;) [zj]: (3.57) Dene the population-wise noise likelihood dierence as f j (y;n) =f(y +njj;)f(yjj;): (3.58) Corollary 3.1 implies that noise benets the mixture model estimation if the dominated- 2 PDFs for which condition (3.51) holds are monotonically increasing or non-decreasing like ramp pdfs. The condition holds only for positive noise samples n even in these cases. 55 density condition holds: f(y +n;zj)f(y;zj): (3.59) This occurs if f j (y;n) 0 for all j : (3.60) The Gaussian mixture model (GMM) uses normal pdfs for the sub-population pdfs [123], [248]. Corollary 3.2 states a simple quadratic condition that ensures that the noisy sub-population pdff(y +njZ =j;) dominates the noiseless sub-population pdf f(yjZ =j;) for GMMs. The additive noise samples n depend on the data samples y. Corollary 3.2. [NEM Condition for GMMs]: Suppose Yj Z=j N ( j ; 2 j ) and thus f(yjj;) is a normal pdf. Then f j (y;n) 0 (3.61) holds if n 2 2n ( j y) : (3.62) Proof. The proof compares the noisy and noiseless normal pdfs. The normal pdf is f(yj) = 1 j p 2 exp " (y j ) 2 2 2 j # : (3.63) So f(y +nj)f(yj) i exp " (y +n j ) 2 2 2 j # exp " (y j ) 2 2 2 j # (3.64) i y +n j j 2 y j j 2 (3.65) i (y j +n) 2 (y j ) 2 : (3.66) Inequality (3.66) holds because j is strictly positive. Expand the left-hand side to 56 get (3.62): (y j ) 2 +n 2 + 2n (y j ) (y j ) 2 (3.67) i n 2 + 2n(y j ) 0 (3.68) i n 2 2n (y j ) (3.69) i n 2 2n ( j y) : (3.70) This proves (3.62). Now apply the quadratic condition (3.62) to (3.60). Then f(y +n;zj)f(y;zj) (3.71) holds when n 2 2n ( j y) for all j : (3.72) The inequality (3.72) gives a condition under which NEM estimates standard deviations j faster than EM. This can benet Expectation-Conditional-Maximization (ECM) [204] methods. Corollary 3.3 gives a similar quadratic condition for the Cauchy mixture model. The noise samples also depend on the data samples. Corollary 3.3. [NEM Condition for CMMs]: Suppose Yj Z=j C(m j ;d j ) and thus f(yjj;) is a Cauchy pdf. Then f j (y;n) 0 (3.73) holds if n 2 2n (m j y) : (3.74) Proof. The proof compares the noisy and noiseless Cauchy pdfs. The Cauchy pdf is f(yj) = 1 d j 1 + ym j d j 2 (3.75) 57 Then f(y +nj)f(yj) i 1 d j 1 + y+nm j d j 2 1 d j 1 + ym j d j 2 (3.76) i " 1 + ym j d j 2 # " 1 + y +nm j d j 2 # (3.77) i ym j d j 2 y +nm j d j 2 : (3.78) Proceed as in the last part of the Gaussian case: ym j d j 2 ym j +n d j 2 (3.79) i (ym j ) 2 (ym j +n) 2 (3.80) i (ym j ) 2 (ym j ) 2 +n 2 + 2n (ym j ) (3.81) i 0n 2 + 2n (ym j ) (3.82) i n 2 2n (m j y) : (3.83) This proves (3.74). The conditions in Corollaries 3.2 and 3.3 simplify to n 2 ( j y) when N > 0. Again apply the quadratic condition (3.74) to (3.60). Then f(y +n;zj)f(y;zj) (3.84) holds when n 2 2n (m j y) for all j : (3.85) Figure 3.2 shows a simulation instance of noise benets for EM estimation on a GMM. Figure 3.3 also shows that EM estimation on a CMM exhibits similar pronounced noise benets. The GMM simulation estimates the sub-population stan- dard deviations 1 and 2 from 200 samples of a Gaussian mixture of two 1-D sub-populations with known means 1 =2 and 2 = 2 and mixing proportions 1 = 0:5 and 2 = 0:5. The true standard deviations are 1 = 2 and 2 = 2. Each EM 58 and NEM procedure starts at the same initial point 3 with 1 (0) = 4:5 and 2 (0) = 5. The simulation runs NEM on 100 GMM data sets for each noise level N and counts the number of iterations before convergence for each instance. The average of these iteration counts is the average convergence time at that noise level N . The CMM simulation uses a similar setup for estimating sub-population dispersions (d 1 ;d 2 ) with its own set of true values ( ;m 1 ;d 1 ;m 2 ;d 2 ) (see inset table in Figure 3.3). The noise in the CMM-NEM simulation satises the condition in (3.85). Simulations also conrm that non-Gaussian noise distributions produce similar noise benets. The mixture model NEM conditions predict noise benets for the estimation of specic distribution parameters { variance and dispersion parameters. The noise benet also applies to the full EM estimation of all distribution parameters since the EM update equations decouple for the dierent parameters. So we can apply the NEM condition to just the update equations for the variance/dispersion parameters. And use the regular EM update equations for all other parameters. The NEM condition still leads to a noise{benet in this more general estimation procedure. Figure 3.4 shows a simulation instance of this procedure. The simulation for this gure estimates all distribution parameters for a GMM in 2-dimensions. 3.2.3 The Geometry of the GMM{ & CMM{NEM Condition Both quadratic NEM inequalities in (3.72) and (3.85) reduce to n [n 2 ( j y)] 0 8j : (3.86) So the noisen must fall in the interval where the parabola n 2 2n ( j y) is negative for allj. We call the set of such admissible noise samples the NEM set. This set forms the support of the NEM noise pdf. There are two possible solution sets for (3.86) depending on the values of j andy. These solution sets are N j (y) = [2 ( j y); 0] (3.87) N + j (y) = [0; 2 ( j y)]: (3.88) 3 The use of the xed initial points and bootstrap condence intervals for the gure-of-merit is in keeping with Shilane's [269] framework for the statistical comparison of evolutionary or iterative computation algorithms with random trajectories. 59 Figure 3.3: EM noise benet for a Cauchy mixture model. This plot uses the noise- annealing variant of the NEM algorithm. Low intensity starting noise decreases convergence time while higher intensity starting noise increases it. The optimal initial noise level has standard deviation = 0:24 which gave a speed improvement of about 11%. This NEM procedure adds independent noise with a cooling schedule. The noise cools at an inverse square rate. The data model is the Cauchy mixture model. The mixture Cauchy distribution is a convex combination of two Cauchy sub-populations C 1 andC 2 . The simulation uses 200 samples of the mixture Cauchy distribution to estimate the dispersions of the two sub-populations. This NEM implementation adds noise to the data only when the noise satises the condition in (3.85) . The additive noise is zero-mean normal with standard deviation . Each sampled point on the curve is the mean of 100 trials. The vertical bars are the 95% bootstrap condence intervals for the convergence time mean at each noise level. The goal is to nd the set N(y) of n values that satisfy the inequality in (3.72) for all j: N(y) = \ j N j (y) (3.89) where N j (y) = N + j (y) or N j (y) = N j (y). Thus N(y)6=f0g holds only when the sample y lies on one side of all sub-population means (or location parameters) j . This holds for j >y for all j or j <y for all j : (3.90) 60 Figure 3.4: EM noise benet for a 2-D Gaussian mixture model. The plot shows that noise injection can improve EM convergence time. The simulation is similar to Figure 3.2 except we estimate all distribution parameters (;;) instead of just . And the GMM is for 2-dim samples instead of 1-dim samples. The optimal initial noise level has standard deviation = 1:6. The average optimal NEM speed-up over the noiseless EM algorithm is 14%. The simulation uses 225 samples of the mixture normal distribution to estimate all distribution parameters for a 2-D 2-cluster GMM. The additive noise uses samples of zero-mean normal noise with standard deviation N screened through the GMM{NEM condition in (3.72). The NEM noise N takes values in T j N j if the data sample y falls to the right of all sub-population means (y> j for all j). The NEM noise N takes values in T j N + j if the data sample y falls to the left of all sub-population means (y < j for all j). And N = 0 is the only valid value for N when y falls between sub-populations means. Thus the noise N tends to pull the data sample y away from the tails and towards the cluster of sub-population means (or locations). The GMM{NEM condition extends component-wise tod-dimensional GMM models with diagonal covariance matrices. Suppose y =fy g d is one such d-D GMM sample. Then the vector noise sample n satises the NEM condition if n 2 2n ( j; y ) 8j;8 (3.91) wherej denotes sub-populations and denotes dimensions. The condition (3.91) holds 61 because d Y =1 f(y +n jj;) d Y =1 f(y jj;) (3.92) when f(y +n jj;)f(y jj;) for each dimension . So d-D GMM{NEM for vector sub-populations with diagonal covariance matrices is equivalent to running d separate GMM{NEM estimations. The multidimensional quadratic positivity condition (3.91) has a simple geometric description. Suppose there exist noise samples n that satisfy the quadratic NEM condition. Then the valid noise samples n =fn g d for the data y must fall in the interior or on the boundary of the minimal hyper-rectangle: n2 d Y fn j n 2 2n ( min; y )g: (3.93) min is the centroid of the NEM set. The dierencesf( j; y )g determine the location of min : min = ( min;1 ;:::; min;d ) (3.94) where min; = min j=1;:::;K f( j; y )g (3.95) for a GMM with K sub-populations. While the dierencesf2( j; y )g determine the bounds of the set (see Figure 3.5) just like in the 1-D case (3.87{3.89). This multidimensional NEM set is the product of 1-D GMM{NEM sets N j (y ) from each dimension . It is a hyper-rectangle because the 1-D GMM{NEM sets are either intervals or the singleton setf0g. The NEM set always has one vertex anchored at the origin 4 since the origin is always in the NEM set (3.93). Figure 3.5 illustrates the geometry of a sample 2-D GMM{NEM set for a 2-D GMM sample y. GMM{NEM sets are lines for 1-D samples, rectangles for 2-D samples, cuboids for 3-D samples, and hyper-rectangles for (d> 3)-D samples. The geometry of the NEM set implies that NEM noise tends to pulls the sampley towards the centroid of the set of sub-population meansf 1 ;:::; K g. This centralizing action suggests that NEM converges faster for GMMs and CMMs because the injective noise makes 4 Thus the EM algorithm is always a special case of the NEM algorithm. This also means that contractions of convex NEM sets still satisfy the NEM condition. This property is important for implementing noise cooling in the NEM algorithm. 62 Figure 3.5: An illustration of the geometry of the GMM{NEM set in 2-dimensions. The 2-D NEM set is the green rectangle with centroid min = (2;4) and a vertex at the origin. The sample y = (3; 5) comes from a 2-cluster GMM. The GMM sub-population means are 1 = (1; 1) and 2 = (1;1). The dierences ( j; y ) determine the location of the centroid min and 2 min is the diagonal vertex of the NEM set. The 1-D NEM intervals in the n 1 (intersection of the blues sets) and n 2 (intersection of the magenta sets) dimensions specify the side-lengths for the 2-D NEM-set. Vectors in the NEM set draw the sample y towards the midpoint of 1 and 2 . the EM estimation more robust to outlying samples. 63 3.2.4 NEM for Mixtures of Jointly Gaussian Populations Thed-dimensional NEM condition in equation (3.93) is valid when the sub-population covariance matrices are diagonal. More general jointly Gaussian (JG) sub-populations have correlations between dimensions and thus non-zero o-diagonal terms in their covariance matrices. NEM for such jointly Gaussian mixture models(JG-MMs) sub- populations requires a more general sucient condition. Corollary 3.4 states this condition. The corollary uses the following notation for the quadratic forms in the JG pdf: hu; vi j = u T 1 j v (3.96) kuk 2 j = u T 1 j u: (3.97) These are the inner product and the norm based on the inverse of the non-degenerate symmetric positive denite covariance matrix 1 j . Corollary 3.4. [NEM Condition for JG-MMs]: Suppose Yj Z=j N ( j ; j ) for d-dimensional random vectors Y. And thus f(yjj;) is a jointly Gaussian pdf. Then the NEM sucient condition E Y;Z;Nj ln f(Y + N;Zj) f(Y;Zj) 0 (3.98) holds if knk 2 j + 2h(y j ); ni j 0 8 j : (3.99) Proof. By Corollary 3.1, the NEM condition E Y;Z;Nj ln f(Y + N;Zj) f(Y;Zj) 0: (3.100) holds if f(y + n;zj)f(y;zj) 0 (3.101) for almost all n; y;z. This dominance condition is equivalent to lnf(y + n;zj) lnf(y;zj) 0 (3.102) 64 over the support of the pdf. The complete JG-MM log-likelihood lnf(y;zj) is lnf(y;zj) = X j [zj] ln h j f(yjj; j ) i (3.103) using the exponential form of the nite mixture model pdf in (2.100). So lnf(y + n;zj) lnf(y;zj) 0 i X j [zj] h ln ( j f(y + njj; j )) ln ( j f(yjj; j )) i 0 (3.104) () X j [zj] h lnf(y + njj; j ) lnf(yjj; j ) i 0: (3.105) JG sub-population pdf and log-likelihood are f(yjj; j ) = 1 p (2) d j j j exp 0:5 y j T 1 j y j ; (3.106) lnf(yjj;) =0:5d ln(2) 0:5 lnj j j 0:5 (y j ) T 1 j (y j ) (3.107) wherej j j is the determinant of the covariance matrix j . Dene w j as w j = y j : (3.108) This simplies the JG sub-population log-likelihood to lnf(yjj;) =0:5(d ln(2) + lnj j j) 0:5 lnj j j 0:5kw j k 2 j : (3.109) Thus X j [zj] h 0:5(d ln(2) + lnj j j) 0:5kw j + nk 2 j + 0:5(d ln(2) + lnj j j) + 0:5kw j k 2 j i 0 (3.110) 65 () X j [zj] h kw j + nk 2 j +kw j k 2 j i 0 (3.111) () X j [zj] h knk 2 j 2hw j ; ni j i 0: (3.112) This gives the d-D JG-MM NEM condition X j [zj] h knk 2 j + 2hw j ; ni j i 0: (3.113) A sub-case of (3.113) satises the condition by ensuring that each summand is negative [knk 2 j + 2hw j ; ni j ] 0. This sub-case eliminates the need for the summation in (3.113) and gives the more conservative JG-MM NEM condition: knk 2 j + 2h(y j ); ni j 0 8 j (3.114) since w j = (y j ). This is a direct analogue and generalization of the 1-D condition with one important dierence: all estimated parameters, ( j ; j ), occur in the condition. So the condition is not as useful as the specialized GMM{NEM condition for diagonal covariance matrices. Diagonal covariance matrices allow us to eliminate variances from the NEM condition and thus get exact NEM noise benet conditions for variance estimation. The condition in (3.99) provides a basis for approximate NEM noise benet conditions at best. Figure 3.6 shows a sample NEM set for a 2-dimensional jointly Gaussian mixture with two clusters. The NEM sets for JG-MMs are intersections of ellipsoids. The component-wise d-Dim NEM condition (3.93) is a sub-case of (3.99) when the covari- ance matrices are diagonal 5 . Thus the NEM set for (3.93) (the green box) is a subset of the more general JG-MM NEM set (the blue overlapping region) for the same sample y and data model. The general theme for JG and Gaussian mixture models is that the roots of second-order polynomials determine the boundaries of NEM sets. Intersections of ellipsoids do not factor into component dimensions the way products of intervals do. So noise sampling from the JG-MM NEM set requires the use of complicated joint distributions. This emphasizes the need for identifying simpler 5 The inner products reduce to sums of 1-D quadratic terms which we bound separately to get the component-wise NEM condition in (3.93). 66 Figure 3.6: An illustration of a sample NEM set for a mixture of two 2-D jointly Gaussian populations. NEM noise samples must come from the overlapping region in blue. This region is the intersection of two ellipses. This is the JG-MM NEM set for the same sample and data model in Figure 3.5. The JG-MM NEM set is a superset of the product GMM{NEM set in Figure 3.5 (the green box here) that comes from manipulating each dimension separately. interior NEM set (like the green box in Figure 3.6) for noise sampling. A study of the geometry of NEM sets may lead to more intelligent noise sampling schemes for NEM. Ease of noise sampling becomes increasingly important as sample dimension increases. The NEM theorem produces better average performance than EM algorithm for any EM estimation problem. This may seem to violate Wolpert and Macready's \No Free Lunch" (NFL) theorems [302] which asserts that this sort of superior algorithmic 67 performance is not possible over the whole space of all optimization problems. The apparent violation is an illusion. The NFL theorems apply to algorithms with static objective functions. The objective functions in NEM algorithms are dynamic; noise injection perturbs the likelihood/objective function at each iteration. And the NEM theorem specically steers the likelihood function evolution towards favorable solutions. But the use of evolving objective functions comes at a price: intelligent objective function evolution raises the algorithm's computational complexity. The noise sampling step accounts for the increased computational complexity in the NEM algorithm. Noise sampling complexity can sometimes lead to higher raw computation time even when the NEM algorithm converges in fewer iterations than the EM algorithm. This problem highlights the need for ecient noise sampling routines. Simple NEM conditions (e.g. GMM{NEM and CMM{NEM) can keep noise sampling cost down. But the ideal case would inject unscreened noise and still produce noise benets. x3.4 below discusses this ideal case. 3.2.5 NEM for Models with Log-Convex Densities EM algorithms can satisfy the positivity condition (3.16) if they use the proper noise N. They can also satisfy the condition if the data model has an amenable complete data pdf f(xj). Inequalities (3.72) and (3.85) can sculpt the noise N to satisfy (3.16) for Gaussian and Cauchy mixture models. The next corollary shows how the complete data pdf can induce a noise benet. The corollary states that a log-convex complete pdf satises (3.16) when the noise is zero-mean. The corollary applies to data models with more general complete random variables X. These include models whose complete random variables X do not decompose into the direct product X = (Y;Z). Examples include censored models that use the unobserved complete random variable as the latent random variable Z =X [52], [112], [279]. Corollary 3.5. [NEM Condition for Log-Convex pdfs]: Suppose that f(xj) is log-convex in x and N is independent of X. Suppose also that E N [N] = 0. Then E X;Nj ln f(X +Nj k ) f(Xj k ) 0: (3.115) 68 Proof. f(y;zj) is log-convex in y and E N [y +N] =y. So E N [lnf(y +N;zj k )] lnf(E N [y +N];zj k ): (3.116) The right-hand side becomes lnf(E N [y +N];zj k ) = lnf(y +E N [N];zj k ) (3.117) = lnf(y;zj k ) (3.118) because E[N] = 0. So E N [lnf(y +N;zj k )] lnf(y;zj k ) (3.119) i (E N [lnf(y +N;zj k )] lnf(y;zj k )) 0 (3.120) i (E N [lnf(y +N;zj k ) lnf(y;zj k ))] 0 (3.121) i E Y;Zj [E N [lnf(Y +N;Zj k ) lnf(Y;Zj k )]] 0 (3.122) i E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0: (3.123) Inequality (3.123) follows because N is independent of . The right-censored gamma data model gives a log-convex data model when the -parameter of its complete pdf lies in the interval (0; 1). This holds because the gamma pdf is log-convex when 0 < < 1. Log-convex densities often model data with decreasing hazard rates in survival analysis applications [9], [64], [244]. x2.5.1 describes the gamma data model and EM algorithm. Figure 3.7 shows a simulation instance of noise benets for a log-convex model. The simulation estimates the parameter from right-censored samples of a (0:65; 4) pdf. Samples are censored to values below a threshold of 4:72. The average optimal NEM speed-up over the noiseless EM algorithm is about 13:3%. A modication of Corollary 3.5 predicts a similar noise benet if we replace the zero-mean additive noise with unit-mean multiplicative noise. The noise is also independent of the data. 69 Figure 3.7: EM noise benet for log-convex censored gamma model. This plot uses the annealed-noise NEM algorithm. The average optimal NEM speed-up over the noiseless EM algorithm is about 13:3%. Low intensity initial noise decreases convergence time while higher intensity starting noise increases it. This NEM procedure adds cooled i.i.d. normal noise that is independent of the data. The noise cools at an inverse-square rate. The log-convex gamma distribution is a (;) distribution with < 1. The censored gamma EM estimates the parameter. The model uses 375 censored gamma samples. Each sampled point on the curve is the mean of 100 trials. The vertical bars are 95% bootstrap condence intervals for the mean convergence time at each noise level. 3.3 The Noisy Expectation-Maximization Algorithm The NEM Theorem and its corollaries give a general method for modifying the noiseless EM algorithm. The NEM Theorem also implies that on average these NEM variants outperform the noiseless EM algorithm. Algorithm 3.1 below gives the Noisy Expectation-Maximization algorithm schema. The operation NEMNoiseSample(y) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling distribution depends on the data vector y in the Gaussian and 70 Cauchy mixture models. Algorithm 3.1: The NEM Algorithm Input : y = (y 1 ;:::;y M ) : vector of observed incomplete data Output : ^ NEM : NEM estimate of parameter 1 while (k k k1 k 10 tol ) do 2 N S -Step: n k NEMNoiseSample(y) 3 N A -Step: y y y + n 4 E-Step: Q (j k ) E Zjy; k [lnf(y y ; Zj)] 5 M-Step: k+1 argmax fQ (j k )g 6 k k + 1 7 ^ NEM k The Maximum a Posteriori (MAP) variant of the NEM algorithm applies MAP-EM modication to the NEM algorithm: it replaces the EM Q-function with the NEM Q N -function: Q N (j k ) =E Zjy; k [lnf(y +N;Zj)] +P (): (3.124) Algorithm 3.2: Modied E-Step: for Noisy MAP-EM 1 E-Step: Q N (j k ) E Zjy; k [lnf(y y ; Zj)] +P () ; The E-Step in both cases takes the conditional expectation of a function of the noisy data y y given the noiseless data y. A deterministic decay factork scales the noise on thek th iteration. is the noise decay rate. The decay factor k reduces the noise at each new iteration. This factor drives the noise N k to zero as the iteration step k increases. The simulations in this paper use = 2 for demonstration. Values between = 1 and = 3 also work. N k still needs to satisfy the NEM condition for the data model. The cooling factor k must not cause the noise samples to violate the NEM condition. This usually means that 0<k 1 and that the NEM condition solution set is closed with respect to contractions. The decay factor reduces the NEM estimator's jitter around its nal value. This is important because the EM algorithm converges to xed-points. So excessive estimator jitter prolongs convergence time even when the jitter occurs near the nal solution. The simulations in this paper use polynomial decay factors instead of logarithmic 71 cooling schedules found in annealing applications [48], [103], [113], [159], [167]. The NEM algorithm inherits some variants from the classical EM algorithm schema. A NEM adaptation to the Generalized Expectation Maximization (GEM) algorithm is one of the simpler variations. The GEM algorithm replaces the EM maximization step with a gradient ascent step. The Noisy Generalized Expectation Maximization (NGEM) algorithm uses the same M-step: Algorithm 3.3: Modied M-Step: for NGEM: 1 M-Step: k+1 ~ such that Q( ~ j k )Q ( k j k ) The NEM algorithm schema also allows for some variations outside the scope of the EM algorithm. These involve modications to the noise sampling step N S -Step or to the noise addition step N A -Step One such modication does not require an additive noise term n i for each y i . This is useful when the NEM condition is stringent because then noise sampling can be time intensive. This variant changes the N S -Step by picking a random or deterministic sub-selection of y to modify. Then it samples the noise subject to the NEM condition for those sub-selected samples. This is the Partial Noise Addition NEM (PNA-NEM). Algorithm 3.4: Modied N S -Step for PNA-NEM 1I f1:::Mg 2J SubSelection(I) 3 for j2J do 4 n k NEMNoiseSample(y ) The NEM algorithm and its variants need a NEM noise generating procedure NEMNoiseSample(y). The procedure returns a NEM{compliant noise sample n j at the desired noise level N for the current data sample. This procedure will change with the EM data model. The noise generating procedure for the GMM and CMM models derive from Corollaries 2 and 3. The 1-D noise-generating procedure for the GMM simulations is as follows: 72 Algorithm 3.5: NEMNoiseSample for GMM{ and CMM{NEM Input :y and N : current data sample and noise level Output :n : noise sample satisfying NEM condition 1 N(y) T j N j (y) 2 n a sample from distribution TN(0; N jN(y)) where TN(0; N jN(y)) is the normal distribution N(0; N ) truncated to the support set N(y). The set N(y) is the interval intersection from (3.89). Multi-dimensional versions of the generator can apply the procedure component-wise. 3.3.1 NEM via Deterministic Interference The original formulation of the NEM theorem and the NEM conditions address the use of random perturbations in the observed data. But the NEM conditions do not preclude the use of deterministic perturbations in the data. Such nonrandom perturbation or deterministic interference falls under the ambit of the NEM theorem since any (deterministic) constant is also a random variable with a degenerate distribution. This interpretation of the NEM theorem leads to another notable variant of the NEM algorithm(Algorithm 3.6): the deterministic interference EM (DIEM) algorithm. This algorithm adds deterministic sample-dependent perturbations to the data using only perturbations that satisfy the NEM condition for the data model. DIEM also applies a cooling schedule like NEM. Algorithm 3.6: The Deterministic Interference EM Algorithm Input : y = (y 1 ;:::;y M ) : vector of observed incomplete data Output : ^ DIEM : DIEM estimate of parameter 1 while (k k k1 k 10 tol ) do 2 N S -Step: n k SamplePertubation(y) 3 where SamplePertubation(y)2 N j (y) 4 N A -Step: y y y + n 5 E-Step: Q (j k ) E Zjy; k [lnf(y y ; Zj)] 6 M-Step: k+1 argmax fQ (j k )g 7 k k + 1 8 ^ DIEM k 73 The only dierence is in the N S -Step, the \noise" sampling step. Algorithm 3.7: SamplePertubation for DIEM Input :y and s N : current data sample and perturbation scale (s N 2 [0; 1]) Output :n : sample perturbation in the NEM set for y 1 N(y) T j N j (y) 2 n s N Centroid[N(y)] Figure 3.8 below shows how the DIEM algorithm performs on the same EM estimation problem in Figure 3.2. The condence bars here measure variability caused by averaging convergence times across multiple instances of the data model. This simulation added deterministic perturbations starting at a value located somewhere on the line between the origin and the center of the NEM set for the current data sample. The initial perturbation factor s N controls how far from the origin the perturbations start. These perturbations then decay to zero at an inverse-squared cooling rate as iteration count increases. A chaotic dynamical system [3], [233] can also provide the deterministic interference for the DIEM algorithm. This is the Chaotic EM (CEM) algorithm. Figure 3.9 shows an example of a GMM-CEM using deterministic interference from the chaotic logistic map z t+1 = 4z t (1z t ) (3.125) with initial value z 0 = 0:123456789 [146], [208]. This logistic dynamical system starting at z 0 produces seemingly random samples from the unit interval [0; 1]. The GMM-CEM uses the scale parameter A N to t logistic map samples to the NEM sets for each data sample. The eective injective noise has the form n t =A N z t : (3.126) A N cools at an inverse-square rate. 3.4 Sample Size Eects in the NEM algorithm The noise-benet eect depends on the size of the GMM data set. Analysis of this eect depends on the probabilistic event that the noise satises the GMM{NEM condition for the entire sample set. This analysis also applies to the Cauchy mixture 74 Figure 3.8: Plot showing eects of GMM-NEM when using deterministic interference on the data samples instead of random noise. The data model and experimental setup is the same as in Figure 3.2. This DIEM algorithm perturbs the data samples with deterministic values in valid NEM sets. The scale of the perturbations decreases with iteration count just like in the NEM case. This GMM-DIEM algorithm converges faster than the regular EM algorithm on average. The improvement is roughly equivalent to the improvement in the random-noise GMM-NEM|about a 28% improvement over the baseline EM convergence time. model because its NEM condition is the same as the GMM's. Dene A k as the event that the noise N satises the GMM{NEM condition for the k th data sample: A k =fN 2 2N( j y k )j8jg: (3.127) Then dene the event A M that noise random variable N satises the GMM{NEM condition for each data sample as A M = M \ k A k (3.128) = N 2 2N( j y k )j8j and8k : (3.129) This construction is useful for analyzing NEM when we use independent and identically distributed (i.i.d.) noise N k d =N for all y k while still enforcing the NEM condition. 75 Figure 3.9: Plot showing eects of GMM-CEM using chaotic deterministic interference on the data samples instead of random noise. The data model and experimental setup is the same as in Figure 3.2. This CEM algorithm perturbs the data samples with samples from a chaotic logistic map scaled to t in valid NEM sets. The scale of the chaotic perturbations decreases with iteration. This GMM-CEM algorithm converges faster than the regular EM algorithm on average. The improvement is roughly equivalent to the improvement in the random-noise GMM-NEM|about a 26% improvement over the baseline EM convergence time. 3.4.1 Large Sample Size Eects The next theorem shows that the set A M shrinks to the singleton setf0g as the numberM of samples in the data set grows. So the probability of satisfying the NEM condition for i.i.d. noise samples goes to zero as M!1 with probability one. Theorem 3.2. [Large Sample GMM{ and CMM{NEM]: Assume that the noise random variables are i.i.d. The set i.i.d. noise values A M = N 2 2N( j y k )j8j and8k (3.130) that satisfy the Gaussian (Cauchy) NEM condition for all data samples y k decreases with probability one to the setf0g as M!1: P lim M!1 A M =f0g = 1: (3.131) 76 Proof. Dene the NEM-condition event A k for a single sample y k as A k =fN 2 2N( j y k )j8jg: (3.132) Then N 2 2N( j y k ) for all j if N satises the NEM condition. So N 2 2N( j y k ) 0 for allj (3.133) and N(N 2( j y k )) 0 for allj : (3.134) This quadratic inequality's solution set (a j ;b j ) for j is I j = [a j ;b j ] = 8 > > > < > > > : [0; 2( j y k )] if y k < j [2( j y k ); 0] if y k > j f0g if y k 2 [min j ; max j ] : (3.135) Deneb + k andb k asb + k = 2 min j ( j y k ) andb k = 2 max j ( j y k ). Then the maximal solution set A k = [a;b] over all j is A k = J \ j I j = 8 > > > < > > > : 0;b + k if y k < j 8j b k ; 0 if y k > j 8j f0g if y k 2 [min j ; max j ] (3.136) where J is the number of sub-populations in the mixture density. There is a sorting such that theI j are nested for each sub-case in (3.136). So the nested interval theorem [273] (or Cantor's intersection theorem [257]) implies that A k is not empty because it is the intersection of nested bounded closed intervals. A k =f0g holds if the NEM condition fails for that value of y k This happens when some I j sets are positive and other I j sets are negative. The positive and negative I j sets intersect only at zero. No non-zero value of N will produce a positive average noise benet. The additive noise N must be zero. 77 Write A M as the intersection of the A k sub-events: A M = N 2 2N( j y k )j8j and8k (3.137) = M \ k A k (3.138) = 8 > > > < > > > : 0; min k b + k if y k < j 8j;k max k b k ; 0 if y k > j 8j;k f0g if 9k :y k 2 [min j ; max j ] : (3.139) Thus a second application of the nested interval property implies that A M is not empty. We now characterize the asymptotic behavior of the set A M . A M depends on the locations of the samples y k relative to the sub-population means j . Then A M =f0g if there exists some k 0 such that min j y k 0 max j . Dene the set S = [min j ; max j ]. Then by Lemma 3.1 below lim M!1 # M (Y k 2 S) > 0 holds with probability one. So there exists with probability one a k 0 2f1:::Mg such that y k 0 2S as M!1. Then A k 0 =f0g by equation (3.139). Then with probability one: lim M!1 A M =A k 0 \ lim M!1 M \ k6=k 0 A k (3.140) =f0g\ lim M!1 M \ k6=k 0 A k : (3.141) So lim M!1 A M =f0g with probability one (3.142) sinceA M is not empty by the nested intervals property and since 02A k for allk. Lemma 3.1. [Borel's Law of Large Numbers]: Suppose that SR is Borel-measurable and that R is the support of the pdf of the random variable Y . Let M be the number of random samples of Y . Then as M!1: # M (Y k 2S) M !P (Y2S) with probability one. (3.143) where # M (Y k 2S) is of the random samples y 1 ;:::;y M of Y that fall in S. 78 Proof. Dene the indicator function random variableI S (Y ) as I S (Y ) = 8 < : 1 Y2S 0 Y = 2S : (3.144) The strong law of large numbers implies that the sample meanI S I S = P M k I S (Y k ) M = # M (Y k 2S) M (3.145) converges toE[I S ] with probability one. Here # M (Y k 2S) is the number of random samples Y 1 ;:::;Y M that fall in the set S. ButE[I S ] =P (Y2S). So with probability one: # M (Y k 2S) M !P (Y2S) (3.146) as claimed. Then P (S)> 0 implies that lim M!1 # M (Y k 2S) M > 0 (3.147) and lim M!1 # M (Y k 2S)> 0 with probability one since M > 0. The proof shows that larger sample sizes M place tighter bounds on the size of A M with probability one. The bounds shrink A M all the way down to the singleton setf0g as M!1. A M is the set of values that identically distributed noise N can take to satisfy the NEM condition for all y k . A M =f0g means that N k must be zero for all k because the N k are identically distributed. This corresponds to cases where the NEM Theorem cannot guarantee improvement over the regular EM using just i.i.d. noise. So identically distributed noise has limited use in the GMM- and CMM-NEM framework. Theorem 2 is an \probability-one" result. But it also implies the following convergence-in-probability result. Suppose ~ N is an arbitrary continuous random variable. Then the probability P( ~ N2A M ) that ~ N satises the NEM condition for all samples falls to P( ~ N2f0g) = 0 as M!1. Figure 3.10 shows a Monte Carlo simulation of how P ( ~ N2A M ) varies with M. Using non-identically distributed noise N k avoids the reduction in the probability 79 Figure 3.10: Probability of satisfying the NEM sucient condition with dierent sample sizes M and at dierent noise standard deviations N . The Gaussian mixture density has mean = [0; 1], standard deviations N = [1; 1], and weights = [0:5; 0:5]. The number M of data samples varies from M = 1 to M = 60. Noise standard deviation varies from N = 0:1 (top curve) to N = 1:0 (bottom curve) at 0:1 incremental step. Monte Carlo simulation computed the probability P(A M ) in equation (3.129) from 10 6 samples. of satisfying the NEM-condition for large M. The NEM condition still holds when N k 2A k for each k even if N k = 2A M = T k A k . This noise sampling model adapts the k th noise random variableN k to thek th data sampley k . This is the general NEM noise model. Figures 3.2 and 3.7 use the NEM noise model. This model is equivalent to dening the global NEM event ~ A M as a Cartesian product of sub-events ~ A M = Q M k A k instead of the intersection of sub-events A M = T k A k . Thus the bounds of ~ A M and its coordinate projections no longer depend on sample size M. Figures 3.2 and 3.7 use the Cartesian product noise event. This is standard for NEM algorithms. Figure 3.11 and 3.12 compare the performance of the NEM algorithm with a pseudo{variant of simulated annealing on the EM algorithm. This version of EM adds annealed i.i.d. noise to data samples y without screening the noise through the 80 NEM condition. Thus we call it blind noise injection 6 . Figure 3.11 shows that the NEM outperforms blind noise injection at all tested sample sizes M. The average convergence time is about 15% lower for the NEM noise model than for the blind noise model at large values of M. The two methods are close in performance only at small sample sizes. This is a corollary eect of Theorem 2 from x3.4.2. Figure 3.12 shows that NEM outperforms blind noise injection at a single sample size M = 225. But it also shows that blind noise injection may fail to give any benet even when NEM achieves faster average EM convergence for the same set of samples. Thus blind noise injection (or simple simulated annealing) performs worse than NEM and sometimes performs worse than EM itself. 3.4.2 Small Sample Size: Sparsity Eect The i.i.d noise model in Theorem 3.2 has an important corollary eect for sparse data sets. The size of A M decreases monotonically with M because A M = T M k A k . Then for M 0 <M 1 : P (N2A M 0 )P (N2A M 1 ) (3.148) sinceM 0 <M 1 implies thatA M 1 A M 0 . Thus arbitrary noiseN (i.i.d and independent of Y k ) is more likely to satisfy the NEM condition and produce a noise benet for smaller samples sizes M 0 than for larger samples sizes M 1 . The probability that N2A M falls to zero as M!1. So the strength of the i.i.d. noise benet falls as M!1. Figure 3.13 shows this sparsity eect. The improvement of relative entropy D(fjjf NEM ) decreases as the number of samples increases: the noise-benet eect is more pronounced when the data is sparse. 3.4.3 Asymptotic NEM Analysis We show last how the NEM noise benet arises by way of the strong law of large numbers and the central limit theorem. This asymptotic analysis uses the sample 6 Blind noise injection diers from both standard simulated annealing and NEM. Standard simulated annealing (SA) injects independent noise into the optimization variable . While blind noise injection (and NEM) injects noise into the data y. Blind noise injection also diers from NEM because NEM noise depends on the data via the NEM condition. While blind noise injection uses noise that is independent of the data (hence the term \blind"). Thus while blind noise injection borrows characteristics from both NEM and SA, it is very dierent from both noise injection methods. 81 Figure 3.11: Comparing the eect of the NEM noise sampling model on GMM{EM at dierent sample sizes M. The dependent noise model uses the NEM condition. The independent noise model does not check the NEM condition. So independent noise model has a lower probability of satisfying the NEM condition for all values of M. The plot shows that the dependent noise model outperforms the independent noise model at all sample sizes M. The dependent noise model converges in about 15% fewer steps than the independent noise model for large M. This Gaussian mixture density has sub-population means =f0; 1g, standard deviations =f1; 1g, and weights =f0:5; 0:5g. The NEM procedure uses the annealed Gaussian noise with initial noise power at N = 0:17. mean W M : W M = 1 M M X k=1 W k : (3.149) The M i.i.d. terms W k have the logarithmic form W k = ln f(Y k +N k ;Z k j t ) f(Y k ;Z k j t ) : (3.150) TheW k terms are independent because functions of independent random variables are independent. The random sampling framework of the EM algorithm just means that the underlying random variables are themselves i.i.d. Each W k term gives a sampling version of the left-hand side of (3.3) and thus of the condition that the added noise makes the signal value more probable. 82 Figure 3.12: Comparing the eects of noise injection via simulated annealing vs. noise injection via NEM on GMM{EM for a single sample size of M = 225. The dependent noise model uses the NEM condition. The independent noise model adds annealed noise without checking the NEM condition. So independent noise model has a lower probability of satisfying the NEM condition by Theorem 3.2. The plot shows that NEM noise injection outperforms the simulated annealing noise injection. NEM converges up to about 20% faster than the simulated annealing for this model. And simulated annealing shows no reduction in average convergence time.The Gaussian mixture density has mean =f0; 1g, standard deviations N =f1; 1g, and weights =f0:5; 0:5g with M = 225 samples. We observe rst that either the strong or weak law of large numbers [29] applies to the sample mean W M . The i.i.d. terms W k have population mean W =E[W ] and nite population variance 2 W =V [W ]. Then the strong (weak) law of large numbers states that the sample mean W M converges to the population mean W : W M ! W (3.151) with probability one (in probability) [29], [79], [88]. The population mean W diers from W in general for a given t because t need not equal until convergence. This dierence arises because the expectation W integrates against the pdff(y;z;nj t ) while the expectation W integrates against the pdff(y;z;nj ). But W ! W as t ! . So the law of large numbers implies that 83 Figure 3.13: Noise benets and sparsity eects in the Gaussian mixture NEM at dierent sample sizesM. The Gaussian mixture density consists of two sub-populations with mean 1 = 0 and 2 = 1 and standard deviations 1 = 2 = 1. The number M of data samples varies from M = 20 (top curve) to M = 1000 (bottom curve). The noise standard deviation varies from N = 0 (no noise or standard EM) to N = 1 at 0.1 incremental steps. The plot shows the average relative entropy D(f jjf NEM ) over 50 trials for each noise standard deviation N . f =f(xj) is the true pdf and f NEM =f(xj NEM ) is the pdf of NEM-estimated parameters. W M ! W (3.152) with probability one (in probability). So the sample mean converges to the expectation in the positivity condition (3.16). The central limit theorem (CLT) applies to the sample mean W M for large sample sizeM. The CLT states that the standardized sample mean of i.i.d. random variables with nite variance converges in distribution to a standard normal random variable ZN(0; 1) [29]. A noise benet occurs when the noise makes the signal more probable and thus when W M > 0. Then standardizing W M gives the following approximation 84 for large sample size M: P (W M > 0) =P W M W W = p M > W W = p M (3.153) P Z > p M W W ! by the CLT (3.154) = p M W W ! (3.155) where is the cumulative distribution function of the standard normal random variable Z. So P(W M > 0) > 1 2 if W > 0 and P(W M > 0) < 1 2 if W < 0. Suppose the positivity condition (3.4) holds such that W > 0. Then this probability P (W M > 0) goes to one as the sample size M goes to innity and as k converges to : lim M!1 P (W M > 0) = 1: (3.156) The same argument and (3.155) show that lim M!1 P (W M > 0) = 0 (3.157) if the positivity condition (3.4) fails such that W < 0. This analysis suggests a sample{mean version of the NEM condition in (3.16): W M 0 (3.158) where W M = 1 M M X k=1 ln f(Y k +N k ;Z k j t ) f(Y k ;Z k j t ) : (3.159) The sample mean NEM condition (3.158) matches the NEM condition (3.16) asymp- totically as the number of samples M! 0. So the sample mean NEM condition can be a surrogate NEM sucient condition for \large" sample-size data sets. This result stands in counterpoint to the discussion on NEM in the small{sample regime (x3.4.2): independent additive noise can produce EM noise-benets for small data sets. While the sample{mean NEM condition can produce noise{benets for large data sets. The sample mean version may also be useful when there is no analytic NEM 85 condition available for the data model. Such scenarios may occur with analytically intractable likelihoods or with empirical or nonparametric likelihoods [234], [235]. 3.5 Conclusion Careful noise injection can speed up the average convergence time of the EM algorithm. The various sucient conditions for such a noise benet involve a direct or average eect where the noise makes the signal data more probable. Special cases include mixture density models and log-convex probability density models. Noise injection for the Gaussian and Cauchy mixture models improves the average EM convergence speed when the noise satises a simple quadratic condition. Even blind noise injection can benet these systems when the data set is sparse. But NEM noise injection still outperforms blind noise injection in all data models tested. An asymptotic argument also shows that the sample-mean version of the EM noise benet condition obeys a similar positivity condition. Future research should address theories and methods for nding optimal noise levels for NEM algorithms. The current NEM implementations use a rudimentary search to nd good noise levels. The noise-benet argument depends on the behavior of the log-likelihood. The NEM condition depends on local variability in the observed log-likelihood surface. This suggests that the Fisher information, a normalized measure of the likelihood's average rate of change, controls the size and location of favorable noise injection regimes. Another open research question concerns the optimal shape of the additive noise distributions for the NEM algorithm. Most simulations in this dissertation sample scalar noise from a single distribution with varying noise power. Other simulations show that dierent noise distributions families also cause faster average EM conver- gence speed as long as the noise satises the NEM condition. But there has been no exploration of the relative performance of dierent noise distributions for noisy EM. An untested conjecture suggests that impulsive noise distributions like alpha-stable distributions [221], [263], [321] may help NEM nd global ML estimates more eas- ily. Such noise distribution-dependent benets provide a level of exibility that is unavailable to the regular EM algorithm. Future research may also study the comparative eects of random noise versus deterministic interference for EM algorithms. This includes the study of chaotic 86 systems for deterministic interference. The NEM theorem and algorithms are general. They apply to many data models. The next three chapters demonstrate NEM benets in three important incomplete data models: mixture models for clustering, hidden Markov models, and feedforward neural networks with hidden layers. The EM tuning algorithms for these models are very popular algorithms: k-means clustering, the Baum-Welch algorithm, and backpropagation respectively. We show that NEM produces in speed improvements for these algorithms. 87 Chapter 4 NEM Application: Clustering and Competitive Learning Algorithms Clustering algorithms feature prominently in large-scale commercial recommendation systems like Google News (news articles) [66], Net ix (movies) [163], [164], and Amazon (products) [190]. Such recommendations often rely on centroid-based clustering algorithms to classify costumers and produce relevant recommendations. These clustering algorithms tend to be computationally expensive and slow [190]. This chapter shows that noise can provably speed up convergence in many centroid- based clustering algorithms. This includes the popular k{means clustering algorithm. The clustering noise benet follows is a direct consequence of the general noise benet for the EM algorithm (Theorem 3.1) because many clustering algorithms (including the k-means algorithm) are special cases of the EM algorithm[47], [307]. The noise benet for clustering algorithms is a classication accuracy improvement for NEM-based classiers compared to EM-based classiers at the same number of training iterations. EM-based clustering algorithms use EM to train underlying data models and then use the optimized data models to classify samples into their best clusters. The data model is usually a mixture model. Both NEM and EM algorithms nd locally optimal model parameters in the convergence limit. But faster NEM convergence means that the NEM algorithm gives higher likelihood pre-converged parameters on average than the EM algorithm for xed iteration limits. The Clustering Noise Benet Theorem (Theorem 4.1) below formalizes this classication accuracy noise benet. Figure 4.1 shows a simulation instance of the corollary clustering noise benet of 88 the NEM Theorem for a two-dimensional GMM with three Gaussian data clusters. Theorem 4.1 below states that such a noise benet will occur. Each point on the curve reports how much two classiers disagree on the same data set. The rst classier is the EM-classier with fully converged EM-parameters. This is the reference classier. The second classier is the same EM-classier with only partially converged EM-parameters. The two classiers agree eventually if we let the second classier's EM-parameters converge. But the gure shows that they agree faster with some noise than with no noise. We call the normalized number of disagreements the misclassication rate. The misclassication rate falls as the Gaussian noise power increases from zero. It reaches a minimum for additive white noise with standard deviation 0:3. More energetic noise does not reduce misclassication rates beyond this point. The optimal noise reduces misclassication by almost 30%. Figure 4.2 shows a similar noise benet in the simplerk{means clustering algorithm on 3-dimensional Gaussian mixture data. The k{means algorithm is a special case of the EM algorithm as we show below in Theorem 2. So the EM noise benet extends to the k-means algorithm. The gure plots the average convergence time for noise{injected k{means routines at dierent initial noise levels. The gure shows an instance where decaying noise helps the algorithm converge about 22% faster than without noise. Simulations also show that noise also speeds up convergence in stochastic unsu- pervised competitive learning (UCL), supervised competitive learning (SCL), and dierential competitive learning (DCL). These competitive learning (CL) algorithms are not fully within the ambit of the NEM theorem. But they are generalized neural- network versions of thek-means algorithm. Thus these simulations hint that a related noise benet principle may apply to CL algorithms. 4.1 Clustering Clustering algorithms divide data sets into clusters based on similarity measures between samples [78], [148], [307], [308]. The similarity measure attempts to quantify how samples dier statistically. Many algorithms use the Euclidean distance or Mahalanobis similarity measure. Clustering algorithms assign similar samples to the same cluster. Centroid-based clustering algorithms assign samples to the cluster with 89 Σ N * =0.3 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 Initial Gaussian NoiseΣ 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 Average Misclassification Pct. Effect of Noise on Sample Misclassification in EM-Clustering Figure 4.1: Noise benet based on the misclassication rate for the Noisy Expectation{ Maximization (NEM) clustering procedure on a 2-D Gaussian mixture model with three Gaussian data clusters (inset) where each has a dierent covariance matrix. The plot shows that the misclassication rate falls as the additive noise power increases. The classication error rises if the noise power increases too much. The misclassication rate measures the mismatch between a NEM classier with unconverged parameters k and the optimal NEM classier with converged parameters . The unconverged NEM classier's NEM procedure stops a quarter of the way to convergence.The dashed horizontal line indicates the misclassication rate for regular EM classication without noise. The red dashed vertical line shows the optimum noise standard deviation for NEM classication. The optimum noise has a standard deviation of 0:3. the closest centroid 1 ;:::; k . This clustering framework is an attempt to solve an NP-hard optimization problem. The algorithms dene data clusters that minimize the total within-cluster deviation from the centroids. Suppose y i are samples of a data set on a sample space D. Centroid-based clustering partitionsD into thek decision classesD 1 ;:::;D k ofD. The algorithms look for optimal cluster parameters that minimize an objective function. The k{means clustering method [197] minimizes the total sum of squared Euclidean 90 Σ N * =0.45 0.00 0.10 0.20 0.30 0.40 0.50 0.60 Initial Gaussian NoiseΣ 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Average Convergence Time Effect of Noise on K-Means Convergence Time Figure 4.2: Noise benet in k{means clustering procedure on 2500 samples of a 3-D Gaussian mixture model with four clusters. The plot shows that the convergence time falls as additive white Gaussian noise power increases. The noise decays at an inverse square rate with each iteration. Convergence time rises if the noise power increases too much. The dashed horizontal line indicates the convergence time for regular k{means clustering without noise. The red dashed vertical line shows the optimum noise standard deviation for noisy k{means clustering. The optimum noise has a standard deviation of 0:45: the convergence time falls by about 22%. within-cluster distances [47], [307]: K X j=1 N X i=1 ky i j k 2 I D j (y i ) (4.1) whereI D j is the indicator function that indicates the presence or absence of pattern y in D j : I D j (y) = 8 < : 1 if y2D j 0 if y = 2D j : (4.2) There are many approaches to clustering [78], [148], [307]. Cluster algorithms 91 come from elds that include nonlinear optimization, probabilistic clustering, neural networks-based clustering [167], fuzzy clustering [27], [138], graph-theoretic clustering [54], [122], agglomerative clustering [318], and bio-mimetic clustering [96], [115]. 4.1.1 Noisy Expectation-Maximization for Clustering Probabilistic clustering algorithms may model the true data distribution as a mixture of sub-populations. This mixture model assumption converts the clustering problem into a two-fold problem: density estimation for the underlying mixture model and data-classication based on the estimated mixture model. The naive Bayes classier is one simple approach for discriminating between sub-populations. There are other more involved classiers that may have better statistical properties e.g. ensemble classiers or boosted classiers. The EM algorithm is a standard method for estimating parametric mixture densities. The parametric density estimation part of the clustering procedure can benet from noise injection. This noise benet derives from the application of the Noisy Expectation Maximization (NEM) theorem to EM in the clustering framework. A common mixture model in EM clustering methods is the Gaussian mixture model (GMM). We can apply the NEM Theorem to clustering algorithms that assume Gaussian sub-populations in the mixture. The NEM positivity condition E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0 (4.3) gives a sucient condition under which noise speeds up the EM algorithm's convergence to local optima. This condition implies that suitably noisy EM algorithm estimates the EM estimate in fewer steps on average than does the corresponding noiseless EM algorithm. The positivity condition reduces to a much simpler algebraic condition for GMMs. The model satises the positivity condition (4.3) when the additive noise samples n = (n 1 ;:::n d ) satisfy the following algebraic condition [229], [231] (misstated in [226] but corrected in [225]): n i [n i 2 ( j i y i )] 0 for all j : (4.4) This condition applies to the variance update in the EM algorithm. It needs the current 92 estimate of the centroids j . The NEM algorithm also anneals the additive noise by multiplying the noise power N by constants that decay with the iteration count. We found that the best application of the algorithm uses inverse{square decaying constants k 2 to scale the noise N [231]: 4.1.2 GMM-EM for Clustering We use the nite mixture model notation from x2.5.2. The Q-function for EM on a nite mixture model with K sub-populations is: Q(j(t)) = K X j=1 ln[ j f(yjj; j )] p Z (jjy; (t)) (4.5) where p Z (jjy; (t)) = j f(yjZ =j; j (t)) f(yj(t)) : (4.6) Equation (4.5) gives the E-step for the mixture model. The GMM uses the above Q-function with Gaussian pdfs for f(yjj; j ). Suppose there are N data samples of the GMM distributions. The EM algorithm estimates the mixing probabilities j , the sub-population means j , and the sub- population covariance j . The current estimate of the GMM parameters is (t) = f 1 (t); ; K (t); 1 (t); ; K (t); 1 (t); ; K (t)g. The iterations of the GMM{ EM reduce to the following update equations: j (t + 1) = 1 N N X i=1 p Z (jjy i ; (t)) (4.7) j (t + 1) = N X i=1 y i j (tjy i ) (4.8) j (t + 1) = N X i=1 j (tjy i ) (y i j (t))(y i j (t)) T : (4.9) where j (tjy i ) = p Z (jjy i ; (t)) P N i=1 p Z (jjy i ; (t)) : (4.10) These equations update the parameters j , j , and j with coordinate values that maximize the Q function in (4.5) [78], [306]. The updates combine both the E{steps 93 and M{steps of the EM procedure. 4.1.3 Naive Bayes Classier on GMMs GMM-EM clustering uses the membership probability density function p Z (jjy; EM ) as a maximum a posteriori classier for each sample y. The classier assigns y to the j th cluster if p Z (jjy; EM )p Z (kjy; EM ) for all k6=j. Thus EMclass(y) = argmax j p Z (jjy; EM ): (4.11) This is the naive Bayes classier [76], [250] based on the EM-optimal GMM parameters for the data. NEM clustering uses the same classier but with the NEM-optimal GMM parameters for the data: NEMclass(y) = argmax j p Z (jjy; NEM ): (4.12) 4.1.4 The Clustering Noise Benet Theorem The next theorem shows that the noise{benet of the NEM Theorem extend to EM- clustering. The noise benet occurs in misclassication relative to the EM-optimal classier. The theorem uses the following notation: class opt (Y ) = argmax p Z (jjY; ): EM-optimal classier. It uses the optimal model parameters P M [k] =P (EMclass k (Y )6=class opt (Y )): Probability of EM-clustering misclas- sication relative to class opt using k th iteration parameters P M N [k] = P (NEMclass k (Y )6=class opt (Y )): Probability of NEM-clustering misclassication relative to class opt using k th iteration parameters Theorem 4.1. [Clustering Noise Benet Theorem (CNBT)] Consider the NEM and EM iterations at the k th step. Then the NEM misclassication probability P M N [k] is less than the noise-free EM misclassication probability P M [k]: P M N [k]P M [k] (4.13) 94 when the additive noiseN in the NEM-clustering procedure satises the NEM Theorem condition from (4.3): E Y;Z;Nj ln f(Y +N;Zj k ) f(Y;Zj k ) 0: (4.14) This positivity condition (4.14) in the GMM-NEM model reduces to the simple algebraic condition (4.4) [229], [231] for each coordinate i: n i [n i 2 ( j i y i )] 0 for all j : Proof. Misclassication is a mismatch in argument maximizations: EMclass k (Y )6=class opt (Y ) if and only if argmax p Z (jjY; EM [k])6= argmax p Z (jjY; ): (4.15) This mismatch disappears as EM converges to . Thus argmax p Z (jjY; EM [k]) converges to argmax p Z (jjY; ) since lim k!1 k EM [k] k = 0 (4.16) by denition of the EM algorithm. So the argument maximization mismatch decreases as the EM estimates get closer to the optimum parameter . But the NEM condition (4.14) implies that the following inequality holds on average at the k th iteration: k NEM [k] kk EM [k] k: (4.17) Thus for a xed iteration count k: P (NEMclass k (Y )6=class opt (Y ))P (EMclass k (Y )6=class opt (Y )) (4.18) on average. So P M N [k]P M [k] (4.19) on average. Thus noise reduces the probability of EM clustering misclassication 95 relative to the EM-optimal classier on average when the noise satises the NEM condition. This means that an unconverged NEM{classier performs closer to the fully converged classier than does an unconverged noise-less EM{classier on average. We next state the noise-enhanced EM GMM algorithm in 1-D. Algorithm 4.1: Noisy GMM{EM Algorithm (1-D) Input :y 1 ;:::;y N GMM data samples Output : ^ NEM : NEM estimate of parameter while (k k k1 k 10 tol ) do N-Step: z i =y i +n i where n i is a sample of the truncated GaussianN(0; N k 2 ) such that n i [n i 2 ( j i y i )] 0 for all i;j E-Step: Q (j k ) P N i=1 P K j=1 ln[ j f(z i jj; j )]p Z (jjy; (t)) M-Step: k+1 argmax fQ (j k )g k k + 1 ^ NEM k The D-dimensional GMM-EM algorithm runs the N{Step component-wise for each data dimension. Figure 4.1 shows a simulation instance of the predicted GMM noise benet for 2-D cluster{parameter estimation. The gure shows that the optimum noise reduces GMM{cluster misclassication by almost 30%. 4.2 The k-Means Clustering Algorithm k{means clustering is a non-parametric procedure for partitioning data samples into clusters [197], [307]. Suppose the data space R d has K centroids 1 ;:::; K . The procedure tries to ndK partitionsD 1 ;:::;D K with centroids 1 ;:::; K that minimize the within-cluster Euclidean distance from the cluster centroids: argmin D 1 ;:::D K K X j=1 N X i=1 ky i j k 2 I D j (y i ) (4.20) for N pattern samples y 1 ;:::;y N . The class indicator functions I D 1 ;:::;I D K arise from the nearest{neighbor classication in (4.22) below. Each indicator functionI D j 96 indicates the presence or absence of pattern y in D j : I D j (y) = 8 < : 1 if y2D j 0 if y = 2D j : (4.21) The k{means procedure nds local optima for this objective function. k{means clustering works in the following two steps: Algorithm 4.2: K-Means Clustering Algorithm Assign Samples to Partitions: y i 2D j (t) if ky i j (t)kky i k (t)k k6=j (4.22) Update Centroids: j (t + 1) = 1 jD j (t)j N X i=1 y i I D j (t) (y i ): (4.23) 4.2.1 k-Means Clustering as a GMM{EM Procedure k{means clustering is a special case of the GMM{EM model [47], [124]. The key to this subsumption is the \degree of membership" function or \cluster-membership measure"m(jjy) [117], [309]. It is a fuzzy measure of how much the sampley i belongs to the j th sub-population or cluster. The GMM{EM model uses Bayes theorem to derive a soft cluster-membership function: m(jjy) =p Z (jjy; ) = j f(yjZ =j; j ) f(yj) : (4.24) k{means clustering assumes a hard cluster-membership [117], [157], [309]: m(jjy) =I D j (y) (4.25) whereD j is the partition region whose centroid is closest toy. Thek{means assignment step redenes the cluster regions D j to modify this membership function. The procedure does not estimate the covariance matrices in the GMM{EM formulation. 97 Noise also benets the k{means procedure as Figure 4.2 shows since k{means is an EM-procedure. Theorem 4.2. [k-means is a sub-case of the EM algorithm]: Suppose that the sub-populations have known spherical covariance matrices j and known mixing proportions j . Suppose further that the cluster-membership function is hard: m(jjy) =I D j (y): (4.26) Then GMM{EM reduces to k-Means clustering: P N i=1 p Z (jjy i ; (t))y i P N i=1 p Z (jjy i ; (t)) = 1 jD j (t)j N X i=1 y i I D j (t) (y i ): (4.27) Proof. The covariance matrices j and mixing proportions j are constant. So the update equations (4.7) and (4.9) do not apply in the GMM{EM procedure. The mean (or centroid) update equation in the GMM{EM procedure becomes j (t + 1) = P N i=1 p Z (jjy i ; (t))y i P N i=1 p Z (jjy i ; (t)) : (4.28) The hard cluster-membership function m t (jjy) =I D j (t) (y) (4.29) changes the t th iteration's mean update to j (t + 1) = P N i=1 y i m t (jjy i ) P N i=1 m t (jjy i ) : (4.30) The sum of the hard cluster-membership function reduces to N X i=1 m t (jjy i ) =N j =jD j (t)j (4.31) 98 where N j is the number of samples in the j th partition. Thus the mean update is j (t + 1) = 1 jD j (t)j N X i=1 y i I D j (t) (y i ): (4.32) Then the EM mean update equals the k{means centroid update: P N i=1 p Z (jjy i ; (t))y i P N i=1 p Z (jjy i ; (t)) = 1 jD j (t)j N X i=1 y i I D j (t) (y i ): (4.33) The known diagonal covariance matrices j and mixing proportions j can arise from prior knowledge or previous optimizations. Estimates of the mixing proportions (4.7) get collateral updates as learning changes the size of the clusters. Approximately hard cluster membership can occur in the regular EM algorithm when the sub-populations are well separated. An EM{optimal parameter estimate will result in very low posterior probabilities p Z (jjy; ) if y is not in the j th cluster. The posterior probability is close to one for the correct cluster. Celeux and Govaert proved a similar result by showing an equivalence between the objective functions for EM and k{means clustering [47], [307]. Noise injection simulations conrmed the predicted noise benet in the k{means clustering algorithm. 4.2.2 k{Means Clustering and Adaptive Resonance Theory k{means clustering resembles Adaptive Resonance Theory (ART) [43], [167], [307]. And so ART should also benet from noise. k{means clustering learns clusters from input data without supervision. ART performs similar unsupervised learning on input data using neural circuits. ART uses interactions between two elds of neurons: the comparison neuron eld (or bottom{up activation) and the recognition neuron eld (or top{down activation). The comparison eld matches against the input data. The recognition eld forms internal representations of learned categories. ART uses bidirectional \resonance" as a substitute for supervision. Resonance refers to the coherence between recognition and comparison neuron elds. The system is stable when the input signals match the recognition eld categories. But the ART system can learn a new pattern or update 99 an existing category if the input signal fails to match any recognition category to within a specied level of \vigilance" or degree of match. ART systems are more exible than regulark{means systems because ART systems do not need a pre{specied cluster count k to learn the data clusters. ART systems can also update the cluster count on the y if the input data characteristics change. Extensions to the ART framework include ARTMAP [44] for supervised classication learning and Fuzzy ART for fuzzy clustering [45]. An open research question is whether NEM-like noise injection will provably benet ART systems. 4.3 Competitive Learning Algorithms Competitive learning algorithms learn centroidal patterns from streams of input data by adjusting the weights of only those units that win a distance-based competition or comparison [111], [161], [167], [168]. Stochastic competitive learning behaves as a form of adaptive quantization because the trained synaptic fan-in vectors (centroids) tend to distribute themselves in the pattern space so as to minimize the mean-squared- error of vector quantization [167]. Such a quantization vector also converges with probability one to the centroid of its nearest-neighbor class [168]. We will show that most competitive learning systems benet from noise. This further suggests that a noise benet holds for ART systems because they use competitive learning to form learned pattern categories. Unsupervised competitive learning (UCL) is a blind clustering algorithm that tends to cluster like patterns together. It uses the implied topology of a two-layer neural network. The rst layer is just the data layer for the input patterns y of dimension d. There are K-many competing neurons in the second layer. The synaptic fan-in vectors to these neurons dene the local centroids or quantization vectors 1 ;:::; K . Simple distance matching approximates the complex nonlinear dynamics of the second- layer neurons competing for activation in an on-center/o-surround winner-take-all connection topology [167] as in a an ART system. Each incoming pattern stimulates a new competition. The winning j th neuron modies its fan-in of synapses while the losing neurons do not change their synaptic fan-ins. Nearest-neighbor matching picks the winning neuron by nding the synaptic fan-in vector closest to the current input pattern. Then the UCL learning law moves the winner's synaptic fan-in centroid or quantizing vector a little closer to the incoming pattern. 100 We rst write the UCL algorithm as a two-step process of distance-based \winning" and synaptic-vector update. The rst step is the same as the assignment step in k{means clustering. This equivalence alone argues for a noise benet. But the second step diers in the learning increment. So UCL diers from k{means clustering despite their similarity. This dierence prevents a direct subsumption of UCL from the EM algorithm. It thus prevents a direct proof of a UCL noise{benet based on the NEM Theorem. We also assume in all simulations that the initialK centroid or quantization vectors equal the rst K random pattern samples: 1 (1) = y(1);:::; K (K) = y(K). Other initialization schemes could identify the rst K quantizing vectors with any K other pattern samples so long as they are random samples. Setting all initial quantizing vectors to the same value can distort the learning process. All competitive learning simulations used linearly decaying learning coecients c j (t) = 0:3(1t=1500). Algorithm 4.3: Unsupervised Competitive Learning (UCL) Algorithm Pick the Winner: The j th neuron wins at t if ky(t) j (t)kky(t) k (t)k k6=j : (4.34) Update the Winning Quantization Vector: j (t + 1) = j (t) +c t [y(t) j (t)] (4.35) for a decreasing sequence of learning coecientsfc t g. A similar stochastic dierence equation can update the covariance matrix j of the winning quantization vector: j (t + 1) = j (t) +c t (y(t) j (t)) T (y(t) j (t)) j (t) : (4.36) A modied version can update the pseudo-covariations of alpha-stable random vectors that have no higher-order moments [158]. The simulations in this paper do not adapt 101 the covariance matrix. We can rewrite the two UCL steps (4.34) and (4.35) into a single stochastic dierence equation. This rewrite requires that the distance-based indicator function I D j replace the pick-the-winner step (4.34) just as it does for the assign-samples step (4.22) of k{means clustering: j (t + 1) = j (t) +c t I D j (y(t)) [y(t) j (t)] : (4.37) The one-equation version of UCL in (4.37) more closely resembles Grossberg's original deterministic dierential-equation form of competitive learning in neural modeling [110], [167]: _ m ij =S j (y j ) [S i (x i )m ij ] (4.38) where m ij is the synaptic memory trace from the i th neuron in the input eld to the j th neuron in the output or competitive eld. The i th input neuron has a real-valued activation x i that feeds into a bounded nonlinear signal function (often a sigmoid) S i . The j th competitive neuron likewise has a real-valued scalar activation y j that feeds into a bounded nonlinear signal function S j . But competition requires that the output signal function S j approximate a zero-one decision function. This gives rise to the approximation S j I D j . The two-step UCL algorithm is the same as Kohonen's \self-organizing map" algorithm [160], [161] if the self-organizing map updates only a single winner. Both algorithms can update direct or graded subsets of neurons near the winner. These near- neighbor beneciaries can result from an implied connection topology of competing neurons if the square K-by-K connection matrix has a positive diagonal band with other entries negative. Supervised competitive learning (SCL) punishes the winner for misclassications. This requires a teacher or supervisor who knows the class membership D j of each input pattern y and who knows the classes that the other synaptic fan-in vectors represent. The SCL algorithm moves the winner's synaptic fan-in vector j away from the current input pattern y if the pattern y does not belong to the winner's class D j . So the learning increment gets a minus sign rather than the plus sign that UCL would use. This process amounts to inserting a reinforcement function r into the winner's 102 learning increment as follows: j (t + 1) = j (t) +c t r j (y) [y j (t)] (4.39) r j (y) =I D j (y) X i6=j I D i (y): (4.40) Russian learning theorist Ya Tsypkin appears the rst to have arrived at the SCL algorithm. He did so in 1973 in the context of an adaptive Bayesian classier [284]. Dierential Competitive Learning (DCL) is a hybrid learning algorithm [162], [167]. It replaces the win-lose competitive learning termS j in (4.38) with the rate of winning _ S j . The rate or dierential structure comes from the dierential Hebbian law [165]: _ m ij =m ij + _ S i _ S j (4.41) using the above notation for synapses m ij and signal functions S i and S j . The traditional Hebbian learning law just correlates neuron activations rather than their velocities. The result is the DCL dierential equation: _ m ij = _ S j (y j ) [S i (x i )m ij ] : (4.42) Then the synapse learns only if the j th competitive neuron changes its win-loss status. The synapse learns in competitive learning only if the j th neuron itself wins the competition for activation. The time derivative in DCL allows for both positive and negative reinforcement of the learning increment. This polarity resembles the plus-minus reinforcement of SCL even though DCL is a blind or unsupervised learning law. Unsupervised DCL compares favorably with SCL in some simulation tests [162]. We simulate DCL with the following stochastic dierence equation: j (t + 1) = j (t) +c t S j (z j ) [S(y) j (t)] (4.43) i (t + 1) = i (t) if i6=j : (4.44) when the j th synaptic vector wins the metrical competition as in UCL. S j (z j ) is the time-derivative of the j th output neuron activation. We approximate it as the signum 103 function of time dierence of the training sample z [74], [162]: S j (z j ) =sgn [z j (t + 1)z j (t)] : (4.45) The competitive learning simulations in Figure 4.3 used noisy versions of the com- petitive learning algorithms just as the clustering simulations used noisy versions. The noise was additive white Gaussian vector noise n with decreasing variance (annealed noise). We added the noise n to the pattern data y to produce the training sample z: z =y +n where nN(0; (t)). The noise covariance matrix (t) was just the scaled identity matrix (t 2 )I for standard deviation or noise level > 0. This allows the scalar to control the noise intensity for the entire vector learning process. We annealed or decreased the variance as (t) = (t 2 )I as in [229], [231]. So the noise vector random sequence n(1);n(2);::: is an independent (white) sequence of similarly distributed Gaussian random vectors. We state for completeness the three-step noisy UCL algorithm. Algorithm 4.4: Noisy UCL Algorithm Noise Injection: Dene z(t) =y(t) +n(t) (4.46) For n(t)N(0; (t)) and annealing schedule (t) = t 2 I. Pick the Noisy Winner: The j th neuron wins at t if kz(t) j (t)kkz(t) k (t)k k6=j : (4.47) Update the Winning Quantization Vector: j (t + 1) = j (t) +c t [z(t) j (t)] (4.48) for a decreasing sequence of learning coecientsfc t g We can dene similar noise-perturbed versions the SCL and DCL algorithms. 104 Algorithm 4.5: Noisy SCL Algorithm Noise Injection: Dene z(t) =y(t) +n(t) (4.49) For n(t)N(0; (t)) and annealing schedule (t) = t 2 I. Pick the Noisy Winner: The j th neuron wins at t if kz(t) j (t)kkz(t) k (t)k k6=j : (4.50) Update the Winning Quantization Vector: j (t + 1) = j (t) +c t r j (z) [z(t) j (t)] (4.51) where r j (z) = I D j (z) X i6=j I D i (z) (4.52) andfc t g is a decreasing sequence of learning coecients. 105 Algorithm 4.6: Noisy DCL Algorithm Noise Injection: Dene z(t) =y(t) +n(t) (4.53) For n(t)N(0; (t)) and annealing schedule (t) = t 2 I. Pick the Noisy Winner: The j th neuron wins at t if kz(t) j (t)kkz(t) k (t)k k6=j : (4.54) Update the Winning Quantization Vector: j (t + 1) = j (t) +c t S j (z j ) [S(x) j (t)] (4.55) where S j (z j ) =sgn [z j (t + 1)z j (t)] (4.56) andfc t g is a decreasing sequence of learning coecients. Figure 4.3 shows that noise injection sped up UCL convergence by about 25%. The simulation used the four-cluster Gaussian data model shown in the inset of gure 4.3. Noise perturbation gave the biggest relative improvement for the UCL algorithm in most of our simulations. 4.4 Conclusion Noise can speed convergence in expectation-maximization clustering and in some types of competitive learning algorithms under fairly general conditions. This suggests that noise should improve the performance of other clustering and competitive-learning algorithms. An open research question is whether the Noisy Expectation-Maximization Theorem or some other mathematical result can guarantee the observed noise benets for competitive learning and for similar clustering algorithms. Future work may also extend EM noise benets to co-clustering methods [73], 106 Σ N * =0.6 0 0.20 0.40 0.80 1.00 1.20 Gaussian NoiseΣ 800 900 1000 1100 1200 Convergence Time Effect of Noise on Convergence Time in UCL Figure 4.3: Noise{benet in the convergence time of Unsupervised Competitive Learning (UCL). The inset shows the four Gaussian data clusters with the same covariance matrix. The convergence time is the number of learning iterations before the synaptic weights stayed within 25% of the nal converged synaptic weights. The dashed horizontal line shows the convergence time for UCL without additive noise. The gure shows that a small amount of noise can reduce convergence time by about 25%. The procedure adapts to noisy samples from a Gaussian mixture of four sub- populations. The sub-populations have centroids on the vertices of the rotated square of side{length 24 centered at the origin as the inset gure shows. The additive noise is zero-mean Gaussian. [119], [148], [207] (also known as biclustering or two-mode clustering). Co-clustering methods cluster both data samples and data features simultaneously. For example, a co-clustering method for a movie recommendation system would simultaneously cluster the users and the movies in the database. An important model for co-clustering algorithms is the Probabilistic Latent Semantic Analysis (PLSA) [132]{[134] used in collaborative ltering [66], [134] and document analysis [132]. PLSA makes use of an EM algorithm to train a hierarchical multinomial model. EM noise benets may apply here. EM noise benets may also apply to other co-clustering methods based on generalized k-means or generalized mixture-model approaches. 107 Chapter 5 NEM Application: Baum-Welch Algorithm for Training Hidden Markov Models This chapter shows that careful noise injection can speed up iterative ML parameter estimation for hidden Markov models (HMMs). The proper noise appears to help the training process explore less probable regions of the parameter space. This new noisy HMM (NHMM) [7] is a special case of NEM [229], [231]. The NEM algorithm gives rise to the NHMM because the Baum-Welch algorithm [14]{[16] that trains the HMM parameters is itself a special case of the EM algorithm [298]. The NEM theorem gives a sucient positivity condition for an average noise boost in an EM algorithm. The positivity condition (5.15) below states the corresponding sucient condition for a noise boost in the Baum-Welch algorithm when the HMM uses a Gaussian mixture model at each state. Figure 5.1 describes the NHMM architecture based on the NEM algorithm. Simu- lations show that a noisy HMM converges faster than a noiseless HMM on the TIMIT data set. Figure 5.2 shows that noise produces a 37% reduction in the number of iterations that it takes to converge to the maximum-likelihood estimate. The simulations below conrm the theoretical prediction that proper injection of noise can improve speech recognition. This appears to be the rst deliberate use of noise injection in the speech data itself. Earlier eorts [81], [181] used annealed noise This chapter features work done in collaboration with Kartik Audhkhasi and rst published in [7]. 108 to perturb the model parameters and to pick an alignment path between HMM states and the observed speech data. These earlier eorts neither added noise to the speech data nor found any theoretical guarantee of a noise{benet. The next section (x5.1) reviews HMMs and the Baum-Welch algorithm that tunes them. The section shows that the Baum-Welch algorithm is an EM algorithm as Welch [298] pointed out. x5.2 presents the sucient condition for a noise boost in HMMs. x5.3 tests the new NHMM algorithm for training monophone models on the TIMIT corpus. Figure 5.1: The training process of the NHMM: The NHMM algorithm adds annealed noise to the observations during the M-step in the EM algorithm if the noise satises the NEM positivity condition. This noise changes the GMM covariance estimate in the M-step. 5.1 Hidden Markov Models An HMM [246] is a popular probabilistic latent variable models for multivariate time series data. Its many applications include speech recognition [189], [246], [301], computational biology [80], [156], [181], computer vision [37], [310], wavelet-based signal processing [63], and control theory [87]. HMMs are especially widespread in speech processing and recognition. All popular speech recognition toolkits use HMMs: Hidden Markov Model Toolkit (HTK) [311], Sphinx [291], SONIC [241], RASR [260], Kaldi [242], Attila [272], BYBLOS [55], and Watson [107]. An HMM consists of a time-homogeneous Markov chain with M states and a 109 single-step transition matrix A. Let S :Z + !Z M denote a function that maps time to state indices. Then A i;j =P [S(t + 1) =jjS(t) =i] (5.1) for8t2Z + and8i;j2Z M . Each state contains a probability density function (pdf) of the multivariate observations. A GMM is a common choice for this purpose [245]. The pdf f i of an observation o2R D at state i is f i (o) = K X k=1 w i;k N (o; i;k ; i;k ) (5.2) where w i;1 ;:::;w i;K are convex coecients andN (o; i;k ; i;k ) denotes a multivariate Gaussian pdf with population mean i;k and covariance matrix i;k . 5.1.1 The Baum-Welch Algorithm for HMM Parameter Es- timation The Baum-Welch algorithm [16] is an EM algorithm for tuning HMM parame- ters. LetO = (o 1 ;:::; o T ) denote a multivariate time series of length T. Let S = (S(1);:::;S(T)) andZ = (Z(1);:::;Z(T)) be the respective latent state and Gaussian index sequences. Then the ML estimate of the HMM parameters is = argmax log X S;Z P [O;S;Zj]: (5.3) The sum over latent variables makes it dicult to directly maximize the objective function (5.3). EM uses Jensen's inequality [35] and the concavity of the logarithm to obtain the following lower-bound on the observed data log-likelihood logP[Oj] at the current parameter estimate (n) : logP [Oj]E P [S;ZjO; (n) ] logP [O;S;Zj] =Q(j (n) ) (5.4) 110 The complete data log-likelihood for an HMM factors is logP [O;S;Zj] = M X i=1 I(S(1) =i) log p i (1) + T X t=1 M X i=1 K X k=1 I(S(t) =i;Z(t) =k) n logw i;k + logN (o t j i;k ; i;k ) o + T1 X t=1 M X i=1 M X j=1 I(S(t + 1) =j;S(t) =i) log A i;j (5.5) where I(:) is an indicator function and p i (1) =P [S(1) =i]. The Q-function requires computing the following sets of variables: (n) i (1) =P [S(1) =ijO; (n) ] (5.6) (n) i;k (t) =P [S(t) =i;Z(t) =kjO; (n) ] (5.7) (n) i;j (t) =P [S(t + 1) =j;S(t) =ijO; (n) ] (5.8) for8t2f1;:::;Tg, i;j2f1;:::;Mg, and k2f1;:::;Kg. The Forward-Backward algorithm is a dynamic programming approach that eciently computes these vari- ables [246]. The resulting Q-function is Q(j (n) ) = M X i=1 (n) i (1) log p i (1)+ T X t=1 M X i=1 K X k=1 (n) i;k (t) n logw i;k + logN (o t j i;k ; i;k ) o + T1 X t=1 M X i=1 M X j=1 (n) i;j (t) log A i;j : (5.9) Maximizing the auxiliary functionQ(j (n) ) with respect to the parameters subject to sum-to-one constraints leads to the re-estimation equations for the M-step at 111 iteration n: p (n) i (1) = (n) i (1) (5.10) A (n) i;j = P T1 t=1 (n) i;j (t) P T1 t=1 (n) i (t) (5.11) w (n) i;k = P T t=1 (n) i;k (t) P T t=1 (n) i (t) (5.12) (n) i;k = P T t=1 (n) i;k (t)o t P T t=1 (n) i (t) (5.13) (n) i;k = P T t=1 (n) i;k (t)(o t (n) i;k )(o t (n) i;k ) T P T t=1 (n) i (t) : (5.14) The next section restates the NEM theorem and algorithm in the HMM context. 5.2 NEM for HMMs: The Noise-Enhanced HMM (NHMM) A restatement of the NEM condition [229], [231] in the HMM context requires the following denitions: The noise random variable N has pdf f(njo). So the noise N can depend on the observed dataO.L are the latent variables in the model.f (n) g is a sequence of EM estimates for . is the converged EM estimate for : = lim n!1 (n) . Dene the noisyQ N functionQ N (j (n) ) =E LjO; (n) [lnf(o + N;Lj)]. Then the NEM positivity condition for Baum-Welch training algorithm is: E O;L;Nj ln f(O + N;Lj (n) ) f(O;Lj (n) ) 0: (5.15) The HMM uses GMMs at each state. So the HMM satises the NEM condition when we enforce the simplied GMM-NEM positivity condition for each GMM. The HMM- EM positivity condition (5.15) holds when the additive noise sample N = (N 1 ;:::;N D ) for each observation vector o = (o 1 ;:::;o D ) satises the following quadratic constraint: N d [N d 2 ( i;k;d o d )] 0 for all k: (5.16) 112 The state sequenceS and the Gaussian indexZ are the latent variablesL for an HMM. The noisy Q-function for the NHMM is Q N (j (n) ) = M X i=1 (n) i (1) log p i (1) + T X t=1 M X i=1 K X k=1 (n) i;k (t) n logw i;k + logN (o t + n t j i;k ; i;k ) o + T1 X t=1 M X i=1 M X j=1 (n) i;j (t) log A i;j (5.17) where n t 2 R D is the noise vector for the observation o t . Then the d th element n t;d of this noise vector satises the following positivity constraint: n t;d [n t;d 2( (n1) i;k;d o t;d )] 0 for all k (5.18) where (n1) i;k is the mean estimate at iteration n 1. Maximizing the noisy Q-function (5.17) gives the update equations for the M-step. Only the GMM mean and covariance update equations dier from the noiseless EM because the noise enters the noisy Q-function (5.17) only through the Gaussian pdf. But the NEM algorithm requires modifying only the covariance update equation (5.14) because it uses the noiseless mean estimates (5.13) to check the positivity condition (5.18). Then the NEM covariance estimate is (n) i;k = P T t=1 (n) i;k (t)(o t + n t (n) i;k )(o t + n t (n) i;k ) T P T t=1 (n) i (t) : (5.19) 113 10 20 30 0 5 10 15 20 25 30 35 40 Iteration number Percent reduction in iterations 4 component GMM 8 component GMM 16 component GMM 32 component GMM Figure 5.2: NHMM training converges in fewer iterations than regular HMM training: The bar graph shows the percent reduction in the number of Baum-Welch iterations with respect to the HMM log-likelihood at iterations 10, 20, and 30. Noise signicantly reduces the number of iterations for 8-, 16-, and 32-component GMMs. Noise also produces greater reduction for iterations 20 and 30 due to the compounding eect of the log-likelihood improvement for the NHMM at each iteration. Noise produces only a marginal reduction for the 4-component GMM case at 10 iterations and no improvement for 20 and 30 iterations. This pattern of decreasing noise benets comports with the data sparsity analysis in [231]. The probability of satisfying the NEM sucient condition increases with fewer data samples for ML estimation. 114 Algorithm 5.1: Noise Injection Algorithm for Training NHMMs Initialize parameters: (1) init for n = 1!n max do Function E-Step(O; (n) ) is for t = 1!T , i;j = 1!M, and k = 1!K do (n) i (1) P [S(1) =ijO; (n) ] (n) i;k (t) P [S(t) =i;Z(t) =kjO; (n) ] (n) i;j (t) P [S(t + 1) =j;S(t) =ijO; (n) ] Function M-Step(O; ;;;) is for i;j = 1!M and k = 1!K do p (n) i (1) (n) i (1) A (n) i;j P T1 t=1 (n) i;j (t) P T1 t=1 (n) i (t) w (n) i;k P T t=1 (n) i;k (t) P T t=1 (n) i (t) (n) i;k P T t=1 (n) i;k (t)ot P T t=1 (n) i (t) n t GenerateNoise( (n) i;k ; o t ;n 2 N ) (n) i;k = P T t=1 (n) i;k (t)(ot+nt (n) i;k )(ot+nt (n) i;k ) T P T t=1 (n) i (t) Function GenerateNoise( (n) i;k ; o t ; 2 ) is n t N (0; 2 ) for d = 1!D do if n t;d [n t;d 2( (n1) i;k;d o t;d )]> 0 for some k then n t;d = 0 Return n t 5.3 Simulation Results We modied the Hidden Markov Model Toolkit (HTK) [311] to train the NHMM. HTK provides a tool called \HERest" that performs embedded Baum-Welch training for an HMM. This tool rst creates a large HMM for each training speech utterance. It concatenates the HMMs for the sub-word units. The Baum-Welch algorithm tunes the parameters of this large HMM. The NHMM algorithm used (5.19) to modify covariance matrices in HERest. We 115 sampled from a suitably truncated Gaussian pdf to produce noise that satised the NEM positivity condition (5.18). We used noise variances inf0:001; 0:01; 0:1; 1g. A deterministic annealing factor n scaled the noise variance at iteration n. The noise decay rate was > 0. We used 2f1;:::; 10g. We then added the noise vector to the observations during the update of the covariance matrices (5.19). The simulations used the TIMIT speech dataset [101] with the standard setup in [114]. We parameterized the speech signal with 12 Mel-Frequency Cepstral Coe- cients (MFCC) computed over 20-msec Hamming windows with a 10-msec shift. We also appended the rst- and second-order nite dierences of the MFCC vector with the energies of all three vectors. We used 3-state left-to-right HMMs to model each phoneme with a K-component GMM at each state. We varied K overf1; 4; 8; 16; 32g for the experiments and used two performance metrics to compare NHMM with HMM. The rst metric was the percent reduction in EM iterations for the NHMM to achieve the same per-frame log-likelihood as does the noiseless HMM at iterations 10; 20, and 30. The second metric was the median improvement in per-frame log-likelihood over 30 training iterations. Figure 5.2 shows the percent reduction in the number of training iterations for the NHMM compared to the HMM log-likelihood at iterations 10, 20, and 30. Noise substantially reduced the number of iterations for 16- and 32-component GMMs. But it only marginally improved the other cases. This holds because the noise is more likely to satisfy the NEM positivity condition when the number of data samples is small relative to the number of parameters [231]. 5.4 Conclusion Careful addition of noise can speed the average convergence of iterative ML estimation for HMMs. The NEM theorem gives a sucient condition for generating such noise. This condition reduces to a simple quadratic constraint in the case of HMMs with a GMM at each state. Experiments on the TIMIT data set show a signicant improvement in per-frame log-likelihood and in time to convergence for the NHMM as compared with the HMM. Future work should develop algorithms to nd the optimal noise variance and annealing decay factor. It should also explore noise benets at other stages of EM training in an HMM. 116 Chapter 6 NEM Application: Backpropagation for Training Feedforward Neural Networks An articial neural network is a biomimetic mathematical model consisting of a network of articial neurons. Articial neurons are simplied models of biological neurons. These neurons transform input signals using an activation function (usually a sigmoid or squashing function). The connecting edges of the neural network (NN) simulate biological synapses between biological neurons. They amplify or inhibit signal transmission between neurons. Figure 6.1 is an example of an articial neural network with three layers or elds of neurons: an input, hidden, and an output layer. It is an example of a multilayer feedforward neural network. The term \feedforward" refers to the absence of backward or self connections (or feedback) between neurons. Feedforward neural networks are a popular computational model for many pat- tern recognition and signal processing problems. Their applications include speech recognition [65], [211]{[214], [262], [267], machine translation of text [72], audio processing [116], articial intelligence [22], computer vision [56], [217], [278], and medicine [143]. Neural networks learn to respond to input stimuli in the same way the brain learns to process sensory stimuli. Learning occurs in both articial and biological neural networks when the network parameters change [167]. We train neural networks via This chapter features work done in collaboration with Kartik Audhkhasi and rst published in [6]. 117 Hidden layer Input layer Output layer F H F X F Y Figure 6.1: A Feedforward Neural Network with three layers of neurons. The nodes represent articial neurons. The edges represent synaptic connections between neu- rons. The backpropagation algorithm trains the network by tuning the strength of these synapses. Feedforward neural networks have no feedback or recurrent synaptic connections. a sequence of adaptations until the network produces acceptable responses to input stimuli. The process of learning the correct input-output response is the same as learning to approximate an arbitrary function. Halbert White [139], [140] proved that multilayer feedforward networks can approximate any Borel-measurable function to arbitrary accuracy. 6.1 Backpropagation Algorithm for NN Training Backpropagation (BP) [259], [299], [300] is the standard supervised training method for multilayer feedforward neural networks [167]. The goal of the BP algorithm is to train a feedforward network (FF-NN) to approximate an input-output mapping by learning from multiple examples of said mapping. The algorithm adapts the neural network by tuning the hidden synaptic weights of the FF-NN to minimize an error function over the example data set. The error function is an aggregated cost function for approximation errors over the training set. The BP algorithm uses gradient descent to minimize the error function. Feedforward neural networks and other \connectionist" learning architectures act 118 as black-box approximators: they oer no insight about how the units cooperate to perform the approximation task. This lack of explanatory power complicates the training process. How does a training algorithm determine how much each unit in the network contributes to overall network failure (or success)? How does a training algorithm fairly distribute blame for errors among the many hidden units in the network? This is a key problem that plagues multi-layered connectionist learning architectures. Minsky [167], [206] called this problem the credit assignment problem. Backpropagation's solution to the credit assignment problem is to propagate the global error signal backwards to each local hidden unit via recursive applications of the chain rule. The application of the chain rule gives a measure the rate of change of the global error with respect to changes in each hidden network parameter. Thus the training algorithm can tune hidden local parameters to perform gradient descent on the global error function. 6.1.1 Summary of NEM Results for Backpropagation The main nding of this chapter is that the backpropagation algorithm for training feed- forward neural networks is a special case of the Generalized Expectation-Maximization algorithm (Theorem 6.1) [6]. This subsumption is consistent with the EM algorithm's missing information theme. BP estimates hidden parameters to match observed data. While EM estimates hidden variables or functions thereof to produce a high-likelihood ts for observed data. The EM subsumption result means that noise can speed up the convergence of the BP algorithm according the NEM theorem. The NEM positivity condition provides the template for sucient conditions under which injecting noise into training data speeds up BP training time of feedforward neural networks. Figures 6.2 and 6.5 show that the backpropagation error function falls faster with NEM noise than without NEM noise. Thus NEM noise injection leads to faster backpropagation convergence time. Matsuoka [198] and Bishop [30] hypothesized that injecting noise into the input eld may act as a regularizer during BP training and improve generalization performance. G. An [5] re-examined Matsuoka's and Bishop's claims with a stochastic gradient descent analysis [33]. He showed that the regularization claim is invalid. But input and synaptic weight noise may indeed improve the network's generalization performance. 119 Figure 6.2: This gure shows the training-set squared error for backpropagation and NEM-backpropagation (NEM-BP) training of an auto-encoder neural network on the MNIST digit classication data set. There is a 5:3% median decrease in the squared error per iteration for NEM-BP when compared with backpropagation training. We added annealed independent and identically-distributed (i.i.d.) Gaussian noise to the target variables. The noise had mean a t t and a variance that decayed with the training epochs asf0:1; 0:1=2; 0:1=3;:::g where a t is the vector of activations of the output layer and t is the vector of target values. The network used three logistic (sigmoidal) hidden layers with 20 neurons each. The output layer used 784 logistic neurons. These previous works used white noise and focused on generalization performance. We instead add non-white noise using a sucient condition which depends on the neural network parameters and output activations. And our goal is to reduce training time. The next section (x6.2) reviews the details of the backpropagation algorithm and recasts the algorithm as an MLE method. x6.3 presents the backpropagation algorithm as an EM algorithm for neural network training. x6.4 discusses the NEM sucient conditions in the backpropagation context. x6.5 shows simulation results comparing BP to NEM for NN training. 120 6.2 Backpropagation as Maximum Likelihood Es- timation Backpropagation performs ML estimation of a neural network's parameters. We use a 3-layer neural network for notational convenience. The results in this paper extend to deeper networks with more hidden layers. x are the neuron values at the input layer consisting of I neurons. a h is the vector of hidden neuron sigmoidal activations whose j th element is a h j = 1 1 + exp P I i=1 w ji x i = I X i=1 w ji x i ; (6.1) where w ji is the weight of the link connecting the i th visible and j th hidden neuron. y represents the K-valued target variable and t is its 1-inK encoding. t k is the k th output neuron's value with activation a t k = exp P J j=1 u kj a h j P K k 1 =1 exp P J j=1 u k 1 j a h j (6.2) =p(y =kjx; ); (6.3) where u kj is the weight of the link connecting the j th hidden and k th target neuron. a t k depends on input x and parameter matrices U and W. Backpropagation minimizes the following cross entropy: E = K X k=1 t k ln(a t k ): (6.4) The cross-entropy is equal to the negative conditional log-likelihood of the targets 121 given the inputs because E = ln h K Y k=1 (a t k ) t k i (6.5) = ln h K Y k=1 p(y =kjx; ) t k i (6.6) = lnp(yjx; ) =L: (6.7) Backpropagation updates the network parameters using gradient ascent to maximize the log likelihood lnp(yjx; ). The partial derivative of this log-likelihood with respect to u kj is @L @u kj = (t k a t k )a h j ; (6.8) and with respect to w ji is @L @w ji =a h j (1a h j )x i K X k=1 (t k a t k )u kj : (6.9) (6.8) and (6.9) give the partial derivatives to perform gradient ascent on the log- likelihood L. Original formulations of backpropagation [259], [299], [300] used the squared error between the observed and target network output as the error function. This setup is sometimes called regression since it is equivalent to generalized least-squares regression. Hidden layers in this conguration correspond to higher order interactions in the regression. Later theoretical developments [137], [271], [285] in backpropagation showed that the cross-entropy error function leads to better convergence properties. The linear activation function a t k = J X j=1 u kj a h j (6.10) often replaces the Gibbs function at the output layer for a regression NN. The target values t of the output neuron layer are free to assume any real values for regression. 122 Backpropagation then minimizes the following squared error function: E = 1 2 K X k=1 (t k a t k ) 2 : (6.11) We assume that the estimation error e = t a t is Gaussian with mean 0 and identity covariance matrix I in keeping with standard assumptions for least-squares regression. Least-squares regression estimates optimal regression parameters U and W that minimize the least-squares error. These least-squares estimates are the maximum likelihood estimate under the Gaussian assumption. So backpropagation also maximizes the following log-likelihood function: L = logp(tjx; ) = logN (t; a t ; I) (6.12) for N (t; a t ; I) = 1 (2) d=2 exp ( 1 2 K X k=1 (t k a t k ) 2 ) : (6.13) And thus the gradient partial derivatives of this log-likelihood function are the same as those for the K-class classication case in (6.8) and (6.9). 6.3 Backpropagation as an EM Algorithm Both backpropagation and the EM algorithm seek ML estimates of a neural network's parameters. The next theorem shows that backpropagation is a generalized EM algorithm. Theorem 6.1. [Backpropagation is a GEM Algorithm]: The backpropagation update equation for a dierentiable likelihood function p(yjx; ) at epoch n is n+1 = n +r lnp(yjx; ) = n (6.14) 123 equals the GEM update equation at epoch n n+1 = n +r Q(j n ) = n ; (6.15) where the Q-function used in GEM is Q(j n ) =E p(hjx;y; n ) n lnp(y; hjx; ) o : (6.16) Proof. We know that [31], [222] logp(yjx; ) =Q(j n ) +H(j n ) (6.17) if H(j n ) is the following cross entropy [61]: H(j n ) = Z lnp(hjx;y; ) dp(hjx;y; n ): (6.18) Hence H(j n ) = logp(yjx; )Q(j n ): (6.19) Now expand the relative entropy: D KL ( n jj) = Z ln p(hjx;y; n ) p(hjx;y; ) ! dp(hjx;y; n ) (6.20) = Z lnp(hjx;y; n ) dp(hjx;y; n ) Z lnp(hjx;y; ) dp(hjx;y; n ) (6.21) =H( n j n ) +H(j n ): (6.22) So H(j n ) H( n j n ) for all because D KL ( n jj) 0. Thus n minimizes H(j n ) and hencer H(j n ) = 0 at = n . Putting this in (6.19) gives r logp(yjx; ) = n =r Q(j n ) = n : (6.23) Hence the backpropagation and GEM update equations are identical. 124 This makes sense because backpropagation is a greedy algorithm for optimizing the likelihood function as the last section showed. The basic intuition is that any greedy algorithm for likelihood{function optimization is a generalized EM algorithm if there is hidden or incomplete data involved in the estimation. The GEM algorithm involves a probabilistic description of the hidden layer neurons. We assume that the hidden layer neurons are Bernoulli random variables. Their activation is thus the following conditional probability: a h j =p(h j = 1jx; ): (6.24) We can now formulate an EM algorithm for ML estimation of a feedforward neural network's parameters. The E-step computes the Q-function in (6.16). Computing the expectation in (6.16) requires 2 J values of p(hjx;y; n ). This is expensive for large values of J. So we thus resort to Monte Carlo sampling to approximate the above Q-function. The strong law of large numbers ensures that this Monte Carlo approximation converges almost surely to the true Q-function. p(hjx;y; n ) becomes the following using Bayes theorem: p(hjx;y; n ) = p(hjx; n )p(yjh; n ) P h p(hjx; n )p(yjh; n ) : (6.25) p(hjx; n ) is easier to sample from because h j s are independent given x. We replace p(hjx; n ) by its Monte Carlo approximation using M independent and identically- distributed (IID) samples: p(hjx; n ) 1 M M X m=1 K (h h m ); (6.26) where K is the J-dimensional Kronecker delta function. The Monte Carlo approxi- 125 mation of the hidden data conditional PDF becomes p(hjx;y; n ) P M m=1 K (h h m )p(yjh; n ) P h P M m 1 =1 K (h h m 1 )p(yjh; n ) (6.27) = P M m=1 K (h h m )p(yjh m ; n ) P M m 1 =1 p(yjh m 1 ; n ) (6.28) = M X m=1 K (h h m ) m ; (6.29) where m = p(yjh m ; n ) P M m 1 =1 p(yjh m 1 ; n ) (6.30) is the \importance" of h m . (6.29) gives an importance-sampled approximation of p(hjx;y; n ) where each sample h m is given weight m . We can now approximate the Q-function as: Q(j n ) X h M X m=1 m K (h h m ) lnp(y; hjx; ) (6.31) = M X m=1 m lnp(y; h m jx; ) (6.32) = M X m=1 m h lnp(h m jx; ) + lnp(yjh m ; ) i ; (6.33) where lnp(h m jx; ) = J X j=1 h h m j lna h j + (1h m j ) ln(1a h j ) i ; (6.34) for sigmoidal hidden layer neurons. Gibbs activation neurons at the output layer give lnp(yjh m ; ) = K X k=1 t k lna mt k ; (6.35) 126 where a h j is given in (6.1) and a mt k = exp P J j=1 u kj a mh j P K k 1 =1 exp P J j=1 u k 1 j a mh j (6.36) Gaussian output layer neurons give lnp(yjh m ; ) = 1 2 K X k=1 (t k a mt k ) 2 : (6.37) The Q-function in (6.33) is equal to a sum of log-likelihood functions for two 2-layer neural networks between the visible-hidden and hidden-output layers. The M-step maximizes this Q-function by gradient ascent. It is equivalent to two disjoint backpropagation steps performed on these two 2-layer neural networks. 6.4 NEM for Backpropagation Training Theorem 6.1 recasts the backpropagation training algorithm as a GEM algorithm. Thus the NEM theorem provides conditions under which BP training converges faster to high likelihood network parameters. We restate the NEM positivity condition [229], [231] in the neural network training context. We use the following notation: The noise random variable N has pdf p(njx). So the noise N can depend on the data x. h are the latent variables in the model. f (n) g is a sequence of EM estimates for . = lim n!1 (n) is the converged EM estimate for . Dene the noisy Q function Q N (j (n) ) =E hjx; k [lnp(x + N; hj)]. Assume that the dierential entropy of all random variables is nite. Assume further that the additive noise keeps the data in the likelihood function's support. Then the NEM positivity condition for neural network training algorithm is: E x;h;Nj ln p(x + N; hj k ) p(x; hj k ) 0: (6.38) The exact form of the noise{benet condition depends on the activation function of the output neurons. 127 6.4.1 NEM Conditions for Neural Network ML Estimation Consider adding noise n to the 1-in-K encoding t of the target variable y. We rst present the noise{benet sucient condition for Gibbs activation output neurons used in K-class classication. Theorem 6.2. [Forbidden Hyperplane Noise{Benet Condition]: The NEM positivity condition holds for ML training of feedforward neural network with Gibbs activation output neurons if E t;h;njx; n n T log(a t ) o 0: (6.39) Proof. We add noise to the target 1-in-K encoding t. The likelihood ratio in the NEM sucient condition becomes p(t + n; hjx; ) p(t; hjx; ) = p(t + njh; )p(hjx; ) p(tjh; )p(hjx; ) (6.40) = p(t + njh; ) p(tjh; ) (6.41) = K Y k=1 (a t k ) t k +n k (a t k ) t k = K Y k=1 (a t k ) n k : (6.42) So the NEM positivity condition becomes E t;h;njx; n log K Y k=1 (a t k ) n k o 0: (6.43) This condition is equivalent to E t;h;njx; n K X k=1 n k log(a t k ) o 0: (6.44) We can rewrite this positivity condition as the following matrix inequality: E t;h;njx; fn T log(a t )g 0 (6.45) where log(a t ) is the vector of output neuron log-activations. The above sucient condition requires that the noise n lie above a hyperplane 128 with normal log(a t ). Figure 6.3 illustrates this geometry. Geometry of NEM Noise for Cross-Entropy Backpropagation Figure 6.3: NEM noise for faster backpropagation using logistic output neurons. NEM noise must fall above a hyperplane through the origin in the noise space. The output activation signal a t controls the normal vector n of the slicing hyperplane. The hyperplane changes on each iteration. The next theorem gives a sucient condition for a noise{benet in the case of Gaussian output neurons. Theorem 6.3. [Forbidden Sphere Noise{Benet Condition]: The NEM positivity condition holds for ML training of a feedforward neural network with Gaussian output neurons if E t;h;nj;x; n n a t + t 2 a t t 2 o 0 (6.46) wherejj:jj is the L 2 vector norm. Proof. We add noise n to the K output neuron values t. The log-likelihood in the 129 NEM sucient condition becomes p(t + n; hjx; ) p(t; hjx; ) = p(t + njh; )p(hjx; ) p(tjh; )p(hjx; ) (6.47) = N (t + n; a t ; I) N (t; a t ; I) (6.48) = exp 1 2 h t a t 2 t + n a t 2 i : (6.49) So the NEM sucient condition becomes E t;h;nj;x n n a t + t 2 a t t 2 o 0: (6.50) The above sucient condition denes a forbidden noise region outside a sphere with center t a t and radiuskt a t k. All noise inside this sphere speeds convergence of ML estimation in the neural network on average. Figure 6.4 illustrates this geometry. The actual implementation of the BP as a GEM uses a Monte Carlo approximation for the E-step. So we are really applying the NEM condition to a Generalized Monte Carlo EM algorithm. The quality of the result depends on the approximation quality of the Monte Carlo substitution. This approximation quality degrades quickly as the number of hidden neurons J increases. 6.5 Simulation Results We modied the Matlab code available in [128] to inject noise during EM-backpropagation training of a neural network. We used 10,000 training instances from the training set of the MNIST digit classication data set. Each image in the data set had 28 28 pixels with each pixel value lying between 0 and 1. We fed each pixel into the input neuron of a neural network. We used a 5-layer neural network with 20 neurons in each of the three hidden layers and 10 neurons in the output layer for classifying the 10 digits. We also trained an auto-encoder neural network with 20 neurons in each of the three hidden layers and 784 neurons in the output layer for estimating the pixels of a digit's image. 130 Geometry of NEM Noise for Least-Squares Backpropagation Figure 6.4: NEM noise for faster backpropagation using Gaussian output neurons. The NEM noise must fall inside the backpropagation \training mismatch" sphere. This is the sphere with center c = t a t (the dierence between the target output t and the actual output layer activation a t ) with radius r =kck. Noise from the noise ball section that intersects with the mismatch sphere will speed up backpropagation training according to the NEM theorem. The mismatch ball changes at each training iteration. The output layer used the Gibbs activation function for the 10-class classication network and logistic activation function for the auto-encoder. We used logistic activation functions in all other layers. Simulations used 10 Monte Carlo samples for approximating the Q-function in the 10-class classication network and 5 Monte Carlo samples for the auto-encoder. Figure 6.2 shows the training-set squared error for the auto-encoder neural network for backpropagation and NEM-backpropagation 131 Figure 6.5: Training-set cross entropy for backpropagation and NEM-BP training of a 10-class classication neural network on the MNIST digit classication data set. There is a 4:2% median decrease in the cross entropy per iteration for NEM-BP when compared with backpropagation training. We added annealed i.i.d. Gaussian noise to the target variables. The noise had mean 0 and a variance that decayed with training epochs asf0:2; 0:2=2; 0:2=3;:::g. The network used three logistic (sigmoidal) hidden layers with 20 neurons each. The output layer used 10 neurons with the Gibbs activation function in (6.2). when we added annealed Gaussian noise with mean a t t and variance 0:1 epoch 1 . Figure 6.5 shows the training-set cross entropy for the two cases when we added annealed Gaussian noise with mean 0 and variance 0:2 epoch 1 . We used 10 Monte Carlo samples to approximate the Q-function. We observed a 5:3% median decrease in squared error and 4:2% median decrease in cross entropy per iteration for the NEM-backpropagation algorithm compared to standard backpropagation. 132 6.6 Conclusion This chapter showed that the backpropagation algorithm is a generalized EM algorithm. This allows us to apply the NEM theorem to develop noise{benet sucient conditions for speeding up convergence EM-backpropagation. Simulations on the MNIST digit recognition data set show that NEM noise injection reduces squared-error and cross- entropy in NN training by backpropagation. Feedforward neural networks are extremely popular in machine learning and data{ mining applications. Most applications of feedforward NNs use backpropagation for training. So the backpropagation noise benets in this chapter are available for these applications and can lead to shorter training times. Such training time reductions are important for large-scale NN applications where training may take weeks. Stacked layers of stochastic neural networks (or \deep" neural networks) may also benet from the NEM algorithm. x10.2.1 presents some theorems predicting noise benets for the pre-training such deep networks. 133 Chapter 7 Bayesian Statistics 7.1 Introduction: The Bayesian & The Frequentist Bayesian inference methods subsume ML estimation methods described in previous chapters. The dening feature of Bayesian inference is the use of Bayes theorem to revise prior beliefs based on new observed data. Rev. Thomas Bayes rst introduced his eponymous theorem in a posthumous letter [19] to the Royal Society of London in 1763. Bayes theorem is now at the heart of many statistical applications including spam ltering, evidence-based medicine, and semantic web search. The \Bayesian approach" to statistics sees probabilities as statements of beliefs about the state of a random parameter. The opposed \frequentist" view sees prob- abilities as long-run frequencies of the outcome of experiments involving a xed underlying parameter. The dierence between the two approaches does not aect the basic Kolmogorov theory of probability. But acceptable statistical inference methods dier based on which view the statistician espouses. The Bayesian (e.g. de Finetti, Lindley, Savage) argues that his approach takes account of all available information (prior information and the data itself) when making an inference or a decision. The frequentist (e.g. Fisher, Student, E.S. Pearson) argues that inference should use only information provided by the data and should be free of subjective input 1 . Modern statistics tends to blend both approaches to t individual applications. The statistical problem of point estimation highlights this Bayesian vs frequentist schism. The goal of point estimation is to nd the best estimate for parameters 1 Fisher even argued [95] that Rev. Bayes withheld publication of his work because he was wary of the dangers involved in injecting subjective information into inferences on data. 134 underlying the data distribution. MLE is a very popular method for point estimation because of its simplicity, its beautiful properties, and its intuitive interpretation. It is frequentist in conception and in spirit; it uses only the data for inference. Bayesian point estimation techniques are more general but they can be complex. They may also give dierent point estimates depending on available prior information and the cost of choosing bad estimates. The Bayesian framework subsumes the frequentist ML estimate as a possible solution when there is no prior information. Thus Bayesian point estimation is more powerful in cases when there is authoritative prior information. The rest of the chapter gives a detailed introduction to Bayesian inference. This sets the foundation for the next chapter which deals with the eects of model and data corruption in Bayesian inference. x7.3 shows where point estimation and MLE (including E{M) ts in the Bayesian inference framework. Thus any subsequent results also apply to previously discussed ML estimation frameworks. 7.2 Bayesian Inference Bayesian inference models learning as computing a conditional probability based both on new evidence or data and on prior probabilistic beliefs. It builds on the simple Bayes theorem that shows how set-theoretic evidence should update competing prior probabilistic beliefs or hypotheses. The theorem gives the posterior conditional probability P (H j jE) that the j th hypothesis H j occurs given that evidence E occurs. The posterior depends on all the converse conditional probabilities P(EjH k ) that E occurs given H k and on all the unconditional prior probabilities P (H k ) of the disjoint and exhaustive hypothesesfH k g: P (H j jE) = P (EjH j )P (H j ) P (E) = P (EjH j )P (H j ) P k P (EjH k )P (H k ) : (7.1) The result follows from the denition of conditional probability P(BjA) = P(A\ B)=P(A) for P(A)> 0 when the set hypotheses H j partition the state space of the probability measure P [172], [253]. P (H j jE) is a measure of the degree to which the data or evidence E supports each of the competing hypothesis H k . This represents the data-informed update of prior beliefs about competing hypothesesfH k g. More accurate beliefs allow for better discrimination between competing hypothesis. Bayesian inference usually works with a continuous version of (7.1). Now the pa- 135 rameter value corresponds to the hypothesis of interest and the evidence corresponds to the sample values x from a random variable X that depends on : f(jx) = g(xj)h() R g(xju)h(u)du / g(xj)h() (7.2) where we follow convention and drop the normalizing term that does not depend on as we always can if has a sucient statistic [135], [136]. The model (7.2) assumes that random variable X conditioned on admits the random sample X 1 ;:::;X n with observed realizations x 1 ;:::;x n . So again the posterior pdf f(jx) depends on the converse likelihood g(xj) and on the prior pdf h(). The posterior f(jx) contains the complete probabilistic description of given observed data x. Its maximization is a standard optimality criterion in statistical decision making [28], [42], [68], [78]. The Bayesian inference structure in (7.2) involves a radical abstraction. The set or event hypothesis H j in (7.1) has become the measurable function or random variable that takes on realizations according to the prior pdfh() : h(). The pdfh() can make or break the accuracy of the posterior pdf f(jx) because it scales the data pdf g(xj) in (7.2). Statisticians can elicit priors from an expert [102], [152]. Such elicited priors are thus \subjective" because they are ultimately opinions or guesses. Or the prior in \empirical Bayes" [42], [135] can come from \objective" data or from statistical hypothesis tests such as chi-squared or Kolmogorov-Smirnov tests for a candidate pdf [136]. 7.2.1 Conjugacy The prior pdf h() is the most subjective part of the Bayesian inference framework. The application determines the sampling pdf g(xj). But the prior comes from preconceptions about the parameter . These preconceptions could be in the form of information from experts or from collateral data about about . It is not always easy to articulate these sources of information into accurate pdfs for . Thus most Bayesian applications resort to simplications. They restrict themselves to a limited set of closed form pdfs for . Many applications limit themselves to an even smaller subset of pdfs called \conjugate priors". Conjugate priors produce not only closed-form posterior pdfs but posteriors that come from the same family as the prior [28], [68], [135], [247]. The three most common 136 conjugate priors in the literature are the beta, the gamma, and the normal. Table 7.1 displays these three conjugacy relationships. The posterior f(jx) is beta if the priorh() is beta and if the data or likelihoodg(xj) is binomial or has a dichotomous Bernoulli structure. The posterior is gamma if the prior is gamma and if the data is Poisson or has a counting structure. The posterior is normal if the prior and data are normal. Prior h() Likelihood g(xj) Posterior f(jx) Beta Binomial Beta 0 B(;) bin(n;) B( +x; +nx) (+) ()() 1 (1) 1 n x x (1) nx (++n) (+x)(+nx) +x1 (1) +nx1 Gamma Poisson Gamma 0 (;) p() ( +x; 1+ ) 1 exp(=) () e x x! (+) +x (+x) +x exp (1+) Normal Normal 0 Normal 00 N(; 2 ) N(j 2 ) N 2 +x 2 2 + 2 ; 2 2 2 + 2 Table 7.1: Conjugacy relationships in Bayesian inference. A prior pdf of one type combines with its conjugate likelihood to produce a posterior pdf of the same type. Consider rst the beta prior on the unit interval: (;) : h() = ( +) ()() 1 (1) 1 (7.3) if 0 < < 1 for parameters > 0 and > 0. Here is the gamma function () = R 1 0 x 1 e x dx. Then has population mean or expectationE[] ==(+). The beta pdf reduces to the uniform pdf if = = 1. A beta prior is a natural choice when the unknown parameter is the success probability for binomial data such as coin ips or other Bernoulli trials because the beta's support is the unit interval (0, 1) and because the user can adjust the and parameters to shape the beta pdf over the interval. A beta prior is conjugate to binomial data with likelihood g(x 1 ;:::;x n j). This means that a beta prior h() combines with binomial sample data to produce a new beta posterior: f(jx) = (n + +) ( +x)(n +x) x+1 (1) nx+1 (7.4) 137 Here x is the observed sum of n Bernoulli trials and hence is an observed sucient statistic for [136]. So g(x 1 ;:::;x n j) = g(xj). This beta posterior f(jx) gives the mean-square optimal estimator as the conditional mean E[jX = x] = ( + x)=( + +n) if the loss function is squared-error [136]. A beta conjugate relation still holds when negative-binomial or geometric data replaces the binomial data or likelihood. The conjugacy result also extends to the vector case for the Dirichlet or multidimensional beta pdf. A Dirichlet prior is conjugate to multinomial data [68], [218]. Gamma priors are conjugate to Poisson data. The gamma pdf generalizes many right-sided pdfs such as the exponential and chi-square pdfs. The generalized (three- parameter) gamma further generalizes the Weibull and lognormal pdfs. A gamma prior is right-sided and has the form (;) : h() = 1 e = () if > 0. (7.5) The gamma random variable has population mean E[] = and variance V [] = 2 . The Poisson sample data x 1 ;:::;x n comes from the likelihood g(x 1 :::;x n j) = x 1 e x 1 ! xn e x n ! : (7.6) The observed Poisson sum x =x 1 + +x n is an observed sucient statistic for because the Poisson pdf also comes from an exponential family [28], [135]. The gamma prior h() combines with the Poisson likelihood g(xj) to produce a new gamma posterior f(jx) [136]: f(jx) = ( P n k=1 x k +1) e =[=(n+1)] ( P n k=1 x k +)[=(n + 1)] ( P n k=1 x k +) : (7.7) So E[jX =x] = ( +x)=(1 +) and V [jX =x] = ( +x) 2 =(1 +) 2 . A normal prior is self-conjugate because a normal prior is conjugate to normal data. A normal prior pdf has the whole real line as its domain and has the form [136] N( 0 ; 2 0 ) : h() = 1 p 2 0 e ( 0 ) 2 =2 2 0 (7.8) 138 for known population mean 0 and known population variance 2 0 . The normal prior h() combines with normal sample data from g(xj) =N(j 2 =n) given an observed realizationx of the sample-mean sucient statisticX n . This gives the normal posterior pdf f(jx) =N( n ; 2 n ). Here n is the weighted-sum conditional mean E[jX =x] = 2 0 2 0 + 2 =n x + 2 =n 2 0 + 2 =n 0 (7.9) and 2 n = 2 =n 2 0 + 2 =n 2 0 : (7.10) A hierarchical Bayes model [42], [135] would write any of the these priors as a function of still other random variables and their pdfs. Conjugate priors permit easy iterative or sequential Bayesian learning because the previous posterior pdf f old (jx) becomes the new prior pdf h new () for the next experiment based on a fresh random sample: h new () = f old (jx). Such conjugacy relations greatly simplify iterative convergence schemes such as Gibbs sampling in Markov chain Monte Carlo estimation of posterior pdfs [42], [135] 7.3 Bayesian Point Estimation Many statistical decision problems involve selecting an \optimal" point estimates for model parameters. Bayesian inference solves the hard question about how to update beliefs about the data-model's parameters given new observed data. It produces the posterior pdf f(jx) which is a measure of the spread of probable values of the model parameter based on observed data. But how can we use this information about the parameter spread to select an\optimal" parameter point estimate? The answer to this question depends on the denition of an \optimal estimate". Each parameter estimate d(X) represents a decision. The Bayesian point of view argues [62] that the concept of the \optimal" point estimate is incomplete if there is no consideration given to the losses incurred by making the wrong decisions. Every parameter estimate d(x) incurs a penalty proportional to how much the estimate d(x) deviates from the parameter . The loss function `(d;) models these losses. The parameter and thus `(d;) are random. The magnitude of the average loss function can be a measure of estimate optimality { higher average loss being less desirable. 139 The average loss is the Bayes risk R(d(X);) associated with the estimate d(X). The posterior pdf enables us to calculate this risk subject to observed data R(d(X);) =E jX [`(d(X);)jX] = Z `(d(X);)f(jx) dx: (7.11) The Bayesian point estimation denes the optimal estimate for as one that minimizes the Bayes risk. This estimate is the Bayes estimate ^ Bayes (X): ^ Bayes (X) = argmin d(X)2 R(d(X);): (7.12) A utility function u(d;) (with sign opposite the loss function's) can model rewards for good estimates. Then the Bayes estimate becomes a maximization of the expected utility. ^ Bayes (X) = argmax d2 E [u(d(X);)jX]: (7.13) The utility function formulation is more typical in classical decision theory [264], [290], game theory, and theories about rational economic choice. For example, a rational economic choice d (X) is a choice that maximizes an economic agent's expected utility[287]. Bayesian point estimation is a posterior-based variant of the statistical decision framework of Wald, Neyman, and Pearson[219], [290]. The basic theme is to treat inference problems (including point estimation and hypothesis testing) as special cases of decision problems [26], [62]. 7.3.1 Bayes Estimates for Dierent Loss Functions The loss function determines the Bayes estimate. The loss function ideally mirrors estimation error penalties from the application domain. Some common loss functions in engineering applications are: the squared error, absolute error, and the 0-1 loss functions. The loss function and their corresponding Bayes estimates are: The Bayes estimates are solutions the optimization problem in equation 7.12. The three loss functions (squared, absolute, and zero-one) are conceptually similar to ell 2 , ell 1 , and ell 0 minimization respectively. Bayes estimation with the 0-1 loss function is equivalent to Maximum A Posteriori (MAP) estimation. The MAP estimate is the mode of the 140 Loss Function `(d;) ^ Bayes (X) squared error loss c (d) E jX [jX] absolute error loss cjdj Median (f(jX)) 0-1 loss 1(d) Mode (f(jX)) Table 7.2: Three Loss Functions and Their Corresponding Bayes Estimates. posterior pdf: ^ MAP = argmax f(jx) = argmax g(xj)h(): (7.14) If the prior distribution is uniform (and possibly improper) then the MAP estimate is equivalent to the Maximum Likelihood (ML) estimate. ^ ML argmax g(xj): (7.15) has a uniform distribution. So the prior h() is constant. Thus: h() =c (7.16) argmax g(xj)h() = argmax [cg(xj)] (7.17) argmax g(xj)h() = argmax g(xj) (7.18) since argument maximization is invariant under scalar multiplication. Therefore ^ MAP = ^ ML : (7.19) This reduction from MAP to MLE is valid when takes values from a bounded subset of the parameter space. The same reduction holds for unbounded domains of . This requires the use of improper prior pdfs i.e. prior pdfs that are not integrable [42]. Thus MAP and ML estimation t into the Bayesian estimation framework. All the Bayes estimates above minimize risk functions. There is an alternative decision strategy that addresses worst-case scenarios: we can dene an estimate d (x) that minimizes the worst case risk d = argmin d fsup R(;d)g: (7.20) 141 This is the minimax estimator. This is a deeply conservative estimator that is typically inadmissible 2 having higher risk compared to any admissible estimator[26], [42]. The minimax approach to rational decision makes the most sense in zero-sum game-theoretic scenarios [224]. Wald ([290, pp. 24{27]) showed that statistical decision problems have the same form as zero-sum two-person games between Nature and the experimenter. The minimax estimators represent minimax strategies for the experimenter. However minimax estimators are often too conservative for Bayesian statistics applications. 7.3.2 Measures of Uncertainty for Bayes Estimates Point estimates need appropriate measures of uncertainty. The full posterior pdf is the most complete Bayesian description of uncertainty about the parameter. We can also specify more succinct measures of parameter variability depending on the type of Bayes estimate in use. Such measures lack the full generality of the posterior but they are simpler to use for xed loss functions. The conditional variance Var [jX] measures variability around the conditional mean Bayes estimateE [jX]. The inter-quartile range measures variability around the median Bayes estimate Medianff(jX)g. The highest posterior density (HPD) credible interval [26], [42] measures variability around the mode Bayes estimate Modeff(jX)g. The credible interval is most akin to the more familiar condence interval CI() in frequentist statistical inference. The credible and condence intervals are both subsets of the parameter space that highlight the characteristic spread of the point estimate ^ = d(X). The (1)-level condence interval is the random set CI() specied by the test statistic ^ such that (1) =P (2CI()): (7.21) While a 1 credible intervals is a connected setC() such that (1) =P (2C()jx) = Z C() f (jx) : (7.22) 2 A decision rule (point estimate) d(X) is inadmissible [26], [68] in the statistical sense if there exists an alternate decision rule (point estimate) d (X) with lower Bayes risk for all values of the parameter i.e.R(d (X);)R(d(X);) for all values of with strict inequality for some values of . 142 The key dierence is that the condence interval measures probabilities via a distri- bution on random sets with constant but unknown. While the credible interval measures probabilities via a posterior distribution on the random parameter . The two intervals have dierent interpretations and are generally not equivalent except under very special conditions [239], [240], [296], [297]. Credible intervals are not unique. Any Bayes estimate can belong to a continuum of dierent credible intervals. But the HPD credible interval is optimal [34] in the sense that it is the minimum-volume credible interval and it always contains the posterior mode. 7.4 Conclusion This chapter highlighted the dierences between the frequentist and Bayesian approach to statistical inference and point estimation in particular. It also showed how the Bayesian framework subsumes frequentist ML point estimation in theory. The rest of this work addresses some issues with the Bayesian inference framework. The subsumption of MLE methods under Bayesian methods implies that these issues are also relevant to ML point estimation. The exposition so far assumes that the data model functions (the priors pdfs and likelihood functions) are accurate. The Bayesian inference framework works well when this assumption is true. What happens when this assumption fails? Is Bayesian inference robust to corruption caused by incorrect model functions? The next chapter addresses these questions. 143 Chapter 8 Bayesian Inference with Fuzzy Function Approximators The key assumptions in the Bayesian inference scheme are: (1) models for the observable data are accurate and (2) the source of confusion is the randomness of the data and the model parameters. Many applications of Bayesian inference involve inaccurate data models and possibly other forms of model corruption. This raises the questions: how reliable are Bayesian statistical estimates when the analytic data model does not match the true data model? Are these estimates useful when we only have approximations of the data model? This chapter addresses these questions by analyzing the eect of approximate model functions on Bayes theorem. The main result of this analysis is the Bayesian Approximation Theorem (BAT) for posterior pdfs: applying Bayes theorem to separate uniform approximators for the prior pdf and the likelihood function (the model functions) results in a uniform approximator for the posterior pdf. This theorem guarantees that good model function approximators produce good approximate posterior pdf. The BAT also applies to any type of uniform function approximator. We demonstrate this result with fuzzy rule-based uniform approximators for the model functions. Fuzzy approximation techniques have two main advantages over other approximation techniques for Bayesian inference. First, they allow users to express prior or likelihood descriptions in words rather than as closed-form probability density functions. Learning algorithms can tune approximators based on expert This chapter features work done in collaboration with Prof. Sanya Mitaim and rst presented in [227], [228]. 144 X n Figure 8.1: Probabilistic graphical model for all Bayesian data models in this chapter. We observe n samples of the data X which depends on a hidden random parameter . The likelihood function g(xj) captures this dependence. The prior h() describes the distribution of the hidden parameter . linguistic rules or just grow them from sample data. Second, they can represent any bounded closed-form model function exactly. Furthermore, the learning laws and fuzzy approximators have a tractable form because of the convex-sum structure of additive fuzzy systems. This convex-sum structure carries over to the fuzzy posterior approximator (see Theorem 8.1). We also show that fuzzy approximators are robust to noise in the data (see gure 8.4). Simulations demonstrate this fuzzy approximation scheme on the priors and posteriors for the three most common conjugate models (see Figures 8.5-8.7): the beta- binomial, gamma-Poisson and normal-normal conjugate models. Fuzzy approximators can also approximate non-conjugate priors and likelihoods as well as approximate hyperpriors in hierarchical Bayesian inference. We later extend this approimation scheme to more general hierarchical Bayesian models in Chapter 9. Most of the approximation qualities carry over to this more general case. We use the notation from Chapter 7: f(jx) is the posterior pdf. f(jx) is the result of the application of Bayes theorem to the prior h() and the likelihood g(xj): f(jx) = g(xj)h() R g(xju)h(u)du / g(xj)h() (8.1) The probabilistic graphical model (PGM) in Figure 8.1 represents this data model succinctly. 8.1 Bayesian Inference with Fuzzy Systems Additive fuzzy systems can extend Bayesian inference because they allow users to express prior or likelihood knowledge in the form of if-then rules. Fuzzy systems can approximate prior or likelihood probability density functions (pdfs) and thereby 145 approximate posterior pdfs. This allows a user to describe priors with fuzzy if-then rules rather than with closed-form pdfs. The user can also train the fuzzy system with collateral data to adaptively grow or tune the fuzzy rules and thus to approximate the prior or likelihood. A simple two-rule system can also exactly represent a bounded prior pdf if such a closed-form pdf is available. So fuzzy rules extend the range of knowledge that prior or likelihood can capture and they do so in an expressive linguistic framework based on multivalued or fuzzy sets [313]. Figure 1 shows how ve tuned fuzzy rules approximate the skewed beta prior pdf (8; 5). Learning has sculpted the ve if-part and then-part fuzzy sets so that the approximation is almost exact. Users will not in general have access to such training data because they do not know the functional form of the prior pdf. They can instead use any noisy sample data at hand or just state simple rules of thumb in terms of fuzzy sets and thus implicitly dene a fuzzy system approximator F. The following prior rules dene such an implied skewed prior that maps fuzzy-set descriptions of the parameter random variable to fuzzy descriptionsF () of the occurrence probability: Rule 1: If is much smaller than 1 2 then F () is very small Rule 2: If is smaller than 1 2 then F () is small Rule 3: If is approximately 1 2 then F () is large Rule 4: If is larger than 1 2 then F () is medium Rule 5: If is much larger than 1 2 then F () is small Learning shifts and scales the Cauchy bell curves that dene the if-part fuzzy sets in Figure 1. The tuned bell curve in the third rule has shifted far to the right of the equi-probable value 1 2 . Dierent prior rules and fuzzy sets will dene dierent priors just as will dierent sets of sample data. The simulations results in Figures 8.5{8.7 show that such fuzzy rules can quickly learn an implicit prior if the fuzzy system has access to data that re ects the prior. These simulations give probative evidence that an informed expert can use fuzzy sets to express reasonably accurate priors in Bayesian inference even when no training data is available. The uniform fuzzy approximation theorem in [169], [171] gives a theoretical basis for such rule-based approximations of priors or likelihoods. Theorem 2 below further shows that such uniform fuzzy approximation of priors or likelihoods leads in general to the uniform fuzzy approximation of the corresponding Bayesian posterior. Bayesian inference itself has a key strength and a key weakness. The key strength is that it computes the posterior pdf f(jx) of a parameter given the observed data 146 Figure 8.2: Five fuzzy if-then rules approximate the beta prior h() =(8; 5). The ve if-part fuzzy sets are truncated Cauchy bell curves. An adaptive Cauchy SAM (standard additive model) fuzzy system tuned the sets' location and dispersion param- eters to give a nearly exact approximation of the beta prior. Each fuzzy rule denes a patch or 3-D surface above the input-output planar state space. The third rule has the form \If =A 3 then B 3 " where then-part set B 3 is a fuzzy number centered at centroid c 3 . This rule might have the linguistic form \If is approximately 1 2 then F() is large." The training data came from 500 uniform samples of (8; 5). The adaptive fuzzy system cycled through each training sample 6,000 times. The fuzzy approximator converged in fewer than 200 iterations. The adaptive system also tuned the centroids and areas of all ve then-part sets (not pictured). x. The posterior pdf gives all probabilistic information about the parameter given the available evidence. The key weakness is that this process requires that the user produce a prior pdf h() that describes the unknown parameter. The prior pdf can inject \subjective" information into the inference process because it can be little more than a guess from the user or from some consulted expert or other source of authority. Priors can also capture \objective" information from a collateral source of data. Additive fuzzy systems use if-then rules to map inputs to outputs and thus to model priors or likelihoods. A fuzzy system with enough rules can uniformly approximate any continuous function on a compact domain. Statistical learning algorithms can grow rules from unsupervised clusters in the input-output data or from supervised gradient descent. Fuzzy systems also allow users to add or delete knowledge by simply adding 147 or deleting if-then rules. So they can directly model prior pdfs and approximate them from sample data if it is available. Inverse algorithms can likewise nd fuzzy rules that maximize the posterior pdf or functionals based on it. These applications of adaptive fuzzy approximators to Bayesian inference do not involve unrelated eorts to fuzzify Bayes Theorem [166], [283]. The use of adaptive fuzzy systems allows for more accurate prior pdf and likelihood function estimation thus improving the versatility and accuracy of classical Bayesian applications. x8.2 reviews the theory behind fuzzy function approximation. We show this fuzzy approximation scheme with the three well-known conjugate priors and their corresponding posterior approximations (Figures 8.5, 8.6, and 8.7). The scheme works well even with non-conjugate data models (Figures 8.8 and 8.9). x8.3 further extends the fuzzy approach to doubly fuzzy Bayesian inference where separate fuzzy systems approximate the prior and the likelihood. This section also states and proves what we call the Bayesian Approximation Theorem: Uniform approximation of the prior and likelihood results in uniform approximation of the posterior. 8.2 Adaptive Fuzzy Function Approximation Additive fuzzy systems can uniformly approximate continuous functions on compact sets [167], [169], [171]. Hence the set of additive fuzzy systems is dense in the space of such functions. A scalar fuzzy system is the map F :R n !R that stores m if-then rules and maps vector inputs x to scalar outputs F(x). The prior and likelihood simulations below map not R n but a compact real interval [a; b] into reals. So these systems also satisfy the approximation theorem but at the expense of truncating the domain of pdfs such as the gamma and the normal. Truncation still leaves a proper posterior pdf through the normalization in (8.1). 8.2.1 SAM Fuzzy Systems A standard additive model (SAM) fuzzy system computes the output F(x) by taking the centroid of the sum of the \red" or scaled then-part sets: F(x) = Centroid(w 1 a 1 (x)B 1 + +w m a m (x)B m ). Then the SAM Theorem states that the output F (x) is a simple convex-weighted sum of the then-part set centroids c j [167], 148 [169], [171], [210]: F (x) = P m j=1 w j a j (x)V j c j P m j=1 w j a j (x)V j = m X j=1 p j (x)c j : (8.2) Here V j is the nite area of then-part set B j in the rule \If X = A j then Y = B j " and c j is the centroid of B j . The convex weights p 1 (x);:::;p m (x) have the form p j (x) = w j a j (x)V j P m i=1 w i a i (x)V i . The convex coecients p j (x) change with each input x. The positive rule weightsw j give the relative importance of the jth rule. They drop out in our case because they are all equal. The scalar set function a j :R! [0; 1] measures the degree to which input x2R belongs to the fuzzy or multivalued set A j : a j (x) = Degree(x2 A j ). The sinc set functions below map into the augmented range [:217; 1] and so require some care in simulations. The fuzzy membership value a j (x) \res" the rule \If X =A j then Y =B j " in a SAM by scaling the then-part set B j to give a j (x)B j . The if-part sets can in theory have any shape but in practice they are parametrized pdf-like sets such as those we use below: sinc, Gaussian, triangle, Cauchy, Laplace, and generalized hyperbolic tangent. The if-part sets control the function approximation and involve the most computation in adaptation. Extensive simulations in [210] show that the sinc function (in 1-D and 2-D) tends to perform best among all six sets in terms of sum of squared approximation error. Users dene a fuzzy system by giving the m corresponding pairs of if-part A j and then-part B j fuzzy sets. Many fuzzy systems in practice work with simple then-part fuzzy sets such as congruent triangles or rectangles. SAMs dene \model-free" statistical estimators in the following sense [171], [188], [210]: E[YjX =x] =F (x) = m X j=1 p j (x)c j (8.3) V [YjX =x] = m X j=1 p j (x) 2 B j + m X j=1 p j (x)[c j F (x)] 2 : (8.4) The then-part set variance 2 B j is 2 B j = R 1 1 (yc j ) 2 p B j (y)dy. Thenp B j (y) =b j (y)=V j is an integrable pdf if b j :R! [0; 1] is the integrable set function of then-part set B j . The conditional variance V [YjX =x] gives a direct measure of the uncertainty in 149 the SAM output F (x) based on the inherent uncertainty in the stored then-part rules. This denes a type of condence surface for the fuzzy system [188]. The rst term in the conditional variance (8.4) measures the inherent uncertainty in the then-part sets given the current rule rings. The second term is an interpolation penalty because the rule \patches" A j B j cover dierent regions of the input-output product space. The shape of the then-part sets aects the conditional variance of the fuzzy system but aects the output F(x) only to the extent that the then-part sets B j have dierent centroids c j or areas V j . The adaptive function approximations below tune only these two parameters of each then-part set. The conditional mean (8.3) and variance (8.4) depend on the realization X =x and so generalize the corresponding unconditional mean and variance of mixture densities [135]. Figure 8.3: Six types of if-part fuzzy sets in conjugate prior approximations. Each type of set produces its own adaptive SAM learning law for tuning its location and dispersion parameters: (a) sinc set, (b) Gaussian set, (c) triangle set, (d) Cauchy set, (e) Laplace set, and (f) a generalized hyperbolic-tangent set. The sinc shape performed best in most approximations of conjugate priors and the corresponding fuzzy-based posteriors. A SAM fuzzy systemF can always approximate a functionf orFf if the fuzzy system contains enough rules. But multidimensional fuzzy systems F :R n !R suer exponential rule explosion in general [149], [170]. Optimal rules tend to reside at the extrema or turning points of the approximandf and so optimal fuzzy rules \patch the bumps" [170]. Learning tends to quickly move rules to these extrema and to ll in with extra rules between the extremum-covering rules. The supervised learning algorithms can involve extensive computation in higher dimensions [209], [210]. Our fuzzy prior approximations did not need many rules or extensive computation time because the fuzzy systems were 1-dimensional (R! R). But iterative Bayesian inference can produce its own rule explosion (chapter 9, [232]). 150 8.2.2 The Watkins Representation Theorem Fuzzy systems can exactly represent a bounded pdf with a known closed form. Watkins has shown that in many cases a SAM system F can exactly represent a function f in the sense that F =f. The Watkins Representation Theorem [293], [294] states that F =f if f is bounded and if we know the closed form of f. The results is stronger that this because the SAM system F exactly represents f with just two rules with equal weights w 1 =w 2 and equal then-part set volumes V 1 =V 2 : F (x) = P 2 j=1 w j a j (x)V j c j P 2 j=1 w j a j (x)V j (8.5) = a 1 (x)c 1 +a 2 (x)c 2 a 1 (x) +a 2 (x) (8.6) =f(x) (8.7) if a 1 (x) = supff(x) supfinff , a 2 (x) = 1a 1 (x) = f(x)inff supfinff , c 1 = inff, c 2 = supf. The representation technique buildsf directly into the structure of the two if-then rules. Let h() be any bounded prior pdf such as the (8; 5) pdf in the simulations below. Then F() = h() holds for the all realizations of if the SAM's two rules have the form \If = A then Y = B 1 " and \If = not-A then Y = B 2 " for the if-part set function a() = suphh(x) suph infh = 1 11 11 7 7 4 4 7 (1) 4 (8.8) The not-A if-part set function is 1a() = 11 11 7 7 4 4 7 (1) 4 . Then-part sets B 1 and B 2 can have any shape from rectangles to Gaussians so long as 0 < V 1 = V 2 <1 with centroids c 1 = infh = 0 and c 2 = suph = (13) (8)(5) ( 7 11 ) 7 ( 4 11 ) 4 . So the Watkins Representation Theorem lets a SAM fuzzy system directly absorb a closed-form bounded prior h() if it is available. The same holds for a bounded likelihood or posterior pdf. 8.2.3 ASAM Learning Laws An adaptive SAM (ASAM) F can quickly approximate a prior h() (or likelihood) if the following supervised learning laws have access to adequate samplesh( 1 );h( 2 );::: from the prior. This may mean in practice that the ASAM trains on the same 151 numerical data that a user would use to conduct a chi-squared or Kolmogorov-Smirnov hypothesis test for a candidate pdf. Figure 8.4 shows that an ASAM can learn the prior pdf even from noisy random samples drawn from the pdf. Unsupervised clustering techniques can also train an ASAM if there is sucient cluster data [167], [171], [308]. The ASAM prior simulations in the next section show howF approximatesh() when the ASAM trains on random samples from the prior. These approximations bolster the case that ASAMs will in practice learn the appropriate prior that corresponds to the available collateral data. Figure 8.4: ASAMs can use a limited number of random samples or noisy random samples to estimate the sampling pdf. The ASAMs for these examples use the tanh set function with 15 rules and they run for 6000 iterations. The ASAMs approximate empirical pdfs from the dierent sets of random samples. The shaded regions represent the approximation error between the ASAM estimate and the sampling pdf. Part (a) compares the (3; 10:4) pdf with ASAM approximations for some (3; 10:4) empirical pdfs. Each empirical pdf is a scaled histogram for a set of N random samples. The gure shows comparisons for the cases N = 500; 2500; 25000. Part (b) compares the (3; 10:4) pdf with ASAM approximations of 3 (3; 10:4) random sample sets corrupted by independent noise. Each set has 5000 random samples. The noise is zero-mean additive white Gaussian noise. The standard deviations n of the additive noise are 0:1, 0:05, and 0:025. The plots show that the ASAM estimate gets better as the number of samples increases. The ASAM has diculty estimating tail probabilities when the additive noise variance gets large. ASAM supervised learning uses gradient descent to tune the parameters of the 152 set functions a j as well as the then-part areas V j (and weights w j ) and centroids c j . The learning laws follow from the SAM's convex-sum structure (8) and the chain-rule decomposition @E @m j = @E @F @F @a j @a j @m j for SAM parameter m j and error E in the generic gradient-descent algorithm [171], [210] m j (t + 1) =m j (t) t @E @m j (8.9) where t is a learning rate at iteration t. We seek to minimize the squared error E() = 1 2 (f()F ()) 2 = 1 2 "() 2 (8.10) of the function approximation. Let m j denote any parameter in the set function a j . Then the chain rule gives the gradient of the error function with respect to the respective if-part set parameter m j , the centroid c j , and the volume V j : @E @m j = @E @F @F @a j @a j @m j ; @E @c j = @E @F @F @c j ; and @E @V j = @E @F @F @V j (8.11) with partial derivatives [171], [210] @E @F =(f()F ()) ="() and @F @a j = [c j F ()] p j () a j () : (8.12) The SAM ratio (8.2) with equal rule weights w 1 = =w m gives [171], [210] @F @c j = a j ()V j P m i=1 a i ()V i =p j () (8.13) @F @V j = a j ()[c j F ()] P m i=1 a i (x)V i = [c j F ()] p j () V j : (8.14) Then the learning laws for the then-part set centroids c j and volume V j have the nal form c j (t + 1) =c j (t) + t "()p j () (8.15) V j (t + 1) =V j (t) + t "()[c j F ()] p j () V j : (8.16) The learning laws for the if-part set parameters follow in like manner by expanding @a j @m j in (8.11). The simulations below tune the location m j and dispersion d j parameters of the 153 Figure 8.5: Comparison of conjugate beta priors and posteriors with their fuzzy approximators. (a) an adapted sinc-SAM fuzzy systemF () with 15 rules approximates the three conjugate beta priors h(): (2:5; 9), (9; 9), and (8; 5). (b) the sinc- SAM fuzzy priors F() in (a) produce the SAM-based approximators F(jx) of the three corresponding beta posteriors f(jx) for the three corresponding binomial likelihood g(xj) with n = 80: bin(20; 80), bin(40; 80), and bin(60; 80) where g(xj) = bin(x; 80) = 80! x!(80x)! x (1) 80x . So X bin(x; 80) and X = 20 mean that there were 20 successes out of 80 trials in an experiment where the probability of success was. Each of the three fuzzy approximations cycled 6,000 times through 500 uniform training samples from the corresponding beta priors. if-part set functions a j for sinc, Gaussian, triangle, Cauchy, Laplace, and generalized hyperbolic tangent if-part sets. Figure 8.3 shows an example of each of these six fuzzy sets with the following learning laws. Sinc ASAM learning law The sinc set function a j has the form a j () = sin m j d j . m j d j (8.17) 154 Figure 8.6: Comparison of conjugate gamma priors and posteriors with their fuzzy approximators. (a) an adapted sinc-SAM fuzzy systemF () with 15 rules approximates the three conjugate gamma priors h(): (1; 30), (4; 12), and (9; 5). (b) the sinc- SAM fuzzy priors F() in (a) produce the SAM-based approximators F(jx) of the three corresponding gamma posteriors f(jx) for the three corresponding Poisson likelihoods g(xj): p(35), p(70), and p(105) where g(xj) =p(x) = x e =x!. Each of the three fuzzy approximations cycled 6,000 times through 1,125 uniform training samples from the corresponding gamma priors. with parameter learning laws [171], [210] m j (t + 1) =m j (t) + t "()[c j F ()] p j () a j () a j () cos m j d j 1 m j d j (t + 1) =d j (t) + t "()[c j F ()] p j () a j () a j () cos m j d j 1 d j : 155 Figure 8.7: Comparison of 11 conjugate normal posteriors with their fuzzy-based approximators based on a standard normal prior and 11 dierent normal likelihoods. An adapted sinc-SAM approximator with 15 rules rst approximates the standard normal priorh() =N(0; 1) and then combines with the likelihoodg(xj) =N(; 2 = 1 16 ). The variance is 1=16 becausex is the observed sample mean of 16 standard-normal random samples X k N(0; 1). The 11 priors correspond to the 11 likelihoods g(xj) with x =4,3:25,2:5,1:75,1,0:25, 0.5, 1.25, 2, 2.75, and 3.5. The fuzzy approximation cycled 6,000 times through 500 uniform training samples from the standard-normal prior. Gaussian ASAM learning law The Gaussian set function a j has the form a j () = exp ( m j d j 2 ) (8.18) with parameter learning laws m j (t + 1) =m j (t) + t "()p j ()[c j F ()] m j d 2 j (8.19) d j (t + 1) =d j (t) + t "()p j ()[c j F ()] (m j ) 2 d 3 j : (8.20) 156 Triangle ASAM learning law The triangle set function has the form a j () = 8 > > < > > : 1 m j l j if m j l j m j 1 m j r j if m j m j +r j 0 else (8.21) with parameter learning laws m j (t + 1) = 8 > > < > > : m j (t) t "()[c j F ()] p j () a j () 1 l j if m j l j <<m j m j (t) + t "()[c j F ()] p j () a j () 1 r j if m j <<m j +r j m j (t) else l j (t + 1) = ( l j (t) + t "()[c j F ()] p j () a j () m j l 2 j if m j l j <<m j l j (t) else r j (t + 1) = ( r j (t) + t "()[c j F ()] p j () a j () m j r 2 j if m j <<m j +r j r j (t) else The Gaussian learning laws (8.19)-(8.20) can approximate the learning laws for the symmetric triangle set function a j () = maxf0; 1 jm j j d j g. Cauchy ASAM learning law The Cauchy set function a j has the form a j () = 1 1 + m j d j 2 (8.22) with parameter learning laws m j (t + 1) =m j (t) + t "()p j ()[c j F ()] m j d 2 j a j () (8.23) d j (t + 1) =d j (t) + t "()p j ()[c j F ()] (m j ) 2 d 3 j a j (): (8.24) 157 Laplace ASAM learning law The Laplace or double-exponential set function a j has the form a j () = exp jm j j d j (8.25) with parameter learning laws m j (t + 1) =m j (t) + t "()p j ()[c j F ()]sign(m j ) 1 d j (8.26) d j (t + 1) =m j (t) + t "()p j ()[c j F ()]sign(m j ) jm j j d 2 j : (8.27) Generalized hyperbolic tangent ASAM learning law The generalized hyperbolic tangent set function has the form a j () = 1 + tanh m j d j 2 ! (8.28) with parameter learning laws m j (t + 1) =m j (t) + t "()p j ()[c j F ()](2a j ()) m j d 2 j (8.29) d j (t + 1) =d j (t) + t "()p j ()[c j F ()](2a j ()) (m j ) 2 d 3 j : (8.30) We can also reverse the learning process and adapt the SAM if-part and then-part set parameters by maximizing a given closed-form posterior pdf f(jx). The basic Bayesian relation (8.1) above leads to the following application of the chain rule for a set parameter m j : @f(jx) @m j / g(xj) @F @m j (8.31) since @g @F = 0 because the likelihood g(xj) does not depend on the fuzzy system F. The chain rule gives @F @m j = @F @a j @a j @m j and similarly for the other SAM parameters. Then the above learning laws can eliminate the product of partial derivatives to produce a stochastic gradient ascent or maximum-a-posteriori or MAP learning law for the SAM parameters. 158 8.2.4 ASAM Approximation Simulations We simulated six dierent types of adaptive SAM fuzzy systems to approximate the three standard conjugate prior pdfs and their corresponding posterior pdfs. The six types of ASAMs corresponded to the six if-part sets in Figure 8.3 and their learning laws above. We combined C++ software for the ASAM approximations with Mathematica to compute the fuzzy-based posterior F (jx) using (8.1). Mathematica's NIntegrate program computed the mean-squared errors between the conjugate prior h() and the fuzzy-based prior F () and between the posterior f(jx) and the fuzzy posterior F (jx). Each ASAM simulation used uniform samples from a prior pdf h(). The program evenly spaced the initial if-part sets and assigned them equal but experimental dispersion values. The initial then-part sets had unit areas or volumes. The initial then-part centroids corresponded to the prior pdf's value at the location parameters of the if-part sets. A single learning iteration began with computing the approximation error at each uniformly spaced sample point. The program cycled through all rules for each sample value and then updated each rule's if-part and then-part parameters according to the appropriate ASAM learning law. Each adapted parameter had a harmonic-decay learning rate t = c t for learning iteration t. Experimentation picked the numerator constants c for the various parameters. The approximation gures show representative simulation results. Figure 8.2 used Cauchy if-part sets for illustration only and not because they gave a smaller mean-squared error than sinc sets did. Figures 8.5-8.7 used sinc if-part sets even though we simulated all six types of if-part sets for all three types of conjugate priors. Simulations demonstrated that all 6 set functions produce good approximations for the prior pdfs. The sinc ASAM usually performed best. We truncated the gamma priors at the right-side value of 150 and truncated the normal priors at4 and 4 because the overlap between the truncated prior tails and the likelihoods g(xj) were small. The likelihood functions g(xj) had narrow dispersions relative to the truncated supports of the priors. Larger truncation values or appended fall-o tails can accommodate unlikely x values in other settings. We also assumed that the priors were strictly positive. So we bounded the ASAM priors to a small positive value (F () 10 3 ) to keep the denominator integral in (8.1) well-behaved. Figure 8.2 used only one fuzzy approximation. The fuzzy approximation of the (8; 5) prior had mean-squared error 4:210 4 . The Cauchy-ASAM learning algorithm 159 used 500 uniform samples for 6,000 iterations. The fuzzy approximation of the beta priors (2:5; 9), (9; 9), and (8; 5) in Figure 8.5 had respective mean-squared errors 1:3 10 4 , 2:3 10 5 , and 1:4 10 5 . The sinc-ASAM learning used 500 uniform samples from the unit interval for 6,000 training iterations. The corresponding conjugate beta posterior approximations had respective mean-squared errors 3:0 10 5 , 6:9 10 6 , and 3:8 10 5 . The fuzzy approximation of the gamma priors (1; 30), (4; 12), and (9; 5) in Figure 8.6 had respective mean-squared errors 5:5 10 5 , 3:6 10 6 , and 7:9 10 6 . The sinc-ASAM learning used 1,125 uniform samples from the truncated interval [0; 150] for 6,000 training iterations. The corresponding conjugate gamma posterior approximations had mean-squared errors 2:3 10 5 , 2:1 10 7 , and 2:3 10 4 . The fuzzy approximation of the single standard-normal prior that underlies Figure 8.7 had mean-squared error of 7:7 10 6 . The sinc-ASAM learning used 500 uniform samples from the truncated interval [4; 4] for 6,000 training iterations. Sample Mean MSE 4 0:12 3:25 1:9 10 3 2:5 3 10 4 1:75 1:5 10 4 1 3:1 10 5 0:25 2:2 10 6 Sample Mean MSE 0:5 1:1 10 5 1:25 6:5 10 5 2 1:6 10 4 2:75 3 10 4 3:5 7:6 10 3 Table 8.1: Mean squared errors for the 11 normal posterior approximations The generalized-hyperbolic-tanh ASAMs in Figure 8.4 learn the beta prior(3; 10:4) from both noiseless and noisy random-sample (i.i.d.) x 1 ;x 2 ;::: draws from the \unknown" prior because the ASAMs use only the histogram or empirical distribution of the pdf. The Glivenko-Cantelli Theorem [29] ensures that the empirical distribution converges uniformly to the original distribution. So sampling from the histogram of random samples increasingly resembles sampling directly from the unknown underlying pdf as the sample size increases. This ASAM learning is robust in the sense that the fuzzy systems still learn the pdf if independent white noise corrupts the random-sample draws. The simulation draws N random samples x 1 ;x 2 ;:::;x N from the pdf h() = (3; 10:40) and then bins them into 50 equally spaced bins of length = 0:02. We generate an empirical pdf h emp () for the beta distribution by rescaling the histogram. 160 The rescaling converts the histogram into a staircase approximation of the pdf h(): h emp () = # of bins X m=1 p[m]rect( b [m]) N (8.32) where p[m] is the number of random samples in bin m and where b [m] is the central location of the m th bin. The ASAM generates an approximation H emp () for the empirical distributionh emp (). Figure 8.4(a) shows comparisons between H emp () and h(). The second example starts with 5; 000 random samples of the (3; 10:4) distribu- tion. We add zero-mean white Gaussian noise to the random samples. The noise is independent of the random samples. The examples use respective noise standard deviations of 0:1, 0:05, and 0:025 in the three separate cases. The ASAM produces an approximation H emp;n () for this noise-modied function h emp;n (). Figure 8.4(b) shows comparisons between H emp;n () to h(). The approximands h emp and h emp;n in Figures 8.4 (a) and (b) are random functions. So these functions and their ASAM approximators are sample cases. 8.2.5 Approximating Non-conjugate Priors We dened a prior pdf h() as a convex bimodal mixture of normal and Maxwell pdfs: h() = 0:4N(10; 1) + 0:3M(2) + 0:3M(5). The Maxwell pdfs have the form M() : h() = 2 e 2 2 2 if > 0. (8.33) The prior pdf modeled a location parameter for the normal mixture likelihood function: g(xj) = 0:7N(; 2:25) + 0:3N( + 8; 1). The prior h() is not conjugate with respect to this likelihood function g(xj). The ASAM used sinc set functions to generate a fuzzy approximator H() for the prior h(). The ASAM used 15 rules and 6000 iterations on 500 uniform samples of h(). The two gures below show the quality of the prior and posterior fuzzy approximators. This example shows that fuzzy Bayesian approximation still works for non-conjugate pdfs. 161 Figure 8.8: Comparison of a non-conjugate prior pdf h() and its fuzzy approximator H(). The pdf h() is a convex mixture of normal and Maxwell pdfs: h() = 0:4N(10; 1) + 0:3M(2) + 0:3M(5). The Maxwell pdf M() is 2 e 2 =2 2 for 0 and 0 for 0. An adaptive sinc-SAM generated H() using 15 rules and 6000 training iterations on 500 uniform samples of the h(). 8.2.6 The SAM Structure of Fuzzy Posteriors The next theorem shows that the SAM's convex-weighted-sum structure passes over into the structure of the fuzzy-based posterior F(jx). The result is a generalized SAM [171] because the then-part centroids c j are no longer constant but vary both with the observed data x and the parameter value . This simplied structure for the posterior F(jx) comes at the expense in general of variable centroids that require several integrations for each observation x. Theorem 8.1. Fuzzy Posterior Approximator is a SAM The fuzzy posterior approximator is a SAM: F (jx) = m X j=1 p j ()c 0 j (xj) (8.34) 162 Figure 8.9: Approximation of a non-conjugate posterior pdf. Comparison of a non- conjugate posterior pdf f(jx) and its fuzzy approximator F(jx). The fuzzy prior H() and the mixture likelihood function g(xj) = 0:7N(j2:25) + 0:3N( + 8; 1) produce the fuzzy approximator of the posterior pdf F (jx). The gure shows F (jx) for the single observation x = 6. where the generalized then-part set centroids c 0 j (xj) have the form c 0 j (jx) = c j g(xj) P m i=1 R D g(xju)p i (u)c i du for sample space D. (8.35) Proof. The proof equates the fuzzy-based posterior F (jx) with the right-hand side of 163 (8.1) and then expands according to Bayes Theorem: F (jx) = g(xj)F () R D g(xju)F (u)du by (8.1) (8.36) = g(xj) P m j=1 p j ()c j R D g(xju) P m j=1 p j (u)c j du by (8.2) (8.37) = g(xj) P m j=1 p j ()c j P m j=1 R D g(xju)p j (u)c j du (8.38) = m X j=1 p j () c j g(xj) P m i=1 R D g(xju)p i (u)c i du (8.39) = m X j=1 p j ()c 0 j (xj): (8.40) We next prove two corollaries that hold in special cases that avoid the integration in (8.40) and thus are computationally tractable. The rst special case occurs when g(xj) approximates a Dirac delta function centered at x: g(xj)(x). This can arise when g(xj) concentrates on a region D g D if D g is much smaller than D p j D and if p j () concentrates on D p j . Then Z D p j (u)g(xju)du Z Dp j p j (u)(ux)du = p j (x): (8.41) Then (8.40) becomes F (jx) g(xj)F () P m j=1 p j (x)c j by (8.41) (8.42) =g(xj) P m j=1 p j ()c j F (x) by (8.2) (8.43) = m X j=1 p j () c j g(xj) F (x) = m X j=1 p j ()c 0 j (xj): (8.44) So a learning law for F(jx) needs to update only each then-part centroid c j by scaling it withg(xj)=F (x) for each observationx. This involves a substantially lighter computation than does the integration in (8.40). The delta-pulse approximation g(xj)(x) holds for narrow bell curves such 164 as normal or Cauchy pdfs when their variance or dispersion is small. It holds in the limit as the equality g(xj) =(x) in the much more general case of alpha-stable pdfs [174], [221] with any shape if x is the location parameter of the stable pdf and if the dispersion goes to zero. Then the characteristic function is the complex exponential e ix! and thus Fourier transformation gives the pdf g(xj) exactly as the Dirac delta function [175]: lim !0 g(xj) =(x). Then F (jx) equals the right-hand side of (8.44). The approximation fails for a narrow binomial g(xj) unless scaling maintains unity status for the mass of g(xj) in (8.41) for a given n. The second special case occurs when the SAM system F uses several narrow rules to approximate the prior h() and when the data is highly uncertain in the sense that D g is large compared with D p j . Then we can approximate g(xj) as the constant g(xjm j ) over D p j : Z D p j (u)g(xju)du Z Dp j p j (u)g(xjm j )du = g(xjm j )U p j (8.45) where U p j = R Dp j p j (u)du. So (8.40) becomes F (jx) g(xj)F () P m j=1 g(xjm j )U p j c j by (8.45) (8.46) = m X j=1 p j () c j g(xj) P m i=1 g(xjm i )U p i c i = m X j=1 p j ()c 0 j (xj) : (8.47) We can pre-compute or estimate the if-part volume U p j in advance. So (8.47) also gives a generalized SAM structure and another tractable way to adapt the variable then-part centroids c 0 j (xj). This second special case holds for the normal likelihood g(xj) = 1 p 2 0 e (x) 2 =2 2 0 (8.48) if the widths or dispersions d j of the if-part sets are small compared with 0 and if there are a large number m of fuzzy if-then rules that jointly cover D g . This occurs if D g = ( 3 0 ; + 3 0 ) with if-part dispersions d j = 0 =m and locations m j . Then p j () concentrates on someD p j = (m j ;m j +) where 0< 0 and sop j () 0 for = 2D p j . Then xm j 0 xm j 0 since 0 . So x 0 xm j 0 for all 2D p j and 165 thus g(xj) 1 p 2 0 e (xm j ) 2 =2 2 0 =g(xjm j ) (8.49) for 2D p j . This special case also holds for the binomial g(xj) = n k x (1) nx for x = 0; 1;:::;n if nm and thus if there are fewer Bernoulli trials n than fuzzy if-then rules m in the SAM system. It holds because g(xj) concentrates on D g and because D g is wide compared with D p j when m n. This case also holds for the Poisson g(xj) = 1 x! x e if the number of timesx that a discrete event occurs is small compared with the number m of SAM rules that jointly cover D g = ( x 2 ; 3x 2 ) because again D g is large compared with D p j . So (8.45) follows. 8.2.7 Other Uniform Function Approximation Methods Fuzzy systems lend themselves well to prior pdf approximation because a lot of sub- jective information exists as linguistic rules. The approximation and representation theorems of Kosko [169] and Watkins [294] make fuzzy approximators well-suited for Bayesian inference. But there are other model-free approximation schemes for approx- imating pdfs and likelihood. Uniform approximation schemes are especially useful in practice because they guarantee that users can specify and achieve a global approxima- tion error tolerance. Examples of model-free uniform approximators include multilayer feedforward (FF) networks and algebraic rings of polynomials. There are even more approximation methods available if we do not insist on uniform approximation quality. Hornik et al. [139] showed that multilayer FF networks can also uniformly approx- imate any Borel measurable function. FF networks can learn the approximand from sample data. Back-propagation training [259], [299], [300] is the standard method for training feedforward approximators with sample data. The approach is similar to the learning procedure for fuzzy ASAMs: gradient descent on an approximation-error surface (squared error or cross-entropy). But FF networks have very opaque internal structures compared to the rule-sets for fuzzy ASAMs. So tuning a FF network with linguistic information is not as simple as in the fuzzy case. Exact function repre- sentation with FF networks is possible [127]. But the FF representation uses highly nonsmooth, possibly non-parametric activation functions [106], [182], [183]. Thus function representation is much harder with FF networks than with fuzzy systems. Polynomial rings provide another method for universal function approximation on 166 compact domains. Polynomial rings refer to the set of polynomials with real coecients equipped with the natural addition and multiplication binary operations. Function approximation with polynomial rings usually relies on the Weierstrass Approximation theorem and its generalizations [258], [274]. The approximation theorem guarantees the existence of uniform polynomial approximator for any continuous function over a compact domain. Applying the theorem involves selecting a polynomial basis for the approximation. The set of available polynomial bases includes splines, Bernstein Polynomials, Hermite polynomials, Legendre polynomials, etc. However polynomi- als tend to be unstable in the norm and are thus less attractive for implementing approximators. 8.3 Doubly Fuzzy Bayesian Inference: Uniform Ap- proximation We will use the term doubly fuzzy to describe Bayesian inference where separate fuzzy systems H() and G(xj) approximate the respective prior pdf h() and the likelihood g(xj). Theorem 3 below shows that the resulting fuzzy approximator F of the posterior pdf f(jx) still has the convex-sum structure (8) of a SAM fuzzy system. The doubly fuzzy posterior approximator F requires only m 1 m 2 rules if the fuzzy likelihood approximator G uses m 1 rules and if the fuzzy prior approximator H uses m 2 rules. The m 1 m 2 if-part sets of F have a corresponding product structure as do the other fuzzy-system parameters. Corollary 3.1 shows that using an exact 2-rule representation reduces the corresponding rule number m 1 or m 2 to two. This is a tractable growth in rules for a single Bayesian inference. But the same structure leads in general to an exponential growth in posterior-approximator rules if the old posterior approximator becomes the new prior approximator in iterated Bayesian inference. Figure 8.10 shows the result of doubly fuzzy Bayesian inference for two normal posterior pdfs. A 15-rule Gaussian SAMG approximates two normal likelihoods while a 15-rule sinc SAM H approximates a standard normal prior pdf. 8.3.1 The Bayesian Approximation Theorem We call the next theorem the Bayesian Approximation Theorem (BAT). The BAT shows that doubly fuzzy systems can uniformly approximate posterior pdfs under some mild conditions. The proof derives an approximation error bound forF (jx) that 167 Figure 8.10: Doubly fuzzy Bayesian inference: comparison of two normal posteriors and their doubly fuzzy approximators. The doubly fuzzy approximations use fuzzy prior- pdf approximator H() and fuzzy likelihood-pdf approximator G(xj). The sinc-SAM fuzzy approximatorH() uses 15 rules to approximate the normal priorh() =N(0; 1). The Gaussian-SAM fuzzy likelihood approximatorG(xj) uses 15 rules to approximate the two likelihood functions g(xj) =N(0:25; 1 16 ) and g(xj) =N(2; 1 16 ). The two fuzzy approximators used 6000 learning iterations based on 500 uniform sample points. does not depend on or x. Thus F(jx) uniformly approximates f(jx). The BAT holds in general for any uniform approximators of the prior or likelihood. Corollary 8.1 shows how the centroid and convex-sum structure of SAM fuzzy approximators H and G specically bound the posterior approximator F . The statement and proof of the BAT require the following notation. LetD denote the set of all and letX denote the set of allx. Assume thatD andX are compact. The prior is h() and the likelihood is g(xj). H() is a 1-dimensional SAM fuzzy system that uniformly approximates h() in accord with the Fuzzy Approximation Theorem [169], [171]. G(xj) is a 2-dimensional SAM that uniformly approximates g(xj). Dene the Bayes factors as q(x) = R D h()g(xj)d and Q(x) = R D H()G(xj)d. Assume that q(x)> 0 so that the posterior f(jx) is well-dened for any sample data 168 x. Let Z denote the approximation error Zz for an approximator Z. Theorem 8.2. [Bayesian Approximation Theorem]: Suppose that h() and g(xj) are bounded and continuous and that H()G(xj)6= 0 almost everywhere. Then the posterior pdf approximator F(jx) = HG=Q uniformly approximates f(jx). Thus for all > 0 : jF (jx)f(jx)j< for all x and . Proof. Write the posterior pdff(jx) asf(jx) = h()g(xj) q(x) and its approximatorF (jx) asF (jx) = H()G(xj) Q(x) . The SAM approximations for the prior and likelihood functions are uniform [171]. So they have approximation error bounds h and g that do not depend on x or : jHj< h and (8.50) jGj< g (8.51) where H =H()h() and G =G(xj)g(xj). The posterior error F is F = Ff = HG Q(x) hg q(x) : (8.52) Expand HG in terms of the approximation errors to get HG = (H +h)(G +g) = HG + Hg +hG +hg: (8.53) We have assumed that HG6= 0 almost everywhere and so Q6= 0. We now derive an upper bound for the Bayes-factor error Q: Q =Qq = Z D (HG + Hg +hG +hghg) d: (8.54) So jQj Z D jHG + Hg +hGj d (8.55) Z D jHjjGj +jHjg +hjGj d (8.56) < Z D ( h g + h g +h g ) d by (8.50): (8.57) Parameter setD has nite Lebesgue measure m(D) = Z D d<1 becauseD is a compact subset of a metric space and thus [216] it is (totally) bounded. Then the 169 bound on Q becomes jQj<m(D) h g + g + h Z D g(xj) d (8.58) because Z D h()d = 1. We now invoke the extreme value theorem [97]. The extreme value theorem states that a continuous function on a compact set attains both its maximum and minimum. The extreme value theorem allows us to use maxima and minima instead of suprema and inma. Now R D g(xj) d is a continuous function of x because g(xj) is a continuous nonnegative function. The range of R D g(xj) d is a subset of the right half line (0;1) and its domain is the compact setD. So R D g(xj) d attains a nite maximum value. Thus jQj< q (8.59) where we dene the error bound q as q =m(D) h g + g + h max x Z D g(xj) d : (8.60) Rewrite the posterior approximation error F as F = qHGQhg qQ = q(HG + Hg +hG +hg) Qhg q(q + Q) (8.61) Inequality (8.59) implies that q < Q< q and that (q q )< (q + Q)< (q + q ). Then (8.50) and (8.51) give similar inequalities for H and G. So q[ h g min(g) h min(h) g ] q hg q(q q ) < F < q[ h g + max(g) h + max(h) g ] + q hg q(q q ) : (8.62) The extreme value theorem ensures that the maxima in (8.62) are nite. The bound on the approximation error F does not depend on . But q still depends on the value of the data sample x. So (8.62) guarantees at best a pointwise approximation of f(jx) when x is arbitrary. We can improve the result by nding bounds for q that do not depend on x. Note that q(x) is a continuous function of x2X because hg is continuous. So the extreme value theorem ensures that the Bayes factor q has a nite upper bound and a positive lower bound. The term q(x) attains its maximum and minimum by the extreme value theorem. The minimum of q(x) is positive because we assumed q(x) > 0 for all x. H older's inequality givesjqj R D jhjd (kg(x;)k 1 ) =kg(x;)k 1 since h is a pdf. So the 170 maximum ofq(x) is nite becauseg is bounded: 0 < minfq(x)g maxfq(x)g < 1. Then < F < + (8.63) if we dene the error bounds and + as = ( h g minfgg h minfhg g ) minfqg hg q minfqg (minfqg q ) (8.64) + = ( h g + maxfgg h + maxfhg g ) maxfgg + hg q minfqg (minfqg q ) : (8.65) Now q ! 0 as g ! 0 and h ! 0. So ! 0 and + ! 0. The denominator of the error bounds must be non-zero for this limiting argument. We can guarantee this when q < minfqg. This condition is not restrictive because the functions h and g x or determine q independent of the approximators H and G involved and because q ! 0 when h ! 0 and g ! 0. So we can achieve arbitrarily small q that satises q < minfqg by choosing appropriate h and g . Then F! 0 as g ! 0 and h ! 0. SojFj! 0. The BAT proof also shows how sequences of uniform approximators H n and G n lead to a sequence of posterior approximators F n that converges uniformly to F. Suppose we have such sequences H n and G n that uniformly approximate the respective prior h and likelihood g. Suppose h;n+1 < h;n and g;n+1 < g;n for all n. Dene F n = HnGn R HnGn . Then for all > 0 there exists an n 0 2N such that for all n>n 0 : jF n (jx)F (jx)j< for all and for allx. The positive integern 0 is the rst n such that h;n and g;n satisfy (8.65). Hence F n converges uniformly to F . Corollary 8.1below reveals the fuzzy structure of the BAT's uniform approximation when the prior H and likelihood G are uniform SAM approximators. The corollary shows how the convex-sum and centroidal structure of H and G produce centroid- based bounds on the fuzzy posterior approximator F. Recall rst that Theorem 1 states that F(jx) = P m j=1 p j ()c 0 j (xj) where c 0 j (xj) = c j g(xj) P m i=1 R D g(xju)p i (u)c i du . Replace the likelihood g(xj) with its doubly fuzzy SAM approximator G(xj) to obtain the posterior F (jx) = m X j=1 p j ()C 0 j (xj) (8.66) 171 where the if-part set centroids are C 0 j (xj) = c h;j G(xj) P m i=1 R D G(xju)p i (u)c h;i du : (8.67) Thefc h;k g k are the then-part set centroids for the prior SAM approximator H(). G(xj) likewise has then-part set centroidsfc g;j g j . Each SAM is a convex sum of its centroids from (8.34). This convex-sum structure induces bounds on H and G that in turn produce bounds on F. We next let the subscripts max and min denote the respective maximal and minimal centroids. The maximal centroids are positive. But the minimal centroids may be negative even though h and g are non-negative functions. We also assume that the minimal centroids are positive. So dene the maximal and minimal product centroids as c gh;max = max j;k c g;j c h;k =c g;max c h;max and c gh;min = min j;k c g;j c h;k =c g;min c h;min : (8.68) Then the BAT gives the following SAM-based bound. Corollary 8.1. Centroid-based bounds for the doubly fuzzy posterior F. Suppose that the set D of all has positive Lebesgue measure. Then the centroids of the H and G then-part sets bound the posterior F : c gh;min m(D)c gh;max F (jx) c gh;max m(D)c gh;min : (8.69) Proof. The convex-sum structure constrains the values of the SAMs: H()2 [c h;min ;c h;max ] for all and G(xj)2 [c g;min ;c g;max ] for all x and . Then (8.67) implies C 0 j (xj) c gh;min c gh;max P m i=1 R D p i (u)du (8.70) = c gh;min m(D)c gh;max for all x and (8.71) since P m i=1 R D p i (u)du = R D P m i=1 p i (u)du = R D du =m(D) where m(D) denotes the (positive) Lebesgue measure of D. The same argument gives the upper bound: C 0 j (xj) c gh;max m(D)c gh;min for all x and : (8.72) 172 Thus (8.71) and (8.72) give bounds for all centroids: c gh;min m(D)c gh;max C 0 j (xj) c gh;max m(D)c gh;min for all x and : (8.73) This bounding interval applies to F (jx) because the posterior approximator also has a convex-sum structure. Thus c gh;min m(D)c gh;max F (jx) c gh;max m(D)c gh;min for all x and : (8.74) The size of the bounding interval depends on the size of the set D and on the minimal centroids ofH andG. The lower bound is more sensitive to minimal centroids than the upper bound because dividing by a maximum is more stable than dividing by a minimum close to zero. The bounding interval becomes [0;1) if any of the minimal centroids for H or G equal zero. The innite bounding interval [0;1) corresponds to the least informative case. Similar centroid bounds hold for the multidimensional case. Suppose that the SAM-based posterior F is the multidimensional approximator F : R! R p with p> 1. Then the same argument applies to the components of the centroids along each dimension. There are p bounding intervals c s gh;min m(D)c s gh;max F s (jx) c s gh;max m(D)c s gh;min (8.75) for each dimensions of the rangeR p . These componentwise intervals dene a bounding hypercube Q p s=1 [ c s gh;min m(D)c s gh;max ; c s gh;max m(D)c s gh;min ]R p for F . 8.4 Conclusion Fuzzy systems allow users to encode prior and likelihood information through fuzzy rules rather than through only a handful of closed-form probability densities. Gradient- descent learning algorithms also allow fuzzy systems to learn and tune rules based on the same type of collateral data that an expert might consult or that a statistical hypothesis might use. Dierent learning algorithms should produce dierent bounds on the fuzzy prior or likelihood approximations and those in turn should lead to dierent bounds on the fuzzy posterior approximation. Hierarchical Bayes systems 173 can model hyperpriors with fuzzy approximators or with other \intelligent" learning systems such as neural networks or semantic networks. An open research problem is how the use of semi-conjugate rules or other techniques can reduce the exponential rule explosion that doubly fuzzy Bayesian systems face in general in Bayesian iterative inference. 174 Chapter 9 Hierarchical and Iterative Bayesian Inference with Function Approximators This chapter presents some extensions of the approximate Bayesian inference scheme in Chapter 8. The rst extension is a demonstration of the fuzzy approximation on hier- archical Bayesian models where the user puts a second-order prior pdf or a hyperprior on one of the uncertain parameters in the original prior pdf. Hierarchical Bayesian models are more complex and require more approximators. The next result, the Extended Bayesian Approximation Theorem (Theorem 9.1), shows that the increased complexity of hierarchical Bayesian models does not weaken the theoretical strength of our approximation scheme. This theorem generalizes the Bayesian Approximation Theorem (Theorem 8.2) to hierarchical and multi-dimensional Bayesian models. The nal set of results addresses the computational complexity of iterative Bayesian schemes. Conjugate Bayesian models track a constant number of pdf parameters with each iteration of Bayes theorem. We propose the related idea of Semi-Conjugacy for fuzzy approximators as a way to limit the growth of pdf parameters for iterative Bayesian schemes. Semi-conjugacy refers to functional similarity between if-part sets of the posterior and prior approximators. It is a looser condition than conjugacy. Semi- conjugacy allows users to strike a balance between parameter explosion associated with non-conjugate models and constant parameter size associated with overly restrictive This chapter features work done in collaboration with Prof. Sanya Mitaim and rst presented in [228], [230], [232]. 175 X n Figure 9.1: Probabilistic graphical model for all Bayesian data models in this chapter. We observe n samples of the data X which depends on a hidden random parameter . The likelihood functiong(xj) captures this dependence. The hidden random parameter itself depends on another hidden hyperparameter which has an unconditional hyperprior pdf (). The conditional prior h() describes the distribution of the hidden parameter and its dependence on the hyperparameter . conjugate models. 9.1 Approximate Hierarchical Bayesian Inference Function approximation can also apply to nested priors or so-called hierarchical Bayesian techniques [42], [135]. Here the user puts a new prior or hyperprior pdf on an uncertain parameter that appears in the original prior pdf. This new hyperprior pdf can itself have a random parameter that leads to yet another new prior or hyper- hyperprior pdf and so on up the hierarchy of prior models. Figure 9.2 demonstrates this hierarchical approximation technique in the common case where an inverse-gamma hyperprior pdf models the uncertainty in the unknown variance of a normal prior pdf. This is the scalar case of the conjugate inverse Wishart prior [42] that often models the uncertainty in the covariance matrix of a normal random vector. The simple Bayesian posterior pdf is f(jx)/g(xj)h() where the likelihood is g(xj) and the prior pdf ish(). But now suppose that the prior pdfh() depends on an uncertain parameter :h(j). We will model the uncertainty involving by making a random variableT with its own pdf or hyperprior pdf (). The probabilistic graphical model (PGM) in Figure 9.1 represents this data model succinctly. Conditioning the original prior h() on adds complexity and a new dimension to the posterior pdf: f(;jx) = g(xj)h(j)() RR g(xj)h(j)() d d : (9.1) Marginalizing or integrating over removes this extra dimension and returns the 176 posterior pdf for the parameter of interest : f(jx) Z g(xj)h(j)()d: (9.2) Thus hierarchical Bayes has the benet of working with a more exible and descriptive prior but at the computational cost of a new integration. The approach of empirical Bayes [42], [135] would simply replace the random variable with a numerical proxy such as its most probable value. That approach is simpler to compute but ignores a lot of information in the hyperprior pdf. Figure 9.2 shows marginal posterior approximations for a hierarchical Bayesian model. The posterior approximations use fuzzy approximators for the prior and hyperprior pdfs. The data model for the posterior pdf is a hierarchical conjugate normal model with a variance hyperparameter. The likelihood is Gaussian with unknown mean g(xj) =N x = 1 16 j . The prior pdf h for is a zero-mean Gaussian with unknown variance . So h(j) is N(0;). We model the pdf of with an inverse gamma (IG) hyperprior pdf: IG(2; 1) where IG(;) =() = e = () +1 . The inverse gamma prior is conjugate to the normal likelihood and so the resulting posterior is inverse gamma. Thus we have conjugacy in both the mean and variance parameters. The posterior approximator F(jx) uses a fuzzy approximation of the truncated hyperprior (). Figure 9.3 shows the sinc-SAM fuzzy approximator () for the truncated hyperprior (). These gures show that approximate hierarchical Bayesian inference is feasible with function approximators like fuzzy ASAMs. The next section gives a theoretical guarantee that this kind of uniform posterior approximation is always possible. 9.2 Uniform Approximation for Hierarchical Bayesian Inference The posterior approximation scheme for hierarchical Bayes uses uniform approximators (),H(j), andG(xj) in place of their respective approximands: the hyperprior pdf (), the prior pdfh(j), and the likelihoodg(xj). This triple function approximation scheme builds on the double function approximation scheme for non-hierarchical models in Chapter 8. The double function approximation scheme uses approximators for just 177 Figure 9.2: Hierarchical Bayes posterior pdf approximation using a fuzzy hyperprior. The plot shows the fuzzy approximation for 11 normal posterior pdfs. These posterior pdfs use 2 levels of prior pdfs. The rst prior pdf h(j) is N(0;) where is a random variance hyperparameter. The distribution of is the inverse-gamma (IG) hyperprior pdf. IG(2; 1) where IG(;) () = e = () +1 . The likelihood function is g(xj) = N(j 1 16 ). The 11 pdfs are posteriors for the observations x = 4;3:25;1:75;1;0:25; 0:5; 1:25; 2:75, and 3.5. The approximate posteriorF (jx) uses a fuzzy approximation for the inverse-gamma hyperprior () (1000 uniform sample points on the support [0, 4], 15 rules, and 6000 learning iterations). The posterior pdfs show the distribution of given the data x. the prior pdf h() and the likelihood g(xj). Theorem 9.1 below proves that the triple function approximation scheme produces uniform posterior pdf approximations when the input model-function approximators are uniform. This result extends the Bayesian Approximation Theorem in Chapter 8. So we call the theorem the Extended Bayesian Approximation Theorem. The proof of this theorem is quite general and does not depend on the structure of the uniform approximators. So the approximators can be fuzzy systems, neural networks or polynomials or any other uniform function approximators. 178 Figure 9.3: Comparison between inverse-gamma (IG) hyperprior () and its fuzzy approximation. The hyperprior pdf is the IG(2; 1) pdf that describes the random parameter that appears as the variance in a normal prior. The approximating fuzzy hyperprior () used 15 rules in a SAM with Gaussian if-part sets. The fuzzy approximator used 1000 uniform samples from [0; 4] and 6000 training iterations. 9.2.1 The Extended Bayesian Approximation Theorem The statement and proof of the Extended Bayesian approximation theorem requires the following notation.The hyperprior pdf is (). The prior is h(j) and the likelihood isg(xj). The 2-D pdfp(;) =h(j)() describes the dependence between and. P (;) is a 2-dimensional uniform approximator for p(;) =h(j)(). G(xj) is a 2-dimensional uniform approximator for g(xj). LetD denote the set of all (;) and letX denote the set of all x. Assume thatD andX are compact. Dene the Bayes factors as q(x) = R D p(;)g(xj)dd and Q(x) = R D P(;)G(xj)dd. Assume that q(x)> 0 so that the posterior f(;jx) is well-dened for any sample data x. We can now state the Extended Bayesian Approximation Theorem. The proof relies on the Extreme Value Theorem [216]. The proof applies to any kind of uniform approximator. The vector structure of the proof also allows the hyperprior prior to depend on its own hyperprior and so on. 179 Theorem 9.1. [Extended Bayesian Approximation Theorem]: Suppose that h(j), (), and g(xj) are bounded and continuous. Suppose that ()H(j)G(xj) =P (;)G(xj)6= 0 (9.3) almost everywhere. Then the posterior pdf approximator F(;jx) =PG=Q uniformly approximates f(;jx). Thus for all > 0 : jF (;jx)f(;jx)j< for all x and all (;). Proof. Write the posterior pdff(;jx) asf(;jx) = p(;)g(xj) q(x) and its approximator F(;jx) as F(;jx) = P (;)G(xj) Q(x) . The SAM approximations for the prior and likelihood functions are uniform [171]. So they have approximation error bounds p and g that do not depend on x or : jPj< p (9.4) and (9.5) jGj< g (9.6) where P =P (;)p(;) and G =G(xj)g(xj). The posterior error F is F = Ff = PG Q(x) pg q(x) : (9.7) Expand PG in terms of the approximation errors to get PG = (P +p)(G +g) (9.8) = P G + Pg +pG +pg: (9.9) We have assumed that PG6= 0 almost everywhere and so Q6= 0. We now derive an upper bound for the Bayes-factor error Q =Qq: Q = Z D (P G + Pg +pG +pgpg) d d: (9.10) 180 So jQj Z D jP G + Pg +pGj d d (9.11) Z D jPjjGj +jPjg +pjGj d d (9.12) < Z D ( p g + p g +p g ) d d by (9.4): (9.13) Parameter setD has nite Lebesgue measure m(D) = Z D d d<1 becauseD is a compact subset of a metric space and thus [216] it is (totally) bounded. Then the bound on Q becomes jQj<m(D) p g + g + p Z D g(xj) d (9.14) because Z D p(;)d d = 1 and g has no dependence on . We now invoke the extreme value theorem [97]. The extreme value theorem states that a continuous function on a compact set attains both its maximum and minimum. The extreme value theorem allows us to use maxima and minima instead of suprema and inma. Now R D g(xj) d is a continuous function of x because g(xj) is a continuous nonnegative function. The range of R D g(xj) d is a subset of the right half line (0;1) and its domain is the compact setD. So R D g(xj) d attains a nite maximum value. Thus jQj< q (9.15) where we dene the error bound q as q =m(D) p g + g + p max x Z D g(xj)d : (9.16) Rewrite the posterior approximation error F as F = qPGQpg qQ (9.17) = q(P G + Pg +pG +pg) Qpg q(q + Q) (9.18) Inequality (9.15) implies that q < Q< q and that (q q )< (q + Q)< (q + q ). 181 Then (9.4) gives similar inequalities for P and G. So q[ p g min(g) p min(h) g ] q pg q(q q ) < F < q[ p g + max(g) p + max(h) g ] + q pg q(q q ) : (9.19) The extreme value theorem ensures that the maxima in (9.19) are nite. The bound on the approximation error F does not depend on . But q still depends on the value of the data sample x. So (9.19) guarantees at best a pointwise approximation of f(;jx) when x is arbitrary. We can improve the result by nding bounds for q that do not depend on x. Note that q(x) is a continuous function of x2X because pg is continuous. So the extreme value theorem ensures that the Bayes factor q has a nite upper bound and a positive lower bound. The term q(x) attains its maximum and minimum by the extreme value theorem. The minimum of q(x) is positive because we assumed q(x) > 0 for all x. H older's inequality givesjqj R D jpj d d (kg(x;)k 1 ) =kg(x;)k 1 sincep is a pdf. So the maximum ofq(x) is nite becauseg is bounded: 0 < minfq(x)g maxfq(x)g < 1. Then < F < + (9.20) if we dene the error bounds and + as = ( p g minfgg p minfpg g ) minfqgpg q minfqg (minfqg q ) (9.21) + = ( p g + maxfgg p + maxfpg g ) maxfgg +pg q minfqg (minfqg q ) : (9.22) Now q ! 0 as g ! 0 and p ! 0. So ! 0 and + ! 0. The denominator of the error bounds must be non-zero for this limiting argument. We can guarantee this when q < minfqg. This condition is not restrictive because the functions p and g x or determine q independent of the approximators P and G involved and because q ! 0 when p ! 0 and g ! 0. So we can achieve arbitrarily small q that satises q < minfqg by choosing appropriate p and g . Then F! 0 as g ! 0 and p ! 0. SojFj! 0. Theorem 9.1 now follows from lemma 9.1 (below). Lemma (9.1) implies Theorem 182 9.1 because we have uniform approximators for f(;jx). We can integrate the nuisance parameter away to get a posterior approximation in terms of alone. Thus F!f uniformly implies R F d! R f d uniformly. Lemma 9.1. [Uniform Integral Approximation Lemma]: If Y is compact and f n !f uniformly then Z Y f n (x;y;~ z) dy! Z Y f(x;y;~ z) dy uniformly. (9.23) Proof. The uniform convergence of the sequence f n to f implies that for all > 0 there is an n2N such that jf n (x;y;~ z)f(x;y;~ z)j < for all (x;y;~ z)2XY Q Z i . Thus < f n (x;y;~ z)f(x;y;~ z) < . (9.24) Thus Z Y dy < Z Y f n (x;y;~ z) dy Z Y f(x;y;~ z) dy and Z Y f n (x;y;~ z) dy Z Y f(x;y;~ z) dy < Z Y dy: Y has nite Lebesgue measure m(Y ) = R Y dy because Y is a compact set on a nite-dimensional metric space. Dene s n (x;~ z) = R Y f n (x;y;~ z) dy and s(x;~ z) = R Y f(x;y;~ z) dy. Then m(Y ) < s n (x;~ z)s(x;~ z) < m(Y ): (9.25) Thus js n (x;~ z)s(x;~ z)j < m(Y ): (9.26) Dene 0 as 0 =m(Y ). Then for all 0 > 0 there exists an n2N such that js n (x;~ z)s(x;~ z)j < 0 for all (x;~ z)2X Q Z i . 183 Therefore Z Y f n (x;y;~ z) dy! Z Y f(x;y;~ z) dy (9.27) uniformly in x and~ z. Lemma 9.1 guarantees that uniform approximation still holds after marginalizing a multidimensional uniform approximator. The proofs of Theorem 9.1 and Lemma 9.1 also extend to n-dimensional fuzzy posterior approximators for n-dimensional functions f i.e. data models with more than one hyperparameter (PGMs with longer linear chains of unobserved parameters than Figure 9.1). Thus an n-dimensional approximator F and its marginalizations uniformly approximate the n-dimensional posterior f and its marginal pdfs. 9.2.2 Adaptive Fuzzy Systems for Hierarchical Bayesian In- ference Hierarchical Bayesian inference uses priors with hyperparameters or multi-dimensional priors and likelihoods. These model functions are multi-dimensional and so are their approximators. Standard additive model (SAM) fuzzy systems [169]{[171], [210] can uniformly approximate multi-dimensional functions according to the Fuzzy Approximation Theorem [169]. Theorem 9.1 guarantees that these fuzzy approximators will produce a uniform approximator for the posterior. A prior parameter with one hyperparameter will have a conditional pdf h(j) which has a 2{D functional representation. A 2{D adaptive SAM (ASAM) H can learn to approximate h uniformly. 2{D ASAMs fuzzy SAMs tune 2{D set functions (like the 2{D sinc and Gaussian set functions below) using the appropriate learning laws. The 2{D sinc set function a j has the form a j (x;y) = sinc xm x;j d x;j sinc ym y;j d y;j (9.28) with parameter learning laws [171], [210] for the location m m x;j (t + 1) =m x;j (t) + t "(x;y)[c j F (x;y)] p j (x;y) a j (x;y) a j (x;y) cos xm x;j d x;j sinc ym y;j d y;j 1 xm x;j (9.29) 184 and the dispersion d d x;j (t + 1) = d x;j (t) + t "(x;y)[c j F (x;y)] a j (x;y) cos xm x;j d x;j sinc( ym y;j d y;j ) p j (x;y) a j (x;y) 1 d x;j : (9.30) The Gaussian learning laws have the same functional form as its 1{D case. We replace a j () with a j (x;y): a j (x;y) = exp " xm x;j d x;j 2 ym y;j d y;j 2 # (9.31) with parameter learning laws m x;j (t + 1) =m j (t) + t "(x;y)p j (x;y)[c j F (x;y)] xm x;j d 2 x;j (9.32) d x;j (t + 1) =d j (t) + t "(x;y)p j (x;y)[c j F (x;y)] (xm x;j ) 2 d 3 x;j : (9.33) These learning laws are the result of applying gradient descent on the squared approx- imation error function for the associated SAM system. 9.2.3 Triply Fuzzy Bayesian Inference The term triply fuzzy describes the special case where (), H(j), and G(xj) are uniform fuzzy approximators. Triply fuzzy approximations allow users to state priors and hyper-priors in words or rules as well as to adapt them from sample data. Figure 9.4 shows a simulation instance of triply fuzzy function approximation in accord with the Extended Bayesian Approximation Theorem. It shows that the 2-D fuzzy approximator F(;jx) approximates the posterior pdf f(;jx)/ g(xj)h(j)() for hierarchical Bayesian inference. The sample data x is normal. A normal prior distribution h(j) = N(1; p ) models the population mean of the data. An inverse gamma IG(2; 1) hyperprior models the variance of the prior. An inverse gamma hyperprior () =IG(;) has the form () =e . () for > 0 where is the gamma function. The posterior fuzzy approximator F(;jx) is proportional to the triple-product approximator G(xj)H(j)(). These three adaptive SAMs separately approximate the three corresponding Bayesian pdfs. G(xj) 185 approximates the 1-D likelihood g(xj). H(j) approximates the 2-D conditional prior pdf h(j). And () approximates the 1-D hyperprior pdf (). Figure 9.4(a) shows the approximand or the original posterior pdf. Figure 9.4(b) shows the adapted triply fuzzy approximator of the posterior pdf using a conditional 2-D approximator H(j) for h(j) and a separate 1-D approximator () for (). Figure 9.4(c) shows a simulation instance where the posterior approximator F (;jx) uses a single 2-D approximator P(;) for the joint prior pdf p(;) =h(j)(). Both fuzzy posterior approximators F (;jx)/G(xj)H(j)() and F (;jx)/ G(xj)P (;) uniformly approximate the posterior pdf f(;j). Figure 9.5 shows an example of triply fuzzy function approximation for an arbitrary non-conjugate Bayesian model. The likelihood is zero-mean Gaussian with unknown standard deviation . The prior h(j) for the standard deviation is an arbitrary beta distribution that depends on a hyperparameter with a mixture-beta hyperprior (). The 2-D fuzzy approximatorF (;jx) uniformly approximates the posterior pdf f(;jx) for this arbitrary model. The quality of this arbitrary posterior approximation shows that approximation feasibility does not require conjugacy in the Bayesian data model. 9.3 Semi-conjugacy in Fuzzy Posterior Approxima- tion Iterative fuzzy Bayesian inference can lead to rule explosion. Iterative Bayesian updates will propagate the convex SAM structure from the fuzzy approximators to the posterior approximator as Theorem 9.2 (and 8.1) shows. But the updates produce exponentially increasing sets of rules and parameters. This also true for Bayesian models with more than two nested levels. Standard Bayesian applications avoid a similar parameter explosion by using conjugate models. Conjugacy keeps the number parameters constant with each update. We dene the related idea of semi-conjugacy for fuzzy-approximators in Bayes model. Semi-conjugacy is the property by which the if-part sets of the posterior fuzzy approximator inherit the shape of the if-part sets of the prior approximators. Theorem 9.3 and Corollaries 9.3|9.6 show that updates also preserve the shapes of the if-part sets (semi-conjugacy of if-part set functions) if the SAM fuzzy systems 186 (a) (b) (c) Figure 9.4: Triply fuzzy Bayesian inference: comparison between a 2-D posterior f(;jx)/g(xj)h(j)() and its triply fuzzy approximator F(;jx). The rst panel shows the approximand f(;jx). The second panel shows a triply fuzzy approximatorF (;jx) that used a 2-D fuzzy approximationH(j) for the conditional prior h(j) and a 1-D fuzzy approximation () for the hyperprior pdf () and a 1-D fuzzy likelihood-pdf approximator G(xj). The third panel shows a triply fuzzy approximator F (;jx) that used a 2-D fuzzy approximation P (;) = (H )(;) for the joint prior p(;) = (h)(;). The likelihood approximation is the same as in the second panel. The sinc-SAM fuzzy approximators H(j) and P (;) use 6 rules to approximate the respective 2-D pdfs h(j) = N(1; p ) and h(j)() = N(1; p )IG(2; 1). The hyperprior Gaussian-SAM approximator () used 12 rules to approximate an inverse-gamma pdf () = IG(2; 1). The Gaussian-SAM fuzzy likelihood approximator G(xj) used 15 rules to approximate the likelihood function g(xj) = N(; 1 16 ) for x =0:25. The 2-D conditional prior fuzzy approximator H(j) used 15000 learning iterations based on 6000 uniform sample points. The hyperprior fuzzy approximator () used 6000 iterations on 120 uniform sample points. The likelihood fuzzy approximator used 6000 iterations based on 500 uniform sample points. use if-part set functions that belong to conjugate families in Bayesian statistics. Figure 9.6 shows examples of such if-part sets from Corollaries 9.3|9.6. The conjugacy of Gaussian if-part sets is straightforward. The conjugacy of the beta, gamma, and Laplace if-part sets is only partial (semi-conjugacy) because we cannot combine the functions' exponents and because two beta set functions or two gamma set functions need not share the same supports. Theorem 9.2. [Preservation of the SAM convex structure in fuzzy Bayesian inference] (i) Doubly fuzzy posterior approximators are SAMs with product rules. 187 2D Posterior fHΜ,ΤÈxL 0.0 0.4 0.8 Σ 0.0 0.4 0.8 Τ 0.0 2.0 4.0 Triply Fuzzy 2D Posterior FHΜ,ΤÈxL Using Conditional Prior HHΣÈΤL 0.0 0.4 0.8 Σ 0.0 0.4 0.8 Τ 0.0 2.0 4.0 (a) (b) Figure 9.5: Triply fuzzy Bayesian inference: comparison between a 2-D non-conjugate posterior f(;jx)/ g(xj)h(j)() and its triply fuzzy approximator F(;jx). The rst panel shows the approximandf(;jx). The second panel shows a triply fuzzy approximatorF (;jx) that used a 2-D fuzzy approximationH(j) for the conditional priorh(j) and a 1-D fuzzy approximation () for the hyperprior pdf() and a 1-D fuzzy likelihood-pdf approximator G(xj). The Gaussian-SAM fuzzy approximator H(j) used 6 rules to approximate the 2-D pdfh(j) =(6 + 2; 4). The hyperprior Gaussian-SAM approximator () used 12 rules to approximate a beta mixture pdf () = 1 3 (12; 4) + 2 3 (4; 7). The Gaussian-SAM fuzzy likelihood approximatorG(xj) used 12 rules to approximate the likelihood function g(xj) =N(0;) for x = 0:25. The 2-D conditional prior fuzzy approximator H(j) used 6000 learning iterations based on 3970 uniform sample points. The hyperprior fuzzy approximator () used 15000 iterations on 1000 uniform sample points. The likelihood fuzzy approximator G(xj) used 15000 iterations based on 300 uniform sample points. Suppose anm 1 -rule SAM fuzzy systemG(xj) approximates (or represents) a likeli- hoodg(xj) and another m 2 -rule SAM fuzzy system H() approximates (or represents) a prior h() pdf with m 2 rules: G(xj) = P m 1 j=1 w g;j a g;j ()V g;j c g;j P m 1 i=1 w g;j a g;j ()V g;j = m 1 X j=1 p g;j ()c g;j (9.34) H() = P m 2 j=1 w h;j a h;j ()V h;j c h;j P m 2 j=1 w h;j a h;j ()V h;j = m 2 X j=1 p h;j ()c h;j (9.35) wherep g;j () = w g;j a g;j ()V g;j P m 1 i=1 w g;j a g;j ()V g;j andp h;j () = w h;j a h;j ()V h;j P m 2 i=1 w h;j a h;j ()V h;j are convex coecients: 188 −2 −1 0 1 2 3 4 5 0 0.5 1 θ a h (θ) a g (θ) a F (θ) −2 −1 0 1 2 3 4 5 0 0.5 1 θ a h (θ) a g (θ) a F (θ) (a) Gaussian set functions (b) beta set functions −2 −1 0 1 2 3 4 5 0 0.5 1 θ a h (θ) a g (θ) a F (θ) −2 −1 0 1 2 3 4 5 6 0 0.5 1 θ a h (θ) a g (θ) a F (θ) (c) gamma set functions (d) Laplace set functions Figure 9.6: Conjugacy and semi-conjugacy of the doubly fuzzy posterior if-part set functions a F () = a h ()a g (xj). (a) Gaussian if-part set functions have the form of (9.84) where a h () = G(1; 1; 1;) and a g () = G(3; 2; 1;) give Gaussian a F () =G( 7 5 ; 4 5 ;e 4=5 ;). (b) beta if-part set functions have the form of (9.105) where a h () =B(0; 4; 2; 3; 29;) anda g () =B(1; 6; 6; 12; 910 4 ;) give semi-betaa F (). (c) gamma if-part set functions have the form of (9.115) wherea h =G(0; 1; 2; 3; 2:7;) and a g =G(1; 1; 2; 0:5; 7:4;) give semi-gammaa F (). (d) Laplace if-part set functions have the form of (9.122) where a h () =L(1; 2;) and a g () =L(3; 3;) give semi-Laplace a F (). P m 1 j=1 p g;j () = 1 and P m 2 j=1 p h;j () = 1. Then (a) and (b) hold: (a) The fuzzy posterior approximator F (jx) is a SAM system with m =m 1 m 2 rules: F (jx) = P m i=1 w F;i a F;i ()V F;i c F;i P m i=1 w F;i a F;i ()V F;i : (9.36) (b) The m if-part set functions a F;i () of the fuzzy posterior approximator F(jx) are the products of the likelihood approximator's if-part sets a g;j () and the prior approximator's if-part sets a h;j (): a F;i () =a g;j ()a h;k (): (9.37) for i =m 2 (j 1) +k, j = 1;:::;m 1 , and k = 1;:::;m 2 . The weights w F i , then-part 189 set volumes V F i , and centroids c F i also have the same likelihood-prior product form: w F i =w g;j w h;k (9.38) V F i =V g;j V h;k (9.39) c F i = c g;j c h;k Q(x) : (9.40) (ii) Triply fuzzy posterior approximators and n-many fuzzy posterior approximators are SAMs with product rules. Suppose an m 1 -rule SAM fuzzy system G(xj) approximates (or represents) a likelihood g(xj), an m 2 -rule SAM fuzzy system H(;) approximates (or represents) a prior pdf h(j) with m 2 rules, an m 3 -rule SAM fuzzy system () approximates (or represents) a hyper-prior pdf pi() with m 3 rules: G(xj) = P m 1 j=1 w g;j a g;j ()V g;j c g;j P m 1 i=1 w g;j a g;j ()V g;j = m 1 X j=1 p g;j ()c g;j (9.41) H(;) = P m 2 j=1 w h;j a h;j (;)V h;j c h;j P m 2 j=1 w h;j a h;j (;)V h;j = m 2 X j=1 p h;j (;)c h;j (9.42) () = P m 3 j=1 w ;j a ;j ()V h;j c h;j P m 3 j=1 w ;j a ;j ()V ;j = m 3 X j=1 p h;j ()c h;j (9.43) where p g;j () = w g;j a g;j ()V g;j P m 1 i=1 w g;i a g;i ()V g;i ; (9.44) p h;j (;) = w h;j a h;j (;)V h;j P m 2 i=1 w h;i a h;i (;)V h;i ; and (9.45) p ;j () = w ;j a ;j ()V ;j P m 3 i=1 w ;i a pi;i ()V h;i (9.46) are convex coecients: P m 1 j=1 p g;j () = 1, P m 2 j=1 p h;j (;) = 1, and P m 2 j=1 p ;j () = 1. Then (a) and (b) hold: (a) The fuzzy posterior approximator F(;jx) is a SAM system with m =m 1 m 2 m 3 190 rules: F (;jx) = P m i=1 w F;i a F;i ()V F;i c F;i P m i=1 w F;i a F;i ()V F;i : (9.47) (b) Them if-part set nctionsa F;i (;) of the fuzzy posterior approximatorF (;jx) are the products of the likelihood approximator's if-part setsa g;j (), the prior approximator's if-part sets a h;j (;), and the hyper-prior approximators's if-part sets a ;j (): a F;i (;) =a g;j ()a h;k (;)a ;l () (9.48) for i =l +m 3 (k 1) +m 2 m 3 (j 1), j = 1;:::;m 1 , k = 1;:::;m 2 , and l = 1;:::;m 3 . The weights w F i , then-part set volumes V F i , and centroids c F i also have the same likelihood-prior-hyper-prior product form: w F i =w g;j w h;k w ;l (9.49) V F i =V g;j V h;k V ;l (9.50) c F i = c g;j c h;k c ;l Q(x) (9.51) where Q(x) = R D G(xj)H(;)()dd. This implies that the n-many fuzzy posterior approximators are also SAMs with product rules. Proof. (i) Doubly fuzzy case. The fuzzy system F (jx) has the form F (jx) = H()G(xj) R D H(t)G(xjt)dt (9.52) = 1 Q(x) m 1 X j=1 p g;j ()c g;j ! m 2 X j=1 p h;j ()c h;j ! (9.53) = m 1 X j=1 m 2 X k=1 p g;j ()p h;k () c g;j c h;k Q(x) (9.54) 191 = m 1 X j=1 m 2 X k=1 w g;j a g;j ()V g;j m 1 X i=1 w g;i a g;i ()V g;i w h;k a h;k ()V h;k m 2 X i=1 w h;i a h;i ()V h;i c g;j c h;k Q(x) (9.55) = m 1 X j=1 m 2 X k=1 w g;j w h;k a g;j ()a h;k ()V g;j V h;k c g;j c h;k Q(x) m 1 X j=1 m 2 X k=1 w g;j w h;k a g;j ()a h;k ()V g;j V h;k (9.56) = m X i=1 w F;i a F;i ()V F;i c F;i m X i=1 w F;i a F;i ()V F;i (9.57) F (jx) = m X i=1 p F;i ()c F;i (9.58) where p F;i () are the convex coecients dened as p F;i () = w F;i a F;i ()V F;i P m i=1 w F;i a F;i ()V F;i : (9.59) Proof. (ii) Triply fuzzy case. The fuzzy system F (;jx) has the form: 192 F (;jx) = G(xj)H(;)() R D D G(xjt)H(t;s)(s)dtds (9.60) = 1 Q(x) m 1 X j=1 p g;j ()c g;j ! m 2 X j=1 p h;j (;)c h;j ! m 3 X j=1 p ;j ()c ;j ! (9.61) = m 1 X j=1 m 2 X k=1 p g;j ()p h;k (;)p ;l () c g;j c h;k c ;l Q(x) (9.62) = m 1 X j=1 m 2 X k=1 m 3 X l=1 w g;j a g;j ()V g;j m 1 X i=1 w g;i a g;i ()V g;i w h;k a h;k (;)V h;k m 2 X i=1 w h;i a h;i (;)V h;i w ;l a ;l ()V ;k m 3 X i=1 w ;i a ;i ()V ;i c g;j c h;k c ;l Q(x) (9.63) = m 1 X j=1 m 2 X k=1 m 3 X l=1 w g;j w h;k w ;k a g;j ()a h;k (;)a ;l ()V g;j V h;k V ;l c g;j c h;k c ;l Q(x) m 1 X j=1 m 2 X k=1 m 3 X l=1 w g;j w h;k w ;k a g;j ()a h;k (;)a ;l ()V g;j V h;k V ;l (9.64) Therefore F (;jx) has the SAM convex structure F (;jx) = m X i w F;i a F;i ()V F;i c F;i m X i w F;i a F;i ()V F;i = X i p F;i ()c F;i : (9.65) The index i is shorthand for the triple summation indices (j;k;l). The parameters for 193 the centroids c F;i and the convex coecients p F;i are: w F;i =w g;j w h;k w ;k (9.66) a F;i =a g;j ()a h;k (;)a ;l () (9.67) V F;i =V g;j V h;k V ;l (9.68) and c F;i = c g;j c h;k c ;l Q(x) : (9.69) Corollary 9.1. Suppose a 2-rule fuzzy system G(xj) represents a likelihood g(xj) and an m-rule system H() approximates the prior pdf h(). Then the fuzzy-based posterior (or \updated" system) F (jx) is a SAM fuzzy system with 2m rules. Proof. Suppose a 2-rule fuzzy system G(xj) represents a likelihood g(xj): G(xj) = 2 X j=1 p g;j ()c g;j = 2 X k=1 a g;j ()c g;j (9.70) where the if-part set functions have the form (from the Watkins Representation Theorem) a g;1 (xj) = g(xj) infg(xj) supg(xj) infg(xj) (9.71) a g;2 (xj) =a c g;1 () = 1a g;1 (xj) (9.72) = supg(xj)g(xj) supg(xj) infg(xj) (9.73) and the centroids are c g;1 = supg and c g;2 = infg. And suppose that an m-rule fuzzy system H() with equal weights w i = = w m and volumes V i = = V m 194 approximates (or represents) the prior h(). Then (9.36) becomes F (jx) = m X j=1 2 X k=1 a g;k (xj)a h;j () c g;k c h;j Q(x) m X j=1 2 X k=1 a g;k (xj)a h;j () (9.74) = m X j=1 a g;1 (xj)a h;j () c g;1 c h;j Q(x) +a g;2 (xj)a h;j () c g;2 c h;j Q(x) m X j=1 a g;1 (xj)a h;j () +a g;2 (xj)a h;j () (9.75) = m X j=1 a g;1 (xj)a h;j () c g;1 c h;j Q(x) + (1a g;1 (xj))a h;j () c g;2 c h;j Q(x) m X j=1 a g;1 (xj)a h;j () + (1a g;1 (xj))a h;j () (9.76) The above results imply that the numberm of rules of a fuzzy systemF (jx) after n stages will bem 1 m n 2 = 2 n m rules. So the iterative fuzzy posterior approximator will in general suer from exponential rule explosion. At least one practical special case avoids this exponential rule explosion and produces only a linear or quadratic growth in fuzzy-posterior rules in iterative Bayesian inference. Suppose that we can keep track of past data involved in the Bayesian inference and that g(x 1 ;:::;x n j) = g( x n j). Then we can compute the likelihood g( x n1 j) from g( x n j) for any new data x n . Then we can update the original prior H() and keep the number of rules at 2m (or m 2 ) if the fuzzy system uses two rules (or m rules). Corollary 9.2. Posterior representation with fuzzy representations of h() and g(xj). Suppose a 2-rule fuzzy system G(xj) represents a likelihood function g(xj) and a 2-rule system H() represents the prior h(). Then the fuzzy-based posterior F(jx) 195 is a SAM fuzzy system with 4 (2 2) rules. Proof. Suppose a 2-rule fuzzy system G(xj) represents a likelihood g(xj) as in (9.70)-(9.73). The 2-rule fuzzy system H() likewise represents the prior pdf h(): H() = 2 X k=1 p h;k ()c h;k = 2 X k=1 a h;k ()c h;k : (9.77) The Watkins Representation Theorem implies that the if-part set functions have the form a h;1 () = h() infh() suph() infh() (9.78) a h;2 () =a c h;1 () = 1a h;1 () (9.79) = suph()h() suph() infh() (9.80) with centroids c h;1 = suph and c h;2 = infh. Then the SAM posterior F (jx) in (9.36) represents f(jx) with 4 rules: F (jx) = P 2 j=1 P 2 k=1 a g;j (xj)a h;k () c g;j c h;k Q(x) P 2 j=1 P 2 k=1 a g;j (xj)a h;k () (9.81) = 2 X j=1 2 X k=1 a g;j (xj)a h;k () c g;j c h;k q(x) (9.82) = 4 X i=1 a F;i ()c F;i (9.83) because P a g;j (xj) = P a h;k () = 1 and Q(x) =q(x) in (9.81). Figure 9.7 shows the if-part sets a h;k () of the 2-rule SAM H() that represents the beta prior h() (9; 9) and the if-part sets a g;j () of the 2-rule SAM G(xj) that represents the binomial likelihood g(20j) bin(20; 80). The resulting SAM posterior F(j20) that represents f(j20)(29; 69) has four rules with if-part sets 196 a F;i () = a g;j ()a h;k (). The next theorem gives the main result on the conjugacy structure of doubly and triply fuzzy systems. Figure 9.7: Doubly fuzzy posterior representation. Top: Two if-part sets a g;j () of the two-rule SAM likelihood representationG(xj) =g(20j)bin(20; 80) and two if-part sets a h;k () of the 2-rule SAM prior representation H(xj) =h()(9; 9). Bottom: Four if-part sets a F;i () = a g;j ()a h;k () of the 4-rule SAM posterior representation F (jx) =f(jx). Theorem 9.3. Semi-Conjugacy (i) The if-part sets of a doubly fuzzy posterior approximator are conjugate to the if-part sets of the fuzzy prior approximator. The product fuzzy if-part set functions a F;i () in Theorem 9.2.i(b) have the same functional form as the if-part prior set functions a h;k if a h;k is conjugate to the if-part likelihood set function a g;j . 197 (ii) The if-part sets of a triply fuzzy posterior approximator are conjugate to the if-part sets of the fuzzy prior approximator. The product fuzzy if-part set functions a F;i () in Theorem 9.2.ii(b) have the same functional form as the if-part prior set functions a h;k if a h;k is conjugate to the if-part likelihood set function a g;j and if-part likelihood set function a ;l . Proof. The product a F;i () = a g;j ()a h;k () of two conjugate functions a g;j and a h;k will still have the same functional form as a g;j () and a h;k (). Then the n parameters 1 ;:::; n dene the if-part likelihood set function: a g;j () = f( 1 ;:::; n ;). The n parameters 1 ;:::; n likewise dene the if-part prior set function a h;k () with the same functional form: a h;k () = f( 1 ;:::; n ;). Then a F;i () also has the same functional form f given the n parameters 1 ;:::; n : a F;i () =f( 1 ;:::; n ;) where l =g l ( 1 ;:::; n ; 1 ;:::; n ) forl = 1:::;n for some functionsg 1 ;:::;g n that do not depend on . Corollary 9.3. Conjugacy of Gaussian if-part sets. (i) Doubly fuzzy case. Suppose that the SAM-based prior H() uses Gaussian if-part sets a h;k () = G(m h;k ;d h;k ; h;k ;) and the SAM-based likelihood G(xj) also uses Gaussian if-part sets a g;j () =G(m g;j ;d g;j ; g;j ;) where G(m;d;;) =e (m) 2 =d 2 (9.84) for some positive constant > 0. Then F (jx) in (9.36) will have set functions a F;i () that are also Gaussian: a F;i () =G(m F;i ;d F;i ; F;i ;): (9.85) (ii) Triply fuzzy case. Suppose that the SAM-based prior H(;) uses factorable (product) Gaussian if-part sets a hk (;) =G(m hk ;d hk ; hk ;)G(m hk ;d hk ; hk ;); (9.86) 198 the SAM-based likelihood G(xj) uses Gaussian if-part sets a gj () =G(m gj ;d gj ; gj ;); (9.87) and the SAM-based hyper-prior () also uses Gaussian if-part sets a hl () =G(m hl ;d hl ; hl ;): (9.88) Then F(;jx) in (9.47) will have set functions a F;i (;) that are products of two Gaussian sets: a F;i (;) =G(m Fi ;d Fi ; Fi ;)G(m Fi ;d Fi ; Fi ;) (9.89) Proof. Gaussian if-part sets are self-conjugate because of their exponential structure. (i) Doubly fuzzy case. a F;i () =a g;j ()a h;k () (9.90) = F;i e (m F;i ) 2 =d 2 F;i (9.91) =G(m F;i ;d F;i ; F;i ;) (9.92) where m F;i = d 2 g;j m h;k +d 2 h;k m g;j d 2 g;j +d 2 h;k (9.93) d 2 F;i = d 2 g;j d 2 h;k d 2 g;j +d 2 h;k (9.94) F;i = h;k g;j expf (m h;k m g;j ) 2 d 2 g;j +d 2 h;k g: (9.95) for j = 1;:::;m 1 , k = 1;:::;m 2 , and i =m 2 (j 1) +k. 199 (ii) Triply fuzzy case. a F;i (;) =a g;j ()a h;k (;)a ;l () (9.96) = Fi e (m Fi ) 2 =d 2 Fi Fi e (m Fi ) 2 =d 2 Fi (9.97) =G(m Fi ;d Fi ; Fi ;)G(m Fi ;d Fi ; Fi ;) (9.98) where m Fi = d 2 gj m hk +d 2 hk m gj d 2 gj +d 2 hk (9.99) d 2 Fi = d 2 gj d 2 hk d 2 gj +d 2 hk (9.100) Fi = hk gj expf (m hk m gj ) 2 d 2 gj +d 2 hk g (9.101) m Fi = d 2 l m hk +d 2 hk m l d 2 l +d 2 hk (9.102) d 2 Fi = d 2 l d 2 hk d 2 l +d 2 hk (9.103) Fi = hk l expf (m hk m l ) 2 d 2 l +d 2 hk g: (9.104) forj = 1;:::;m 1 ,k = 1;:::;m 2 ,l = 1;:::;m 3 , andi =l+m 3 (k1)+m 2 m 3 (j1). Corollary 9.3 also shows that if the fuzzy approximator H(;) uses product if part set functions a h (;) =a h ()a h () then the fuzzy posterior F (;jx) also has product if-part sets a F (;) =a F ()a F (). This holds for higher dimension fuzzy approximators for Bayesian inference. Thus the corollaries below only state the results for doubly fuzzy cases. Corollary 9.4. Semi-conjugacy of beta if-part sets. Suppose that the SAM-based prior H() uses beta (or binomial) if-part sets a h;k () = B(m h;k ;d h;k ; h;k ; h;k ; h;k ;) and the SAM-based likelihood G(xj) also uses beta (or binomial) if-part sets a g;j () =B(m g;j ;d g;j ; g;j ; g;j ; g;j ;) where B(m;d;;;;) = m d 1 ( m d ) (9.105) 200 if 0< m d < 1 and for some constant > 0. Then the posterior F (jx) in (9.36) will have if-part set functions a F;i () of semi-beta form: a F;i () = F;i m h;k d h;k h;k + g;j jk () 1 ( m h;k d h;k ) h;k + g;j jk () (9.106) Proof. a F;i () = a g;j ()a h;k () (9.107) = F;i m h;k d h;k h;k 1 ( m h;k d h;k ) h;k m g;j d g;j g;j 1 ( m g;j d g;j ) g;j (9.108) = F;i m h;k d h;k h;k + g;j jk () 1 ( m h;k d h;k ) h;k + g;j jk () (9.109) if 0< m h;k d h;k < 1 and 0< m g;j d g;j < 1 or if 2 (m h;k ;m h;k +d h;k )\ (m g;j ;m g;j +d g;j ) where jk () = log ( m h;k d h;k ) ( m g;j d g;j ) (9.110) jk () = log (1 m h;k d h;k ) (1 m g;j d g;j ): (9.111) A special case occurs if m h;k = m g;j and d h;k = d g;j . Then a F;i has the beta conjugate form: a F;i () = F;i m h;k d h;k F;i 1 ( m h;k d h;k ) F;i (9.112) =B(m h;k ;d h;k ; F;i ; F;i ; F;i ;) (9.113) if 0< m h;k d h;k < 1. Here F;i = h;k + g;j , F;i = h;k + g;j , and F;i = h;k g;j . The if-part fuzzy sets of the posterior approximation in (9.109) have beta-like form but with exponents that also depend on . Suppose we repeat the updating of the prior-posterior. Then the nal posterior will still have the beta-like if-part sets of the 201 form a F;s () = F;s m h;k d h;k h;k + P i g;i ik () 1 ( m h;k d h;k ) h;k + P i g;i ik () (9.114) for 2D =\ i (m g;i ;m g;i +d g;i )\ (m h;k ;m h;k +d h;k ). Corollary 9.5. Semi-conjugacy of gamma if-part sets. Suppose that the SAM-based priorH() uses gamma (or Poisson) if-part setsa h;k () = G(m h;k ;d h;k ; h;k ; h;k ; h;k ;) and the SAM-based likelihood G(xj) also uses gamma (or Poisson) if-part sets a g;j () =G(m g;j ;d g;j ; g;j ; g;j ; g;j ;) where G(m;d;;;;) = m d e ( m d )= (9.115) if m d > 0 (or if > m) for some constant > 0. Then the posterior F(jx) in (9.36) will have set functions a F;i () of semi-gamma form a F;i () = F;i m h;k d h;k h;k + g;j log ( m h;k d h;k ) ( m g;j d g;j ) e ( g;j d g;j m h;k + h;k d h;k m g;j g;j d g;j + h;k d h;k )= g;j h;k d g;j d h;k g;j d g;j + h;k d h;k (9.116) Proof. a F;i () = a g;j ()a h;k () (9.117) = F;i m h;k d h;k h;k e ( m h;k d h;k )= h;k m g;j d g;j g;j e ( m g;j d g;j )= g;j (9.118) = F;i m h;k d h;k h;k + g;j log ( m h;k d h;k ) ( m g;j d g;j ) e ( m h;k d h;k )= h;k ( m g;j d g;j )= g;j (9.119) = F;i m h;k d h;k h;k + g;j log ( m h;k d h;k ) ( m g;j d g;j ) e ( g;j d g;j m h;k + h;k d h;k m g;j g;j d g;j + h;k d h;k )= g;j h;k d g;j d h;k g;j d g;j + h;k d h;k (9.120) if >m h;k and >m g;j (or > maxfm h;k ;m g;j g). 202 A special case occurs if m h;k =m g;j and d h;k =d g;j . Then a F;i has gamma form a F;i () = F;i m h;k d h;k F;i e ( m h;k d h;k )= F;i =G(m h;k ;d h;k ; F;i ; F;i ; F;i ;) (9.121) if >m h;k . Here F;i = h;k + g;j , F;i = g;j h;k g;j + h;k , and F;i = h;k g;j . Corollary 9.6. Semi-conjugacy of Laplace if-part sets. Suppose that the SAM-based priorH() uses Laplace if-part setsa h;k () =L(m h;k ;d h;k ;) and the SAM-based likelihoodG(xj) also uses Laplace if-part setsa g;j () =L(m g;j ;d g;j ;) where L(m;d;) =e j m d j : (9.122) Then F (jx) in (9.36) will have set functions a F;i () of the (semi-Laplace) form a F;i () =a g;j ()a h;k () = e j m h;k d h;k jj m g;j d g;j j : (9.123) A special case occurs if m h;k =m g;j and d h;k =d g;j . Then a F;i is of Laplace form a F;i () =e j m h;k d h;k =2 j : (9.124) Such semi-conjugacy diers from outright conjugacy in a crucial respect: The parameters of semi-conjugate if-part sets increase with each iteration or Bayesian update. The conjugate Gaussian sets in Corollary 9.3 avoid this parameter growth while the semi-conjugate beta, gamma, and Laplace sets in Corollaries 9.4|9.6 incur it. The latter if-part sets do not depend on a xed number of parameters such as centers and widths as in the Gaussian case. Only the set functions with the same centersm j and widthsd j (in the special cases) will result in set functions for posterior approximation with the same xed number of parameters. Coping with this form of \parameter explosion" remains an open area of research in the use of fuzzy systems in iterative Bayesian inference. 203 a G a F = a H ´a G ã -1.52Abs@4.+ΘD ã -1.98Abs@2.+ΘD ã -1.42Abs@-1.96+ΘD ã -1.49Abs@4.+ΘD Exp H-1.52 Θ+ 4.¤- 1.49 Θ+ 4.¤L Exp H-1.98 Θ+ 2.¤- 1.49 Θ+ 4.¤L Exp H-1.42 Θ- 1.96¤- 1.49 Θ+ 4.¤L a H ã -1.51Abs@2.+ΘD Exp H-1.51 Θ+ 2.¤- 1.52 Θ+ 4.¤L Exp H-1.98 Θ+ 2.¤- 1.51 Θ+ 2.¤L Exp H-1.42 Θ- 1.96¤- 1.51 Θ+ 2.¤L ã -1.52Abs@-2.+ΘD Exp H-1.52 Θ- 2.¤- 1.52 Θ+ 4.¤L Exp H-1.52 Θ- 2.¤- 1.98 Θ+ 2.¤L Exp H-1.52 Θ- 2.¤- 1.42 Θ- 1.96¤L Figure 9.8: Laplace semi-conjugacy: Plots show examples of semi-conjugate Laplace set functions for the doubly fuzzy posterior approximator F (jx)/H()G(xj). The approximator uses ve Laplace set functions a H for the prior approximator H() and ve Laplace set functions a G for the likelihood approximator G(xj). Thus F(jx) is a weighted sum of 25 Laplacian semi-conjugate set functions of the form a F;i () = exp j m h;k d h;k jj m g;j d g;j j . The plots show that the Laplace semi-conjugate function can have a variety of shapes depending on the location and dispersion parameters of the prior and likelihood set functions. 9.4 Conclusion We have shown that additive fuzzy systems can uniformly approximate a Bayesian posterior even in the hierarchical case when the prior pdf depends on its own uncer- tain parameter with its own hyperprior. This gives a triply fuzzy uniform function approximation. That hyperprior can in turn have its own hyperprior. The result will be a quadruply fuzzy uniform function approximation and so on. This new theorem substantially extends the choice of priors and hyperpriors from well-known closed-form pdfs that obey conjugacy relations to arbitrary rule-based priors that depend on user knowledge or sample data. An open research problem is whether semi-conjugate rules or other techniques can reduce the exponential rule explosion that both doubly and 204 triply fuzzy Bayesian systems face in general Bayesian iterative inference. 205 Chapter 10 Conclusion and Future Directions 10.1 Conclusion The prevalence of high-speed ubiquitous computing devices has led to a phenomenal growth in the amount data available for algorithmic manipulation. One side eect of this growth is the commensurate growth in the amount of corrupted or incomplete data. Increased diversity in source and structure of recorded data also means there is a higher risk of applying the wrong statistical model for analysis. There are costs associated with the inecient use of corrupted data and the improper use of statistical models. These costs will only grow with our increasing reliance on algorithms in daily life. And yet the systematic study of data and model corruption lags behind other statistical signal processing research eorts. This dissertation addresses both problems: the statistical analysis of corrupted or incomplete data and the eects of model misspecication. The rst part of the dissertation showed how to improve the speed of a general algorithm (the EM algorithm) for analyzing corrupted data. The speed improvement relies on a theoretical result (Theorem 3.1) about noise-benets in EM algorithms. This dissertation demonstrated how this result aects three important statistical applications: data-clustering, sequential data analysis with HMMs, and supervised neural network training. The second part of the dissertation showed that the use of uniform approximators can limit the sensitivity of Bayesian inference schemes to model misspecication. This dissertation presented theorems showing that Bayes theorem on uniform approxi- mators for model functions (prior pdfs and likelihood functions) produce uniform 206 approximations for the posterior pdf. So Bayes theorem preserves the approximation quality of approximate data models. The dissertation also demonstrated an ecient data-driven fuzzy approximation scheme for uniformly approximating these statistical models functions. The approximation method is particularly interesting because it can exactly reproduce known traditional closed-form statistical models. The next sections discuss future extensions to the results in this dissertation. 10.2 Future Work The EM algorithm has many diverse applications. Noisy expectation maximization may help reduce the training time of time-intensive EM applications. Table 10.1 lists a few areas to explore. The subsections below describe some of these application areas in detail. Application Prior EM Work NEM Extension Big Data Clustering Celeux & Govaert, Hof- mann [47], [132], [197]. Chapter 4, [226] Feedforward NN Training Ng & McLachlan [220], Cook & Robinson [59]. Chapter 6 [6] HMM Training Baum & Welch [16], [17], [298]. Chapter 5 [7] Deep Neural Network Training None. Extension based on Chapter 6 Genomics: Motif Identication Lawrence & Reilly [187], Bai- ley & Elkan [10]. Proposed below PET/SPECT Vardi & Shepp[268]. Proposed below MRI Segmentation Zhang et al.[320]. Proposed below Table 10.1: EM algorithm applications with possible NEM extensions 10.2.1 NEM for Deep Learning Deep learning [129] refers to the use \deep" stacks of restricted Boltzmann machines (RBMs) [129], [130], [270] to perform machine learning tasks. An RBM is a stochastic neural network with a bipartite graph structure and an associated Gibbs energy function E(x; hj) for the whole ensemble of neurons. RBMs are simple bidirectional associative memories (BAMs) [167] with synchronous neuron-update. The Gibbs energy functions are just the BAM energy or Lyapunov functions [167], [176], [177] for the network. RBMs derive their convergence properties from the general BAM 207 convergence theorem [167], [176], [177]. Thus the convergence of any deep learning algorithm is a direct consequence of the general BAM convergence theorem. Chapter 6 showed that backpropagation training of feedforward neural networks is a GEM procedure. The GEM training formalism extends to the training of RBMs [6]. Thus we can apply NEM noise{benet to speed up RBM training and deep learning. An RBM has a visible layer with I neurons and a hidden layer with J neurons. Let x i and h j denote the values of the i th visible and j th hidden neuron. W W T Hidden Layer, h Visible layer, x Figure 10.1: A restricted Boltzmann machine (RBM) with visible-to-hidden connection matrix W. The network is restricted because there are no connections between neurons in the same layer. The RBM is a BAM with forward connection matrix W and backward connection matrix W T . . . . . . . . . . Hidden layers Visible layer Output layer Figure 10.2: A Deep Neural Network consists of a stack of restricted Boltzmann machines (RBMs) or bidirectional associative memories (BAMs). Let E(x; hj) be the energy function for the network. Then the joint probability 208 density function of x and h is the Gibbs distribution: p(x; hj) = exp E(x; hj) Z() (10.1) where Z() = X x X h exp(E(x; hj)): (10.2) The Gibbs energy functionE(v; hj) depends on the type of RBM. A Bernoulli(visible)- Bernoulli(hidden) RBM has logistic conditional PDFs at the hidden and visible layers and has the following Gibbbs energy: E(x; hj) = I X i=1 J X j=1 w ij x i h j I X i=1 b i x i J X j=1 a j h j (10.3) where w ij is the weight of the connection between the i th visible and j th hidden neuron, b i is the bias for the i th visible neuron, and a j is the bias for the j th hidden neuron. A Gaussian(visible)-Bernoulli(hidden) RBM has Gaussian conditional PDFs at the visible layer, logistic conditional PDFs at the hidden layer, and the energy function [130], [131] E(x; hj) = I X i=1 J X j=1 w ij x i h j + 1 2 I X i=1 (x i b i ) 2 J X j=1 a j h j : (10.4) Constrastive divergence (CD) [130] is an approximate gradient ascent algorithm for nding ML parameters for the RBMs. This ML gradient ascent approach means that CD is an instance of generalized EM [6] { just like backpropagation is an instance of GEM. So we can derive NEM conditions for RBM training too. The NEM theorem implies that noise benets occur on average if E x;h;nj n E(x + n; hj) +E(x; hj) o 0: (10.5) This condition depends on the activation function in the visible and hidden layers. The next two theorems specify the NEM conditions for two common RBM congurations: the Bernoulli-Bernoulli and the Gaussian-Bernoulli RBM. 209 Theorem 10.1. [Forbidden Hyperplane Noise{Benet Condition]: The NEM positivity condition holds for Bernoulli-Bernoulli RBM training if E x;h;nj n n T (Wh + b) o 0: (10.6) Proof. The noise{benet for a Bernoulli(visible)-Bernoulli(hidden) RBM results if we apply the energy function from (10.3) to the expectation in (10.5) to get E x;h;nj n I X i=1 J X j=1 w ij n i h j + I X i=1 n i b i o 0: (10.7) The term in brackets is equivalent to I X i=1 J X j=1 w ij n i h j + I X i=1 n i b i = n T (Wh + b): (10.8) So the noise{benet sucient condition becomes E x;h;nj n n T (Wh + b) o 0: (10.9) The above sucient condition stated separation condition that guarantees a noise{ benet in the Bernoulli-Bernoulli RBM. The next theorem states a spherical separation condition that guarantees a noise{benet in the Gaussian-Bernoulli RBM. Theorem 10.2. [Forbidden Sphere Noise{Benet Condition]: The NEM positivity condition holds for Gaussian-Bernoulli RBM training if E x;h;nj n 1 2 knk 2 n T (Wh + b x) o 0: (10.10) Proof. Putting the energy function in (10.4) into (10.5) gives the noise{benet condi- tion for a Gaussian(visible)-Bernoulli(hidden): E x;h;nj n I X i=1 J X j=1 w ij n i h j + I X i=1 n i b i 1 2 I X i=1 n 2 i I X i=1 n i x i o 0: (10.11) 210 The term in brackets equals the following matrix expression: I X i=1 J X j=1 w ij n i h j + I X i=1 n i b i 1 2 I X i=1 n 2 i I X i=1 n i x i = n T (Wh + b x) 1 2 knk 2 : (10.12) So the noise{benet sucient condition becomes E x;h;nj n 1 2 knk 2 n T (Wh + b x) o 0: (10.13) These noise-benet conditions predict speed improvements for deep learning via the careful application of noise injection. 10.2.2 NEM for Genomics: DNA Motif Identication An organism's genome contains all the instructions necessary to manufacture the proteins its cells need to function. The information resides in the coding DNA sequences (or genes) within the genome. Transcription is the nuclear process by which the cell copies protein-code from a gene to a strand of messenger-RNA (mRNA). mRNA takes this code outside the nucleus to ribosomes where protein-synthesis occurs. The human genome contains between 30000 and 40000 of such protein-coding genes. Most advanced cells need only a small portion of these genes to be active at any given time. So eukaryotic cells have adapted by keeping most genes switched o unless they receive a signal for expression [39]. One of the functions of non-coding (or upstream) DNA sequences is to control the initiation and rate of gene expression. The cell transcribes (or turns on) a specic gene when transcription factors (promoters and enhancers 1 ) bind to that gene's associated binding sites located upstream in the genome. The basic question is [10], [187], [252]: how do we identify transcription factor binding sequences for specic genes? The problem is more complicated because the recognition of a binding site DNA sequence is often fuzzy; inexact promoter sequence matches can activate gene expression. Sabatti and Lange [261] give an example of 11 1 There are also negative transcription factors that inhibit specic gene expression. 211 experimentally-veried admissible variations of a single promoter sequence (the CRP binding site) that activates a specic gene in E. coli. These fuzzy sequence patterns compose a single Motif [314] for CRP binding. Motif search is a fertile area for EM methods [10], [187], [275] because of the problem's missing information structure. The motif positions are missing and the exact motif sequence is not precise. The idea of a motif also extends from DNA sequences for genes to amino acid sequences for proteins. Bailey and Elkan [10]{[13] developed a mixture model EM for motif identication which they called Multiple EM for Motif Elicitation (MEME). Likelihood models on DNA sequences assume sequence entries are independent discrete random variables that take values from the 4 possible nucleotides 2 :fAdenine (A), Thymine (T), Guanine (G), Cytosine (C)g. MEME models the data as a mixture of two dierent discrete populations [10]: motif sequences and background (non-motif) sequences. An EM algorithm learns this mixture model which then informs the unsupervised sequence classication. Lawrence and Reilly [187] had introduced the rst explicit EM model. But their EM algorithm does a supervised position search for xed-length motifs in sequences that contain a single of the motif. Sabatti and Lange [261] also developed an MM generalization of Lawrence and Reilly's method that allows motifs to have variable length sequences. These approaches have had some success. But they work slowly on the huge human genome data set. And they miss some binding sites. Perturbing these likelihood-based estimation methods with noise can alleviate some of these problems. The success of a randomized Gibbs' sampling [104], [280] version of Lawrence and Reilly's supervised method [186] strongly suggests that the principled use of noise injection (e.g. NEM) should improve EM-based approaches to motif identication. 10.2.3 NEM for PET & SPECT We can write a NEM condition for PET reconstruction via EM. The Q-function for the PET-EM model from x2.5.3 was: Q(j(t)) = X j X i p ij i +E ZjY;(t) [Z ij ] ln(p ij i )E ZjY;(t) [ln(z ij !)]: (10.14) 2 A similar likelihood model applies for motif identication in biopolymer sequences like proteins. Proteins just use a symbol lexicon of amino acids instead of nucleotides. 212 The noise{benet condition for general exponential-family EM models is: E Z;Nj ln f(Z + Nj) f(Zj) 0 (10.15) The likelihood function for a Poisson model is: lnf(zj z ) = z +z ln( z ) ln(z!) (10.16) The log-ratio in the expectation reduces to the following for each data sample z: ln f(z +Nj z ) f(zj z ) =N ln z + ln(z!) ln[(z +N)!]: (10.17) This simplies to the following condition: N 1 ln z ln (z +N + 1) (z + 1) : (10.18) The noise{benet holds if the inequality (10.18) holds only on average. Thus E N [N] 1 ln( z ) E N ln (z +N + 1) (z + 1) : (10.19) The noise condition for the PET model replaces z in (10.19) with p ij i in (2.116). E N [N] 1 ln(p ij i ) E N ln (z +N + 1) (z + 1) : (10.20) This improvement applies to the PET using EM. Th PET-EM approach was state-of-the-art almost 30 years ago when Shepp and Vardi [268] introduced the idea of ML reconstruction for emission tomography. There have been recent advances in statistical reconstruction for tomography. These advances include the use of MAP estimation, the introduction of regularization, the use of Gibbs priors, and hybrid methods that take frequency-domain details into account. The NEM theorem and its generalizations may accommodate some of these improvements. 213 10.2.4 NEM for Magnetic Resonance Image Segmentation The Expectation Maximization (EM) algorithm gures prominently in the segmenta- tion of magnetic resonance (MR) images of brain and cardiac tissue [154],[320],[194],[75]. For example, a brain image segmentation problem takes a MR image of the brain and classies voxels or parts of voxels as grey matter, white matter, or cerebro-spinal uid. There have been recent eorts [141], [142], [150] to develop robust automated meth- ods for dierentiating and measuring dierences between water and fat contents in a MR image. MR is optimal for this task because MR can dierentiate between fat and water-saturated tissue unlike most other medical imaging modalities. But Water-Fat segmentation is dicult even with MR sensing. This is because subcutaneous and vis- ceral (within organs) adipose tissue have dierent distribution characteristics. Current water/fat segmentation uses post-processing methods based on semi-automated image registration techniques [150] or human annotation. These work well for identifying large adipose masses. But they are not well-suited for identifying for the smaller adi- pose masses that are characteristic of visceral fat. Accurate and automatic water/fat MRI segmentation would be an important tool for assessing patients' risk of type II diabetes, heart disease, and other obesity-related ailments. A simple EM segmentation application would use of EM to t the pixel intensity levels to dierent tissue classes [155], [144] under a Gaussian mixture model (GMM) assumption. Pixel intensity classication model does not take spatial information into account. This limits the model to well-dened, low-noise applications [320]. A more general approach is to apply a spatially-aware image segmentation method based on hidden Markov Random Field model (HMRF).The HMRF model is an non-causal generalization of the hidden Markov model. HMMs usually apply to signals with localized dependence in a time-like dimension. While HMRFs model signals with localized dependence in its spatial dimensions. The traditional Baum-Welch algorithm cannot train HMRF models because of the non-causal nature of signals involved. But there are mean-eld EM algorithms that can train HMRFs. There is prior work showing EM-based image segmentation [316],[317] on the HMRF model. There is also prior work showing how the HMRF model applies to image segmentation for MR images [154],[320],[75]. The EM algorithm on HMRF models is computationally expensive and can be slow. Thus a speed-up in the algorithm would be useful. We propose to generalize the Noisy Expectation Maximization result to such 214 mean-eld algorithms and apply their noisy variants to HMRF training. The goal would be to show faster convergence of the HMRF-NEM algorithm or to show that NEM improves classication/segmentation results for a given number of EM iterations. 10.2.5 Rule Explosion in Approximate Bayesian Inference The Bayesian approximation results apply to generic uniform approximations. Chap- ters x8 and x9 examined the behavior of fuzzy approximators in Bayesian applications. Neural network pdf approximators may have interesting properties. Other uniform approximators methods may also be worth exploring. The main problem with the Bayesian approximation results is approximator explosion in the iterative Bayesian setting. This manifests as rule explosion when we use fuzzy function approximators. Semi-conjugacy (x9.3) is a restricted approach to controlling rule explosion problem. Other approximation methods may lead to more robust strategies for reducing rule explosion in iterative Bayesian inference. Further work could also combine the idea of a noise-benet with iterative Bayesian inference. A noise{benet may speed up the convergence of Gibbs samplers. 215 Bibliography [1] R. K. Adair, R. D. Astumian, and J. C. Weaver, \Detection of Weak Electric Fields by Sharks, Rays and Skates", Chaos: Focus Issue on the Constructive Role of Noise in Fluctuation Driven Transport and Stochastic Resonance, vol. 8, no. 3, pp. 576{587, Sep. 1998. [2] J. Aldrich, \RA Fisher and the making of maximum likelihood 1912-1922", Statistical Science, vol. 12, no. 3, pp. 162{176, 1997. [3] K. Alligood, T. Sauer, and J. Yorke, Chaos: An Introduction to Dynamical Systems, ser. Chaos: An Introduction to Dynamical Systems. Springer, 1997, isbn: 9780387946771. [4] C. Ambroise, M. Dang, and G. Govaert, \Clustering of spatial data by the em algorithm", Quantitative Geology and Geostatistics, vol. 9, pp. 493{504, 1997. [5] G. An, \The eects of adding noise during backpropagation training on a generalization performance", Neural Computation, vol. 8, no. 3, pp. 643{674, 1996. [6] K. Audhkhasi, O. Osoba, and B. Kosko, \Noise benets in backpropagation and bidirectional pre-training", in International Joint Conference on Neural Networks (IJCNN) (to appear), IEEE, 2013. [7] ||, \Noisy hidden Markov models for speech recognition", in International Joint Conference on Neural Networks (IJCNN) (to appear), IEEE, 2013. [8] P. Bacchetti, \Estimating the incubation period of AIDS by comparing popu- lation infection and diagnosis patterns", Journal of the American Statistical Association, vol. 85, no. 412, pp. 1002{1008, 1990. [9] M. Bagnoli and T. Bergstrom, \Log-concave Probability and Its Applications", Economic theory, vol. 26, pp. 445{469, 2005. 216 [10] T. L. Bailey and C. Elkan, \Fitting a mixture model by expectation maximiza- tion to discover motifs in biopolymers.", in Proc. International Conference on Intelligent Systems for Molecular Biology, vol. 2, 1994, p. 28. [11] ||, \The value of prior knowledge in discovering motifs with MEME", in Proc. International Conference on Intelligent Systems for Molecular Biology, vol. 3, 1995, pp. 21{9. [12] ||, \Unsupervised learning of multiple motifs in biopolymers using expecta- tion maximization", Machine learning, vol. 21, no. 1-2, pp. 51{80, 1995. [13] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li, \MEME: discovering and analyzing dna and protein sequence motifs", Nucleic acids research, vol. 34, no. suppl 2, W369{W373, 2006. [14] L. E. Baum and J. A. Eagon, \An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology", Bull. Amer. Math. Soc, vol. 73, no. 3, pp. 360{363, 1967. [15] L. E. Baum and T. Petrie, \Statistical inference for probabilistic functions of nite state Markov chains", The Annals of Mathematical Statistics, vol. 37, no. 6, pp. 1554{1563, 1966. [16] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, \A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains", The annals of mathematical statistics, pp. 164{171, 1970. [17] L. Baum, T. Petrie, G. Soules, and N. Weiss, \A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains", The annals of mathematical statistics, pp. 164{171, 1970. [18] M. J. Bayarri and M. H. DeGroot, \Diculties and ambiguities in the denition of a likelihood function", Journal of the Italian Statistical Society, vol. 1, no. 1, pp. 1{15, 1992. [19] T. Bayes, \An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a Letter to John Canton, A. M. F. R. S.", Philosophical Transactions of the Royal Society of London, vol. 53, pp. 370{418, 1763. 217 [20] E. M. L. Beale and R. J. A. Little, \Missing values in multivariate analysis", Journal of the Royal Statistical Society. Series B (Methodological), pp. 129{145, 1975. [21] M. P. Becker, I. Yang, and K. Lange, \EM algorithms without missing data", Statistical Methods in Medical Research, vol. 6, no. 1, pp. 38{54, 1997. [22] Y. Bengio, \Learning deep architectures for AI", Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1{127, 2009. [23] R. Benzi, G. Parisi, A. Sutera, and A. Vulpiani, \A Theory of Stochastic Resonance in Climatic Change", SIAM Journal on Applied Mathematics, vol. 43, no. 3, pp. 565{578, Jun. 1983. [24] R. Benzi, A. Sutera, and A. Vulpiani, \The Mechanism of Stochastic Resonance", Journal of Physics A: Mathematical and General, vol. 14, no. 11, pp. L453{L457, 1999. [25] J. Berger and R. Wolpert, The Likelihood Principle, ser. Institute of Mathe- matical Statistics Hayward, Calif: Lecture notes, monograph series. Institute of Mathematical Statistics, 1988, isbn: 9780940600133. [Online]. Available: http://books.google.com/books?id=7fz8JGLmWbgC. [26] J. Bernardo and A. Smith, Bayesian Theory, ser. Wiley Series in Probability and Statistics. Wiley, 2009, isbn: 9780470317716. [27] J. Bezdek, R. Ehrlich, and W. Fill, \FCM: the fuzzy c-means clustering algo- rithm", Computers & Geosciences, vol. 10, no. 2-3, pp. 191{203, 1984. [28] P. J. Bickel and K. A. Doksum, Mathematical Statistics: Volume I, 2nd. Prentice Hall, 2001. [29] P. Billingsley, Probability and Measure, 3rd ed. John Wiley & Sons, 1995. [30] C. M. Bishop, \Training with noise is equivalent to Tikhonov regularization", Neural computation, vol. 7, no. 1, pp. 108{116, 1995. [31] C. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics. Springer, 2006, isbn: 9780387310732. [32] D. B ohning and B. G. Lindsay, \Monotonicity of quadratic-approximation algorithms", Annals of the Institute of Statistical Mathematics, vol. 40, no. 4, pp. 641{663, 1988. 218 [33] L. Bottou, \Stochastic gradient learning in neural networks", Proceedings of Neuro-Nmes, vol. 91, p. 8, 1991. [34] G. E. P. Box and G. C. Tiao, \Multiparameter problems from a Bayesian point of view", The Annals of Mathematical Statistics, vol. 36, no. 5, pp. 1468{1482, 1965. [35] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [36] R. A. Boyles, \On the convergence of the EM algorithm", Journal of the Royal Statistical Society. Series B (Methodological), vol. 45, no. 1, pp. 47{50, 1983, issn: 0035-9246. [37] M. Brand, N. Oliver, and A. Pentland, \Coupled hidden Markov models for complex action recognition", in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, IEEE, 1997, pp. 994{999. [38] J. J. Brey and A. Prados, \Stochastic Resonance in a One-Dimension Ising Model", Physics Letters A, vol. 216, pp. 240{246, Jun. 1996. [39] W. M. Brown and P. M. Brown, Transcription, ser. Cell and biomolecular sciences. Taylor & Francis, 2004, isbn: 9780203166314. [40] A. R. Bulsara and L. Gammaitoni, \Tuning in to Noise", Physics Today, pp. 39{ 45, Mar. 1996. [41] A. R. Bulsara and A. Zador, \Threshold Detection of Wideband Signals: A Noise-Induced Maximum in the Mutual Information", Physical Review E, vol. 54, no. 3, R2185{R2188, Sep. 1996. [42] B. P. Carlin and T. A. Louis, Bayesian Methods for Data Analysis, 3rd. CRC Press, 2009. [43] G. A. Carpenter and S. Grossberg, \A massively parallel architecture for a self- organizing neural pattern recognition machine", Computer Vision, Graphics, and Image Processing, vol. 37, no. 1, pp. 54{115, 1987. [44] G. A. Carpenter, S. Grossberg, and J. H. Reynolds, \ARTMAP: supervised real-time learning and classication of nonstationary data by a self-organizing neural network", Neural Networks, vol. 4, no. 5, pp. 565{588, 1991. 219 [45] G. A. Carpenter, S. Grossberg, and D. B. Rosen, \Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system", Neural Networks, vol. 4, no. 6, pp. 759{771, 1991. [46] G. Celeux and J. Diebolt, \The SEM algorithm: A Probabilistic Teacher Algorithm Derived from the EM Algorithm for the Mixture Problem", Compu- tational Statistics Quarterly, vol. 2, no. 1, pp. 73{82, 1985. [47] G. Celeux and G. Govaert, \A Classication EM Algorithm for Clustering and Two Stochastic Versions", Computational Statistics & Data Analysis, vol. 14, no. 3, pp. 315{332, 1992. [48] V. Cern y, \Thermodynamical approach to the Traveling Salesman Problem: an ecient simulation algorithm", Journal of Optimization Theory and Appli- cations, vol. 45, no. 1, pp. 41{51, 1985. [49] F. Chapeau-Blondeau and D. Rousseau, \Noise improvements in stochastic resonance: from signal amplication to optimal detection", Fluctuation and noise letters, vol. 2, no. 03, pp. L221{L233, 2002. [50] ||, \Noise-Enhanced Performance for an Optimal Bayesian Estimator", IEEE Transactions on Signal Processing, vol. 52, no. 5, pp. 1327{1334, 2004, issn: 1053-587X. [51] F. Chapeau-Blondeau, S. Blanchard, and D. Rousseau, \Fisher information and noise-aided power estimation from one-bit quantizers", Digital Signal Processing, vol. 18, no. 3, pp. 434{443, 2008. [52] D. Chauveau, \A stochastic EM algorithm for mixtures with censored data", Journal of Statistical Planning and Inference, vol. 46, no. 1, pp. 1{25, 1995. [53] H. Chen, P. Varshney, S. Kay, and J. Michels, \Noise Enhanced Nonparametric Detection", IEEE Transactions on Information Theory, vol. 55, no. 2, pp. 499{ 506, 2009, issn: 0018-9448. [54] J. Cherng and M. Lo, \A hypergraph based clustering algorithm for spatial data sets", in Proceedings of the IEEE International Conference on Data Mining, 2001. ICDM 2001, IEEE, 2001, pp. 83{90. 220 [55] Y Chow, M Dunham, O Kimball, M Krasner, G Kubala, J Makhoul, P Price, S Roucos, and R Schwartz, \BYBLOS: The BBN continuous speech recognition system", in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'87., IEEE, vol. 12, 1987, pp. 89{92. [56] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, \Deep, big, simple neural nets for handwritten digit recognition", Neural computation, vol. 22, no. 12, pp. 3207{3220, 2010. [57] J. J. Collins, C. C. Chow, A. C. Capela, and T. T. Imho, \Aperiodic Stochastic Resonance", Physical Review E, vol. 54, no. 5, pp. 5575{5584, Nov. 1996. [58] J. J. Collins, C. C. Chow, and T. T. Imho, \Aperiodic Stochastic Resonance in Excitable Systems", Physical Review E, vol. 52, no. 4, R3321{R3324, Oct. 1995. [59] G. D. Cook and A. J. Robinson, \Training MLPs via the expectation maxi- mization algorithm", in Fourth International Conference on Articial Neural Networks, IET, 1995, pp. 47{52. [60] P. Cordo, J. T. Inglis, S. Vershueren, J. J. Collins, D. M. Merfeld, S. Rosenblum, S. Buckley, and F. Moss, \Noise in Human Muscle Spindles", Nature, vol. 383, pp. 769{770, Oct. 1996. [61] T. M. Cover and J. A. Thomas, Elements of Information Theory, 1st ed. Wiley & Sons, New York, 1991. [62] D. R. Cox, \Some problems connected with statistical inference", The Annals of Mathematical Statistics, vol. 29, no. 2, pp. 357{372, 1958. [63] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, \Wavelet-based statisti- cal signal processing using hidden markov models", Signal Processing, IEEE Transactions on, vol. 46, no. 4, pp. 886{902, 1998. [64] R. C. Dahiya and J. Gurland, \Goodness of Fit Tests for the Gamma and Exponential Distributions", Technometrics, pp. 791{801, 1972. [65] G. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, \Phone recognition with the mean-covariance restricted boltzmann machine", Advances in neural information processing systems, vol. 23, pp. 469{477, 2010. 221 [66] A. S. Das, M. Datar, A. Garg, and S. Rajaram, \Google News personalization: scalable online collaborative ltering", in Proceedings of the 16th international conference on World Wide Web, ser. WWW '07, ACM, 2007, pp. 271{280, isbn: 978-1-59593-654-7. doi: 10.1145/1242572.1242610. [67] A. R. De Pierro, \A modied expectation maximization algorithm for penalized likelihood estimation in emission tomography", IEEE Transactions on Medical Imaging, vol. 14, no. 1, pp. 132{137, 1995. [68] M. H. DeGroot, Optimal Statistical Decisions. McGraw-Hill, 1970. [69] J. P. Delmas, \An equivalence of the EM and ICE algorithm for exponential family", IEEE Transactions on Signal Processing, vol. 45, no. 10, pp. 2613{2615, 1997. [70] A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion", Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1{38, Jan. 1977. [71] ||, \Iteratively reweighted least squares for linear regression when errors are normal/independent distributed", in Multivariate Analysis{V: Proceedings of the Fifth International Symposium on Multivariate Analysis, North-Holland, 1980, p. 35. [72] T. Deselaers, S. Hasan, O. Bender, and H. Ney, \A deep learning approach to machine transliteration", in Proceedings of the 4th EACL Workshop on Statistical Machine Translation, Association for Computational Linguistics, 2009, pp. 233{241. [73] I. S. Dhillon, S. Mallela, and D. S. Modha, \Information-theoretic co-clustering", in Proceedings of the ninth ACM SIGKDD international conference on Knowl- edge discovery and data mining, ACM, 2003, pp. 89{98. [74] J. A. Dickerson and B. Kosko, \Virtual worlds as fuzzy cognitive maps", Presence, vol. 3, no. 2, pp. 173{189, 1994. [75] A. Diplaros, N. Vlassis, and T. Gevers, \A spatially constrained generative model and an EM algorithm for image segmentation", IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 798{808, 2007. 222 [76] P. Domingos and M. Pazzani, \On the optimality of the simple Bayesian classier under zero-one loss", Machine learning, vol. 29, no. 2, pp. 103{130, 1997. [77] J. L. Doob, \Probability and statistics", Transactions of the American Mathe- matical Society, vol. 36, no. 4, pp. 759{775, 1934. [78] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication (2nd Edition), 2nd ed. Wiley-Interscience, 2001. [79] R. Durrett, Probability: Theory and Examples, 4th. Cambridge Univ. Press, 2010. [80] S. R. Eddy, \Prole hidden Markov models.", Bioinformatics, vol. 14, no. 9, pp. 755{763, 1998. [81] S. R. Eddy et al., \Multiple alignment using hidden Markov models", in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, vol. 3, 1995, pp. 114{120. [82] F. Y. Edgeworth, \On the probable errors of frequency-constants", Journal of the Royal Statistical Society, pp. 381{397, 499512, 651{678{, 1908. [83] B. Efron, \Maximum likelihood and decision theory", The annals of Statistics, pp. 340{356, 1982. [84] B. Efron and D. Hinkley, \Assessing the accuracy of the maximum likelihood estimator: observed versus expected sher information", Biometrika, vol. 65, no. 3, pp. 457{483, 1978. [85] B. Efron and R. Tibshirani, \Statistical data analysis in the computer age", Science, vol. 253, no. 5018, pp. 390{395, 1991. [86] B. Efron, \Missing data, imputation, and the bootstrap", Journal of the American Statistical Association, vol. 89, no. 426, pp. 463{475, 1994. [87] R. J. Elliott, L. Aggoun, and J. B. Moore, Hidden Markov models: estimation and control. Springer, 1994, vol. 29. [88] W. Feller, An Introduction to Probability Theory and Its Applications. John Wiley & Sons, 1966, vol. II. 223 [89] J. A. Fessler and A. O. Hero, \Complete-data spaces and generalized EM algorithms", in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993 (ICASSP-93), IEEE, vol. 4, 1993, pp. 1{4. [90] ||, \Space-Alternating Generalized Expectation-Maximization Algorithm", IEEE Transactions on Signal Processing, vol. 42, no. 10, pp. 2664{2677, 1994. [91] ||, \Penalized maximum-likelihood image reconstruction using space-alternating generalized EM algorithms", IEEE Transactions on Image Processing, vol. 4, no. 10, pp. 1417{1429, 1995. [92] R. A. Fisher, \On an absolute criterion for tting frequency curves", Messenger of Mathmatics, vol. 41, pp. 155{160, 1912. [93] ||, \On the mathematical foundations of theoretical statistics", Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 222, pp. 309{368, 1922. [94] ||, \Theory of statistical estimation", in Mathematical Proceedings of the Cambridge Philosophical Society, Cambridge Univ Press, vol. 22, 1925, pp. 700{ 725. [95] ||, \Two New Properties of Mathematical Likelihood", Proceedings of the Royal Society of London. Series A, vol. 144, no. 852, pp. 285{307, 1934. [96] D. B. Fogel, \An introduction to simulated evolutionary optimization", IEEE Transactions on Neural Networks, vol. 5, no. 1, pp. 3{14, 1994. [97] G. B. Folland, Real Analysis: Modern Techniques and Their Applications, 2nd. Wiley-Interscience, 1999. [98] A. F orster, M. Merget, and F. W. Schneider, \Stochastic Resonance in Chem- istry. 2. The Peroxidase-Oxidase Reaction", Journal of Physical Chemistry, vol. 100, pp. 4442{4447, 1996. [99] B. Franzke and B. Kosko, \Noise can speed convergence in Markov chains", Physical Review E, vol. 84, p. 041 112, 4 2011. doi: 10.1103/PhysRevE.84. 041112. [100] L. Gammaitoni, P. H anggi, P. Jung, and F. Marchesoni, \Stochastic Resonance", Reviews of Modern Physics, vol. 70, no. 1, pp. 223{287, Jan. 1998. 224 [101] J. S. Garofolo, TIMIT: Acoustic-phonetic Continuous Speech Corpus. Linguistic Data Consortium, 1993. [102] P. H. Garthwaite, J. B. Kadane, and A. O'Hagan, \Statistical methods for elic- iting probability distributions", Journal of the American Statistical Association, vol. 100, no. 470, pp. 680{701, 2005. [103] S. Geman and C. Hwang, \Diusions for global optimization", SIAM Journal on Control and Optimization, vol. 24, no. 5, pp. 1031{1043, 1986. [104] S. Geman and D. Geman, \Stochastic relaxation, gibbs distributions, and the bayesian restoration of images", IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 6, pp. 721{741, 1984. [105] N. A. Gershenfeld, The Nature of Mathematical Modeling. Cambridge University Press, 1999, isbn: 9780521570954. [106] F. Girosi and T. Poggio, \Representation properties of networks: Kolmogorov's theorem is irrelevant", Neural Computation, vol. 1, no. 4, pp. 465{469, 1989. [107] V. Gon, C. Allauzen, E. Bocchieri, D. Hakkani-Tur, A. Ljolje, S. Parthasarathy, M. Rahim, G. Riccardi, and M. Saraclar, \The AT&T Watson speech recog- nizer", in Proceedings of ICASSP, 2005, pp. 1033{1036. [108] P. J. Green, \Bayesian reconstructions from emission tomography data using a modied EM algorithm", IEEE Transactions on Medical Imaging, vol. 9, no. 1, pp. 84{93, 1990. [109] ||, \On use of the EM for penalized likelihood estimation", Journal of the Royal Statistical Society. Series B (Methodological), pp. 443{452, 1990. [110] S. Grossberg, Studies of Mind and Brain: Neural Principles of Learning, Per- ception, Development, Cognition, and Motor Control. D. Reidel Publishing Company, 1982. [111] ||, \Competitive learning: from interactive activation to adaptive resonance", Cognitive science, vol. 11, no. 1, pp. 23{63, 1987. [112] M. R. Gupta and Y. Chen, \Theory and Use of the EM Algorithm", Foundations and Trends in Signal Processing, vol. 4, no. 3, pp. 223{296, 2010. [113] B. Hajek, \Cooling schedules for optimal annealing", Mathematics of operations research, pp. 311{329, 1988. 225 [114] A. K. Halberstadt and J. R. Glass, \Heterogeneous acoustic measurements for phonetic classication", in Proceedings of EUROSPEECH, vol. 97, 1997, pp. 401{404. [115] L. Hall, I. Ozyurt, and J. Bezdek, \Clustering with a genetically optimized approach", IEEE Transactions on Evolutionary Computation, vol. 3, no. 2, pp. 103{112, 1999. [116] P. Hamel and D. Eck, \Learning features from music audio with deep be- lief networks", in 11th International Society for Music Information Retrieval Conference (ISMIR 2010), 2010. [117] G. Hamerly and C. Elkan, \Alternatives to the k-Means Algorithm that Find Better Clusterings", in Proceedings of the Eleventh International Conference on Information and Knowledge Management, ACM, 2002, pp. 600{607. [118] P. H anggi, \Stochastic resonance in biology", ChemPhysChem, vol. 3, no. 3, pp. 285{290, 2002, issn: 1439-7641. [119] J. A. Hartigan, \Direct clustering of a data matrix", Journal of the American Statistical Association, vol. 67, no. 337, pp. 123{129, 1972. [120] H. O. Hartley, \Maximum likelihood estimation from incomplete data", Bio- metrics, pp. 174{194, 1958. [121] H. O. Hartley and R. R. Hocking, \The analysis of incomplete data", Biometrics, pp. 783{823, 1971. [122] E. Hartuv and R. Shamir, \A clustering algorithm based on graph connectivity", Information processing letters, vol. 76, no. 4, pp. 175{181, 2000. [123] V. Hasselblad, \Estimation of Parameters for a Mixture of Normal Distribu- tions", Technometrics, vol. 8, no. 3, pp. 431{444, 1966. [124] R. J. Hathaway, \Another interpretation of the EM algorithm for mixture distributions", Statistics & Probability Letters, vol. 4, no. 2, pp. 53{56, 1986. [125] T. Hebert and R. Leahy, \A generalized EM algorithm for 3-D Bayesian reconstruction from Poisson data using Gibbs priors", IEEE Transactions on Medical Imaging, vol. 8, no. 2, pp. 194{202, 1989. 226 [126] T. J. Hebert and R. Leahy, \Statistic-based MAP image-reconstruction from Poisson data using Gibbs priors", IEEE Transactions on Signal Processing, vol. 40, no. 9, pp. 2290{2303, 1992. [127] R. Hecht-Nielsen, \Kolmogorov's mapping neural network existence theorem", in Proc. IEEE International Conference on Neural Networks, vol. 3, 1987, pp. 11{14. [128] G. E. Hinton. (2013). Training a deep autoencoder or a classier on MNIST dig- its. [Online; http://www.cs.toronto.edu/ hinton/MatlabForSciencePaper.html]. [129] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., \Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups", IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82{97, 2012. [130] G. E. Hinton, S. Osindero, and Y. W. Teh, \A fast learning algorithm for deep belief nets", Neural computation, vol. 18, no. 7, pp. 1527{1554, 2006. [131] G. E. Hinton and R. R. Salakhutdinov, \Reducing the dimensionality of data with neural networks", Science, vol. 313, no. 5786, pp. 504{507, 2006. [132] T. Hofmann, \Probabilistic latent semantic analysis", in Proceedings of the Fifteenth conference on Uncertainty in articial intelligence, Morgan Kaufmann Publishers Inc., 1999, pp. 289{296. [133] ||, \Unsupervised learning by probabilistic latent semantic analysis", Ma- chine Learning, vol. 42, no. 1-2, pp. 177{196, 2001. [134] ||, \Latent semantic models for collaborative ltering", ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 89{115, 2004. [135] R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Mathematical Statistics, 6th. Prentice Hall, 2005. [136] R. V. Hogg and E. A. Tanis, Probability and Statistical Inference, 7th. Prentice Hall, 2006. [137] M. J. J. Holt and S. Semnani, \Convergence of back-propagation in neural networks using a log-likelihood cost function", Electronics Letters, vol. 26, no. 23, pp. 1964{1965, 1990. 227 [138] F. H oppner, F. Klawonn, and R. Kruse, Fuzzy cluster analysis: methods for classication, data analysis and image recognition. John Wiley, 1999. [139] K. Hornik, S. M., and H. White, \Multilayer Feedforward Networks are Uni- versal Approximators", Neural networks, vol. 2, no. 5, pp. 359{366, 1989, issn: 0893-6080. [140] K. Hornik, M. Stinchcombe, and H. White, \Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks", Neural networks, vol. 3, no. 5, pp. 551{560, 1990. [141] H. H. Hu and K. S. Nayak, \Quantication of absolute fat mass using an adipose tissue reference signal model", Journal of Magnetic Resonance Imaging, vol. 28, no. 6, pp. 1483{1491, 2008. [142] H. H. Hu, K. S. Nayak, and M. I. Goran, \Assessment of abdominal adipose tissue and organ fat content by magnetic resonance imaging", obesity reviews, vol. 12, no. 5, e504{e515, 2011. [143] X. Hu, H. Cammann, H.-A. Meyer, K. Miller, K. Jung, and C. Stephan, \Arti- cial neural networks and prostate cancer{tools for diagnosis and management", Nature Reviews Urology, 2013. [144] A. Huang, R. Abugharbieh, R. Tam, and A. Traboulsee, \MRI brain extraction with combined expectation maximization and geodesic active contours", in IEEE International Symposium on Signal Processing and Information Technol- ogy, IEEE, 2006, pp. 107{111. [145] H. M. Hudson and R. S. Larkin, \Accelerated Image Reconstruction using Ordered Subsets of Projection Data", IEEE Transactions on Medical Imaging, vol. 13, no. 4, pp. 601{609, 1994. [146] E. Ippen, J. Lindner, and W. L. Ditto, \Chaotic Resonance: A Simulation", Journal of Statistical Physics, vol. 70, no. 1/2, pp. 437{450, Jan. 1993. [147] M. W. Jacobson and J. A. Fessler, \An expanded theoretical treatment of iteration-dependent Majorize-Minimize algorithms", IEEE Transactions on Image Processing, vol. 16, no. 10, pp. 2411{2422, 2007. [148] A. K. Jain, \Data clustering: 50 years beyond k-means", Pattern Recognition Letters, vol. 31, no. 8, pp. 651{666, 2009. 228 [149] Y. Jin, \Fuzzy Modeling of High-Dimensional Systems: Complexity Reduction and Interpretability Improvement", IEEE Transactions on Fuzzy Systems, vol. 8, no. 2, pp. 212{221, Apr. 2000, issn: 1063-6706. [150] A. A. Joshi, H. H. Hu, R. M. Leahy, M. I. Goran, and K. S. Nayak, \Automatic intra-subject registration-based segmentation of abdominal fat from water{fat MRI", Journal of Magnetic Resonance Imaging, vol. 37, no. 2, pp. 423{430, 2013. [151] B. H. Juang and L. R. Rabiner, \Hidden Markov models for speech recognition", Technometrics, vol. 33, no. 3, pp. 251{272, 1991. [152] J. B. Kadane and L. J. Wolfson, \Experiences in elicitation", Journal of the Royal Statistical Society: Series D (The Statistician), vol. 47, no. 1, pp. 3{19, 1998. [153] S. Kakutani, \A generalization of Brouwer's xed point theorem", Duke Math- ematical Journal, vol. 8, no. 3, pp. 457{459, 1941. [154] T. Kapur, W. Eric, L. Grimson, R. Kikinis, and W. Wells, \Enhanced spa- tial priors for segmentation of magnetic resonance imagery", Medical Image Computing and Computer-Assisted Intervention 98, pp. 457{468, 1998. [155] T. Kapur, W. Grimson, W. Wells, and R. Kikinis, \Segmentation of brain tissue from magnetic resonance images", Medical image analysis, vol. 1, no. 2, pp. 109{127, 1996. [156] K. Karplus, C. Barrett, and R. Hughey, \Hidden Markov models for detecting remote protein homologies.", Bioinformatics, vol. 14, no. 10, pp. 846{856, 1998. [157] M. Kearns, Y. Mansour, and A. Ng, \An Information-theoretic Analysis of Hard and Soft Assignment Methods for Clustering", in Proceedings of the Thir- teenth Conference on Uncertainty in Articial Intelligence, Morgan Kaufmann Publishers Inc., 1997, pp. 282{293. [158] H. Kim and B. Kosko, \Fuzzy prediction and ltering in impulsive noise", Fuzzy sets and systems, vol. 77, no. 1, pp. 15{33, 1996. [159] S. Kirkpatrick, C. Gelatt Jr, and M. Vecchi, \Optimization by simulated annealing", Science, vol. 220, no. 4598, pp. 671{680, 1983. [160] T. Kohonen, \The self-organizing map", Proceedings of the IEEE, vol. 78, no. 9, pp. 1464{1480, 1990. 229 [161] ||, Self-Organizing Maps. Springer, 2001. [162] S. Kong and B. Kosko, \Dierential competitive learning for centroid estimation and phoneme recognition", IEEE Transactions on Neural Networks, vol. 2, no. 1, pp. 118{124, 1991. [163] Y. Koren, \Factorization meets the neighborhood: a multifaceted collaborative ltering model", in Proceedings of the 14th ACM SIGKDD international con- ference on Knowledge discovery and data mining, ser. KDD '08, ACM, 2008, pp. 426{434, isbn: 978-1-60558-193-4. doi: 10.1145/1401890.1401944. [164] Y. Koren, R. Bell, and C. Volinsky, \Matrix factorization techniques for recommender systems", Computer, vol. 42, no. 8, pp. 30{37, 2009. [165] B. Kosko, \Dierential Hebbian learning", in AIP Conference Proceedings, vol. 151, 1986, pp. 277{282. [166] ||, \Fuzzy Entropy and Conditioning", Information Sciences, vol. 40, pp. 165{ 174, 1986. [167] ||, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence. Prentice Hall, 1991. [168] ||, \Stochastic competitive learning", IEEE Transactions on Neural Net- works, vol. 2, no. 5, pp. 522{529, 1991. [169] ||, \Fuzzy Systems as Universal Approximators", IEEE Transactions on Computers, vol. 43, no. 11, pp. 1329{1333, Nov. 1994. [170] ||, \Optimal Fuzzy Rules Cover Extrema", International Journal of Intelli- gent Systems, vol. 10, no. 2, pp. 249{255, Feb. 1995. [171] ||, Fuzzy Engineering. Prentice Hall, 1996. [172] ||, \Probable Equality, Superpower Sets, and Superconditionals", Interna- tional Journal of Intelligent Systems, vol. 19, pp. 1151 {1171, Dec. 2004. [173] ||, Noise. Viking, 2006, isbn: 0670034959. [174] B. Kosko and S. Mitaim, \Robust stochastic resonance: signal detection and adaptation in impulsive noise", Physical Review E, vol. 64, no. 051110, Oct. 2001. [175] ||, \Stochastic Resonance in Noisy Threshold Neurons", Neural Networks, vol. 16, no. 5-6, pp. 755{761, Jun. 2003. 230 [176] B. Kosko, \Adaptive bidirectional associative memories", Applied optics, vol. 26, no. 23, pp. 4947{4960, 1987. [177] ||, \Bidirectional associative memories", IEEE Transactions on Systems, Man, and Cybernetics, vol. 18, no. 1, pp. 49{60, 1988. [178] B. Kosko, I. Lee, S. Mitaim, A. Patel, and M. M. Wilde, \Applications of for- bidden interval theorems in stochastic resonance", in Applications of Nonlinear Dynamics, Springer, 2009, pp. 71{89. [179] B. Kosko and S. Mitaim, \Robust stochastic resonance for simple threshold neurons", Physical Review E, vol. 70, no. 3, p. 031 911, 2004. [180] H. A. Kramers, \Brownian Motion in a Field of Force and the Diusion Model of Chemical Reactions", Physica, vol. VII, no. 4, pp. 284{304, Apr. 1940. [181] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, \Hidden Markov models in computational biology: applications to protein modeling", Journal of Molecular Biology, vol. 235, no. 5, pp. 1501{1531, 1994. [182] V. K urkov a, \Kolmogorov's theorem is relevant", Neural Computation, vol. 3, no. 4, pp. 617{622, 1991. [183] ||, \Kolmogorov's theorem and multilayer neural networks", Neural networks, vol. 5, no. 3, pp. 501{506, 1992. [184] S. Kullback and R. A. Leibler, \On information and suciency", The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79{86, 1951. [185] K. Lange, D. R. Hunter, and I. Yang, \Optimization transfer using surrogate objective functions", Journal of computational and graphical statistics, vol. 9, no. 1, pp. 1{20, 2000. [186] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton, \Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment", Science, vol. 262, pp. 208{208, 1993. [187] C. E. Lawrence and A. A. Reilly, \An expectation maximization (EM) algo- rithm for the identication and characterization of common sites in unaligned biopolymer sequences", Proteins: Structure, Function, and Bioinformatics, vol. 7, no. 1, pp. 41{51, 1990. 231 [188] I. Lee, B. Kosko, and W. F. Anderson, \Modeling Gunshot Bruises in Soft Body Armor with Adaptive Fuzzy Systems", IEEE Transactions on Systems, Man, and Cybernetics, vol. 35, no. 6, pp. 1374 {1390, Dec. 2005. [189] S. E. Levinson, \Continuously variable duration hidden Markov models for automatic speech recognition", Computer Speech & Language, vol. 1, no. 1, pp. 29{45, 1986. [190] G. Linden, B. Smith, and J. York, \Amazon.com recommendations: item-to- item collaborative ltering", IEEE Internet Computing, vol. 7, no. 1, pp. 76{80, 2003. [191] R. J. A. Little and D. B. Rubin, Statistical analysis with missing data. Wiley, 2002, isbn: 9780471183860. [192] C. Liu and D. B. Rubin, \The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence", Biometrika, vol. 81, no. 4, p. 633, 1994. [193] C. Liu, D. B. Rubin, and Y. N. Wu, \Parameter expansion to accelerate EM: the PX-EM algorithm", Biometrika, vol. 85, no. 4, pp. 755{770, 1998. [194] M. Lorenzo-Valdes, G. Sanchez-Ortiz, R. Mohiaddin, and D. Rueckert, \Seg- mentation of 4D cardiac MR images using a probabilistic atlas and the EM algorithm", Medical Image Computing and Computer-Assisted Intervention- MICCAI 2003, pp. 440{450, 2003. [195] T. A. Louis, \Finding the observed information matrix when using the EM algorithm", Journal of the Royal Statistical Society. Series B (Methodological), pp. 226{233, 1982. [196] M. A. Carreira-Perpi~ n an, \Gaussian mean shift is an EM algorithm", IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 29, p. 2007, 2005. [197] J. MacQueen, \Some methods for classication and analysis of multivariate observations", in Proceedings of the fth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967, pp. 281{297. [198] K. Matsuoka, \Noise injection into inputs in back-propagation learning", IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 3, pp. 436{440, 1992. 232 [199] M. McDonnell, N. Stocks, C. Pearce, and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization. Cambridge University Press, 2008. [200] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. John Wiley and Sons, 2007, isbn: 9780470191606. [201] G. J. McLachlan and D. Peel, Finite Mixture Models, ser. Wiley series in probability and statistics: Applied probability and statistics. Wiley, 2004, isbn: 9780471654063. [202] B. McNamara, K. Wiesenfeld, and R. Roy, \Observation of Stochastic Reso- nance in a Ring Laser", Physical Review Letters, vol. 60, no. 25, pp. 2626{2629, Jun. 1988. [203] X. L. Meng, \Optimization transfer using surrogate objective functions: dis- cussion", Journal of Computational and Graphical Statistics, vol. 9, no. 1, pp. 35{43, 2000. [204] X. L. Meng and D. B. Rubin, \Maximum Likelihood Estimation via the ECM algorithm: A general framework", Biometrika, vol. 80, no. 2, p. 267, 1993. [205] X. L. Meng and D. van Dyk, \The EM algorithm { an old folk-song sung to a fast new tune", Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 59, no. 3, pp. 511{567, 1997. [206] M. Minsky, \Steps toward Articial Intelligence", Proceedings of the IRE, vol. 49, no. 1, pp. 8{30, 1961. [207] B. Mirkin, Mathematical Classication and Clustering, ser. Mathematics and Its Applications. Springer, 1996, isbn: 9780792341598. [208] S. Mitaim and B. Kosko, \Adaptive Stochastic Resonance", Proceedings of the IEEE: Special Issue on Intelligent Signal Processing, vol. 86, no. 11, pp. 2152{ 2183, Nov. 1998. [209] ||, \Neural Fuzzy Agents for Prole Learning and Adaptive Object Match- ing", Presence: Special Issue on Autonomous Agents, Adaptive Behavior, and Distributed Simulations, vol. 7, no. 6, pp. 617{637, Dec. 1998. [210] ||, \The Shape of Fuzzy Sets in Adaptive Function Approximation", IEEE Transactions on Fuzzy Systems, vol. 9, no. 4, pp. 637{656, Aug. 2001. 233 [211] A. Mohamed, G. Dahl, and G. E. Hinton, \Deep belief networks for phone recognition", in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009. [212] A. Mohamed, G. E. Dahl, and G. E. Hinton, \Acoustic modeling using deep be- lief networks", IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14{22, 2012. [213] A. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hinton, and M. A. Picheny, \Deep belief networks using discriminative features for phone recognition", in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2011, pp. 5060{5063. [214] A. Mohamed, D. Yu, and L. Deng, \Investigation of full-sequence training of deep belief networks for speech recognition", in Proc. Interspeech, Citeseer, 2010, pp. 2846{2849. [215] F. Moss, A. Bulsara, and M. Shlesinger, Eds., Journal of Statistical Physics, Special Issue on Stochastic Resonance in Physics and Biology (Proceedings of the NATO Advanced Research Workshop). Plenum Press, Jan. 1993, vol. 70, no. 1/2. [216] J. R. Munkres, Topology, 2nd. Prenctice Hall, 2000. [217] V. Nair and G. E. Hinton, \3D object recognition with deep belief nets", Advances in Neural Information Processing Systems, vol. 22, pp. 1339{1347, 2009. [218] R. E. Neapolitan, Learning Bayesian Networks. Prentice Hall, 2004. [219] J. Neyman and E. S. Pearson, \On the problem of the most ecient tests of statistical hypotheses", Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 231, pp. 289{337, 1933. [220] S. Ng and G. J. McLachlan, \Using the EM algorithm to train neural net- works: misconceptions and a new algorithm for multiclass classication", IEEE Transactions on Neural Networks, vol. 15, no. 3, pp. 738{749, 2004. [221] C. L. Nikias and M. Shao, Signal Processing with Alpha-Stable Distributions and Applications. John Wiley & Sons, 1995. 234 [222] D. Oakes, \Direct calculation of the information matrix via the EM", Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 61, no. 2, pp. 479{482, 1999. [223] T. Orchard and M. A. Woodbury, \A missing information principle: theory and applications", in Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 1972, pp. 697{715. [224] M. J. Osborne and A. Rubinstein, A course in game theory. MIT press, 1994. [225] O. Osoba and B. Kosko, \Corrigendum to `Noise enhanced clustering and competitive learning algorithms' [Neural Netw. 37 (2013) 132{140]", Neural Networks (2013), no. 0, 2013, issn: 0893-6080. doi: 10.1016/j.neunet.2013. 05.004. [226] ||, \Noise-Enhanced Clustering and Competitive Learning Algorithms", Neural Networks, vol. 37, no. 0, pp. 132{140, Jan. 2013, issn: 0893-6080. doi: 10.1016/j.neunet.2012.09.012. [227] O. Osoba, S. Mitaim, and B. Kosko, \Adaptive fuzzy priors for Bayesian inference", in The 2009 International Joint Conference on Neural Networks (IJCNN), IEEE, 2009, pp. 2380{2387. [228] ||, \Bayesian inference with adaptive fuzzy priors and likelihoods", IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 5, pp. 1183 {1197, Oct. 2011, issn: 1083-4419. doi: 10.1109/TSMCB.2011. 2114879. [229] ||, \Noise Benets in the Expectation-Maximization Algorithm: NEM Theo- rems and Models", in The International Joint Conference on Neural Networks (IJCNN), IEEE, 2011, pp. 3178{3183. [230] ||, \Triply fuzzy function approximation for Bayesian inference", in The 2011 International Joint Conference on Neural Networks (IJCNN), IEEE, 2011, pp. 3105{3111. [231] ||, \The Noisy Expectation-Maximization Algorithm", in review, 2012. [232] ||, \Triply fuzzy function approximation for hierarchical Bayesian inference", Fuzzy Optimization and Decision Making, pp. 1{28, 2012. [233] E. Ott, Chaos in Dynamical Systems. Cambridge University Press, 2002, isbn: 9780521010849. 235 [234] A. B. Owen, Empirical Likelihood, ser. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press, 2010, isbn: 9781584880714. [235] A. B. Owen, \Empirical likelihood ratio condence intervals for a single func- tional", Biometrika, vol. 75, no. 2, pp. 237{249, 1988. doi: 10.1093/biomet/ 75.2.237. [236] A. Patel and B. Kosko, \Noise Benets in Quantizer-Array Correlation Detec- tion and Watermark Decoding", IEEE Transactions on Signal Processing, vol. 59, no. 2, pp. 488 {505, Feb. 2011, issn: 1053-587X. [237] A. Patel, \Noise benets in nonlinear signal processing", PhD thesis, University of Southern California, 2009. [238] Y. Pawitan, In All Likelihood: Statistical Modelling and Inference Using Likeli- hood, ser. Oxford Science Publications. Oxford University Press, 2001, isbn: 9780198507659. [239] H. W. Peers, \On condence points and Bayesian probability points in the case of several parameters", Journal of the Royal Statistical Society. Series B (Methodological), pp. 9{16, 1965. [240] ||, \Condence properties of bayesian interval estimates", Journal of the Royal Statistical Society. Series B (Methodological), pp. 535{544, 1968. [241] B. Pellom and K. Hacioglu, \Recent improvements in the cu sonic asr system for noisy speech: the spine task", in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, IEEE, vol. 1, 2003, pp. I{4. [242] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlcek, Y. Qian, P. Schwarz, et al., \The kaldi speech recognition toolkit", in Proc. ASRU, 2011. [243] J. Pratt, \FY Edgeworth and RA Fisher on the Eciency of Maximum Likeli- hood Estimation", The Annals of Statistics, vol. 4, no. 3, pp. 501{514, 1976. [244] F. Proschan, \Theoretical Explanation of Observed Decreasing Failure Rate", Technometrics, pp. 375{383, 1963. [245] L. Rabiner and B. Juang, \Fundamentals of speech recognition", Prenctice Hall: Englewood Clis, NJ, 1993. 236 [246] L. R. Rabiner, \A tutorial on hidden Markov models and selected applications in speech recognition", Proceedings of the IEEE, vol. 77, no. 2, pp. 257{286, 1989. [247] H. Raia and R. Schlaifer, Applied Statistical Decision Theory. John Wiley & Sons, 2000. [248] R. A. Redner and H. F. Walker, \Mixture Densities, Maximum Likelihood and the EM algorithm", SIAM Review, vol. 26, no. 2, pp. 195{239, 1984. [249] M. Reilly and E. Lawlor, \A likelihood-based method of identifying contami- nated lots of blood product", International Journal of Epidemiology, vol. 28, no. 4, pp. 787{792, 1999. [250] I. Rish, \An empirical study of the naive Bayes classier", in IJCAI 2001 Workshop on Empirical Methods in Articial Intelligence, vol. 3, 2001, pp. 41{ 46. [251] C. Robert and G. Casella, Monte Carlo Statistical Methods, 2nd ed. Springer, 2004, isbn: 9780387212395. [252] K. Robison, A. M. McGuire, and G. M. Church, \A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome", Journal of molecular biology, vol. 284, no. 2, pp. 241{254, 1998. [253] S. Ross, A First Course in Probability, 7th. Prentice Hall, 2005. [254] D. B. Rubin, \Inference and missing data", Biometrika, vol. 63, no. 3, pp. 581{ 592, 1976. [255] ||, \Missing data, imputation, and the bootstrap: comment", Journal of the American Statistical Association, pp. 475{478, 1994. [256] ||, \Multiple imputation after 18+ years", Journal of the American Statistical Association, vol. 91, no. 434, pp. 473{489, 1996. [257] W. Rudin, Principles of Mathematical Analysis, 3rd ed. McGraw-Hill New York, 1976. [258] ||, Functional Analysis. McGraw-Hill, 1977. [259] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \Learning representations by back-propagating errors", Nature, vol. 323, pp. 533{536, 1986. 237 [260] D. Rybach, C. Gollan, G. Heigold, B. Homeister, J. L o of, R. Schl uter, and H. Ney, \The RWTH Aachen University open source speech recognition system", in Proc. Interspeech, 2009, pp. 2111{2114. [261] C. Sabatti and K. Lange, \Genomewide motif identication using a dictionary model", Proceedings of the IEEE, vol. 90, no. 11, pp. 1803{1810, 2002. [262] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed, \Making deep belief networks eective for large vocabulary continu- ous speech recognition", in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, IEEE, 2011, pp. 30{35. [263] G. Samorodnitsky and M. S. Taqqu, Stable non-Gaussian random processes: stochastic models with innite variance, ser. Stochastic Modeling. Chapman & Hall/CRC, 1994, isbn: 9780412051715. [264] L. J. Savage, \The theory of statistical decision", Journal of the American Statistical association, vol. 46, no. 253, pp. 55{67, 1951. [265] J. L. Schafer and J. W. Graham, \Missing data: our view of the state of the art", Psychological methods, vol. 7, no. 2, p. 147, 2002. [266] M Segal and E. Weinstein, \The Cascade EM algorithm", Proceedings of the IEEE, vol. 76, no. 10, pp. 1388{1390, 1988. [267] F. Seide, G. Li, and D. Yu, \Conversational speech transcription using context- dependent deep neural networks", in Proc. Interspeech, 2011, pp. 437{440. [268] L. A. Shepp and Y. Vardi, \Maximum likelihood reconstruction for emission tomography", IEEE Transactions on Medical Imaging, vol. 1, no. 2, pp. 113{ 122, 1982. [269] D. Shilane, J. Martikainen, S. Dudoit, and S. J. Ovaska, \A general framework for statistical performance comparison of evolutionary computation algorithms", Information Sciences, vol. 178, no. 14, pp. 2870{2879, 2008. [270] P Smolensky, \Information processing in dynamical systems: foundations of harmony theory", in Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, MIT Press, 1986, pp. 194{281. [271] S. A. Solla, E. Levin, and M. Fleisher, \Accelerated learning in layered neural networks", Complex Systems, vol. 2, no. 6, pp. 625{639, 1988. 238 [272] H. Soltau, G. Saon, and B. Kingsbury, \The IBM Attila speech recognition toolkit", in Spoken Language Technology Workshop (SLT), 2010 IEEE, IEEE, 2010, pp. 97{102. [273] M. Spivak, Calculus. Cambridge University Press, 2006, isbn: 9780521867443. [274] M. H. Stone, \The generalized Weierstrass approximation theorem", Mathe- matics Magazine, vol. 21, no. 5, pp. 237{254, 1948. [275] G. D. Stormo and G. W. Hartzell III, \Identifying protein-binding sites from unaligned dna fragments", Proceedings of the National Academy of Sciences, vol. 86, no. 4, pp. 1183{1187, 1989. [276] R. Sundberg, \Maximum likelihood theory for incomplete data from an expo- nential family", Scandinavian Journal of Statistics, pp. 49{58, 1974. [277] ||, \An iterative method for solution of the likelihood equations for incom- plete data from exponential families", Communication in Statistics-Simulation and Computation, vol. 5, no. 1, pp. 55{64, 1976. [278] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, \Gener- ating facial expressions with deep belief nets", Aective Computing, Emotion Modelling, Synthesis and Recognition, pp. 421{440, 2008. [279] M. T. Tan, G. Tian, and K. W. Ng, Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation. CRC Press, 2010. [280] M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, ser. Springer Series in Statistics. Springer, 1996, isbn: 9780387946887. [281] H. Teicher, \Identiability of mixtures", The Annals of Mathematical Statistics, vol. 32, no. 1, pp. 244{248, 1961. [282] ||, \Identiability of nite mixtures", The Annals of Mathematical Statistics, vol. 34, no. 4, pp. 1265{1269, 1963. [283] T. Terano, K. Asai, and M. Sugeno, Fuzzy Systems Theory and Its Applications. Academic Press, 1987. [284] Y. Z. Tsypkin, Foundations of the Theory of Learning Systems. Academic Press, 1973. 239 [285] A Van Ooyen and B Nienhuis, \Improving the convergence of the back- propagation algorithm", Neural Networks, vol. 5, no. 3, pp. 465{471, 1992. [286] Y Vardi, L. A. Shepp, and L. Kaufman, \A statistical model for positron emission tomography", Journal of the American Statistical Association, vol. 80, no. 389, pp. 8{20, 1985. [287] J. von Neumann and O. Morgenstern, \Theory of games and economic behav- ior", Princeton University, Princeton, 1947. [288] A. Wald, \Tests of statistical hypotheses concerning several parameters when the number of observations is large", Transactions of the American Mathemat- ical Society, pp. 426{482, 1943. [289] ||, \Note on the consistency of the maximum likelihood estimate", The Annals of Mathematical Statistics, vol. 20, no. 4, pp. 595{601, 1949. [290] ||, Statistical Decision Functions. Wiley, 1950. [291] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, \Sphinx-4: a exible open source framework for speech recognition", 2004. [292] J. Wang, A. Dogandzic, and A. Nehorai, \Maximum likelihood estimation of compound-gaussian clutter and target parameters", IEEE Transactions on Signal Processing, vol. 54, no. 10, pp. 3884{3898, 2006. [293] F. A. Watkins, Fuzzy Engineering. Ph.D. Dissertation, Department of Electrical Engineering, UC Irvine, 1994. [294] ||, \The Representation Problem for Additive Fuzzy Systems", in Proceedings of the IEEE International Conference on Fuzzy Systems (IEEE FUZZ-95), vol. 1, Mar. 1995, pp. 117{122. [295] G. C. G. Wei and M. A. Tanner, \A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms", Journal of the American Statistical Association, vol. 85, no. 411, pp. 699{704, 1990. [296] B. L. Welch, \On comparisons between condence point procedures in the case of a single parameter", Journal of the Royal Statistical Society. Series B (Methodological), pp. 1{8, 1965. 240 [297] B. L. Welch and H. W. Peers, \On formulae for condence points based on integrals of weighted likelihoods", Journal of the Royal Statistical Society. Series B (Methodological), pp. 318{329, 1963. [298] L. R. Welch, \Hidden Markov models and the Baum-Welch algorithm", IEEE Information Theory Society Newsletter, vol. 53, no. 4, pp. 1{14, 2003. [299] P. J. Werbos, \Backpropagation Through Time: What It Does and How to Do It", Proceedings of the IEEE, vol. 78, no. 10, pp. 1550{1560, 1990. [300] P. Werbos, \Beyond regression: new tools for prediction and analysis in the behavioral sciences", PhD thesis, Harvard University, Boston, 1974. [301] J. G. Wilpon, L. R. Rabiner, C Lee, and E. R. Goldman, \Automatic recognition of keywords in unconstrained speech using hidden Markov models", Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 38, no. 11, pp. 1870{ 1878, 1990. [302] D. H. Wolpert and W. G. Macready, \No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67{82, 1997. [303] C. F. J. Wu, \On the Convergence Properties of the EM Algorithm", The Annals of Statistics, vol. 11, no. 1, pp. 95{103, 1983. [304] T. T. Wu and K. Lange, \The MM alternative to EM", Statistical Science, vol. 25, no. 4, pp. 492{505, 2010. [305] X. Wu, V. Kumar, R. J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, et al., \Top 10 algorithms in data mining", Knowledge and Information Systems, vol. 14, no. 1, pp. 1{37, 2008. [306] L. Xu and M. I. Jordan, \On convergence properties of the EM algorithm for gaussian mixtures", Neural computation, vol. 8, no. 1, pp. 129{151, 1996. [307] R. Xu and D. Wunsch, \Survey of clustering algorithms", IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645{678, 2005. [308] R. Xu and D. C. Wunsch, Clustering. IEEE Press & Wiley, 2009. [309] R. Xu and D. Wunsch, \Clustering algorithms in biomedical research: a review", IEEE Reviews in Biomedical Engineering, vol. 3, pp. 120{154, 2010. 241 [310] J. Yamato, J. Ohya, and K. Ishii, \Recognizing human action in time-sequential images using hidden Markov model", in Computer Vision and Pattern Recog- nition, 1992. Proceedings CVPR'92., 1992 IEEE Computer Society Conference on, IEEE, 1992, pp. 379{385. [311] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, \The HTK book", Cambridge University Engi- neering Department, vol. 3, 2002. [312] T. J. Ypma, \Historical development of the Newton-Raphson method", SIAM review, vol. 37, no. 4, pp. 531{551, 1995. [313] L. A. Zadeh, \Fuzzy sets", Information and Control, vol. 8, pp. 338{353, 1965. [314] F. Zambelli, G. Pesole, and G. Pavesi, \Motif discovery and transcription factor binding sites before and after the next-generation sequencing era", Briengs in Bioinformatics, vol. 14, no. 2, pp. 225{237, 2013. doi: 10.1093/bib/bbs016. [315] W. I. Zangwill, Nonlinear Programming: A Unied Approach. Prentice-Hall Englewood Clis, NJ, 1969. [316] J. Zhang, \The mean eld theory in EM procedures for Markov random elds", IEEE Transactions on Signal Processing, vol. 40, no. 10, pp. 2570{2583, 1992. [317] ||, \The mean eld theory in EM procedures for blind Markov random eld image restoration", IEEE Transactions on Image Processing, vol. 2, no. 1, pp. 27{40, 1993. [318] T. Zhang, R. Ramakrishnan, and M. Livny, \BIRCH: an ecient data clustering method for very large databases", in ACM SIGMOD Record, vol. 25, 1996, pp. 103{114. [319] Y. Zhang, M. Brady, and S. Smith, \Segmentation of Brain MR Images through a Hidden Markov Random Field Model and the Expectation-Maximization Algorithm", IEEE Transactions on Medical Imaging, vol. 20, no. 1, pp. 45{57, 2001. [320] ||, \Segmentation of brain MR images through a hidden Markov random eld model and the expectation-maximization algorithm", IEEE Transactions on Medical Imaging, vol. 20, no. 1, pp. 45{57, 2001. 242 [321] V. M. Zolotarev, One-Dimensional Stable Distributions, ser. Translations of Mathematical Monographs. American Mathematical Society, 1986, isbn: 9780821845196. 243
Abstract (if available)
Abstract
This dissertation shows that careful injection of noise into sample data can substantially speed up Expectation-Maximization algorithms. Expectation-Maximization algorithms are a class of iterative algorithms for extracting maximum likelihood estimates from corrupted or incomplete data. The convergence speed-up is an example of a noise benefit or "stochastic resonance" in statistical signal processing. The dissertation presents derivations of sufficient conditions for such noise-benefits and demonstrates the speed-up in some ubiquitous signal-processing algorithms. These algorithms include parameter estimation for mixture models, the k-means clustering algorithm, the Baum-Welch algorithm for training hidden Markov models, and backpropagation for training feedforward artificial neural networks. This dissertation also analyses the effects of data and model corruption on the more general Bayesian inference estimation framework. The main finding is a theorem guaranteeing that uniform approximators for Bayesian model functions produce uniform approximators for the posterior pdf via Bayes theorem. This result also applies to hierarchical and multidimensional Bayesian models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Noise benefits in Markov chain Monte Carlo computation
PDF
Noise benefits in nonlinear signal processing
PDF
High-capacity feedback neural networks
PDF
Population pharmacokinetic/pharmacodynamic modeling: evaluation of maximum likelihood expectation maximization method in ADAPT 5
PDF
Noise aware methods for robust speech processing applications
PDF
Exploiting latent reliability information for classification tasks
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Causality and consistency in electrophysiological signals
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Novel theoretical characterization and optimization of experimental efficiency for diffusion MRI
PDF
Scalable sampling and reconstruction for graph signals
PDF
Reconfigurable high-speed processing and noise mitigation of optical data
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Distributed algorithms for source localization using quantized sensor readings
PDF
Efficient graph learning: theory and performance evaluation
PDF
Statistical inference of stochastic differential equations driven by Gaussian noise
PDF
Localization of multiple targets in multi-path environnents
PDF
Data-driven optimization for indoor localization
Asset Metadata
Creator
Osoba, Osonde Adekorede
(author)
Core Title
Noise benefits in expectation-maximization algorithms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/22/2013
Defense Date
06/18/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian statistics,expectation-maximization algorithm,maximum likelihood estimation,neural networks,noise benefits,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kosko, Bart A. (
committee chair
), Moore, James Elliott, II (
committee member
), Ortega, Antonio K. (
committee member
)
Creator Email
ope.osoba@gmail.com,osondeos@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-294341
Unique identifier
UC11294488
Identifier
etd-OsobaOsond-1807.pdf (filename),usctheses-c3-294341 (legacy record id)
Legacy Identifier
etd-OsobaOsond-1807.pdf
Dmrecord
294341
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Osoba, Osonde Adekorede
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian statistics
expectation-maximization algorithm
maximum likelihood estimation
neural networks
noise benefits