Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-capacity feedback neural networks
(USC Thesis Other)
High-capacity feedback neural networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-CAPACITY FEEDBACK NEURAL NETWORKS by Olaoluwa Adekola Adigun A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Olaoluwa Adekola Adigun Acknowledgements I am deeply grateful to my advisor Professor Bart Kosko for his motivation, unwavering support, and patience in the course of my graduate studies. I am thankful for his commitment to my academic and personal development over the past few years. I could not imagine having a better advisor and mentor. I would also like to express my deep appreciation to my other committee members: Professor C.-C. Jay Kuo, Professor Edmond Jonckheere, and Professor James Moore. Their insightful suggestions and comments were invaluable to this dissertation. I would like to express my sincere gratitude to my family for their steadfast support and belief in me. I am forever indebted to my parents for their sacrifice and for laying a strong foundation for my educational pursuits. I would also like to thank my friends and colleagues for their support and immense contribution to the success of this journey. ii Table of Contents Acknowledgements ii List of Tables viii List of Figures xi Abstract xii 1 Preview of Dissertation Results 1 1.1 Bidirectional Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Blocking and High-Capacity Neural Classifiers . . . . . . . . . . . . . . . . 5 1.3 Noisy Recurrent Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Bidirectional Backpropagation (B-BP) Algorithm 12 2.1 A Review of Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Unidirectional Backpropagation: Probabilistic Description . . . . . . 13 2.1.2 Backpropagation: Learning Laws . . . . . . . . . . . . . . . . . . . . 16 2.1.3 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . 21 2.1.4 Backpropagation as Generalized EM Algorithm . . . . . . . . . . . . 24 2.1.5 A Brief History of Backpropagation . . . . . . . . . . . . . . . . . . 26 2.2 Bidirectional Neural Representation . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 Bidirectional Associative Memories Extend to Multilayer Networks . 29 2.2.2 Exact Bidirectional Representation of Bipolar Permutations . . . . . 30 2.3 Bidirectional Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.1 Bidirectional Backpropagation: Maximum Likelihood Estimation . . 41 2.4 Bayesian Bidirectional Backpropagation . . . . . . . . . . . . . . . . . . . . 46 2.5 Learning Noninvertible Functions with Bidirectional Backpropagation . . . 49 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 iii 3 Bidirectional Backpropagation Application: Deep Neural Classifiers 52 3.1 Neural Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.1 Implicit Bidirectionality of Neural Classifiers . . . . . . . . . . . . . 53 3.1.2 Bidirectional Backpropagation Training of Neural Classifiers . . . . . 57 3.2 Convolutional Neural Networks and Bidirectional Representation . . . . . . 59 3.2.1 Convolution Filters and Feedforward Networks . . . . . . . . . . . . 60 3.2.2 Convolution Filters and Bidirectional Networks . . . . . . . . . . . . 62 3.3 Bayesian Bidirectional Backpropagation Training . . . . . . . . . . . . . . . 63 3.4 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.2 Network Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4 Bidirectional Backpropagation Application: Generative Adversarial Net- works (GANs) 75 4.1 A Review of Generative Adversarial Networks . . . . . . . . . . . . . . . . . 75 4.1.1 Architecture of Generative Adversarial Networks . . . . . . . . . . . 76 4.1.2 Training Generative Adversarial Networks . . . . . . . . . . . . . . . 76 4.2 Generative Adversarial Network and Bidirectional Backpropagation . . . . . 78 4.2.1 Training Vanilla GANs with Bidirectional Backpropagation . . . . . 79 4.2.2 Training Wasserstein GANs with Bidirectional Backpropagation . . 81 4.3 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2 Performance metrics for GAN models . . . . . . . . . . . . . . . . . 81 4.3.3 Training Vanilla GANs with Bidirectional Backpropagation . . . . . 83 4.3.4 Training Deep-Convolutional GANs with Bidirectional Backpropagation 86 4.3.5 Training Wasserstein GANs with Bidirectional Backpropagation . . 88 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5 High Capacity Neural Classifiers 92 5.1 Logistic versus Softmax Output Neurons . . . . . . . . . . . . . . . . . . . . 92 5.1.1 Activation, Decision Rule, and Error Function . . . . . . . . . . . . . 93 5.1.2 Random Coding with Bipolar Codewords . . . . . . . . . . . . . . . 97 5.1.3 Backpropagation Invariance and Network Likelihood . . . . . . . . . 98 5.2 Blocking and Deep-sweep Training . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.1 Training with Deep-sweep . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.2 Deep-sweep with Bidirectional Backpropagation . . . . . . . . . . . . 102 iv 5.3 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.2 Logistic Coding and Deep-sweep Training . . . . . . . . . . . . . . . 106 5.3.3 Logistic Coding, Deep-sweep, and Bidirectional Backpropagation . . 111 5.3.4 Logistic Coding and Convolutional Neural Classifiers . . . . . . . . . 115 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6 Deeper Neural Networks: Non-Vanishing (NoVa) Hidden Units 120 6.1 The Search for a Better Hidden Neuron . . . . . . . . . . . . . . . . . . . . 120 6.1.1 Review of Old Hidden Neurons: Activation Function and Derivative 121 6.2 Nonvanishing (NoVa) Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 Generalized Nonvanishing (G-NoVa) Units . . . . . . . . . . . . . . 126 6.3 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.2 Network Description and Training Parameters . . . . . . . . . . . . 129 6.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7 Noise-Boosted Recurrent Backpropagation 134 7.1 Backpropagation and the Expectation-Maximization (EM) Algorithm . . . 134 7.1.1 A Review of Backpropagation as Generalized EM . . . . . . . . . . 135 7.1.2 Noise-Boosting the EM Algorithm . . . . . . . . . . . . . . . . . . . 136 7.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2.1 Long-Short-Term Memory (LSTM) Recurrent Neural Networks . . . 139 7.2.2 Gated Recurrent Unit (GRU) Recurrent Neural Networks . . . . . . 142 7.2.3 Recurrent Backpropagation Training . . . . . . . . . . . . . . . . . . 144 7.2.4 Recurrent Backpropagation and EM Connection . . . . . . . . . . . 146 7.3 NEM Noise Injection in Recurrent Network’s Output Neurons . . . . . . . . 147 7.3.1 NEM Noise injection in Recurrent Neural Classifiers . . . . . . . . . 148 7.3.2 NEM Noise injection in Recurrent Regression Networks . . . . . . . 151 7.4 NEM Noise Injection in the Hidden Layers of a Recurrent Neural Network . 156 7.4.1 NEM Noise Injection in the Hidden Units of Long Short-Term Memory (LSTM) RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.4.2 NEM Noise Injection in the Hidden Units of Gated-Recurrent-Unit (GRU) RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.5 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.5.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 v 7.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8 Future Work 177 8.1 Big-K Simulation Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.2 Improved Random Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.3 Probabilistic and Dynamical Analysis of Bidirectionality . . . . . . . . . . . 178 Bibliography 180 vi List of Tables 2.1 A Three-Bit Bipolar Permutation Function . . . . . . . . . . . . . . . . . . 29 2.2 Forward pass inference with a 3-bit bipolar permutation . . . . . . . . . . . 38 2.3 Backward pass inference with a 3-bit bipolar permutation . . . . . . . . . . 38 3.1 CIFAR-10: Bayesian bidirectional backpropagation performed best with multilayered perceptron models . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2 CIFAR-10: Training with raw images versus PixelHop++ features . . . . . 69 3.3 CIFAR-10: Bayesian bidirectional backpropagation performed best with convolutional neural classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4 EMNIST-47: Bayesian bidirectional backpropagation performed best with multilayered perceptron models . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 EMNIST: Training with raw images versus PixelHop++ features . . . . . . 73 4.1 MNIST: Bidirectional BP training with vanilla GANs . . . . . . . . . . . . 85 4.2 MNIST: Bidirectional start-point with vanilla GANs . . . . . . . . . . . . . 85 4.3 CIFAR-10: Bidirectional BP training with DCGANs . . . . . . . . . . . . . 86 4.4 CIFAR-10: Bidirectional start-point with DCGANs . . . . . . . . . . . . . . 86 4.5 MNIST: Bidirectional BP training with Wasserstein GANs . . . . . . . . . 88 4.6 MNIST: Bidirectional start-point with Wasserstein GANs . . . . . . . . . . 88 4.7 CIFAR-10: Bidirectional BP training with Wasserstein GANs . . . . . . . . 91 4.8 CIFAR-10: Bidirectional start-point with WGANs . . . . . . . . . . . . . . 91 5.1 Output logistic and softmax neurons . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Logistic activations outperformed softmax activation . . . . . . . . . . . . . 108 5.3 Random bipolar coding scheme with neural classifiers . . . . . . . . . . . . 108 5.4 Deep-sweep with ordinary backpropagation . . . . . . . . . . . . . . . . . . 111 5.5 Finding the best block size with deep-sweep training . . . . . . . . . . . . . 113 5.6 Logistic activation versus Softmax activation with bidirectional BP . . . . . 114 vii 5.7 Random logistic coding scheme with deep neural classifiers and bidirectional backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.8 Deep-sweep with bidirectional BP . . . . . . . . . . . . . . . . . . . . . . . . 115 5.9 Blocking deep neural classifiers with bidirectional backpropagation . . . . . 116 5.10 Effect of code length on the classification accuracy of neural classifiers . . . 118 5.11 Logistic output activation improved convolutional neural classifiers: . . . . . 118 6.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Finding the best hidden neuron with CIFAR-10 classification . . . . . . . . 129 6.3 Finding the best hidden neuron with CIFAR-100 classification . . . . . . . . 132 6.4 Finding the best hidden neuron with Caltech-256 classification . . . . . . . 132 7.1 NEM-benefit in classification accuracy with LSTM classifiers . . . . . . . . 171 7.2 NEM-speed-up with LSTM classifiers . . . . . . . . . . . . . . . . . . . . . . 171 7.3 NEM-benefit in classification accuracy with GRU classifiers . . . . . . . . . 172 7.4 NEM-speed-up with GRU classifiers . . . . . . . . . . . . . . . . . . . . . . 172 7.5 NEM-benefit on mean squared error with LSTM regression networks . . . . 174 7.6 NEM-speed-up with LSTM regression networks . . . . . . . . . . . . . . . . 174 7.7 NEM-benefit in mean squared error with GRU regression networks . . . . . 176 7.8 NEM-speed-up with GRU regression networks . . . . . . . . . . . . . . . . . 176 8.1 Big-K simulation databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 viii List of Figures 1.1 Bayesian Bidirectional Backpropagation . . . . . . . . . . . . . . . . . . . . 2 1.2 Running a neural classifier in reverse . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Unidirectional versus bidirectional error reduction . . . . . . . . . . . . . . 5 1.4 Blocking: Long versus deep networks . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Softmax 1-in-K coding versus logistic coding . . . . . . . . . . . . . . . . . 7 1.6 NoVa (nonvanishing) hidden neuron . . . . . . . . . . . . . . . . . . . . . . 8 1.7 Additive NEM noise with recurrent backpropagation . . . . . . . . . . . . . 10 2.1 Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Bidirectional representation of a three-bit bipolar permutation . . . . . . . 28 2.3 Bidirectional network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Exact bidirectional representation of bipolar permutations . . . . . . . . . . 35 2.5 Bidirectional BP allows fewer neurons than the bidirectional permutation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Bidirectional BP avoids overwriting . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Bayesian Bidirectional Backpropagation . . . . . . . . . . . . . . . . . . . . 48 2.8 Bidirectional representation of a noninvertible function . . . . . . . . . . . . 49 3.1 Running a neural classifier in reverse . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Running a convolutional neural classifier in reverse . . . . . . . . . . . . . . 55 3.3 Class centroids versus their projections . . . . . . . . . . . . . . . . . . . . 56 3.4 Convolution filters with feedforward networks . . . . . . . . . . . . . . . . . 61 3.5 Bidirectional approximation with convolutional networks . . . . . . . . . . . 61 3.6 Convolution operation with 2D input . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Transposed convolution operation . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Pooling operation with 2D convolution . . . . . . . . . . . . . . . . . . . . . 62 3.9 Padding operation with 2D convolution . . . . . . . . . . . . . . . . . . . . 63 3.10 Training with Bayesian bidirectional backpropagation . . . . . . . . . . . . 64 3.11 Experimental datasets for training neural classifiers with B-BP . . . . . . . 67 ix 3.12 Architecture of convolutional neural classfiers . . . . . . . . . . . . . . . . . 68 3.13 Backward-pass recall in deep classifiers . . . . . . . . . . . . . . . . . . . . . 70 3.14 CIFAR-10: Backward-pass recall with deep convolutional neural classifiers . 71 3.15 EMNIST47: Backward-pass recall in deep classifiers . . . . . . . . . . . . . 72 4.1 Architecture of a generative adversarial network . . . . . . . . . . . . . . . . 77 4.2 Datasets for GAN simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Vanilla GAN and MNIST dataset: Bidirectional BP reduces mode collapse 84 4.4 CIFAR-10: Bidirectional BP improved deep-convolutional GANs . . . . . . 87 4.5 MNIST: Bidirectional BP improved Wasserstein GANs . . . . . . . . . . . . 89 4.6 CIFAR-10: Bidirectional BP slightly improved Wasserstein GANs . . . . . . 90 5.1 High capacity networks with a number of classes K . . . . . . . . . . . . . . 93 5.2 Random logistic codewords . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Blocking a deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4 Blocking with bidirectional backpropagation . . . . . . . . . . . . . . . . . . 102 5.5 Simulation data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Logistic activations outperformed softmax activation . . . . . . . . . . . . . 107 5.7 Random bipolar coding with neural classifiers . . . . . . . . . . . . . . . . . 109 5.8 Random bipolar coding and unidirectional BP . . . . . . . . . . . . . . . . . 110 5.9 Deep-sweep improved the performance of random logistic codewords . . . . 112 5.10 Random logistic codewords with BBP . . . . . . . . . . . . . . . . . . . . . 112 5.11 Deep-sweep, logistic coding, and bidirectional backpropagation . . . . . . . 115 5.12 Pre-training on Bigger-K improved logistic coding . . . . . . . . . . . . . . 117 6.1 Activation functions for multilayer neural networks . . . . . . . . . . . . . . 122 6.2 Derivative of common activation functions . . . . . . . . . . . . . . . . . . . 123 6.3 Activation function and derivative of NoVa units . . . . . . . . . . . . . . . 125 6.4 Activation function and derivative of G-NoVa units . . . . . . . . . . . . . . 126 6.5 Simulation data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.6 Benefit of using G-NoVa with ordinary backpropagation . . . . . . . . . . . 130 6.7 Benefit of G-NoVa with deep neural classifiers using bidirectional BP . . . . 131 7.1 Additive NEM-noise condition with neural classifiers . . . . . . . . . . . . . 150 7.2 Multiplicative NEM-noise condition with neural classifiers . . . . . . . . . . 152 7.3 Additive NEM-noise condition with regression neural networks . . . . . . . 155 7.4 Multiplicative NEM-noise condition with regression networks . . . . . . . . 156 7.5 Image frames of a sport-action video . . . . . . . . . . . . . . . . . . . . . . 167 7.6 Image extraction with inception-v3 network . . . . . . . . . . . . . . . . . . 168 x 7.7 US dollar to Indian rupee time-series dataset . . . . . . . . . . . . . . . . . 169 7.8 Noise-benefit with LSTM classifiers . . . . . . . . . . . . . . . . . . . . . . . 170 7.9 Additive NEM-benefit with LSTM regression networks . . . . . . . . . . . . 173 7.10 Time-series prediction with LSTM regression networks . . . . . . . . . . . . 174 7.11 Additive NEM-benefit with GRU regression networks . . . . . . . . . . . . . 175 xi Abstract This dissertation shows that three new feedback-based architectures increase the pattern capacity and the performance of deep neural networks: (1) bidirectional backpropagation and its Bayesian extension, (2) blocking networks with random logistic output coding and the new NoVa (non-vanishing) hidden neurons, and (3) noise-boosted recurrent backpropagation for time-varying classification and regression. The first and central contribution of this thesis is the new bidirectional backpropagation (B-BP)algorithm. B-BPtrainsaclassifierorregressionnetworkbothforwardsandbackwards through the same network of synapses and neurons. This introduces a feedback structure into the network’s training and its probabilistic structure. B-BP tends to improve classification accuracy at little extra cost compared with ordinary forward-only BP. Ordinary BP ignores such backward training. So it ignores the hidden regressor in the backward direction when training a classifier. Bayesian B-BP further allows prior probabilities to shape the optimization of the network’s global posterior probability structure. B-BP also improves the classification performance of generative adversarial networks. The second main contribution extends classifier networks to bidirectional blocks of networks with random logistic coding at the terminal layers of the interior blocks. Output logistic neurons allow the choice of binary codewords for K patterns from any of the 2 K vertices of a binary hypercube. Output softmax neurons limit this choice to the K vertices of the simplex embedded in the hypercube. Random logistic coding allows the same network to store and recall far more patterns than simple softmax 1-in-K encoding. The new NoVa hidden neurons allow still deeper and more powerful networks per block before the problem xii of the vanishing gradient takes its toll. NoVa neurons tended to outperform and “live” longer than threshold-linear ReLU (rectified linear unit) hidden neurons in very deep classifier networks that train on a large number K of patterns. The third main contribution is the new noisy recurrent backpropagation algorithm for time-varying signals or patterns. This feedback architecture uses the recent reduction of the backpropagation algorithm to the Expectation-Maximization algorithm for iterative maximum-likelihood estimation. This noise boost makes the current training signal more probable as the system climbs the nearest hill of log-likelihood, The noise boost improves both classification and regression performance. Current neural classifiers tend to work with a small number K of patterns. The number K may be only in the hundreds. Future networks will have to work with much larger numbers of patterns in the hundreds of thousands or millions. This will require new neural architectures and learning algorithms that scale with such large K. The new techniques in this thesis offer a step in that direction. xiii Chapter 1 Preview of Dissertation Results This chapter gives an overview of the main results in the dissertation. The next chapters present these results in far greater mathematical and simulation detail. The three key results help neural networks scale to large K patterns and enhance their performance. The first and main contribution is the new bidirectional backpropagation algorithm and its Bayesian extension. This dissertation further introduces new methods that increase the pattern capacity of a deep neural classifier along its width and depth. The final contribution is the noisy recurrent backpropagation (RBP) algorithm for time-varying signals. This injects the beneficial noisy Expectation-Maximization (NEM) noise samples in recurrent neural networks during the training phase. 1.1 Bidirectional Backpropagation The new bidirectional backpropagation (B-BP) algorithm trains a neural network in the forward and backward directions through the same web of synapses. Signals pass forward through weight matrices and neurons. They pass backwards through the transpose weight matrices and the same neurons. The forward pass produces the likelihood of an observed output y while the backward pass produces the dual likelihood of an observed input x. Neural classifiers have an inherent but often ignored bidirectional structure. This arises because the identity input neurons define a hidden regressor when the network runs in the backward direction. Figure 1.1 shows the bidirectional structure of a deep neural classifier and its hidden regressor in the backward direction. The figure also shows that we can put a prior probability structure on the network weights or other parameters and advance to Bayesian B-BP or B 3 . Then the B-BP algorithm corresponds to the default case of a uniform prior because then maximizing the network’s joint posterior equates to maximizing its joint likelihood. 1 Bayesian B-BP: Input Output Feedforward Feedback Feedforward Input layer Output layer Hidden layers Forward likelihood Backward likelihood Feedback Prior Classification Regression Figure 1.1: Bayesian Bidirectional Backpropagation: The network maximizes the joint forward and backward posterior probability. The network diagram shows the simple but practical case of a Laplacian or Lasso-like prior on the input weights. The identity neurons at the input field give rise to a vector normal likelihood and thus a squared-error error function in the backward direction. The softmax neurons at the output field give rise to a multinomial likelihood and thus a cross-entropy error function in the forward direction. So the network acts as a classifier in the forward direction and a regressor in the backward direction. Figure 1.2 shows that running an ordinary neural classifier produces only visual noise at the input layer. But running it backwards after bidirectional training produces the proper image that the network expects to see given its training and given the current stimulus at the input layer. B-BP takes account of the extra directional probabilistic information that unidirectional BP ignores and does so at little extra computational cost. This backward expectation acts as an inherent type of attentive network focus on input activity. Ordinary backpropagation (BP) is unidirectional. It runs in either the forward direction or in theory in the backward direction. In practice BP runs only in the forward direction. Ordinaryforwardtrainingiterativelymaximizestheforwardlikelihoodp f (y|x, Θ)fornetwork weights or parameters Θ. It can instead maximize the backward likelihood p b (x|y, Θ): Θ BP f = arg max Θ p f (y|x, Θ) | {z } forward = arg max Θ log p f (y|x, Θ) = arg min Θ E f (Θ) (1.1) Θ BP b = arg max Θ p b (x|y, Θ) | {z } backward = arg max Θ log p b (x|y, Θ) = arg min Θ E b (Θ) (1.2) where Θ BP f and Θ BP b solve the respective forward-BP and backward-BP maximizations. The forward error E f (Θ) and backward error E b (Θ) are the respective negative log-likelihoods 2 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) Target Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) Backward recall with unidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) Backward recall with bidirectional BP Figure 1.2: Running a neural classifier in reverse: Unidirectional BP recall versus bidirectional BP recall after training on the CIFAR-10 image dataset. The backward recall mapped a class-label unit-bit-vector output basis vector e k back through the transposed matrices of the synaptic weights to the input layer. The classifier used 5 hidden layers and swept forward and backward through those layers. (a) The 10 target images are the sample class centroids of the 10 image pattern classes. (b) The 10 noise images show the backward recall of the class-label output vectors e k when the network trained with ordinary unidirectional BP. (c) Backward-pass prediction of the class-label output vectors e k with B-BP. The feedback signals closely matched the corresponding sample class centroids and gave the network a type of attentive focus. E f (Θ)∝−log p f (y|x, Θ) and E b (Θ) ∝−log p b (x|y, Θ). B-BP finds the parameter vector Θ BBP that jointly maximizes the directional likelihoods 3 or log-likelihoods: Θ BBP = arg max Θ p f (y|x, Θ) | {z } forward p b (x|y, Θ) | {z } backward (1.3) = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ) (1.4) = arg min Θ E f (Θ) +E b (Θ). (1.5) The bidirectional classifier network in Figure 1.1 uses a multinomial forward likelihood and a vector-normal backward likelihood. So a forward pass corresponds to one roll of a K-sided die. Then the forward error is just the cross-entropy since the error is the negative log-likelihood of the one-shot multinomial. The backward error is likewise the squared error of the regression that arises from taking the negative log-likelihood of the vector-normal joint probability density. This backward squared-error leads to the non-noise recalled patterns in Figure 1.2. Figure 1.3 shows how unidirectional BP overwrites the training in the opposite direction: Either the cross-entropy goes down and squared-error goes up or vice versa. But B-BP reduces both errors because it maximizes the joint log-likelihood. Bayesian bidirectional backpropagation B 3 maximizes the network’s joint posterior p(Θ|y, x) for a given prior probability over the parameters: Θ B 3 = arg max Θ p(Θ|y, x) = arg max Θ log p(Θ|y, x) (1.6) = arg max Θ log p(Θ|y, x) | {z } forward + log p(Θ|y, x) | {z } backward (1.7) = arg max Θ log p f (y|x, Θ) + log h f (Θ|x) | {z } forward + log p b (x|y, Θ) + log h b (Θ|y) | {z } backward (1.8) where h f (Θ|x) is the forward prior and h b (Θ|y) is the backward prior. The bidirectional network in Figure 1.1 used a Laplacian prior in its B 3 training and so gave a type of Lasso penalized regression in the backward direction. The Laplace prior applied here to just the first and densest thicket of weights from the input layer to the first set of hidden neurons. It gave a sparse representation because the implied l 1 optimization zeroed many of these synaptic weights. We found that B-BP improved classifier accuracy and at little extra computational cost compared with ordinary unidirectional BP. We also found that transforming input images with the new PixelHop++ [43] method further improved the performance of B-BP with image classification. B-BP also improved regression accuracy and the performance of generative adversarial networks. 4 0 20 40 60 80 100 Training epoch 0 2 4 6 8 Error Unidirectional BP (Forward) Forward errorE f Backward errorE b (a) Forward direction 0 20 40 60 80 100 Training epoch 0 20 40 60 Error Unidirectional BP (Backward) Forward errorE f Backward errorE b (b) Backward direction 0 20 40 60 80 100 Training epoch 0 3 6 9 12 Error Bidirectional BP Forward errorE f Backward errorE b (c) Bidirectional Figure 1.3: Unidirectional versus bidirectional error reduction: The plots show the directional errors when training a 5-layer neural classifier on the CIFAR-10 dataset over 100 epochs. Unidirectional BP overwrites in one direction as it trains in the other direction. Bidirectional BP avoids overwriting because it jointly minimizes the directional errors. The forward pass runs as a classifier and the backward pass runs as a regressor. The forward path uses a multinomial likelihood and thus a cross-entropy as its error functionE f . The backward path uses a vector-normal likelihood and thus a mean-squared error function E b . (a) Unidirectional BP training in the forward direction reduced the forward error E f but overwrote the regression learning in the backward direction. (b) Unidirectional BP training in the backward direction reduced the backward error E b but overwrote the classification learning in the forward direction. (c) Bidirectional BP training minimized the joint error E f +E b . B-BP reduced both directional errors as it correspondingly increased both the forward and backward network likelihood probabilities. 1.2 Blocking and High-Capacity Neural Classifiers Figure 1.4 shows a blocking network that is long in blocks as well as deep within each block network. This new modular architecture requires a new coding scheme because backpropagation training needs to know the desired neuron values at the terminal layers of the interior or 5 Input Hidden Block ( 2 ) Input Block ( 1 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 1 ) ( 1 ) ( 10 ) ( 9 ) Hidden Block ( 3 ) ( 3 ) ( 3 ) ( 6 ) ( 7 ) ( 8 ) Output Block ( 4 ) ( 4 ) ( 4 ) ( 2 ) ( 2 ) ⋮ ⋮ ⋮ ( 2 | , ) ( 5 | 3 , ) ( 8 | , ) ( | 9 , ) ( | , , Ɵ ) Figure 1.4: Blocking: Long versus deep networks. A long blocked network consists of several deep-network blocks or modules. The hidden blocks have terminal layers with known neuron values while their hidden layers have unknown neuron values. Blocking can break up a very deep one-block network into several shallower blocks. The network in the diagram starts with 10 hidden layers. It shows the architecture for BP training with blocking. This optimizes the complete likelihood of the network. Blocking here breaks a one-block 10-layer network into four smaller blocks with two hidden blocks. Ordinary or bidirectional backpropagation trains the blocked network via the multiplication rule of probability (for neurons with no skip-layer synapses) by maximizing the product of the block likelihoods p(y, h|x, Θ) =p(y|h 9 , Θ) p(h 8 |h 6 , Θ) p(h 5 |h 3 , Θ) p(h 2 |x, Θ). Blocks can train separately for appropriate training codewords. Then final global or deep-sweep training can train the entire blocked network. hidden blocks. This in turn requires new output or terminal neurons as the number K of patterns increases. Current classifiers use softmax output neurons and 1-in- K coding. So they pick classifier codewords from theK vertices of the simplex embedded in a Boolean hypercube of dimension K. That leaves 2 K −K cube vertices that can also serve as codewords as in Figure 1.5. The number of these remaining codewords quickly dwarfs the K unit bit-vector codewords as K increases. An output layer of K logistic sigmoid neurons can code for any of the 2 K possible outcomes. So the probability structure of the output logistic layer is that of flipping K independent but different coins or a product of Bernoulli probabilities. That contrasts with an output softmax’s probability structure of one roll of a K-sided die or a one-shot multinomial. The softmax output is always a probability distribution or a point in embedded simplex. The logistic output is instead a finite fuzzy set or a point in the power set of all fuzzy subsets of K elements. The problem is that these codewords need not be orthogonal and so logistic neurons require some form of random or pseudo-random coding to achieve approximate or quasi-orthogonality. Logistic coding picks binary codewords that represent class patterns at the output layer of neural classifiers or neural blocks. This scheme increases the pattern capacity of neural classifier because again logistic coding picks K pattern codewords from the 2 M vertices of the M-dimensional binary or bipolar hypercube where M is the number of output neurons. Figure 1.5 shows how the logistic coding increases the pattern capacity. Random logistic 6 vs. [ 1 , 0 , 0 ] [ 0 , 1 , 0 ] [ 0 , 0 , 1 ] [ 1 , 0 , 0 ] [ 0 , 1 , 0 ] [ 0 , 0 , 1 ] [ 0 , 1 , 1 ] [ 1 , 1 , 0 ] [ 1 , 0 , 1 ] [ 1 , 1 , 1 ] [ 0 , 0 , 0 ] 1-in- coding Logistic coding Simplex Hypercube (a) Softmax vs. Logistic Random logistic coding = 100 = 60 = 20 1-in- coding = 100 (b) Random logistic coding Figure 1.5: Softmax 1-in-K coding versus logistic coding: Logistic coding gives Big-K capacity because it uses all 2 K vertices of the unit hypercube. 1-in-K coding has a limited capacity because it uses only the K vertices of the embedded probability simplex. (a) 1-in-K coding with a maximum capacity of 3 classes because the code length M = 3. The simplex has only 3 vertices in this case. (b) The codewords generated from the random logistic coding scheme with sampling probability p = 0.5, M≤ 100, and K = 100. The scheme found these sets of codewords C ∗ with the smallest orthogonality measure. The scheme searched over 10,000 iterations. This figure shows the grayscale image of the codewords. The black pixels denote 0 and the white pixels denote 1. This plot shows the best code for M∈{20, 60, 100} and the unit basis-vector codewords from{0, 1} 100 . coding randomly picks K codewords from the 2 M possible codewords using the approximate orthogonality of random codewords in high dimension. Figure 1.5 also shows a set of random logistic codewords with K = 100 and reduced codelength M∈{20, 60}. Random logistic coding combines with blocking to further increase the capacity of a deep neural classifier. Blocking means breaking down a very deep neural classifier into contiguous small blocks (sub-networks). Blocking maximizes the complete likelihood 7 −3 −2 −1 0 1 2 3 x −2 0 2 a(x) α = 1.0,β = 5.0 α = 0.1,β = 3.0 α = 0.5,β = 3.0 (a) NoVa activation 1 5 9 13 17 21 Number of hidden layers 0 5 10 15 20 Classification accuracy Logistic ReLU NoVa (b) NoVa benefit with deep classifiers Figure 1.6: NoVa (nonvanishing) hidden neurons for deep networks. The NoVa activation function perturbs the logistic sigmoid so that its derivative is not zero. This helps avoid the problem of vanishing gradients while maintaining much of the function-approximation power of logistic neurons. (a) Graphs of three NoVa activations. (b) Classification accuracy of NoVa deep networks compared with ReLU and logistic networks in deep neural classifiers trained on the CIFAR-100 dataset. NoVa neurons perform well and tend to “live” long after ReLU and logistic neurons “die”. p(y, h J ,..., h 1 |x, Θ) of a deep network instead of the output likelihoodp(y|x, Θ). This follows from the multiplication theorem of probability [117], [129] applied to the total likelihood of the contiguous blocks: p(y, h J ,..., h 1 |x, Θ) =p(y,|h J ,..., h 1 , x, Θ) J Y j=2 p(h j |h j−1 ,..., h 1 , Θ) p(h 1 |x, Θ) (1.9) where the terms y, h J , . . . , h 1 are the visible layers. Unidirectional BP training with blocking finds the network parameters Θ ds that maximize the complete likelihood: Θ ds = arg max Θ p(y, h J ,..., h 1 |x, Θ) (1.10) = arg max Θ p(y,|h J ,..., h 1 , x, Θ) J Y j=2 p(h j |h j−1 ,..., h 1 , Θ) p(h 1 |x, Θ) (1.11) = arg max Θ logp(y,|h J , Θ) + J X j=2 logp(h j |h j−1 , Θ) + logp(h 1 |x, Θ) (1.12) where the logistic codewords define the targets h J ,...,h 1 of the hidden blocks. Figure 1.4 shows the modular structure for BP training with blocking. Blocking also extends to B-BP training with bidirectional blocks. We also introduced a new hidden neuron that scales well in very deep networks. We call it the NoVa or nonvanishing neural activation because its derivative does not equal zero in general. 8 The NoVa neuron is a type of perturbed logistic neuron. It is more broadly a family of such perturbed sigmoidal functions. NoVa neurons mitigate the vexing problems of “vanishing gradients” [112], [163], [207] and “dying” neurons [178] in deep neural networks. Networks of NoVa hidden neurons also have good function-approximation power because of their logistic-like nonlinearity [56]. They are suitable for gradient-based learning because NoVa neurons are continuously differentiable [281]. A NoVa activation has the form of an additively perturbed logistic sigmoid: a(x) =cx +σ(bx) (1.13) =cx + 1 1 +e −bx (1.14) where x is the input. Taking the derivative gives da(x) dx =c + bσ(bx)[1−σ(bx)]. (1.15) So the derivative does not vanish in general if c> 0 because the logistic sigmoid σ takes values in the unit interval [0, 1]. The additive perturbation term cx saves the logistic in this sense. The term cx also resembles the threshold-linear form max(0,x) of the ReLU or rectified linear unit. Figure 1.6 compares the classification accuracy of ever-deeper NoVa networks with logistic and ReLU networks. Logistic networks quickly “die” as the number of hidden layers grows for these networks trained on the CIFAR-100 image data set. ReLU networks last longer but their classifier performance also falls off steeply for deeper networks. NoVa networks perform well and continue to “live”. Chapter 6 shows that the doubly perturbed G-NoVa or Generalized NoVa performs even better. The extra computational cost involved is trivial. 1.3 Noisy Recurrent Backpropagation The noisy recurrent backpropagation (N-RBP) algorithm is the noise-enhanced form of the RBP algorithm that trains deep neural networks on time-varying signals. RBP itself is a time-varying form of unidirectional backpropagation [114], [261]. Noisy RBP injects a special new type of noise into training signals. The noise stems from the noise that benefits the Expectation-Maximization algorithm for iteratively climbing the nearest hill of log-likelihood [61]. A recent result shows that backpropagation is a special case of the EM algorithm. So the “NEM” noise that speeds EM convergence on average can also speed up backpropagation training on average [200], [202], [203]. We show that a form of NEM noise also benefits RBP for both classification and time-series regression. 9 (a) Additive NEM noise: Classifier (b) Additive NEM noise: Regressor Figure 1.7: Beneficial NEM noise for a recurrent backpropagation classifier and regressor. The NEM inequality leads to different beneficial noise for the two networks and speeds their backpropagation training on average. (a) Classifier NEM RBP noise lies below the NEM hyperplane in noise space. Here the output activation was a y = [0.6, 0.3, 0.1] and the target was y = [1, 0, 0]. (b) Regression NEM RBP noise lies inside the sphere that arises from the NEM inequality. The output activation was a y = [1.0, 2.0, 1.0] and the target was y = [2.0, 3.0, 2.0]. NEM noise differs from the simple dither or blind white noise that improves some nonlinear signal processing techniques in stochastic resonance [39], [85], [152], [206], [210], [211], [265]. NEM noise is instead just that noise perturbation that makes the current training signal more probable on average. 10 Figure 1.7 shows how beneficial NEM RBP noise differs for a classifier and a regressor. The helpful additive noise samples for the RBP video classifier lie below the NEM hyperplane in noise space. The NEM hyperplane arises from an inequality. The NEM noise itself injects or adds to the output neurons of the recurrent neural network. The beneficial NEM noise for the RBP regressor lies inside a sphere that arises from the NEM condition. We also show that forms of NEM noise can speed B-BP convergence and improve its accuracy and similarly for generative adversarial networks. 11 Chapter 2 Bidirectional Backpropagation (B-BP) Algorithm The new bidirectional backpropagation algorithm extends ordinary unidirectional backprop- agation from ordinary unidirectional training to bidirectional training of deep multilayer neural networks. B-BP maximizes both the forward likelihood p f (y|x, Θ) and the backward likelihood p b (x|y, Θ)becauseitmaximizesthenetwork’stotalbidirectionallikelihoodp f (y|x, Θ)p b (x|y, Θ). So it maximizes the network’s total log-likelihood L: L = logp f (y|x, Θ) + logp b (x|y, Θ). (2.1) It equivalentlyminimizes the network’s total or joint error functionsE f (Θ)+E b (Θ) because each error function at a layer equals the negative log-likelihood of the layer probability. Bayesian bidirectional backpropagation (B 3 ) extends likelihood B-BP to the general case of maximizing the network’s bidirectional posterior probability that includes prior probabilities on parameters such as some or all of the weights. The default weight prior of a uniform probability reduces B 3 back to likelihood B-BP. The next sections develop this mathematical framework. 2.1 A Review of Backpropagation Backpropagation (BP) algorithm trains a neural network to approximate some mapping from the input space X to the output space Y [34], [133], [164], [223], [262]. The mapping 12 is a simple function for an ideal classifier. Applications of BP include speech recognition [23], [41], [57], [58], [63], [64], [74], [91], [92], [179], [252], [271], computer vision [49], [118], [132], [144], [175], [204], [251], [259], natural language processing [86], [88], [109], [134], [186], [234], [237], recommendation systems [73], [194], [255], [268], [275], bioinformatics [44], [45], [228], and fraud detection [142]. The BP algorithm is itself a special case of generalized expectation-maximization (GEM) for maximium-likelihood estimation with latent or hidden parameters [20]. The reduction of BP to GEM follows from the key gradient identity∇ logp(y|x, Θ) = ∇Q(Θ|Θ n ) that we re-derive below in (2.89). The BP noise benefit follows in turn from the EM-noise-boost results in [200]. These EM results show that injecting noise helps the maximum-likelihood system bound faster up the nearest hill of probability on average if the noise satisfies a positivity condition that involves a likelihood ratio. The next section review the probabilistic framework for BP. 2.1.1 Unidirectional Backpropagation: Probabilistic Description We now formulate the unidirectional BP using the following notation: Θ : network parameters or weights x∈ X⊆R I : input vector with dimension I y∈ Y⊆R K : target vector with dimension K a y : prediction of y given x (output activation) p f (y|x, Θ) : forward likelihood pdf of y given x and Θ p(Θ|x) : forward prior pdf of Θ given x p(Θ|y, x) : forward posterior pdf of Θ given y and x Θ ML : maximum likelihood estimate of Θ (2.2) The maximum likelihood estimation (MLE) form of the unidirectional BP seeks to maximize the forward or output layer likelihood of a network. BP training has a probabilistic structure based on maximum-likelihood estimation. BP trains a network by iteratively minimizing some error or performance measure E(Θ) that depends on the difference between the network’s actual or observed output value a y = N(x) and its desired or target y. The parameter vector Θ describes the network’s current configuration of synaptic weights and neuronal coefficients. There is some probability p(y|x, Θ) that a neural network N with parameters Θ will emit output y given input x. The probability p(y|x, Θ) defines the network’s output-layer likelihood function. 13 The error E(Θ) of a network is proportional to the negative of the network’s log- likelihood function L: E(Θ)∝−L =− logp(y|x, Θ). Then minimizing the network error E(Θ) maximizes the network log-likelihood logp(y|x, Θ) and vice versa. So BP performs maximum-likelihood estimation [34]: Θ ∗ = Θ ML = arg min Θ E(Θ) = arg max Θ log p(y|x, Θ). (2.3) We next present the likelihood structure for three different neural networks. The networks differ in the structure of their terminal-layer neurons and in the function that these networks perform. We consider three different likelihood structures based on the output activation of the networks. The two main neural models are classifiers and regressors. The third type of network is the logistic network. A classifier network maps the input space R I to a length-K probability vector in [0, 1] K . The output neurons of a classification network typically use a softmax or Gibbs activation function. So an output neuron has the form of an exponential divided by K exponentials. The target vector y is a unit bit vector. It has a 1 in the k th slot and 0s elsewhere. The probability that the k th scalar component y k equals 1 is p(y k = 1|x, Θ) =a y k . (2.4) The target y defines a one-shot multinomial or categorical random variable. So the (output-layer) likelihood p class (y|x,Θ) of a classifier network is the multinomial product p class (y|x,Θ) = K Y k=1 a y k y k . (2.5) The log-likelihood L class is the negative cross-entropy: L class = log p class (y|x, Θ) (2.6) = log K Y k=1 a y k y k (2.7) = K X k=1 y k log a y k . (2.8) Maximizing the log-likelihood L class with respect to Θ minimizes the output cross-entropy E class : E class =− K X k=1 y k log a y k . (2.9) 14 A regression network maps the input vector space R I to the output spaceR K . I is the number of input neurons. K is the number of output neurons. A regression network uses identity activation functions at the output layer. The target vector y is a Gaussian random vector. It has a mean vector a y and has an identity or white covariance matrix I (which can generalize to a nonwhite covariance matrix): y∼N (y|a y , I). (2.10) The likelihood p reg (y|x, Θ) of a regression network is p reg (y|x, Θ) = 1 (2π) K/2 exp − ||y− a y || 2 2 (2.11) where||·|| is the Euclidean norm. Then the log-likelihood L reg of the regression network is a constant plus the squared-error: L reg = logp reg (y|x, Θ) (2.12) = log(2π) − K 2 − ||y− a y || 2 2 (2.13) = log(2π) − K 2 − 1 2 K X k=1 y k −a y k 2 . (2.14) So maximizing L reg with respect to Θ minimizes the squared-error E reg of regression: E reg = 1 2 K X k=1 y k −a y k 2 . (2.15) A logistic network N maps the input space R I to the unit hypercube [0, 1] K if K is the number of output logistic neurons. Bipolar logistic output neurons map inputs to the bipolar hypercube [−1, 1] K by shifting and scaling the logistic activation functions. The target vector y consists of K independent Bernoulli variables. The probability that the k th output neuron y k equals 1 is just the ordinary Bernoulli probability of getting heads after one flip of a coin: p log (y k = 1|x, Θ) = (a y k ) y k (1−a y k ) 1−y k . (2.16) The likelihoodp log (y|x,Θ) of a logistic network equals the product of the K independent 15 Bernoulli variables. So it equals the probability of flipping K independent coins: p log (y|x,Θ) = K Y k=1 (a y k ) y k (1−a y k ) 1−y k . (2.17) Then the log-likelihood L log of a logistic network equals the negative of the double cross entropy: L log = log p log (y|x,, Θ) (2.18) = K X k=1 y k log (a y k ) + K X k=1 (1−y k ) log (1−a y k ). (2.19) So maximizing the log-likelihood L log minimizes the error function E log or double cross entropy: E log =− K X k=1 y k log (a y k )− K X k=1 (1−y k ) log (1−a y k ). (2.20) The BP algorithm uses gradient descent or its variants to update a network’s weights and other parameters. The BP learning laws remain invariant for all three networks (regression, classification, and logistic) because taking the gradient of their network likelihood gives the same partial derivatives for updating their weights [155]. The next section presents the BP learning laws for the four-layer network in Figure 2.1. 2.1.2 Backpropagation: Learning Laws This section presents the BP learning laws for the network N in Figure 2.1N has four layers including two hidden layers Suppose that the neural network is a neural classifier with softmax activation and sigmoidal hidden neurons. All results apply to any finite number of hidden layers. The weight matrix U connects the input layer to the first hidden layer h 1 . The weight matrix V connects h 1 to the second hidden layer h 2 . The weight matrix W connects h 2 to the output layer. The input layer uses identity activation so the activation of the i th input neuron is a x i =x i (2.21) where x i is the i th component of the input vector. 16 Input Layer Output Layer Hidden Layers Forward Pass 1 2 Figure 2.1: Feedforward network: The forward pass of an input vector x over a four-layer neural network. The nodes represent artificial neurons. The blue nodes are the input neurons, the red nodes are the hidden neurons, and the green nodes are the output neurons. The edges U, V and W represent synaptic weights between neurons. Forward pass propagates x through the hidden layers to the output layer. The output activation a y =N(x) is the prediction of output vector y given the input vector x. The inner-product input o h 1 m of the m th neuron on the hidden layer h 1 is o h 1 m = I X i=1 u mi a x i (2.22) where u mi is the synaptic weight that connects the i th input neuron to the m th hidden neuron on layer h 1 . The sigmoidal activation a h 1 m of the m th neuron on layer h 1 is a h 1 m = 1 1 +e −(o h 1 m ) . (2.23) The inner-product input o h 2 j of the j th neuron on the hidden layer h 2 is o h 2 j = M X m=1 v jm a h 1 m (2.24) where v jm is the synaptic weight that connects the m th hidden neuron on layer h 1 to the 17 j th hidden neuron on layer h 2 . The corresponding sigmoidal activation a h 2 j of thej th hidden neuron on layer h 2 is a h 2 j = 1 1 +e −(o h 2 j ) . (2.25) The inner-product input o y k of the k th neuron on the output layer is o y k = J X j=1 w kj a h 2 j (2.26) where w kj is the synaptic weight that connects the j th hidden neuron on layer h 2 to the k th output neuron. The softmax or Gibbs activation of the k th output neuron is a y k = e (o y k ) P K l=1 e (o y l ) . (2.27) The BP algorithm recursively computes the derivatives of the log-likelihood starting with the output layer and progressing to the input layer. This recursive computation relies on calculus chain rule. The chain rule expands the partial derivative of the log-likelihood 18 L class with respect to the synaptic weight u kj as ∂L class ∂w kj = ∂L class ∂o y k ∂o y k ∂w kj (2.28) = K X l=1 ∂L class ∂a y l ∂a y l ∂o y k ∂o y k ∂w kj (2.29) = y k a y k e (o y k ) P K s=1 e o y s −e (o y k ) e (o y k ) P K s=1 e (o y s ) 2 − K X l=1 l̸=k y l a y l e (o y l ) e (o y k ) P K s=1 e (o y s ) 2 a h 2 j (2.30) = y k a y k e (o y k ) ( P K s=1 e o y s −e (o y k ) ) P K s=1 e (o y s ) 2 − K X l=1 l̸=k y l a y l e (o y l ) e (o y k ) P K s=1 e (o y s ) 2 a h 2 j (2.31) = y k a y k a y k (1−a y k )− K X l=1 l̸=k y l a y l a y l a y k a h 2 j (2.32) = y k (1−a y k )−a y k K X l=1 l̸=k y l a h 2 j (2.33) = y k −a y k K X l=1 y l a h 2 j (2.34) = (y k −a y k )a h 2 j (2.35) where∃s∈{1, 2,...,K} such that y s = 1 and∀l̸=s, y l = 0 because softmax neurons uses 1-in-K encoding. The partial derivative of the log-likelihood L class with respect to the synaptic weight v jm expands as ∂L class ∂v jm = ∂L class ∂a h 2 j ∂a h 2 j ∂o h 2 j ∂o h 2 j ∂v jm (2.36) = K X k=1 ∂L class ∂o k ∂o k ∂a h 2 j ∂a h 2 j ∂o h 2 j ∂o h 2 j ∂v jm (2.37) = K X k=1 (y k −a y k )w kj a h 2 j (1−a h 2 j )a h 1 m . (2.38) The partial derivative of the log-likelihood L class with respect to the synaptic weight u mi 19 expands as ∂L ∂u mi = ∂L ∂a h 2 j ∂a h 2 j ∂o h 2 j ∂o h 2 j ∂a h 1 m ∂a h 1 m ∂o h 1 m ∂o h 1 m ∂u mi (2.39) = J X j=1 K X k=1 ∂L ∂o k ∂o k ∂a h 2 j ∂a h 2 j ∂o h 2 j ∂o h 2 j ∂a h 1 m ∂a h 1 m ∂o h 1 m ∂o h 1 m ∂u mi (2.40) = J X j=1 K X k=1 (y k −a y k )w kj a h 2 j (1−a h 2 j )v jm a h 1 m (1−a h 1 m )a x i . (2.41) The learning laws follow from using gradient ascent or any of its variants to update the parameters. The update rule with gradient ascent after t training epochs is θ (t+1) =θ (t) +η ∂L class ∂θ θ=θ (t) (2.42) where η is the learning rate and θ∈{w kj ,v jm ,u mi }. The learning laws are the same for L reg , L class , and L log [155]. This is so because ∂L class ∂θ = ∂L reg ∂θ = ∂L log ∂θ (2.43) for all θ if the three networks use the same activation at their hidden layers. Regression networks use identity output activation a y k =o y k with the log-likelihood L reg in (2.14). So the derivative of L reg with respect to the output weight w kj is ∂L reg ∂w kj = ∂L reg ∂a y k ∂a y k ∂o y k ∂o y k w kj (2.44) = (y k −a y k )a h 2 j . (2.45) Logistic networks use logistic output activation a y k = 1 1 +e −(o y k ) (2.46) with the log-likelihood L log in (2.19). The derivative of L log with respect to the output 20 weight w kj is ∂L log ∂w kj = ∂L log ∂a y k ∂a y k ∂o y k ∂o y k w kj (2.47) = y k a y k − 1−y k 1−a y k a y k (1−a y k )a h 2 j (2.48) = (y k −a y k ) a y k (1−a y k ) a y k (1−a y k )a h 2 j (2.49) = (y k −a y k )a h 2 j . (2.50) Equations (2.35), (2.45), and (2.50) show that the derivatives of the three networks (classifi- cation, regression, and logistic) at the output layer are equal. The derivative of the weights at the hidden layers are also the same. So the learning rules for these three networks remain unchanged. We call this backpropagation invariance. 2.1.3 Expectation Maximization (EM) Algorithm Expectation Maximization (EM) algorithm is an iterative method for finding ML or MAP estimates for the parameter of a statistical model in situations where we have missing data and/or latent variables [61], [105], [185], [196]. EM alternates between two key steps: the Expectation (E) step and the Maximization (M) step. E-step infers the missing values given the parameters and the M-step optimizes the parameters given the inferred values [196]. The applications of EM algorithm include: speech recognition [19], [135], [214], genome sequencing [24], [161], data clustering [16], [40], radar denoising [256], and medical imaging [232], [278] We now formulate EM algorithm using the following notation: θ∈ Θ⊆R d : unknown d-dimensional parameter X : complete data space Y : observed data space Z : latent data space y∈Y : observed data with pdf f(y|θ) z∈Z : latent data with pdf f(z|y,θ) (y,z)∈ (Y,Z) : complete data random variable with pdf f(y,z|θ) θ ∗ : maximum likelihood estimation of θ θ t : an estimate for θ ∗ at iteration t and the goal of the EM algorithm is to find the MLE estimate θ ∗ for the likelihood f(y|θ). 21 But instead of using f(y|θ) EM relies on f(y,z|θ) to find the θ ∗ . The log-likelihood logf(y|θ) equals the sum of the surrogate log-likelihood and the cross-entropy. This following equations simplify the log-likelihood: logf(y|θ) = log f(y,θ) h(θ) (2.51) = log f(y,θ) h(θ) f(y,z,θ) f(y,z,θ) (2.52) = log f(y,z|θ) f(z|y,θ) (2.53) = log f(y,z|θ)− log f(z|y,θ). (2.54) The log-likelihood further simplifies as follows: logf(y|θ) =E Z|y,θt [log f(y|θ)] (2.55) =E Z|y,θt [log f(y,z|θ)− logf(z|y,θ)] (2.56) =E Z|y,θt [log f(y,z|θ)] | {z } surrogate log-likelihood +E Z|y,θt [− log f(z|y,θ)] | {z } cross-entropy (2.57) =Q(θ|θ k ) +H(θ|θ k ) (2.58) where Q(θ|θ t ) is the surrogate log-likelihood and H(θ|θ t ) is the cross-entropy. EM algorithm iteratively optimizes the Q-function because any θ t+1 that improves the function will also improve the log-likelihood. This is the EM ascent property [61]. This property guarantees that maximizing the Q-function is equivalent to maximizing the log-likelihood. We now show the EM ascent property. Let θ t+1 = arg max θ Q(θ|θ t ). We have log f(y|θ t+1 )− log f(y|θ t ) =E Z|y,θt [log f(y,z|θ t+1 )]−E Z|y,θt [log f(y,z|θ t )] (2.59) =Q(θ t+1 |θ t )−Q(θ t |θ t ) +H(θ t+1 |θ t )−H(θ t |θ t ) (2.60) =Q(θ t+1 |θ t )−Q(θ t |θ t ) | {z } ≥0 −E Z|y,θt log f(z|y,θ t+1 ) f(z|y,θ t ) (2.61) ≥−E Z|y,θt log f(z|y,θ t+1 ) f(z|y,θ t ) . (2.62) Then Jensen’s inequality applies to (2.62). Jensen’s inequality states that E[g(X)] ≤ 22 g(E[(X)]) if g is a concave function [4], [99]. So we have E Z|y,θt log f(z|y,θ t+1 ) f(z|y,θ t ) ≤ log E Z|y,θt f(z|y,θ t+1 ) f(z|y,θ t ) (2.63) ≤ log Z θ f(z|y,θ t+1 ) f(z|y,θ t ) f(z|y,θ t ) dz (2.64) ≤ log (1) = 0. (2.65) So equations (2.62) and (2.65) simplify as follows: log f(y|θ t+1 )− log f(y|θ t )≥H(θ t+1 |θ t )−H(θ t |θ t ) (2.66) ≥−E Z|y,θt log f(z|y,θ t+1 ) f(z|y,θ t ) (2.67) ≥ 0 . (2.68) So the θ t+1 that increases the Q-function can only increase the log-likelihood. The optimal solution θ ∗ that maximizes Q(θ|θ t ) also maximizes log f(y|θ). This is the EM ascent property. The EM algorithm simplifies to two main steps . The EM algorithm steps after t iterations are E-step: Compute the surrogate Q(θ|θ t ) (2.69) M-step: θ t+1 = arg max θ Q(θ|θ t ) (2.70) and the EM ascent property guarantees that the EM solution maximizes logf(y|θ). Osobaetal. derivedthenoisyExpectation-Maximization(NEM)theoremthatguarantees noise benefit on average with the EM algorithm [200], [203]. The Noisy Expectation- Maximization (NEM) theorem states the general sufficient condition for a noise benefit in the EM algorithm. The theorem shows that the noise injection can only shorten the EM algorithm’s walk up the nearest hill of log-likelihood on average if the noise satisfies the NEM positivity condition. It holds for additive or multiplicative or any other type of measurable noise injection [200]. The noise is just the noise that makes the current signal more probable. The basic NEM theorem for additive noise states that a noise benefit holds on average at each iteration t if the following positivity condition holds [203]: E x,h,n|Θ ∗ h log p(x + n, h|Θ t ) p(x, h|Θ t ) i ≥ 0. (2.71) 23 Then the EM noise benefit Q(Θ t |Θ ∗ )≤Q N (Θ t |Θ ∗ ) (2.72) holds on average at iteration t: E x,N|Θt h Q(Θ ∗ |Θ ∗ )−Q N (Θ t |Θ ∗ ) i ≤E x|Θt h Q(Θ ∗ |Θ ∗ )−Q(Θ t |Θ ∗ ) i (2.73) where Θ ∗ denotes the maximum-likelihood vector of parameters, Q N (Θ t |Θ ∗ ) =E h|y,n,Θ ∗[logp(x + n, h|Θ t )] (2.74) Q(Θ t |Θ ∗ ) =E h|y,Θ ∗[logp(x, h|Θ t )]. (2.75) The intuition behind the NEM sufficient condition is that NEM noise is just that added noise n that makes the current signal x more probable on average: p(x + n|Θ)≥p(x|Θ). Rearranging and taking logarithms and expectations gives the NEM sufficient condition (7.4). The noise-boosted likelihood is closer on average at each iteration to the maximum-likelihood outcome than is the noiseless likelihood [7], [200]. Kullback-Leibler divergence measures the closeness. 2.1.4 Backpropagation as Generalized EM Algorithm We now restate the recent BP-as-GEM theorem [20] and then sketch its proof. The theorem states that the backpropagation update equation for a differentiable likelihood function p(y|x, Θ) at epoch i: Θ (i+1) = Θ (i) +η∇ Θ log p(y|x, Θ) Θ=Θ (i) (2.76) equals the GEM update equation at epoch i Θ (i+1) = Θ (i) +η∇ Θ Q(Θ|Θ (i) ) Θ=Θ (i) (2.77) if GEM uses the differentiable Q-function Q(Θ|Θ (i) ) =E h|y,x,Θ (i) h log p(h, y|x, Θ) i . (2.78) The EM algorithm takes the expectation of log p(y|x, Θ) with respect to the hidden posterior p(h|y, x, Θ (i) ). The “EM trick” rewrites the conditional probability p(h|y, x, Θ) = p(h,y|x,Θ) p(y|x,Θ) as p(y|x, Θ) = p(h,y|x,Θ) p(h|y,x,Θ) . Then taking hidden-posterior expectations of the 24 log-likelihood log p(y|x, Θ) gives log p(y|x, Θ) =E h|y,x,Θ (i) h log p(y|x, Θ) i (2.79) =E h|y,x,Θ (i) log p(h, y|x, Θ) p(h|y, x, Θ) (2.80) =Q(Θ|Θ (i) ) +H(Θ|Θ (i) ) (2.81) where Q(Θ|Θ (i) ) is the EM surrogate likelihood and the cross entropy is H(Θ|Θ (i) ) =−E h|y,x,Θ (i)[logp(h|y, x, Θ) . (2.82) Taking gradients gives ∇ logp(y|x, Θ) =∇Q Θ|Θ (i) +∇H Θ|Θ (i) . (2.83) The entropy inequality H(Θ (i) |Θ (i) ) ≤ H(Θ|Θ (i) ) holds for all Θ because Jensen’s inequality and the concavity of the logarithm imply that Shannon entropy H(Θ (i) |Θ (i) ) minimizes the cross entropy H(Θ|Θ (i) ). We have H(Θ (i) |Θ (i) )−H(Θ|Θ (i) ) =−E h|y,x,Θ (i)[logp(h|y, x, Θ (i) )− log p(h|y, x, Θ)] (2.84) =E h|y,x,Θ (i) " log p(h|y, x, Θ) p(h|y, x, Θ (i) ) # (2.85) ≤ log E h|y,x,Θ (i) " p(h|y, x, Θ) p(h|y, x, Θ (i) ) # (2.86) ≤ log Z p(h|y, x, Θ) p(h|y, x, Θ (i) ) p(h|y, x, Θ (i) ) dh (2.87) ≤ log (1) = 0 (2.88) where (2.86) follows from applying the Jensen’s inequality. Hence∇H(Θ (i) |Θ (i) ) = 0. This gives the BP-as-GEM master equation: ∇ logp(y|x, Θ) =∇Q Θ|Θ (i) . (2.89) So the BP and GEM gradients are identical at each iteration i. We remark that the master gradient equation (2.89) holds at each hidden layer of the neural network. This allows NEM-noise injection in hidden layers as well as in output layers [155]. Suppose the network has J hidden layers h J ,..., h 1 . The numbering starts in the forward direction with the first hidden layer h 1 after the input (identity) layer x. Then 25 the total network likelihood is the probability density p(y, h J ,..., h 1 ,|x, Θ (i) ). The “chain rule” or multiplication theorem of probability factors this likelihood into a product of layer likelihoods: p(y, h J ,..., h 1 ,|x, Θ (i) ) =p(y|h J ,..., h 1 , x, Θ (i) ) ×p(h J |h J−1 ,..., h 1 , x, Θ (i) )···p(h 2 |h 1 , x, Θ (i) )p(h 1 |x, Θ (i) ) (2.90) where we assume the input-data priorp(x) = 1 for simplicity. Taking logarithms at iteration i gives the total network log-likelihood L(x) as the sum of the layer log-likelihoods: L(x) =L(y|x) +L(h J |x) +··· +L(h 1 |x) (2.91) where L(h J |x) = logp(h J |h J−1 ,..., h 1 , x, Θ (i) ). Then the above BP-as-GEM argument (2.79)-(2.89) applies at layer J if BP invariance holds at that layer. 2.1.5 A Brief History of Backpropagation Rumelhart et al.[223] introduced the backpropagation algorithm in 1986 and popularized [70], [221], [222] it in the Parallel Distributed Processing (PDP) edited volumes in the late 1980s [151]. Minsky and Papert had earlier identified the limitations of the perceptron algorithm in 1969 [189]. The PDP researchers considered the backpropagation (also known as the generalized delta rule) to be the solution to the limitations of the perceptron algorithm and a victory for connectionism 1 . The history and novelty of backpropagation remains a controversial topic because of the similarity of backpropagation with earlier methods that predate the 1980s. There are publications [37], [38], [139] from the 1960s that focus on deriving a continuous form of backpropagation and its application to control systems [89], [191], [229]. These methods applied the steepest descent in the parameter space of the systems. Linnainmaa introduced an efficient error backpropagation method called the general automatic differentiation(AD) method and ran it on computers in 1970 [174]. In fact most of the recent machine learning software packages (such as Google’s Tensorflow and Meta’s PyTorch) in the 21 st and beyond rely on the framework of the AD method to train neural networks with the backpropagation algorithm [28], [35], [208], [247]. Although Linnainmaa did not apply the AD method to neural networks. Werbosderivedthebackpropagationalgorithmas“dynamicfeedback”forneuralnetworks [127], [151] in his 1974 Harvard dissertation. He clearly described this method as a process 1 This is a term that many computer and cognitive scientists use to refer to neural network theory and application [151]. 26 for training artificial neural networks through the back-propagation of errors. Werbos further applied the method to economic forecasting and other problems [260], [263]. White [264] showed in 1989 that the backpropagation method reduces to the stochastic approximation method from the 1950s but with a new and computational efficient way to implement it. This further heighten the doubt that surrounds the claim that backpropagation was a new method in 1986 when Rumelhart et al.[223] introduced the term. The emergence of powerful and cheap processors such as graphical processing units (GPUs) and tensor processing units (TPUs) has contributed immensely to the success of backpropagation in the 21 st century. This has made it possible to collect very large dataset and build very deep neural network models that require a lot of dataset to train. These two recent hardware breakthroughs catalyzed the emergence and success of modern deep learning [30], [258]. 2.2 Bidirectional Neural Representation What is a feedback signal that arrives at the input layer of a neural network? Grossberg answered this question with his adaptive resonance theory or ART: The feedback signal is an expectation [95]–[97]. The neural network expects to see this feedback signal or pattern given the current input signal that stimulated the network and given the pattern associations that it has learned. So any synaptic learning should depend on the match or mismatch between the input signal and the feedback expectation [98]. Grossberg gave this ART answer in the special case of a two-layer neural network. The two layers defined the input and output fields of neurons. The two layers can also define two stacked contiguous layers of neurons in a larger network with multiple such stacked layers [29]. The topological point is that the neural signals flow bidirectionally between the two layers. There are no hidden or intervening layers. The synapses of the forward flow differ in general from the synapses of the backward flow. A bidirectional associative memory or BAM results if the two synaptic webs are the same [146]–[149]. A BAM is in this sense a minimal two-layer neural network. The input signal passes forward through a synaptic matrix M. Then the backward signal passes through the transpose M T of the same matrix M. This minimal BAM structure holds for stacked contiguous layers that reverberate through the same weight matrix M [92], [249]. The basic BAM stability theorem results if the network uses the transpose M T for the backward pass because then the forward and backward network Lyapunov functions are equal. The BAM theorem states that every real rectangular matrix M is bidirectionally stable for threshold or sigmoidal neurons: Any input stimulation quickly leads to an equilibrium or resonating bidirectional fixed point of a fixed input vector and a fixed output vector. This 27 Forward Pass: Backward Pass: ℎ ℎ Figure 2.2: Bidirectional representation of the three-bit bipolar permutation function in Table 2.1: This shows a four-layer bidirectional associative memory (BAM) network that exactly represents the permutation function and its inverse. The forward or left-to-right direction encodes the permutation mapping. The backward or right-to-left direction encodes the inverse mapping. This set of weights and thresholds used a single hidden layer with four threshold neurons. BAM stability holds for the wide class of Cohen-Grossberg neuron nonlinearities [52], [146], [149]. It still holds even if the synaptic weights simultaneously change in accord with a Hebbian or competitive learning law [146]. It also holds for time-lags and for many other recent extensions of the basic neuron models [15], [31], [181], [254]. Simple fixed-point stability need not hold in general when the BAM contains one or more hidden layers of nonlinear units. It does hold for the three-layer BAM network in Figure 2.2 because it represents the invertible permutation mapping in Table 2.1. Simulations showed that such multilayered feedback BAMs still often converged to fixed points. 28 Table 2.1: A three-bit bipolar permutation function that the three-layer bidirectional associative memory (BAM) network in Figure 2.2 represents. The function maps an input vector x to its corresponding output vector y. The inverse simply maps from y back to the corresponding x Input x Output y [+ + +] [−− +] [+ +−] [− + +] [+− +] [+− +] [+−−] [+ + +] [− + +] [− +−] [− +−] [−−−] [−− +] [+−−] [−−−] [+ +−] 2.2.1 Bidirectional Associative Memories Extend to Multilayer Networks The ART and BAM concepts apply in the more general case when the network has any number of hidden or intervening neural layers between the input and output layers. This section considers just such generalized or extended ART-BAM networks for supervised learning. TheextendedBAM’sfeedbackstructuredependsontheneuralnetwork’s reverse mapping N T :R K →R I from output vectors y∈R K back to vectors x in the input pattern spaceR I . Suppose that the vector input pattern x∈R I stimulates the multilayered neural network N. It produces the vector output y =N(x)∈R K . Then the feedback signal x ′ is the signal N T (y) that the network N produces when it maps back from the output y =N(x) through its web of synapses to the input layer: x ′ =N T (y) =N T (N(x)). We use the notation N T (y) to denote this feedback signal. The backward direction uses only the transpose matrices M T of the synaptic matrices M that the network N uses in the forward direction. So these deep networks define generalized BAM networks with complete unidirectional sweeps from front to back and vice versa. The transpose notation N T also makes clear that the feedback or backward signal N T (y) differs from the set-theoretic inverse or pullback N −1 (y). The pullback mapping N −1 : 2 R K → 2 R I always exists. It maps sets B in the output space to sets A in the input space: A = N −1 (B) ={x∈ R I : N(x)∈ B} if B⊂ R K . So the set-theoretic inverse N −1 (y) =N −1 ({y}) =A partitions the input pattern space R I into the two sets A and A c . All pattern vectors x in A map to y. All other inputs map elsewhere. The point inverse N −1 :R K →R I exists only in the rare bijective case that the network N :R I →R K is both one-to-one and onto. The three-layer BAM network in Figure 2.2 29 Input Layer Output Layer Hidden Layers Forward Pass 1 2 Backward Pass Figure 2.3: Bidirectional network: This is a four-layer network N that runs bidirectional inferences. The nodes represent artificial neurons. The blue nodes are the input neurons, the red nodes are the hidden neurons, and the green nodes are the output neurons. The edges U, V and W represent synaptic weights between neurons. The forward pass feeds the input vector x to the input layer and spills out a yf = N(x) at the output layer. This is the prediction of input vector y given x. The backward pass feeds the y at the output layer and spills out a xb =N T (y) at the input layer. This is the prediction of x given y. Backward pass inference uses the transpose of the weights for the forward pass inference. is just such a rare case of bijective network because it exactly represents the permutation mapping in Table 2.1. The bidirectional learning algorithms below do not require that the networks N have a point inverse N −1 . 2.2.2 Exact Bidirectional Representation of Bipolar Permutations This section proves that there exists multilayered neural networks that can exactly bidi- rectionally represent some invertible functions. We first define the network variables. The proof uses threshold neurons while the B-BP algorithms use soft-threshold logistic sigmoids for hidden neurons and uses identity activations for input and output neurons. 30 A bidirectional neural network is a multilayered network N : X→ Y that maps the input space X to the output space Y and conversely through the same set of weights. The backward pass uses the transpose matrices of the weight matrices that the forward pass uses. Such a network is a bidirectional associative memory or BAM [146], [148]. Figure 2.3 shows a multilayered bidirectional network. The forward pass sends input vector x through weight matrix W from the input layer to the hidden layer and then passes on through matrix U to the output layer. The backward pass sends the output y from the output layer back through the hidden layer to the input layer. Let I,J, and K denote the respective number of input, hidden, and output neurons. Then the I×J matrix W connects the input layer to the hidden. The J×K matrix U connects the hidden layer to the output layer. 2.2.2.1 Forward Pass The input neurons use bipolar threshold activation: a xf i (x i ) = −1 if x i ≤ 0 1 otherwise . (2.92) The hidden-neuron input o hf j has the affine form o hf j = I X i=1 w ji a xf i (x i ) +b h j (2.93) where weightw ji connects thei th input neuron to thej th hidden neuron,a xf i is the activation of the i th input neuron, and b h j is the bias term of the j th hidden neuron. The activation a hf j of the j th hidden neuron is a binary threshold: a hf j (o hf j ) = 0 if o hf j ≤ 0 1 otherwise . (2.94) The B-BP algorithm in the next section uses soft-threshold bipolar logistic functions for the hidden activations because such sigmoid functions are differentiable. The proof below also modifies the hidden thresholds to take on binary values in (2.108) and to fire with a slightly different condition. The input o yf k to the k th output neuron from the hidden layer is also affine: o yf k = J X j=1 u kj a hf j +b y k (2.95) 31 where weight u kj connects the j th hidden neuron to the k th output neuron . Term b y k is the additive bias of the k th output neuron. The output activation vector a yf gives the predicted outcome or target on the forward pass. The k th output neuron has bipolar threshold activation a yf k : a yf k (o yf k ) = −1 if o yf k ≤ 0 1 otherwise . (2.96) The forward pass of an input bipolar vector x from Table 2.1 through the network in Figure ?? gives an output activation vector a y that equals the table’s corresponding target vector y. 2.2.2.2 Backward Pass The backward pass feeds y from the output layer back through the hidden layer to the input layer. The output activation a yb k on the backward pass is a yb k (o yb k ) = −1 if o yb k ≤ 0 1 otherwise . (2.97) Then the backward-pass input o hb j to the j th hidden neuron is o hb j = K X k=1 u kj y k +b h j (2.98) where y k is the output of the k th output neuron. The backward-pass activation of the j th hidden neuron a hb j is a hb j (o hb j ) = 0 if o hb j ≤ 0 1 otherwise . (2.99) The backward-pass input o xb i to the i th input neuron is o xb i = J X j=1 w ji a hb j +b x i (2.100) whereb x i is the bias for thei th input neuron. The input-layer activation a x gives the predicted value for the backward pass. The i th input neuron has bipolar activation a xb i (o xb i ) = −1 if o xb i ≤ 0 1 otherwise . (2.101) 32 We can now state and prove the bidirectional representation theorem for bipolar per- mutations. The theorem also applies to binary permutations because the input and output neurons have bipolar threshold activations. Theorem 2.1. [Exact Bidirectional Representation of Bipolar Permutations]: Suppose that the invertible function f :{−1, 1} n →{−1, 1} n is a permutation. Then there exists a three-layer bidirectional neural network N :{−1, 1} n →{−1, 1} n that exactly repre- sents f in the sense that N(x) =f(x) and N −1 (x) =f −1 (x) for all x. Proof: The proof strategy picks weight matrices W and U so that exactly one hidden neuron fires on both the forward and the backward pass. Figure 2.4 illustrates the proof technique for the special case of a three-bit bipolar permutation. So we structure the network such that any input vector x fires only one hidden neuron on the forward pass and such that the output vector y =N(x) fires only the same hidden neuron on the backward pass. The bipolar permutation f is a bijective map of the bipolar hypercube{−1, 1} n into itself. The bipolar hypercube contains the 2 n input bipolar column vectors x 1 , x 2 ,..., x 2 n. It likewise contains the 2 n output bipolar vectors y 1 , y 2 ,..., y 2 n. The network will use 2 n corresponding hidden threshold neurons. So J = 2 n . Matrix W connects the input layer to the hidden layer. Matrix U connects the hidden layer to output layer. Define W so that each row lists all 2 n bipolar input vectors and define U so that each column lists all 2 n transposed bipolar output vectors: W = . . . . . . . . . . . . . . . x 1 x 2 . . . . . . x 2 n . . . . . . . . . . . . . . . U = ... y 1 T ... ... y 2 T ... ... ... ... ... ... ... ... y 2 n T ... We now show that this arrangement fires only one hidden neuron and that the forward pass of any input vector x j gives the corresponding output vector y j . Assume that every neuron has zero bias. Forward Pass 33 Pick a bipolar input vector x m for the forward pass. Then the input activation vector a xf (x m ) = [a xf 1 (x m ),...,a xf n (x m )] equals the input bipolar vector x m because the input activations (2.101) are bipolar threshold functions with zero threshold. So a xf equals x m because the vector space is bipolar{−1, 1} n . The hidden layer input o hf is the same as (2.93). It has the matrix-vector form o hf = W T a xf (2.102) = W T x m (2.103) = o hf 1 , o hf 2 ,..., o hf j ,..., o hf 2 n T (2.104) = x T 1 x m , x T 2 x m ,..., x T j x m ,..., x T 2 nx m T (2.105) from the definition of W since o hf j is the inner product of x j and x m . The input o hf j to the j th neuron of the hidden layer obeys o hf j = n when j = m and o hf j < n when j̸= m. This holds because the vectors x j are bipolar with scalar components in{−1, 1}. The magnitude of a bipolar vector in{−1, 1} n is √ n. The inner product x T j x m is maximum when both vectors have the same direction. This occurs when j = m. The inner product is otherwise less than n. Figure 2.4 shows an example of bidirectional neural network that fires the fifth hidden neuron only. The weights for the network in Figure (2.4) are: W = 1 1 1 1 −1 −1 −1 −1 1 1 −1 −1 1 1 −1 −1 1 −1 1 −1 1 −1 1 −1 (2.106) U = −1 −1 1 1 −1 −1 1 1 −1 1 −1 1 1 −1 −1 1 1 1 1 1 −1 −1 −1 −1 (2.107) Now comes the key step in the proof. Define the hidden activation a hf j as a binary (not bipolar) threshold function where n is the threshold value: a hf j (o hf j ) = 1 if o hf j ≥ n, 0 otherwise . (2.108) Then the hidden layer activation a hf is the unit bit vector [0, 0, ..., 1, ..., 0] T wherea hf j = 1 when j = m and where a hf j = 0 when j̸= m. This holds because all 2 n bipolar vectors x m in{−1, 1} n are distinct. So exactly one of these 2 n vectors achieves the maximum 34 3 3 3 3 3 3 3 3 0 0 0 0 0 0 Input Layer Hidden Layer Output Layer Figure 2.4: Bidirectional network structure for the proof of Theorem 2.1. The input and output layers have n threshold neurons while the hidden layer has 2 n neurons with threshold values of n. The 8 fan-in 3-vectors of weights in W from the input to the hidden layer list the 2 3 elements of the bipolar cube{−1, 1} 3 and thus the 8 vectors in the input column of Table 2.1. The 8 fan-in 3-vectors of weights in U from the output to the hidden layer list the 8 bipolar vectors in the output column of Table 2.1. The threshold value for the sixth and highlighted hidden neuron is 3. Passing the sixth input vector [−1, 1,−1] through W leads to the vector of thresholded hidden units of [0, 0, 0, 0, 0, 1, 0, 0]. Passing this 8-bit vector through U produces after thresholding the sixth output vector [−1,−1,−1] in Table 2.1. Then passing this output vector back through the transpose of U produces the same unit bit vector of thresholded hidden-unit values. Then passing this vector back through the transpose of W produces the original bipolar vector [−1, 1,−1]. inner-product value n = x T m x m . See Figure 2.4 for details in the case of representing the three-bit bipolar permutation in Table 2.1. 35 The input vector o yf to the output layer is o yf = U T a hf (2.109) = J X j=1 y j a hf j (2.110) = y m (2.111) where a hf j is the activation of the j th hidden neuron. The activation a yf of the output layer is: a yf (o yf j ) = 1 if o yf j ≥ 0 −1 otherwise . (2.112) The output layer activation leaves o yf unchanged because o yf equals y m and because y m is a vector in{−1, 1} n . So a yf = y m . (2.113) So the forward pass of an input vector x m through the network yields the desired corre- sponding output vector y m where y m =f(x m ) for bipolar permutation map f. Backward Pass Consider next the backward pass over N. The backward pass propagates the output vector y m from the output layer to the input layer through the hidden layer. The output activation vector a yb (y m ) = [a yb 1 (y m ),...,a yb n (y m )] equals the output bipolar vector y m because the output activations (2.97) are bipolar threshold functions with zero threshold. So a yb equals y m because the vector space is bipolar{−1, 1} n . The hidden layer input o hb has the form (2.98) and so o hb = U a yb (2.114) = U y m (2.115) = o hb 1 , o hb 2 ,..., o hb j ,..., o hb 2 n T (2.116) = y T 1 y m , y T 2 y m ,..., y T j y m ,..., y T 2 ny m T (2.117) from the definition of U since o hb j is the inner product of y j and y m . The input o hb j of the j th neuron in the hidden layer equals the inner product of y j and y m . So o hb j = n when j = m and o hb j < n when j̸= m. This holds because again the magnitude of a bipolar vector in{−1, 1} n is √ n. The inner product o hb j is maximum 36 when vectors y m and y j lie in the same direction. The activation a hb for the hidden layer has the same components as (2.108). So the hidden-layer activation a hb again equals the unit bit vecgtor [0, 0, ..., 1, ..., 0] T wherea hb j = 1 when j = m anda hb j = 0 when j̸= m. Then the input vector o xb for the input layer is o xb = W a hb (2.118) = J X j=1 x j a hb (2.119) = x m . (2.120) The i th input neuron has a threshold activation that is the same as a xb i (o xb i ) = 1 if o xb i ≥ 0 −1 otherwise (2.121) where o xb i is the input of i th neuron in the input layer. This activation leaves o xb unchanged because o xb equals x m and because the vector x m lies in{−1, 1} n . So a xb = o xb (2.122) = x m . (2.123) So the backward pass of any target vector y m yields the desired input vector x m where f −1 (y m ) = x m . This completes the backward pass and the proof. ■ We now apply this proof strategy to the three-bit bipolar permutation in Table 2.1. The three-layer network in this case uses matrix W in (2.106) connects the input layer to the hidden layer and matrix U in (2.107) connects the hidden layer to the output. This network has 2 3 = 8 hidden neurons. Table 2.2 shows the forward pass of the input vectors from Table 2.1. The table shows that the forward passexactly represents the mapping from input space X to output Y. Table 2.3 shows the backward pass on the output vectors from Table 2.1. This table shows that the backward pass exactly represents the mapping from output space Y to input space X. The next section presents the new bidirectional backpropagation algorithm. 37 Table 2.2: Forward pass inference of the three-bit bipolar permutation function on Table 2.1 with the bidirectional network from Figure 2.4 with respective weights W and U from (2.106) and (2.107). The signal flow for the forward pass is a xf → o hf → a hf → o yf → a yf . a xf o hf a hf o yf a yf [ 1, 1, 1] [ 3, 1, 1,−1, 1,−1,−1,−3] [1,0,0,0,0,0,0,0] [−1,−1, 1] [−1,−1, 1] [ 1, 1,−1] [ 1, 3,−1, 1,−1, 1,−3,−1] [0,1,0,0,0,0,0,0] [−1, 1, 1] [−1, 1, 1] [ 1,−1, 1] [ 1,−1, 3, 1,−1,−3, 1,−1] [0,0,1,0,0,0,0,0] [ 1,−1, 1] [ 1,−1, 1] [ 1,−1,−1] [−1, 1, 1, 3,−3,−1,−1, 1] [0,0,0,1,0,0,0,0] [ 1, 1, 1] [ 1, 1, 1] [−1, 1, 1] [ 1,−1,−1,−3, 3, 1, 1,−1] [0,0,0,0,1,0,0,0] [−1, 1,−1] [−1, 1,−1] [−1, 1,−1] [−1, 1,−3,−1, 1, 3,−1, 1] [0,0,0,0,0,1,0,0] [−1,−1,−1] [−1,−1,−1] [−1,−1, 1] [−1,−3, 1,−1, 1,−1, 3, 1] [0,0,0,0,0,0,1,0] [ 1,−1,−1] [ 1,−1,−1] [−1,−1,−1] [−3,−1,−1, 1,−1, 1, 1, 3] [0,0,0,0,0,0,0,1] [ 1, 1,−1] [ 1, 1,−1] Table 2.3: Backward pass inference of the three-bit bipolar permutation function on Table 2.1 with the bidirectional network from Figure 2.4 with respective weights W and U from (2.106) and (2.107). The signal flow for the backward pass is a yb → o hb → a hb → o xb → a xb . a yb o hb a hb o xb a xb [−1,−1, 1] [ 3, 1, 1,−1, 1,−1,−1,−3] [1,0,0,0,0,0,0,0] [ 1, 1, 1] [ 1, 1, 1] [−1, 1, 1] [ 1, 3,−1, 1,−1, 1,−3,−1] [0,1,0,0,0,0,0,0] [ 1, 1,−1] [ 1, 1,−1] [ 1,−1, 1] [ 1,−1, 3, 1,−1,−3, 1,−1] [0,0,1,0,0,0,0,0] [ 1,−1, 1] [ 1,−1, 1] [ 1, 1, 1] [−1, 1, 1, 3,−3,−1,−1, 1] [0,0,0,1,0,0,0,0] [ 1,−1,−1] [ 1,−1,−1] [−1, 1,−1] [ 1,−1,−1,−3, 3, 1, 1,−1] [0,0,0,0,1,0,0,0] [−1, 1, 1] [−1, 1, 1] [−1,−1,−1] [−1, 1,−3,−1, 1, 3,−1, 1] [0,0,0,0,0,1,0,0] [−1, 1,−1] [−1, 1,−1] [ 1,−1,−1] [−1,−3, 1,−1, 1,−1, 3, 1] [0,0,0,0,0,0,1,0] [−1,−1, 1] [−1,−1, 1] [ 1, 1,−1] [−3,−1,−1, 1,−1, 1, 1, 3] [0,0,0,0,0,0,0,1] [−1,−1,−1] [−1,−1,−1] 2.3 Bidirectional Backpropagation The newbidirectional backpropagation (B-BP) algorithm extends the ART and BAM concepts to supervised learning in multilayer networks [5], [8]. B-BP trains the neural network in both the forward and backward directions using the same weights. Using the same weights entails using the matrix transposes M T in the backward direction for any synaptic weight matricesM that the networkN uses in the forward direction. So this bidirectional processing throughthebackward-passnetworkN T convertsfeedforwardnetworksN intoBAMnetworks. The B-BP algorithm shows how to extend ordinary unidirectional BP [223], [262] without overwriting in either direction. 38 2 3 4 5 6 7 8 9 10 Dimensionn 200 400 600 800 1000 Hidden neuronsN h Theorem 2.1 Bidirectional BP Figure2.5: Thenumberofhiddenneuronsforthebidirectionalrepresentationofbipolarpermutations. The number grows exponentially with n by Theorem 2.1. The number grows linearly with n for the B-BP training. The added computational cost of B-BP is slight compared with ordinary or forward- only BP. This holds because the time complexity for BP in either direction is O(n) for n input-output training samples. The complexity in the B-BP case remains O(n) because O(n) +O(n) =O(n). Figure 2.2 shows a three-layer BAM network after B-BP training. All neurons are threshold on-off neurons with zero threshold. The feedback network exactly represents the same three-bit bipolar permutation mapping and its inverse from Table 2.1. Each network maps a three-bit vector x∈{−1, 1} 3 to a three-bit vector y∈{−1, 1} 3 . Each network’s backward direction maps y back to the same input x. So the input x = [−1, 1,−1] maps to y = [−1,−1,−1] and conversely. So the vector pair (x, y) is a BAM fixed point of each network. This holds for all 8 vector pairs in Table 2.1 for each BAM network. There are more than 40, 320 such three-bit permutation mappings and associated inverse mappings because there are 2 3 ! or 8! ways to permute the 8 input vectors in Table 2.1. Theorem 2.1 showed that a three-layer threshold BAM network can always exactly represent an n-bit permutation and its inverse if the network uses 2 n hidden threshold neurons. That architecture would require 8 hidden neurons in this case. So B-BP algorithm reduced this exponential number of hidden neurons inn to a linear number of hidden neurons. Figure 2.5 shows this trend using a set of bipolar permutations with n∈{2,..., 10}. The main problem with unidirectional learning is overwriting. Learning in one direction tends to overwrite or undo learning in the other direction. This may explain why users have applied BP almost exclusively in the forward direction for classifiers or regressors. Running simple BP in the backward direction will only degrade the prior learning in the forward direction. 39 B-BP solves the overwriting problem with two performance measures. Each direction gets its own performance measure based on whether that direction performs classification or regression or some other task. B-BP sums the two error or log-likelihood performance measures. The overall forward-and-backward sweeps minimize the network’s total log- likelihood. Figure 2.6 shows how the two performance measures overcome the problem of overwriting for the 3-layer BAM in Figure 2.2. Figures 2.6a and 2.6b show that learning in one direction overwrites learning in the other direction with the same performance measure. Figure 2.6c shows rapid learning in both directions because it used the sum of the two correct performance measures. The performance measures were correct in the sense that they preserved the BP learning equations for a given choice of output-neuron structure and directional functionality. This is backpropagation invariance. The BP invariance requires that the log-likelihoods must correspond to the functions that the forward and backward passes perform. A deep classifier gives a canonical example. It uses a cross-entropy performance measure for its forward pass of classification. The classifier implicitly uses a squared-error performance measure for its backward pass of regression. The overwhelming number of classifiers in practice simply ignore this backward-pass structure. B-BP’s sum of performance measures stems from the bidirectional network’s total log- likelihood structure. The forward direction has the likelihood function p(y|x, Θ) for input vector x and output vector y and all network parameters Θ. The backward direction has the converse likelihood function p(x|y, Θ). This likelihood structure holds in general at the layer level so long as the layer neurons and performance measure obey BP invariance. The back-and-forth structure of bidirectionality gives the joint or total likelihood function as the product p(y|x, Θ)p(x|y, Θ). So the network log-likelihood L is just the sum L = logp(y|x, Θ) + logp(x|y, Θ). The negative log-likelihoods define the forward and backward error functions or performance measures. Then the gradient ∇L = ∇ logp(y|x, Θ) + ∇ logp(x|y, Θ) leads to the B-BP algorithm and its noise-boost by way of the generalized EM algorithm. The forward direction of the classifier uses the cross-entropy between the desired output target distribution and the actualK softmax outputs. The cross-entropy is just the negative of the logarithm of a one-trial multinomial probability density p(y|x, Θ). So the statistical structure of the forward pass corresponds to rolling a K-sided die. The backward pass uses the squared-error at the input. This arises from taking the logarithm of a vector normal distribution p(x|y, Θ). It also implies that the backward learning estimates the k th local pattern-class centroid because the centroid minimizes the squared-error. The key point is that BP invariance must hold for proper bidirectional learning: The network can perform classification or regression or any other function that leaves the 40 0 20 40 60 80 100 Training epoch 0.0 0.2 0.4 0.6 0.8 Logistic cross-entropy Undirectional BP (f :X−→Y) Forward error Ef(Θ) Backward error Eb(Θ) (a) Forward error 0 20 40 60 80 100 Training epoch 0.0 0.2 0.4 0.6 0.8 Logistic cross-entropy Undirectional BP (f −1 :Y −→X) Forward error Ef(Θ) Backward error Eb(Θ) (b) Backward error 0 20 40 60 80 100 Training epoch 0.0 0.2 0.4 0.6 0.8 Logistic cross-entropy Bidirectional BP (f −1 :Y −→X) Forward error Ef(Θ) Backward error Eb(Θ) (c) Joint error: Forward and Backward Figure 2.6: Bidirectional BP avoids overwriting: Backpropagation overwrites in one direction as it trains in the other direction if it uses a single performance measure or error function. Bidirectional backpropagation avoids overwriting because it uses the sum of two error functions that match the terminal neuron structure and functionality of each direction. The forward and backward performance measures were each a double cross entropy in equation (2.20) because the input and output neurons were logistic (later rounded-off to threshold neurons). (a) Unidirectional BP training on the forward pass reduces the forward error but overwrites the learning of the inverse map in the backward direction. (b) Unidirectional BP training on the backward pass reduces the backward training error but overwrites the learning of the permutation map in the forward direction. (c) Bidirectional BP reduces both errors because it trains with the sum of the forward and backward cross entropies. There was no over-writing in either direction. backpropagation likelihood structure invariant. The bidirectional network itself can perform different functions in different directions. 2.3.1 Bidirectional Backpropagation: Maximum Likelihood Estimation B-BP is also a form of maximum likelihood estimation. The forward pass maps the input x to the output N(x). The backward pass maps the output y back through the network to N T (y) given the current network parameters Θ. The algorithm jointly maximizes the forward likelihood p f (y|x, Θ) and the backward likelihood p b (x|y, Θ). So B-BP finds the 41 weight or parameter vector Θ ∗ that maximizes the joint network likelihood: Θ ∗ = arg max Θ p f (y|x, Θ)p b (x|y, Θ). (2.124) This likelihood structure holds more generally in terms of the forward and backward layer likelihoods per (2.90). This joint optimization is the same as the joint optimization of the directional (and layer) log-likelihoods: Θ ∗ = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ) (2.125) because the logarithm is monotone increasing. The B-BP algorithm also uses gradient descent or any of its variants to iteratively update the parameters. The next section presents four different B-BP structures based on the corresponding likelihood structure of the network. The four structures are double regression, double classification, double logistic, and the mixed case of classification and regression. Their function and network structure dictates their log-likelihoods in accord with BP invariance. BP invariance ensures that all the different network structures still use the same BP updates on a given directional training pass. 2.3.1.1 Bidirectional Backpropagation: Double Regression A bidirectional network is a double regression network if it performs regression in both directions. So the input and output layers both use identity activations. The output y is a Gaussian random vector with mean a yf and with identity or white covariance matrix I. The input x is a also Gaussian random vector with mean a xb and identity or white covariance matrix I: p f (y|x, Θ) = 1 (2π) K/2 exp ( − ||y− a yf || 2 2 ) (2.126) and p b (x|y, Θ) = 1 (2π) I/2 exp ( − ||x− a xb || 2 2 ) (2.127) 42 where I is the dimension of the input layer and K is the dimension of the output layer. The forward log-likelihood L f (Θ) and backward log-likelihood L b (Θ) are: L f (Θ) = logp(y|x, Θ) = log(2π) − K 2 − ||y− a yf || 2 2 (2.128) and L b (Θ) = logp(x|y, Θ) = log(2π) − I 2 − ||x− a xb || 2 2 . (2.129) The error functions are the squared-error E f (Θ) for the forward pass and sqaured error E b (Θ) for the backward pass. B-BP minimizes the joint error E(Θ): E(Θ) =E f (Θ) +E b (Θ) = 1 2 K X k=1 y k −a yf k 2 + 1 2 I X i=1 x i −a xb i 2 . (2.130) Then [8] gives the detailed B-BP updates rules for double regression. 2.3.1.2 Bidirectional Backpropagation: Double Classification A double classifier is a neural network that acts as a classifier in both the forward and backward directions. So both the input and output layers use softmax or Gibbs activations. The output y with activation a yf k defines the multinomial probability p f (y k = 1|x, Θ). The input x with a xb i defines the dual multinomial probability p b (x i = 1|y, Θ): p f (y|x, Θ) = K Y k=1 a yf k y k (2.131) and p b (x|y, Θ) = I Y i=1 a xb i x i . (2.132) Then the log-likelihoods L f and L b are negative cross entropies: L f (Θ) = logp f (y|x, Θ) = K X k=1 y k log a yf k (2.133) and L b (Θ) = logp f (x|y, Θ) = I X i=1 x i log a xb i . (2.134) 43 The forward error E f (Θ) is the cross-entropy E f (Θ). The backward error E b (Θ) is likewise the cross-entropy E b (Θ). B-BP for double classification minimizes this joint error E(Θ): E(Θ) =E f (Θ) +E b (Θ) =− K X k=1 y k log a yf k − I X i=1 x i log a xb i . (2.135) [8] give the update rules for B-BP with double classification. Training in each direction applies ordinary BP updates because of BP invariance. Training occurs in one direction at a time. 2.3.1.3 Bidirectional Backpropagation: Double Logistic Networks A double logistic network has logistic neurons at both the input and output layers. Figure 2.2 shows such double logistic networks where the logistic sigmoids are so steep as to give threshold neurons. The probability of the logistic-vector output y is a product of K independent Bernoulli probabilities. The k th output activation a yf k is the probability p f (y k = 1|x, Θ). The probability of the logistic-vector input x is a product ofI independent Bernoulli probabilities. So the i th input activation a xb i is the probability p b (x i = 1|y, Θ). Logistic neurons can also map to the bipolar interval [−1, 1] by simple scaling and translation of the usual binary logistic function. We trained the bidirectional threshold network in Figure 2.2 as a double- logistic network with steep logistic sigmoid activations. Then we replaced the steep logistics after B-BP training with threshold functions. The forward likelihood p f (y|x, Θ) and the backward likelihood p b (x|y, Θ) have the product-Bernoulli form p f (y|x, Θ) = K Y k=1 a yf k y k 1−a yf k 1−y k (2.136) and p b (x|y, Θ) = I Y i=1 a xb i x i 1−a xb i 1−x i . (2.137) The log-likelihoods L f (Θ) and L b (Θ) define double cross entropies: L f (Θ) = logp f (y|x, Θ) = K X k=1 y k log a yf k + (1−y k ) log 1−a yf k (2.138) 44 Algorithm 2.1: Bidirectional Backpropagation Algorithm Data: x ={x (i) } N i=1 , y ={y (i) } N i=1 , number of iterations T, backward error scale α, and learning rate η Result: Θ BBP : Bidirectional BP estimate of weight Θ 1 • Initialize the network parameter Θ (0) ; 2 for t = 0,....,T− 1 do 3 for m = 1,....,M do 4 • Forward inference: a yf ←N(x (m) ); 5 • Backward inference: a xb ←N −1 (y (m) ); 6 • Compute forward error: E f (Θ) ; 7 • Compute backward error: E b (Θ); 8 • Compute the joint error: E(Θ)←E f (Θ) +αE b (Θ) 9 • Update the weights: Θ (t+1) ← Θ (t) −η∇ Θ E(Θ) Θ=Θ (t) 10 Θ BBP ← Θ (T ) ; Backward error scale α>0. and L b (Θ) = logp f (x|y, Θ) = I X i=1 x i log a xb i + (1−x i ) log 1−a xb i . (2.139) Then the joint error function E(Θ) for a double logistic network is the sum E(Θ) =E f (Θ) +E b (Θ) =− K X k=1 y k log a yf k + (1−y k ) log 1−a yf k − I X i=1 x i log a xb i + (1−x i ) log 1−a xb i . (2.140) The BP update rules for double-logistic training are the same as for double classification because of BP invariance. Algorithm 2.1 shows the pseudocode for running the B-BP algorithm. It is applicable to the four likelihood structures. 2.3.1.4 Mixed Case: B-BP Classification and Regression The mixed bidirectional case occurs when the forward pass is a classifier and the backward pass is a regressor. This is the implied structure of the modern neural classifier used only in the forward direction because it maps identity neurons to output softmax neurons. The 45 forward pass through the network N estimates the class membership of a given input pattern x. So the K output neurons have softmax or Gibbs activations. The backward pass throughN T estimates the class centroid (in the common least-squares case) given an output class-label unit bit vector or other length-K probability vector y. So the input neurons use identity or linear activations. The forward likelihood is a multinomial random variable as in the case of double classification. The backward likelihood is a Gaussian random vector as in the case of double regression. BP invariance holds for these likelihoods given the corresponding softmax output neurons and identity input neurons. The total mixed error E(Θ) sums the forward cross-entropy E f (Θ) and the backward squared-error E b (Θ): E(Θ) =E f (Θ) +E b (Θ) =− K X k=1 y k log a yf k + 1 2 I X i=1 x i −a xb i 2 . (2.141) The original B-BP sources [5], [8] give the gradient update rules for B-BP in the mixed case as well for double regression and double classification. BP invariance greatly simplifies the forward and backward updates because then all network types use the same BP updates. 2.4 Bayesian Bidirectional Backpropagation Bayesian training of a neural network is a form of maximum aposteriori (MAP) estimation. It maximizes the posterior f(Θ|x): f(Θ|x) = g(x|Θ)h(Θ) R g(x|Θ)h(Θ) dΘ ∝g(x|Θ)h(Θ) (2.142) where g(x|Θ) is the likelihood and h(Θ) is the prior density of the now random vector of parameters Θ. The MAP estimate Θ MAP equivalently maximizes the log-posterior and thus maximizes the sum of the log-likelihood and the log-prior: Θ MAP = arg max Θ f(Θ|x) (2.143) = arg max Θ g(x|Θ)h(Θ) (2.144) = arg max Θ logg(x|Θ) + logh(Θ) (2.145) because the unconditional total-probability denominator term R g(x|Θ)h(Θ)dΘ is not a function of Θ. The bidirectional backpropagation algorithm uses separate directional posteriors (and 46 can extend to any finite number of directions). The forward posterior p f (Θ|x) has the usual network form from Bayes theorem: p f (Θ|x) = g f (x|Θ)h f (Θ) R g f (x|Θ)h f (Θ) dΘ ∝g f (x|Θ)h f (Θ) (2.146) where g f (x|Θ) is the forward likelihood and h f (Θ) is the forward prior. The backward posterior p b (Θ|y) has the like form p b (Θ|y) = g b (y|Θ)h b (Θ) R g b (y|Θ)h b (Θ) dΘ ∝g b (y|Θ)h b (Θ) (2.147) where g b (y|Θ) is the backward likelihood and h b (Θ) is the backward prior. The bidirectional network’s total or joint bidirectional posterior combines the forward posterior p f (Θ|x) and the backward posterior p b (Θ|y). The conjunctive “and” gives rise to the posterior product p f (Θ|x)p b (Θ|y) because to first order the two directional passes are independent of each other. This gives the joint posterior as this product p f (Θ|x)p b (Θ|y) (this generalizes to k directions as just the product of the k directional posteriors). Then the bidirectional MAP estimate Θ BMAP maximizes the joint bidirectional posterior: Θ BMAP = arg max Θ p f (Θ|x)p b (Θ|y) (2.148) = arg max Θ g f (x|Θ)h f (Θ)g b (y|Θ)h b (Θ). (2.149) Maximizing the log-posterior gives the same MAP estimate: Θ BMAP = arg max Θ logg f (x|Θ) + logh f (Θ) + logg b (y|Θ) + logh b (Θ) (2.150) = arg max Θ logg f (x|Θ) + logg b (y|Θ) + logh f (Θ) + logh b (Θ) = arg max Θ logg f (x|Θ) + logg b (y|Θ) + logh(Θ) (2.151) where h(Θ) =h f (Θ)h b (Θ). The original bidirectional backpropagation (B-BP) algorithm [5], [7], [8], [153] is a form of maximum likelihood estimation. B-BP training endows a multilayered neural network with a form of bidirectional inference and proceeds as the joint gradient optimization of the forward likelihood p f (y|x, Θ) and the backward likelihood p b (x|y, Θ). So the B-BP 47 Bayesian B-BP: Input Output Feedforward Feedback Feedforward Input layer Output layer Hidden layers Forward likelihood Backward likelihood Feedback Prior Classification Regression Figure 2.7: Bayesian Bidirectional Backpropagation: The network maximizes the joint forward and backward posterior probability. The network diagram shows the simple but practical case of a Laplacian or Lasso-like prior on the input weights. The identity neurons at the input field give rise to a vector normal likelihood and thus a squared-error error function in the backward direction. The softmax neurons at the output field give rise to a multinomial likelihood and thus a cross-entropy error function in the forward direction. So the network acts as classifier in the forward direction and a regressor in the backward direction. maximum-likelihood estimate Θ BBP has the form Θ BBP = arg max Θ p f (y|x, Θ) p b (x|y, Θ) (2.152) = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ). (2.153) Suppose the default case that both the forward prior h f (Θ) and the backward prior h b (Θ) are uniform probability densities. Then the Bayesian B-BP estimate Θ BMAP reduces to the original maximum-likelihood B-BP estimate Θ BBP : Θ BMAP = Θ BBP . (2.154) 48 −3 −2 −1 0 1 2 3 x −1.0 −0.5 0.0 0.5 1.0 f(x) = sin(x) Forward Pass Target Prediction −1.0 −0.5 0.0 0.5 1.0 y −2 −1 0 1 2 Centroid({ sin −1 y} ) Backward Pass Target Prediction (a) Forward squared-error and backward squared-error −3 −2 −1 0 1 2 3 x −1.0 −0.5 0.0 0.5 1.0 f(x) = sin(x) Forward Pass Target Prediction −1.0 −0.5 0.0 0.5 1.0 y −2 −1 0 1 2 Median({ sin −1 y} ) Backward Pass Target Prediction (b) Forward squared-error and backward absolute error Figure 2.8: Bidirectional BP double regression learning a non-invertible function: The forward pass approximation and the backward pass approximation of f(x) =sin(x) over x∈ [−π,π] with double regression networks. The networks used 2 hidden layers with 100 neurons per hidden layer. The hidden layers used logistic sigmoid neurons. (a) This network used forward squared-error and backward squared error. The backward pass maps to the centroid of the set-based pullback f −1 (y). The centroids were−π/2 , π/2, and 0. (b) This network used forward squared-error and backward absolute error. The backward pass maps to the median of the set-based pullback f −1 (y). The two medians were−π/2, π/2, and 0. 2.5 LearningNoninvertibleFunctionswithBidirectionalBack- propagation Let f : X → Y be a noninvertible function. The inverse mapping f −1 is not a point well-defined point function. Instead it is a set-based pullback or preimage f −1 such that f −1 ({y}) ={x∈R :f(x) =y}⊂R . (2.155) An inverse mapping can return more that one response to a single query in this case. The backward pass of a bidirectional network captures the set-based pullback of the 49 target function. A neural network cannot output multiple answers to a single query because its parameter are fixed during the inference stage. Instead the backward pass maps to a representative of the set pullback. The definition of the representative depends on the choice of the backward error or likelihood structure. The backward maps to the centroid of the set-based pullback if the backward pass uses squared-error or a Gaussian likelihood. This is so because centroids minimize squared-error. It maps back to the median of the set-based pullback in the case of absolute error or Laplacian likelihood because medians minimize absolute error. Figure 2.8 shows the backward pass of a bidirectional network that trained on the noninvertible function f(x) = sin(x) where x∈ [−π,π]. The range of y is [-1, 1] in this case. The pre-image or set-based pullback f −1 in this case is f −1 ({y}) = sin −1 (y) = {x,π−x}, if y> 0 {−π, 0,π}, if y == 0 {−π−x,x}, if y< 0 (2.156) where− π 2 ≤x≤ π 2 . So the centroidC of the set-based pullback is C(f −1 ({y})) =C(sin −1 (y)) = π 2 , if y> 0 0, if y == 0 − π 2 , if y< 0 . (2.157) So the backward pass converges to this centroid in the case of backward squared-error. The median is equal to the centroid in this case. So the backward pass also converges to the centroid with backward absolute error. Figure 2.8 compares the forward pass and backward of a double regression models. It shows that the model converges to the centroid in the case of squared-error and median in the case of absolute error. 2.6 Conclusion Unidirectional backpropagation learning extends to bidirectional backpropagation learning if the algorithm uses the appropriate joint error function for both forward and backward passes. A bidirectional multilayered threshold network can exactly represent permutation mappings if the hidden layer contains an exponential number of hidden threshold neurons. Bidirectional BP extends to the Bayesian bidirectional backpropagation that introduces a prior term in addition to the forward and backward likelihoods. Noninvertible mapping 50 requires a bidirectional network to map to a representative of the set-based pullback. 51 Chapter 3 Bidirectional Backpropagation Application: Deep Neural Classifiers Bidirectional backpropagation (B-BP) allows a deep classifier to run backwards and thereby improve the classifier’s accuracy at little extra cost. This holds because B-BP reveals a hiddenregressor in the reverse direction when neural signals flow backward from the output layer to the identity neurons at the input layer. B-BP takes into account the layer probability structure at both the output softmax classification layer and the identity input layer. The output layer defines a multinomial probability. The output probability corresponds to the roll of a K-sided die. So the output error function is cross entropy since the output error is just the negative log-likelihood of the output probability. But the input layer defines to first order a vector normal probability because the input neurons are identity or linear neurons as in a regression. So the input error function is just ordinary squared error. Unidirectional BP ignores this backward error function while B-BP jointly minimizes it along with the forward cross entropy. This bidirectional framework also extends to convolutional architectures. B-BP further extends to Bayesian bidirectional backpropagation. This general case shows how BP training maximizes the joint or bidirectional posterior probability of the network. Ordinary B-BP just maximizes the total or bidirectional likelihood probability of the network. Bayesian B-BP takes account of the prior probability structure of parameters such as the weights between hidden layers. Bayesian B-BP reduces to likelihood B-BP in the default case when such weight priors are uniform are uninformative. The next sections present these different bidirectional architectures and then show how they perform on standard image test sets. 52 3.1 Neural Classifiers Neural classifiers are feedforward neural networks that map from the input space R I to one of K possible classes. These feedforward networks map images or videos or other patterns to K softmax output neurons. The input neurons are most often identity functions because they act as data registers. Modern neural classifiers tend to grow deeper with more pattern classes [34], [133], [164], [193]. Classifier networks map patterns to probability vectors. A softmax output activation has the ratio form of an exponential divided by a sum of K such exponentials. So the output activation a y =N(x) defines a probability vector of length K. The network output y is a point in the (K− 1)-dimensional simplex S K−1 ⊂R K . Supervised learning for the classifier network uses 1-in-K encoding for the K standard basis or unit bit vectors e 1 ,..., e K . Input pattern x ∈ C k ⊂ R I should map to the corresponding k th unit bit vector e k ∈S K−1 if C k is the k th input pattern class. An ideal classifier emits N(x) = e k if and only if x∈C k . Then theK set-theoretic pullbacksN −1 (e k ) of the K basis vectors e k partition the input pattern space R n into the K desired pattern classes C k : R I =∪ K k=1 N −1 (e k ) =∪ K k=1 C k . 3.1.1 Implicit Bidirectionality of Neural Classifiers A classifier network defines a bidirectional feedback system despite its feedforward mapping of patterns to probabilities. Most users simply ignore the backward pass and thus ignore the overall bidirectional structure. The reverse network N T :S K−1 →R I can always map probability descriptions y back to patterns in the input space through the reverse mapping N T by using the transpose of all synaptic weight matrices. This holds even when the classifier encodes time-varying patterns in simple recurrent loops among their hidden layers [114]. The reverse pass through N T can thereby answer why-type causal questions: Why did this observed output occur? What type of input pattern caused this result? Pattern answers to these why-type questions are versions of Grossberg’s network feedback expectations [95]–[97]. The images in Figure 3.1 show such answers or expectations from a deep network with 5 hidden layers. This network operates in the bidirectional associative memory (BAM) mode [146]–[149]. The bidirectional network trained on CIFAR-10 images. The convolutional classifier in Figure 3.2 shows similar reverse-pass images N T (y) orN T (e k ) from a deep convolutional network in BAM mode. Simple unidirectional backpropagation training produces the uninformative reverse-pass images in Figures 3.1b and 3.2b. Proper bidirectional training with the B-BP algorithm below produces reverse-pass images that closely match the sample centroids of the pattern classes. TheforwardpassthroughthenetworkN answerscomparativelysimplerwhat-if questions: 53 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) Target Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) Backward recall with unidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) Backward recall with bidirectional BP Figure 3.1: Running a neural classifier in reverse: Unidirectional BP versus bidirectional BP recall after training on the CIFAR-10 image dataset. The backward recall mapped a class-label unit-bit-vector output basis vector e k back through the transposed matrices of the synaptic weights to the input layer. The classifier used 5 hidden layers and swept forward and backward through those layers. (a) The 10 target images are the sample class centroids of the 10 image pattern classes. (b) The 10 noise images show the backward recall of the class-label output vectors e k when the network trained with ordinary unidirectional BP. (c) Backward-pass prediction of the class-label output vectors e k with B-BP. The feedback signals closely matched the corresponding sample class centroids and gave the network a type of attentive focus. What happens if this input stimulates the system? What effect will this input cause? Bidirectional processing can help even here as many of the simulations show. Modern unidirectional classifiers N simply ignore the reverse mapping information housed in the transpose matrices of the reverse pass N T . The equilibrium forward pass through the network N differs in general from the first forward pass through N. Modern feedforward 54 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) Target Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) Backward recall with unidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) Backward recall with bidirectional BP Figure3.2: Runningaconvolutionalneuralclassifierinreverse: UnidirectionalBPversusbidirectional BP recall after training on the CIFAR-10 image dataset. The backward pass of the class-label unit-bit vector output basis vectors e k for multilayered convolutional neural classifiers. The classifier used four convolution layers and two fully connected layers. The network swept forward and backward through those layers. The convolution filters acted as bidirectional feature extractors. (a) The 10 target images are the sample class centroids of the 10 image pattern classes. (b) The 10 noise images show the backward recall of the class-label output vectors e k when the network trained with ordinary unidirectional BP. (c) Backward-pass prediction of the class-label output vectors e k with B-BP. The feedback signals closely matched the corresponding sample class centroids and gave the network a type of attentive focus. classifiers simply take this first forward pass as the network’s final output. This implied bidirectional associative memory (BAM) [146]–[149] feedback structure requires only that output vectors y∈S K−1 pass back through the transpose matrices of all the synaptic matrices that the forward pass used to produce a given output from a given 55 Digit: 0 Digit 1 Digit: 2 Digit: 3 Digit: 4 Digit: 5 Digit: 6 Digit: 7 Digit: 8 Digit: 9 (a) MNIST: Class centroids Digit: 0 Digit 1 Digit: 2 Digit: 3 Digit: 4 Digit: 5 Digit: 6 Digit: 7 Digit: 8 Digit: 9 (b) MNIST: Projections of class centroids Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) CIFAR-10: Class centroids Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (d) CIFAR-10: Projections of class centroids Figure 3.3: Class centroids versus their projections: The class centroids of the MNIST handwritten digit dataset look like a member of their respective classes but this is not true for the class centroids of the CIFAR-10 image dataset. The respective class projections of the class centroids look like samples from their respective classes for the MNIST and CIFAR-10 image datasets. (a) The class centroids of the 10 pattern classes that make up the MNIST dataset. (b) The respective class projections of the class centroids of the 10 pattern classes that make up the MNIST dataset. (c) The class centroids of the 10 pattern classes that make up the CIFAR-10 dataset. (d) The respective class projections of the class centroids of the 10 pattern classes that make up the CIFAR-10 dataset. input. This backward pass also requires a related backward pass through any windows or masks in convolutional classifiers [78], [157], [166]. Then even a convolutional classifier can produce a feedback signal x ′ =N T (y) at the input layer (and at any hidden layer). The last section shows how the simulations passed output information backward through the convolutional filters of deep convolutional networks. The feedback signal x ′ looks like a handwritten signal if the classifier network trains on the MNIST digit data. It looks like some other image if it trains on the CIFAR image test set. Figure 3.3 compares the class centroids of these datasets. So we can train the network in the backward direction N T if we compute some form of error based on the triggering input pattern x∈C k and based on what the network expects to see x ′ =N T (y) =N T (N(x)) in the Grossberg sense of adaptive resonance theory (ART) [95]–[97]. This insight leads to the new bidirectional supervised learning method below. Algorithm 3.1 shows the pseudocode of the new method. We also point out that the backward direction in such classifiers performs a form of statistical regression. This holds because the input neurons are identity units. It does not 56 hold if the input neurons have logistic or other nonlinear form. The regression structure leads to an optimal mean-squared structure in the backward direction as we explain below. The regression structure in the backward direction dictates both a different training regimen and a different noise-injection algorithm from the forward classifier. The backward tendency toward mean-squared estimation implies that such classifiers estimate the K class centroids c k in the backward direction. So the feedback signal x ′ = N T (y) shows what the classifier network expects to “see” or perceive at the input field given the current pattern stimulus x. It shows what the network expects the local class centroid to look like. That explains why the backward-pass signals in Figures 3.1c and 3.2c so closely match the respective sample class centroids in Figures 3.1a and 3.2a. 3.1.2 Bidirectional Backpropagation Training of Neural Classifiers ThissubsectionextendstheB-BPalgorithmtothetrainingofneuralclassifiersasbidirectional neural networks. The mapping fromY toX or vice versa is not a well-defined point function in most cases. So the k th unit basis vector y need not map back to a unique value. It maps instead to a pre-image set f −1 (y) as discussed above. The reverse pass through N T gives a point x ′ in the input pattern spaceR n . Then [8] shows that the regression backward pass over the network N tends to map the centroid of the pre-image in a classification-regression bidirectional network. The backward pass of a classifier network maps the class labels or output K-length probability vectors back to the input pattern space. The output space Y is the (K− 1)- dimensional simplex S K−1 of K-length probability vectors for a classifier network. This holds because the network maps vector pattern inputs to K softmax neurons vectors and uses 1-in-K encoding. So the backward pass for bidirectional representation maps any given y to a centroidal measure of its class. Let m th sample{x (m) , y (m) } be a pair of sample input and target vectors that belongs to the k th class C k . B-BP trains the network along the forward pass N(x (m) ) with y (m) as the target. The forward errorE f is the cross entropy betweenN(x (m) ) and y (m) . Backward-pass training can in principle use any x ′ . The simplest case uses the backward- emitted vector x ′ =N T (y) after the input x has emitted the output vector y in the forward direction: y =N(x). Simulations show that somewhat better performance tends to result if the backward pass replaces y with the supervised class label k or the unit vector e k if the stimulating input vector x belongs to the k th input class. The backward pass can also use the k th class population centroid c k of input pattern 57 class C k or its sample-mean estimate ˆ c k : ˆ c k = 1 N k X x (m) ∈C k x (m) (3.1) where N k is the number of input sample patterns x (m) from class C k . Figure 3.3 shows the sample class centroids for the MNIST handwritten digit and the CIFAR-10 image datasets. Then the weak and strong laws of large numbers state that the sample centroid ˆ c k converges to the population or true class centroid c k in probability or with probability one if the sample vectors are random samples (independent and identically distributed) with respective finite covariances or finite means [115]. This ergodic result still holds in the mean-squared sense when the sample vectors are correlated if the random vectors are wide-sense stationary and if the finite scalar covariances are asymptotically zero [100]. This correlated result also holds in probability since convergence in mean-square implies convergence in probability. Using the centroid estimate ˆ c k makes sense in the backward direction because locally the population centroid c k minimizes the squared error of regression. It likewise minimizes the total mean-squared error of vector quantization [145]. The K-means clustering algorithm is the standard way to estimate K centroids from training data [128]. This algorithm is a form of competitive learning and sometimes called a self-organizing map [143]. K-means clustering enjoys a NEM noise-boost because the algorithm is itself a special case of the EM algorithm for tuning a Gaussian mixture model if the class membership functions are binary indicator functions [199], [201]. The projection ˜ x (m) of the sample class centroid ˆ c k to the training samples in C k obeys ˜ x k = arg min x (m) ∈C k ||x (m) − ˆ c k || 2 (3.2) where x (i) ∈C k . The sample centroid need not lie in the pattern class C k . We can compute the backward vector directly if it does by taking the difference between the backward pass x ′ and the sample class centroid. We otherwise take the difference between x ′ and that training vector in C k that lies closest to the sample class centroid. Then the backward error E b equals the squared error between x ′ =N T (y (m) ) and the projection ˜ x k . Figure 3.3 shows the respective class projections of the sample class centroids for the MNIST handwritten digit and the CIFAR-10 image datasets. 58 Algorithm 3.1: Bidirectional Backpropagation Training of a Neural Classifier Data: M input data vectors{x (1) ,..., x (M) }, M target label 1-in-K vectors {y (1) ,..., y (M) }, number of training epochs T, learning rate η, and α backward error scale. Result: Trained network parameter Θ BBP 1 • Initialize the network parameter Θ (0) ; 2 • Compute the class centroid c k for all the K classes using (3.1); 3 • Compute the projection ˜ x k of the class centroid c k for all the K classes using (3.2); while epoch t : 0→T− 1 do 4 while training sample index m : 1→M do 5 • Compute the forward inference: a yf(m) ←N(x (m) ) 6 • Compute the backward inference: a xb(m) ←N T (y (m) ) 7 • Determine the true class of x (m) : c m ← arg max k y (m) k 8 • Compute the forward error E f (Θ): E f (Θ)←−y (m) · log a yf(m) 9 • Compute the backward error E b (Θ): E b (Θ)← a xb(m) − ˜ x cm 2 10 • Back-propagate the error functions and update the network parameter: Θ (t+1) ← Θ (t) −η∇ Θ E f (Θ) +α E b (Θ) Θ=Θ (t) 11 Θ BBP ← Θ (T ) 3.2 Convolutional Neural Networks and Bidirectional Repre- sentation Convolutional neural networks are deep neural networks that use convolutional layers or masks to extract features. This type of neural network employs a specialized mathematical operation calledconvolution [69], [89]. Convolution is a formof transformation thatpreserves ordering. CNN derives its origin from a biological experiment in 1959. The experiment found two types of cells in the visual primary cortex. The cells are simple cell (S-cell) and complex cell (C-cell) [121], [124]. The experiment also shows that a model made up of cascading C-cells and S-cells is useful for pattern recognition tasks [122], [123]. Fukushima 59 later extended this idea to neocognitron in the 1970s [77], [124] Neocognitron is a hierarchical multilayered neural network that is capable of robust visual pattern recognition [79], [81], [83], [84]. It uses S-cells to extract local features and the C-cells compensates for the deformation of features such as local shifts[26]. Neocognitron takes advantage of the local features integration in pattern recognition task [26], [82]. These local features propagate to the higher layer for the classification of patterns [80]. The idea of local feature integration inspired other models such as the convolutional network [165]. Convolutional networks have shared-weight architecture of convolutional kernels or filters that slide along input features and extract local features [173], [277]. The filters provide trans- lation equivariant responses known as feature maps [276]. The success of convolutional net- works in machine learning tasks stems from three concepts associated convolutional operation [89]. They are parameter sharing, sparse interaction, and equivariant representation. CNNs have become the standard feedforward neural network models for large-scale image recognition [50], [108], [157], [158], [162], [167], [168], [183], [239], [240], [282]. Other applica- tions include video analysis [21], [22], [42], [65], [130], [138], [230], [243], natural language processing [1]–[3], [53], [54], [93], [120], [137], [198], [205], [213], [231], [238], [242], time-series forecasting [32], [36], [55], anomaly detection [160], [219], [225], image super-resolution [67], [132], [140], [169], [233], [257], and so on. 3.2.1 Convolution Filters and Feedforward Networks Figure 3.4 shows the architecture of LeNet-5 [72] convolutional network. The network has two convolutional (conv.) layers and two fully connected (fc) layers. The convolutional layers of the convolutional network use kernels to extract output feature maps through discrete convolution given the input feature maps [69]. Pooling operation is another building block in convolutional networks. This operation reduces the size of the feature maps by summarizing the subregions with functions such as taking the average or maximum value [69]. Pooling is similar to discrete convolution in some sense. It is a form of convolution that replaces the linear combination due to convolutional kernels with some other function (average or maximum). The final layer of the network has 10 output neurons with softmax activation. The BP training of a convolutional network finds the kernels and synaptic weights that minimizes the loss or error function at the output layer. The convolutional network acts as feedforward network in this case. 60 3 32 32 Input 6 28 28 Conv. 1 6 14 14 Pool 1 16 10 10 Conv. 2 16 5 5 Pool 2 1 120 FC1 1 84 FC2 10 Softmax Figure 3.4: Convolution filters with feedforward networks: The input size is 32× 32× 3. The convolution layer Conv. 1 has 6 filters of size 3× 3 each. The convolution layer Conv. 2 has 16 filters of size 5× 5 each. A synaptic weight of size 400× 120 connects the pooling layer pool 2 to the fully connected layer FC1 and a weight of size 120× 84 connects FC1 to the other fully connected layer FC2. A synaptic weight of size 84× 10 connects FC2 to the output softmax layer. 3 32 32 Input 6 28 28 Conv.1 6 14 14 Resize 1 16 10 Conv.2 16 5 5 Resize 2 1 120 FC1 1 84 FC2 10 Softmax Figure 3.5: Bidirectional approximation with convolutional networks: The input size is 32× 32× 3. The convolution layer Conv. 1 is with 6 filters of size 3× 3. The convolution layer Conv. 2 is with 16 filters of size 5× 5. The resize 1 module connects the convolution layers. A synaptic weight of size 400× 120 connects resize 2 to the fully connected FC1 and a weight of size 120× 84 connects FC1 to the fully connected FC2. A synaptic weight of size 84× 10 connects FC2 to the output softmax layer. The resize modules run the pooling operation along the forward pass and run the padding operation along the backward pass. 61 1 1 0 3 0 2 1 0 0 1 0 0 0 1 1 1 2 1 2 1 0 2 1 2 3 I ⊛ 1 0 0 0 0 1 1 0 0 K = 1 2 3 4 I⊛K 1 0 0 0 0 1 1 1 0 ×1 ×0 ×0 ×0 ×0 ×1 ×1 ×1 ×0 Figure 3.6: Convolution operation of input image I with kernel K using 2× 2 stride without padding. The size of I is 5× 5 and the size of K is 3× 3. The output feature maps I⊛ K from this operation has a size 2× 2. 1 2 3 4 J ⊛ T 1 0 0 0 0 1 1 1 0 K = = 1 0 2 0 0 0 0 1 0 2 4 1 6 2 0 0 0 3 0 4 3 3 4 4 0 J⊛ T K 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 2 0 0 0 0 0 0 2 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 0 0 3 3 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 4 0 0 4 4 0 Figure 3.7: Transposed convolution operation on input J with kernel K using zero padding and stride 2× 2. The size of J is 2× 2 and the size of K is 3× 3. The feature maps J⊛ T K after the transpose convolution has a shape 5× 5. 1 3 2 3 0 4 3 1 0 2 3 1 2 0 3 2 Input Max. pooling = ===========⇒ stride : 2×2 pool window : 2×2 4 3 2 3 Output Figure 3.8: Pooling operation on 4× 4 input using a stride of 2× 2 and pooling window size of 2× 2. 3.2.2 Convolution Filters and Bidirectional Networks Figure 3.5 shows the architecture of a bidirectional convolutional networks for bidirectional approximation. The blue path represents the forward pass and the red path represents the backward pass. The convolutional layers use bidirectional kernels. The kernels at the convolutional layers acts as bidirectional feature extractors. The kernels run convolution 62 4 3 3 2 Input Padding = ==========⇒ window size: 2×2 4 0 3 0 0 0 0 0 3 0 2 0 0 0 0 0 Output Figure 3.9: Padding operation on a 2× 2 input using a window size of 2× 2. operation along the forward pass and transposed convolution operation along the backward pass. Transposed convolution is a transformation along the opposite direction of normal convolution. One might use such a transformation as the decoding layer of a convolutional auto-encoder or to project feature maps to a higher-dimensional space [69]. Figure 3.6 shows an example of a 2D convolution operation with a 5× 5 input and 3× 3 kernel. The stride for the convolution operation 2×2. Figure 3.7 shows an example of a 2D transposed convolution operation with a 2× 2 and 3× 3 kernel. The stride size 1 for the transposed convolution is also 2× 2. The resize component in Figure 3.5 replaces the pooling operation from the unidirectional case (see Figure 3.4). The resize component 2 runs pooling operation along the forward pass and padding (spatial interpolation) along the backward pass. Figure 3.8 shows an example of the pooling operation over the resize component. It takes an input of size 4× 4 using a stride of 2× 2 and pooling window size of 2× 2 for the pooling operation. Figure 3.9 shows the corresponding padding operation over the resize component along the opposite direction. 3.3 Bayesian Bidirectional Backpropagation Training This section defines the classifier-regressor probability structure for the Bayesian bidirectional backpropagation (B 3 ) training of the neural classifier in Figure 3.10. The B 3 algorithm finds the maximizing value Θ ∗ with gradient ascent on the given bidirectional posterior. . The classifier’s forward likelihood p f (y|x, Θ) is multinomial for output target vector y. Its backward likelihood p b (x|y, Θ) is vector normal or Gaussian. The backward prior h(Θ) 1 The stride size for convolutional and transpose convolution with bidirectional representation must be the same. 2 The pooling window size and the padding window size of the resize module must be the same. The stride must be the same along both directions. 63 Bayesian B-BP: Input Output Feedforward Feedback Feedforward Input layer Output layer Hidden layers Forward likelihood Backward likelihood Feedback Prior Classification Regression Figure 3.10: Training with Bayesian bidirectional backpropagation: The network maximizes the joint forward and backward posterior probability. The network diagram shows the simple but practical case of a Laplacian or Lasso-like prior on the input weights. The identity neurons at the input field give rise to a vector normal likelihood and thus a squared-error error function in the backward direction. The softmax neurons at the output field give rise to a multinomial likelihood and thus a cross-entropy error function in the forward direction. So the network acts as classifier in the forward direction and a regressor in the backward direction. is Laplacian or a Lasso-like l 1 penalty. Then Θ ∗ = arg max Θ p f (Θ|y, x) p b (Θ|x, y) (3.3) = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ) | {z } log-likelihoods + logh(Θ|x) + logh(Θ|y) | {z } log-priors (3.4) = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ) + logh(Θ) (3.5) = arg min Θ − logp f (y|x, Θ)− logp b (x|y, Θ)− logh(Θ) (3.6) = arg min Θ E f (Θ) +E b (Θ) +λ||Θ|| 1 +w (3.7) = arg min Θ E f (Θ) +E b (Θ) +λ||Θ|| 1 (3.8) 64 because − log p f (y|x, Θ) =− K X k=1 y k log a yf k =E f (Θ) (3.9) − log p b (x|y, Θ) = log 1 2π I 2 exp − ||a xb −x|| 2 2 2 (3.10) =− log(2π) I 2 + 1 2 ||a xb − x|| 2 2 (3.11) =u +E b (Θ) (3.12) − logh(Θ) =λ||Θ|| 1 +w (3.13) where u and w do not depend on Θ. This simplifies to Θ ∗ = arg min Θ E f (Θ) +E b (Θ) +λ||Θ|| 1 (3.14) = arg min Θ − y T log a yf + 1 2 ||a xb − x|| 2 2 +λ||Θ|| 1 (3.15) = arg min Θ − K X k=1 y k log a yf k + 1 2 I X i=1 |a xb i −x i | 2 +λ||Θ|| 1 . (3.16) Algorithm 3.2 lists the pseudocode for the Bayesian bidirectional backpropagation with a Laplacian or Lasso-like prior. The pseudocode describes the Bayesian error function E(Θ) =−y T log a yf +α||a xb − x|| 2 +λ||Θ|| 1 (3.17) for a classifier-regressor bidirectional network. The forward pass uses the one-shot multino- mial or categorical distribution. So its negative logarithm gives the error as a cross-entropy. The backward pass uses the multivariate Gaussian probability. So its error function is just the squared-error. The forward pass includes the regularizing Laplacian prior and its corresponding l 1 or absolute error. 3.4 Simulation and Results This classification simulation trained and tested multilayer perceptron models and convo- lutional neural classifiers on the CIFAR-10. The section presents the results from using unidirectional BP, B-BP (using Algorithm 3.1), and Bayesian B-BP (using Algorithm 3.2). It shows that the Bayesian B-BP-trained classifiers outperformed B-BP and unidirectional BP. 65 Algorithm 3.2: Bayesian Bidirectional Backpropagation Training of a Neural Classifier Data: M input data vectors{x (1) ,..., x (M) }, M target label 1-in-K vectors {y (1) ,..., y (M) }, number of training epochs T, learning rate η, penalty parameter λ, and backward loss coefficient α. Result: Trained network parameter Θ BBP 1 • Initialize the network parameter Θ (0) ; 2 • Compute the class centroid c k for all the K classes using (3.1); 3 • Compute the projection ˜ x k of the class centroid c k for all the K classes using (3.2); while epoch t : 0→T− 1 do 4 while training sample index m : 1→M do 5 • Compute the forward inference: a yf(m) ←N(x (m) ) 6 • Compute the backward inference: a xb(m) ←N T (y (m) ) 7 • Determine the true class of x (m) : c m ← arg max k y (m) k 8 • Compute the forward error E f (Θ): E f (Θ)←−y (m) · log a yf(m) 9 • Compute the backward error E b (Θ): E b (Θ)←α a xb(m) − ˜ x cm 2 10 • Compute the penalty term R Θ =λ||Θ|| 1 11 • Back-propagate the error functions and update the network parameter: Θ (t+1) ← Θ (t) −η∇ Θ E f (Θ) +E b (Θ) +R Θ Θ=Θ (t) 12 Θ BBB ← Θ (T ) 3.4.1 Datasets This simulation used the CIFAR-10 image [156] and the extended MNIST balanced image[25], [51] datasets. The CIFAR-10 image set is a collection of images from 10 classes (K = 10). The 10 pattern classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. This dataset is balanced with 6,000 color images per class. Each class consists of 5,000 training samples and 1,000 testing samples. The size of each image is 32× 32× 3. The EMNIST balanced dataset is the extension of the MNIST handwritten digits. This 66 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) CIFAR-10 dataset (b) EMNIST balanced dataset Figure 3.11: Datasets for training neural classifiers with B-BP: These are samples images from the simulation datasets for training neural classifiers with bidirectional BP under subsection 3.4.3. (a) The sample images from the CIFAR-10 dataset The 10 samples show one image per class. (b) The sample images from the EMNIST balanced dataset. The 47 samples show one image per class. extension is made up of 47 balanced classes (K = 47) with 10 digit classes and 37 letter classes. The letter classes merges certain uppercase and lowercase characters. The dataset is balanced because it contains equal number of samples per class. Each class is made up 2,400 training samples and 400 testing samples. The size of each image is 28×28×1. Figure 3.11 shows sample images from the CIFAR-10 image and the EMNIST balanced image datasets. 3.4.2 Network Description The deep neural classifiers trained on the CIFAR-10 and EMNIST balanced image datasets. The multilayer perceptron classifiers used identity neurons in the input layer and 512 ReLU neurons in each hidden layer. They used either 10 or 47 softmax neurons in the output layer depending on the dataset. We trained some of the classifiers with ordinary unidirectional BP as a baseline. We trained the other classifiers with B-BP either without or with a Bayesian prior. A dropout value of 0.1 for the hidden-layers reduced overfitting in the models. 67 3 32 32 Input 16 32 32 Conv. 1 16 16 16 Resize 1 32 16 16 Conv. 2 32 8 8 Resize 2 64 8 8 Conv. 3 64 4 4 Resize 3 1024 FC1 500 FC2 10 Softmax (a) Convolutional neural classifier: Model 1 3 32 32 Input 32 32 32 Conv.1 32 16 16 Resize1 64 16 16 Conv.2 128 16 16 Conv.3 128 8 8 Resize2 256 8 8 Conv. 4 512 8 8 Conv. 5 512 4 4 Resize3 8192 FC1 4096 FC2 4096 FC3 1000 FC4 10 Softmax (b) Convolutional neural classifier: Model 2 Figure 3.12: The architecture of two convolutional neural classifiers: This shows the architecture of the CNN models under this section. The input layer used identity activation, the hidden neurons (convolution and fully connected layers) used rectified linear units (ReLUs), and the output layer used softmax activation. (a) CNN model 1 used three convolution layers and two fully connected layers. (b) CNN model 2 used five convolution layers and four fully connected layers. This simulation also considered two convolutional neural classifiers. The first convolu- tional neural classifier (CNN model 1) used three convolutional layers with 16 filters, 32 filters, and 64 filters. This model used two fully connected layers with 1024 neurons and 500 neurons. Figure 3.12a shows the architecture of CNN model 1. The second convolutional neural classifier (CNN model 2) used five convolutional layers with 32 filters, 64 filters, 128 filters, 256 filters, and 512 filters. This model used four fully connected layers with 8192 68 neurons, 4096 neurons, 4096 neurons, and 1000 neurons. Figure 3.12b shows the architecture of CNN model 2. The input layer used identity neurons and the hidden layers (including the convolutional and fully connected layers) used rectified linear units (ReLUs), and the output layer used softmax activation in both cases. 3.4.3 Results and Discussion Table 3.1: Classification accuracy: Training deep neural classifiers with unidirectional BP, bidirec- tional BP, and Bayesian B-BP algorithms on CIFAR-10 image dataset No. of Hidden Layers Classification accuracy Unidirectional BP B-BP Bayesian B-BP 3 layers 57.26% 57.65% 57.97% 5 layers 57.40% 58.24% 58.43% 7 layers 57.06% 57.91% 58.42% 13 layers 55.21% 56.66% 57.80% Table 3.2: Training with raw images versus PixelHop+[43] features: The PixelHop++ features further improved the performance of deep neural classifiers that trained on CIFAR-10 image dataset with B-BP and Bayesian B-BP. These classifiers used non-convolution hidden layers. Training No. of Hidden Layers Classification accuracy Raw Image PixelHop++ Unidirectional BP 3 layers 57.65% 64.11% 5 layers 58.24% 64.79% 7 layers 57.91% 63.95% 13 layers 56.66% 64.34% Bidirectional BP 3 layers 57.97% 65.12% 5 layers 58.43% 64.90% 7 layers 58.42% 64.40% 13 layers 57.80% 64.29% Table 3.1 shows that the B-BP training outperformed the unidirectional BP training and the Bayesian B-BP performed best. This trend also holds with the convolutional neural classifiers. Table 3.2 shows that the best model came from the Bayesian B-BP training. 69 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) Target: Centroidal images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) Backward pass: Unidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) Backward pass: Bidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (d) Backward pass: Bayesian Bidirectional BP Figure 3.13: Backward-pass recall in deep classifiers trained on the CIFAR-10 image dataset: The unidirectionally trained classifier produced only visual noise at the input layer on the backward pass while the bidirectionally trained classifier produced a good centroidal estimate of the pattern class. The classifiers used 7 hidden layers with 512 ReLU neurons in each hidden layer and 10 output softmax neurons. The 3072 input neurons had identity activations and so defined a regression layer on the backward pass. Unidirectional backpropagation caused overwriting in the backward direction during training while bidirectional backpropagation did not. Bidirectional training and its Bayesian form had better classification accuracy. Table 3.3: Classification accuracy: Training deep convolutional neural classifiers with unidirectional BP, bidirectional BP, and Bayesian B-BP algorithms on CIFAR-10 image dataset Model Classification accuracy Unidirectional BP B-BP Bayesian B-BP CNN model 1 68.40% 69.28% 71.48% CNN model 2 81.62% 82.31% 83.19% This simulation compared the effect of PixelHop++ features on the performance of the B-BP methods. PixelHop++ [43] is a form of successive subspace feature transformation that extracts label-guided features. This method runs a form of principal component analysis (PCA) that finds the best basis vectors for representing the feature . This form of PCA uses the information from the label or target to guide the choice of basis vectors. Table 3.2 shows that the Bayesian B-BP outperformed the ordinary B-BP. PixelHop++ 70 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (a) Target: Centroidal images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) Backward pass: Unidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (c) Backward pass: Bidirectional BP Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (d) Backward pass: Bayesian Bidirectional BP Figure 3.14: Backward-pass recall with deep convolutional classifiers trained on the CIFAR-10 image dataset: The unidirectionally trained classifier produced only visual noise at the input layer on the backward pass while the bidirectionally trained classifier produced a good centroidal estimate of the pattern class. The classifiers used 5 convolutional layers with 3 fully connected layers. The convolutional layers used 32, 64, 128, 256, and 512 convolution filters. The fully connected layers used 4096 neurons, 4096 neurons, and 1000 neurons. The network used 10 output softmax neurons. The 3072 (32×32×3) input neurons had identity activations and so defined a regression layer on the backward pass. Unidirectional backpropagation caused overwriting in the backward direction during training while bidirectional backpropagation did not. Bidirectional and Bayesian training also had better classification accuracy. further improved the performance of the two B-BP methods. Table 3.3 shows that the Bayesian B-BP also outperformed the ordinary B-BP and unidirectional BP with convo- lutional neural classifiers. Figure 3.14 shows the backward of the convolutional neural classifiers. The B-BP and Bayesian B-BP endowed neural classifiers with bidirectional representation. Figure 3.13 shows that the backward pass with B-BP and Bayesian B-BP recalled the centroidal images that stored along the backward path during the training phase. The unidirectional BP failed to recall any useful information along the backward path. The Bayesian B-BP benefit also held for the classification of the EMNIST balanced dataset. Table 3.4 shows the benefit of the Bayesian B-BP with the multilayered perceptron classifiers that trained on the EMNIST balanced image dataset. Figure 3.15 compares the recall of these classifiers along the backward pass. Table 3.5 shows the benefit of the PixelHop++ feature extraction with the B-BP training. 71 (a) Target: Centroidal images (b) Backward pass: Unidirectional BP (c) Backward pass: Bidirectional BP (d) Backward pass: Bayesian Bidirectional BP Figure 3.15: Backward-pass recall in deep classifiers trained on the EMNIST-47 image dataset: The unidirectionally trained classifier produced only visual noise at the input layer on the backward pass while the bidirectionally trained classifier produced a good centroidal estimate of the pattern class. The classifiers used 5 hidden layers with 512 ReLU neurons in each hidden layer and 47 output softmax neurons. The 784 input neurons had identity activations and so defined a regression layer on the backward pass. Unidirectional backpropagation caused overwriting in the backward direction during training while bidirectional backpropagation did not. Bidirectional and Bayesian training also had better classification accuracy. 72 Table 3.4: Classification accuracy: Training deep neural classifiers with unidirectional BP, bidirec- tional BP, and Bayesian B-BP algorithms on the EMNIST balanced image dataset. These classifiers used non-convolution hidden layers. No. of Hidden Layers Classification accuracy Unidirectional BP B-BP Bayesian B-BP 3 layers 85.89% 86.06% 86.23% 5 layers 84.93% 85.23% 85.92% 7 layers 84.49% 85.11% 85.58% 11 layers 82.65% 84.61% 84.82% Table 3.5: Training with raw images versus PixelHop+[43] features: The PixelHop++ features further improved the performance of deep neural classifiers that trained on EMNIST balanced image dataset with B-BP and Bayesian B-BP. These classifiers used non-convolution hidden layers. Training No. of Hidden Layers Classification accuracy Raw Image PixelHop++ Unidirectional BP 3 layers 85.89% 86.13% 5 layers 84.93% 85.87% 7 layers 84.49% 85.49% 11 layers 82.65% 85.35% Bidirectional BP 3 layers 86.06% 86.64% 5 layers 85.23% 86.51% 7 layers 85.11% 86.17% 11 layers 84.61% 85.57% 3.5 Conclusion The new bidirectional backpropagation endows a neural classifier with bidirectional rep- resentation. This is not the case with the ordinary unidirectional backpropagation. The Bayesian extension of the bidirectional backpropgation gave the best classifier performance compared with ordinary bidirectional backpropagation and unidirectional backpropagation. The features extracted with PixelHop++ further improved the classifier performance of these bidirectional methods with multilayer perceptron models. The bidirectional benefits also extended to convolutional neural classifiers. The introduction of a prior term extend the bidirectional backpropagation to the Bayesian 73 bidirectional backpropagation. This methods applies to training neural classifiers. The Laplacian prior gave the best classifier performance compared with Gaussian and uniform prior. 74 Chapter 4 Bidirectional Backpropagation Application: Generative Adversarial Networks (GANs) Bidirectional backpropagation (B-BP) can train generative adversarial networks (GANs) and improve their performance compared with training with ordinary unidirectional backpropa- gation. The best B-BP training for GANs ran the discriminator network as a bidirectional network and left the generator network as a feedforward network. B-BP training improved the performance of GANs and their inception metrics. B-BP training also tended to remove the stuttering-type problem of “mode collapse” from vanilla GANs and approximated the performance of a Wasserstein GAN. B-BP training further improved the training of Wasserstein GANs. The next sections show how B-BP applies to GANs and how it performs on standard image test sets. 4.1 A Review of Generative Adversarial Networks An adversarial network consists of two or more neural networks that try to trick each other. They use feedback among the neural networks and sometimes within neural networks. The standard generative adversarial network (GAN) consists of two competing neural networks. One network generates patterns to trick or fool the other network that tries to tell whether a generated pattern is real or fake. The generator network G acts as a type of art forger while the discriminator network D acts as a type of art expert or detective. There exists different variants of GAN. These variants result from modifying the vanilla (original) GAN by Goodfellow et al. [90]. This type of modification includes one or more 75 of the following: redefining the structure of the sub-networks, recasting the probabilistic description or similarity measure, and changing the training method. Some of the variants include deep convolutional GAN [215], Wasserstein GAN [103], conditional GAN [190], super-resolution GAN [169], and generative recurrent adversarial network [125]. Applications of GANs include video generation [246], [250], [253], text-to-image [60], [274], image super-resolution [169], [257], speech conversion [119], person re-identification [176], [212], object transfiguration [267], [279], object detection [71], [131], [171], [172], music generation [101], [192], [272], medical image segmentation [59], [269], [270], image translation [126], [141], [280], and video super-resolution [47], [195]. 4.1.1 Architecture of Generative Adversarial Networks The structure of GANs changes between the training phase and the inference phase. The inference phase involves using the G to transform the latent variable z. We do not need the D during this phase The training phase involves adjusting the parameter of the sub-networks (D and G) such that the generated samples G(z) and the real samples x are similar. The two sub-networks run during the training phase. Figure 4.1 shows the full architecture of a GAN. The G and D are simple multilayer feedforward networks with no convolution layer in this case. The sub-network G converts the latent variable z to generated samplesG(z). The sub-networkD measures the similarity between G(z) andx. The D only works during the training phase while G works during the training phase and the inference phase. The D measures how well G is doing and then uses the corresponding error function or similarity measure to update the weights. 4.1.2 Training Generative Adversarial Networks GAN training helps the discriminator D get better at detecting real data from fake or synthetic data. It trains the generator G so that D finds it harder to distinguish the real data from G’s fake or generated data. Goodfellow et al.[90] showed that some forms of this adversarial process reduce to a two-person minimax game between D and G in terms of a value function V (G,D) min G max D V (D,G) =E x∼pr (x) [log D(x)] +E z∼pz (z) [log (1−D(G(z))] (4.1) if p r is the probability density of the real data x and if p z is the probability density of the latent or hidden variable z. GAN training alternates between training D and G. It trains D for a constant G and then trainsG for a constantD. The probabilityp g is the probability density of the generated dataG(z) thatG produces. Optimizing the value function in (4.1) is the same as minimizing 76 Goal: ~ Real data Generator Discriminator Real samples Generated samples Generated samples Latent variable Real samples Figure 4.1: Architecture of a generative adversarial network: A generative adversarial network (GAN) is made up of two subnetworks namely generator G and discriminator D. G transforms the latent variable z into generated samples G(z). D tries to separate G(z) from the real samples x. Training of GANs means finding the best network parameter that makes it hard to distinguish the real samples from the generated samples. the Jensen-Shannon (JS) divergence betweenp r andp g [90]. This is the same as maximizing the product of the likelihood functions p(y = 1|x) and p(y = 0|G(z)) with respect to D and minimizing the likelihood function p(y = 0|G(z)) with respect to G. The likelihood functions are Bernoulli probability densities. Minimizing the JS divergence can lead to a vanishing gradient as the discriminator D gets better or confident at distinguishing real data from fake data [18]. The JS divergence can also correlate poorly with the quality of the generated samples [17]. It also often leads to mode collapse: Training converges to the generation of a single image [187]. The vanilla GAN trains to minimize the JS distance We found that bidirectionality greatly reduced the instances and severity of mode collapse in regular vanilla GANs. Arjovsky [17] proposed the Wasserstein GAN (WGAN) that minimizes the Earth-Mover distance between p r and p g . The Kanotorovich-Rubinstein duality [248] shows that we can define the Earth-Mover distance as W (p r , p g ) = sup ∥f∥ L≤1 E x∼pr [f(x)]−E ˜ x∼pg [f(˜ x)] (4.2) where the supremum runs over the set of all 1-Lipschitz functionsf :χ→ R. Then Gulrajani [102] proposed adding a gradient penalty term that enforces the Lipschitz constraint. This 77 gives the modified value function L(p g ,p r ) as L(p r , p g ) =E x∼pr [D(x)]−E z∼pz [D(G(z))] +λE ˆ x∼p ˆ x [(∥∇ ˆ x D(ˆ x)∥ 2 −1) 2 ] (4.3) where ˆ x is a convex sum of x∼p r and G(z) with z∼p z . This approach performed better than using weight clipping [18] to enforce the Lipschitz constraint. Salimans [226] proposed heuristics such as feature matching, minibatch discrimination, and batch normalization to improve the performance and convergence of GAN training. Algorithm 4.1 uses (4.3) for the B-BP training of a WGAN. 4.2 Generative Adversarial Network and Bidirectional Back- propagation Bidirectional GAN training uses a new backward training pass for the discriminator D network. The bidirectional training uses the appropriate joint forward-backward performance measure so that training in one direction does not overwrite training in the reverse direction. Such training does not apply to the usual unidirectional backpropagation training of the generatorG as we explain below. The forward pass of the discriminatorD tries to distinguish the pattern x from the generated pattern G(z). The backward pass helps the set-inverse map D T better distinguish the two. We cast the backward pass as multidimensional regression. Then the backward-pass training becomes maximum-likelihood estimation for a conditional Gaussian probability density. This holds because maximizing the likelihood L b of the backward pass minimizes the backward error E b when E b is squared error [8], [20], [200]. B-BP did not train the generator G because we found that B-BP did not outperform unidirectional BP for this task. We show briefly that this holds because of a constraint inequality: Training the generatorG on the backward pass to maximize a likelihood function L G b can only harm the GAN’s performance on average. The argument for this negative result is that B-BP constrains the training of G more than the unidirectional BP does. GAN training seeks solutions G ∗ and D ∗ that optimize the value function V (D,G) that measures the similarity between the real samples and the fake or generated samples: min G max D V (D,G). (4.4) Consider the optimal solution D ∗ G for a given G: D ∗ G = arg max D V (D,G) (4.5) 78 and define C(G) =V (D ∗ G , G). The best generator G ∗ for the value function obeys G ∗ = arg min G C(G). (4.6) The definition of arg min implies that C(G ∗ )≤C(G) (4.7) for all G. Let G ∗ b be that constrained G that further partially or totally maximizes the backward-pass likelihood L G b . Then C(G ∗ )≤C(G ∗ b ) (4.8) since the additional B-BP constraint only shrinks the search space for the minimum. 4.2.1 Training Vanilla GANs with Bidirectional Backpropagation Each bidirectional training iteration of a vanilla GAN involves three steps. The first step trains the discriminator D to maximize the backward-pass likelihood L D b . This likelihood measures the mismatch between the real samples and the generated samples on the backward pass. The likelihood was a multivariate Gaussian probability density function. The backward error E D b measures the mismatch between a generated sample G(z (m) ) and a real or authentic sample x (m) : E D b = G(z (m) )−D T (a (m) r ) 2 + x (m) −D T (a (m) g ) 2 (4.9) where a (m) r =D(x (m) ) and a (m) g =D(G(z (m) )). The term D T denotes the backward pass over the discriminator network D. The second step maximizes the forward-pass likelihood L D f for the discriminatorD. This maximization is the same as minimizing the forward-pass error E D f because the error just equals the negative log-likelihood: E D f =− log D(x (m) ) + log 1−G(D(z (m) )) . (4.10) The forward-pass error is the cross entropy because the network acts as binary classifier with a Bernoulli probability density since the samples are either real or fake. The third step trains the generator network G. The goal is to make G(z) resemblex. So 79 Algorithm 4.1: Bidirectional Backpropagation Training of a Vanilla GAN Data: M real samples x (1) ,..., x (M) , number of training epochs T, start point for backward training t ∗ such that (t ∗ <T ), learning rate η, and the parameter for the probability distribution p z (z) of the latent variable. Result: Trained parameter of the subnetworks Θ BP g and Θ BBP d 1 • Initialize the parameter of the subnetworks Θ (0) g and Θ (0) d ; 2 while epoch t : 0→T− 1 do 3 while training sample index m : 1→M do 4 • Pick a sample z (m) at random from p z (z): 5 if epoch t≥t ∗ then 6 • Compute the backward error E D b using (4.9); 7 • Update the weights of D for the backward pass with gradient descent: ˜ Θ (t) d = Θ (t) d −η∇ Θ d E D b Θ d =Θ (t) d 8 else 9 • Leave the weights of D unchanged: ˜ Θ (t) d = Θ (t) d 10 • Compute the forward error E D f using (4.10); 11 • Update the weights of D for the forward pass with gradient descent: Θ (t+1) d = ˜ Θ (t) d −η∇ Θ d E D f Θ d = ˜ Θ (t) d • Compute the error E G using (4.11); 12 • Update the weights of G with gradient descent: Θ (t+1) g = Θ (t) g −η∇ Θg E G Θg =Θ (t) g 13 Θ BBP d ← Θ (T ) d ; 14 Θ BP g ← Θ (T ) g this step trains G to minimize the error function E G : E G =− log D(G(z (m) )). (4.11) E G measures the mismatch between the real and generated samples on the forward pass. Algorithm 4.1 lists the steps for the B-BP training of a vanilla GAN. 80 4.2.2 Training Wasserstein GANs with Bidirectional Backpropagation The B-BP algorithm extends to training a Wasserstein GAN. The measure of similarity is now the earth-mover distance betweenp r andp g in (4.3). The algorithm involves three steps for each training iteration. The first two steps train D with B-BP. The third step trains G with unidirectional BP. Algorithm 4.2 lists the steps for the B-BP training of a Wasserstein GAN. 4.3 Simulation and Results This section presents the simulation results that show the benefit of training GANs with the B-BP algorithm. The simulation compared the performance of Algorithms 4.1 and 4.2 with their respective unidirectional BP versions. The B-BP-trained GANs outperformed the unidirectional BP-trained GANs. The B-BP training works best if it only applies to the D and not the G. This simulation tested this new training approach on the vanilla GAN [90], Wasserstein GAN [103], and deep convolutional GAN (DCGAN) [215]. 4.3.1 Datasets This simulation used the MNIST handwritten digit [168] and CIFAR-10 image [156] data sets. The MNIST digit dataset is a database of handwritten digits from 10 classes. These classes are the 10 numerical digits 0 through 9. Each image has a size 28× 28× 1. This dataset is balanced with 7,000 samples per class. Each class consists of 6,000 training samples and 1,000 testing samples. The CIFAR-10 image set is a collection of images from 10 classes. The 10 pattern classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. This dataset is balanced with 6,000 color images per class. Each class consists of 5,000 training samples and 1,000 testing samples. The size of each image is 32× 32× 3. Figure 4.2 shows sample images from the MNIST digit and CIFAR-10 image data sets. 4.3.2 Performance metrics for GAN models This simulation used two metrics namely the inceptionscoreIS(G) [27], [227] andFr´ echet inceptiondistanceFID [110], [182] to evaluate the performance of the trained GAN models. IS(G) is based on the exponentiated average Kullback-Leibler divergence: IS(G) = exp E x∼pg D KL p(y|x)∥p(y) (4.12) 81 Algorithm 4.2: Bidirectional Backpropagation Training of a Wasserstein GAN Data: M real samples x (1) ,..., x (M) , number of training epochs T, start point for backward training t ∗ such that (t ∗ <T ), learning rate η, gradient penalty coefficient λ, and the parameter for the probability distribution p z (z) of the latent variable. Result: Trained parameter of the subnetworks Θ BP g and Θ BBP d 1 • Initialize the parameter of the subnetworks Θ (0) g and Θ (0) d ; 2 while epoch t : 0→T− 1 do 3 while training sample index m : 1→M do 4 • Pick a sample z (m) at random from p z (z); 5 • Select a sample ϵ from the uniform distribution U[0, 1]; 6 • Compute the perturbation ˆ x (m) of x (m) : ˆ x (m) =ϵx (m) + (1−ϵ)G(z (m) ) 7 if epoch t≥t ∗ then 8 • Compute the backward error E D b using (4.9); 9 • Update the weights of D for the backward pass with gradient descent: ˜ Θ (t) d = Θ (t) d −η∇ Θ d E D b Θ d =Θ (t) d 10 else 11 • Leave the weights of D unchanged: ˜ Θ (t) d = Θ (t) d 12 • Compute the forward error E D f =L(p r ,p g ) using (4.3); 13 • Update the weights of D for the forward pass with gradient descent: Θ (t+1) d = ˜ Θ (t) d −η∇ Θ d E D f Θ d = ˜ Θ (t) d • Compute the error E G using (4.11); 14 • Update the weights of G with gradient descent: Θ (t) g = Θ (t) g −η∇ Θg E G Θg =Θ (t) g 15 Θ BBP d ← Θ (T ) d ; 16 Θ BP g ← Θ (T ) g 82 Digit: 0 Digit 1 Digit: 2 Digit: 3 Digit: 4 Digit: 5 Digit: 6 Digit: 7 Digit: 8 Digit: 9 (a) MNIST Handwritten Digit Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck (b) CIFAR-10 images Figure 4.2: Datasets for GAN simulation: These are samples images from the simulation datasets for training GAN models under § 4.3. (a) The sample images from the MNIST handwritten digit dataset. The 10 samples show one image per class. (b) The sample images from the CIFAR-10 image dataset. The 10 samples show one image per class. for the generatorG. The Kullback-Leibler divergence or relative entropyD KL p(y|x)∥p(y) measures the pseudo-distance between the conditional probability density p(y|x) and the corresponding unconditional marginal density p(y). A higher inception score IS(G) tends to describe generated images of better quality. So a larger inception score corresponds to better GAN performance. FID metric is the square of the Wasserstein distance between two multidimensional Gaussian distributions. This involves using the pool-3 layer of an Inception-v3 [240] model that pre-trained on the ImageNet [224] dataset to extract features F r from the real samples and F g from the generated samples. The dimension of the extracted pool-3 features is 2,048. The FID computes as follows: FID =||µ r −µ g || 2 + Tr(Σ r + Σ g − 2(Σ r Σ g ) 1 2 ) (4.13) with the assumption that F r ∼N (µ r , Σ r ) and F g ∼N (µ g , Σ g ). A lower FID corresponds to better similarity between the real and generated samples. 4.3.3 Training Vanilla GANs with Bidirectional Backpropagation This subsection describes the simulation results for the B-BP training of the vanilla GAN and its convolutional variant DCGAN. The vanilla GAN’s generator G and discriminator D neural networks were multilayered perceptrons. The generator network G mapped the space of the latent variable z to the space of generated samples. The discriminator network D mapped these samples to a single output logistic neuron. The vanilla GANs in this simulation only trained on the MNIST digit dataset. The generator network G used 100 input neurons with identity-function activations. It had a single hidden layer with 128 neurons with logistic sigmoid activations. The output layer contained 784 logistic neurons. The latent variable z modeled as a bipolar uniform 83 0 20 40 60 80 100 Training epoch 1 2 3 4 5 6 7 Inception score (IS) Unidirectional BP B-BP (D only) B-BP (G only) B-BP (D & G) (a) Vanilla GAN: Inception Score 0 20 40 60 80 100 Training epoch 0 20 40 60 80 100 Fr´echet Inception Distance Unidirectional BP B-BP (D only) B-BP (G only) B-BP (D & G) (b) Vanilla GAN: Fr´ echet Inception Distance (c) Vanilla GAN: Unidirectional BP (d) Vanilla GAN: Bidirectional BP Figure 4.3: Bidirectional BP reduces mode collapse: Bidirectional BP improves the quality of the generated images with vanilla GAN. The vanilla GANs trained over 100 epochs with 100,000 iterations per epoch on the MNIST handwritten digit dataset. The unidirectional BP-trained vanilla GANs suffered from mode collapse problem. This is a situation where the model only generates samples from a single class. Training the D with the bidirectional BP algorithm and G with the unidirectional BP performed best. (a) The inception scores for generated samples from vanilla GANs. The mode collapse problem caused the IS of the unidirectional BP-trained vanilla GAN to drop after some epochs. (b) The Fr´ echet inception distances for the samples from vanilla GANs. The mode collapse problem caused the FID of the unidirectional BP-trained vanilla GAN to increase after some epochs. (c) The generated samples from the vanilla GAN trained with the unidirectional BP. These samples reflect the mode collapse problem. (d) The generated samples from the vanilla GAN trained with the B-BP from Algorithm 4.1 with the bidirectional start point t ∗ = 10. random variable z∼U[−1, 1]. The discriminator network D used 784 input neurons with identity-function activations. 84 Table 4.1: This compares the inception score IS(G) and Fr´ echet inception distance FID of vanilla GANs with B-BP-trained generator G and / or discriminator G. The GANs trained on the MNIST handwritten digit dataset. The combination of a unidirectional BP-trained G and a B-BP-trained D performed best (Algorithm 4.1). Training method IS(G) FID Unidirectional BP Bidirectional BP D and G — 3.918 30.777 G D 6.978 6.883 D G 1.000 105.705 — D and G 5.136 18.178 Table 4.2: This shows the effect of the bidirectional start point t ∗ on the inception score IS(G) and Fr´ echet inception distance FID of bidirectional BP-trained vanilla GANs using Algorithm 4.1. The GANs trained on the MNIST handwritten digit dataset over 100 epochs. Bidirectional start t ∗ IS(G) FID 0 6.592 8.590 10 6.978 6.883 20 6.584 8.573 50 6.723 8.086 100 3.918 30.777 It had a single hidden layer of 128 logistic neurons. The discriminator had a single logistic neuron in its output layer The B-BP training runs the D as a bidirectional network. Let W d 1 denote the weight matrix that connects the input layer of D to its hidden layer. Let W d 2 connect the hidden layer of D to its output layer. The forward pass D mapped samples from p r andp g through W d 1 and W d 2 to the output layer. The backward pass D T mapped samples from the unit- interval space [0,1] through the transpose of W d 2 and the transpose of W d 1 to the input space. Using the matrix transpose this way for the backward pass defines a bidirectional associative memory or BAM [146], [148]. Every real rectangular matrix is bidirectionally stable for most known nonlinear neural units [148]. The vanilla GANs trained for over 100 epochs with 10,000 iterations per epoch. Figure 4.3 shows that B-BP significantly improved the performance of the vanilla GAN compared with ordinary BP training. B-BP training gave a 20.3% increase in the inception score and a 21.7% decrease in the Fr´ echet inception distance for produced samples with B-BP compared with produced samples from unidirectional BP training. Figure 4.3 shows some of the generated samples from the vanilla GAN after training on MNIST data with unidirectional BP and B-BP. It shows that B-BP substantially reduced mode collapse in the simulations. 85 Table 4.3: This compares the inception scoreIS(G) and Fr´ echet inception distanceFID of DCGANs with B-BP-trained generatorG and / or discriminatorG. The GANs trained on the CIFAR-10 image dataset. The combination of a unidirectional BP-trained G and a B-BP-trained D (Algorithm 4.1) performed best. Training method IS(G) FID Unidirectional BP Bidirectional BP D and G — 6.139 15.354 G D 6.481 11.539 D G 6.042 13.485 — D and G 6.217 11.208 Table 4.4: This shows the effect of the bidirectional start point t ∗ on the inception score IS(G) and Fr´ echet inception distance FID of bidirectional BP-trained DCGANs using Algorithm 4.1. The GANs trained on the CIFAR-10 image dataset over 100 epochs. Bidirectional start t ∗ IS(G) FID 0 6.469 12.638 10 6.481 11.539 20 6.336 13.635 50 6.256 16.635 100 6.139 15.354 Table 4.1 shows that training the D with B-BP and G with unidirectional BP outperformed other forms of B-BP training with vanilla GANs. Table 4.2 shows how the start point t ∗ of bidirectional training affects B-BP performance for the vanilla GAN. The best starting value for t ∗ was 10. 4.3.4 Training Deep-Convolutional GANs with Bidirectional Backpropa- gation The B-BP benefit also extends to deep-convolutional GANs. This variant of GAN is a convolutional version of the vanilla GAN. The D and G of a deep-convolutional GAN are convolutional neural networks. The subnetworks had 3 convolutional layers each and no max-pooling layer [11]. It used the same error function as did the vanilla GAN for MNIST data. The forward pass convolved the input data with masks W ∗ i . The backward pass used the transposed masks W ∗ i T (in true bidirectional fashion) to convolve signals. There was no pooling at the convolutional layers for either the forward or backward pass. The latent variable z modeled as a Gaussian random variable z∼N (µ z = 0,σ z = 1.0). The discriminator used three convolutional layers with leaky rectified linear units and a fully connected layer between its input and output layers. The generator G also used three 86 0 20 40 60 80 100 Training epoch 1 2 3 4 5 6 Inception score (IS) Unidirectional GAN B-BP (D only) B-BP (G only) B-BP (D & G) (a) DCGAN: Inception Score (IS) 0 20 40 60 80 100 Training epoch 10 1 10 2 Fr´echet Inception Distance Unidirectional GAN B-BP (D only) B-BP (G only) B-BP (D & G) (b) DCGAN: Fr´ echet Inception Distance (FID) (c) DCGAN: Unidirectional BP (d) DCGAN: Bidirectional BP Figure 4.4: Bidirectional BP improved deep-convolutional GANs: The B-BP algorithm improved the inception and the Fr´ echet inception distance. The DCGANs trained with unidirectional BP and B-BP after 100 epochs. B-BP training outperformed the ordinary unidirectional BP Bidirectional BP training of DCGAN improved the quality of generated images. The DCGAN models trained on the CIFAR-10 image dataset. (a). The inception scores for the generated images from the DCGANs trained with unidirectional BP and B-BP. (b). The Fr´ echet inception distances for the generated images from the DCGANs trained with unidirectional BP and B-BP. (c). The generated samples from the DCGAN trained with unidirectional BP. (d). The generated samples from the DCGAN trained with the B-BP from Algorithm 4.1 with the bidirectional start point t ∗ = 0. convolutional layers with rectified linear activations and a fully connected layer between its input and output layers. The generator G used 128 input neurons with identity-function activations. The deep-convolutional GAN trained for over 100 epochs with 100,000 iterations per epoch. Figure 4.4 shows that B-BP training slightly improved the performance of 87 Table 4.5: This compares the inception score IS(G) and Fr´ echet inception distance FID of Wasserstein GANs with B-BP-trained subnetworks (disciminator D and generator G). The GANs trained on the MNIST handwritten digit dataset. Training method IS(G) FID Unidirectional BP Bidirectional BP D and G — 7.747 1.307 G D 8.913 0.369 D G 7.725 1.737 — D and G 8.643 0.870 Table 4.6: This shows the effect of the bidirectional start point t ∗ on the inception score IS(G) and Fr´ echet inception distance FID of B-BP-trained Wasserstein GANs using Algorithm 4.2. The Wasserstein GANs trained on the MNIST handwritten digit dataset. Bidirectional start t ∗ IS(G) FID 0 8.830 0.529 10 8.779 0.493 20 8.913 0.369 50 8.785 0.601 100 7.747 1.307 DCGANs. It also shows some typical generated samples from the deep-convolutional GAN after it trained on the CIFAR-10 color-image data with unidirectional BP and B-BP. B-BP increased the inception score by 4.75% and reduced the Fr´ echet inception distance by 24.5% after training on the CIFAR-10 dataset. Table 4.3 shows that training the D with B-BP and G with unidirectional BP outperformed other forms of B-BP training with DCGANs. Table 4.4 shows that B-BP performed best when t ∗ = 0. 4.3.5 Training Wasserstein GANs with Bidirectional Backpropagation This subsection compares the unidirectional BP training with B-BP training of the Wasser- stein GAN (WGAN) on the MNIST and CIFAR-10 datasets. The WGAN had the same network structure as the above vanilla GAN trained on the MNIST data set ( subsec- tion 4.3.3). The WGANs trained over 100 epochs with 10,000 iterations per epoch and varied the starting point t ∗ for the backward training of the discriminator D. Figure 4.5 shows that the B-BP increased the WGAN’s IS(G) by 15.05% and reduce the WGAN’sFID by 71.8%. Table 4.5 shows that the best WGAN used a B-BP-trained D and a unidirectional BP-trained G. Table 4.6 shows that B-BP performed best when t ∗ = 20. The WGAN also trained on the CIFAR-10 data set and used convolutional layers for 88 0 20 40 60 80 100 Training epoch 1 3 5 7 9 Inception score (IS) Unidirectional BP B-BP (D only) B-BP (G only) B-BP (D & G) (a) WGAN: Inception Score 0 20 40 60 80 100 Training epoch 10 0 10 1 10 2 Fr´echet Inception Distance Unidirectional BP B-BP (D only) B-BP (G only) B-BP (D & G) (b) WGAN: Fr´ echet Inception Distance (c) WGAN samples: Unidirectional BP (d) WGAN samples: Bidirectional BP Figure 4.5: Bidirectional BP improved Wasserstein GANs: The B-BP algorithm outperformed the ordinary unidirectional BP training of WGANs on MNIST handwritten digit dataset. B-BP improved the IS and FID. (a) The inception scores for the generated samples from WGANs. (b) The Fr´ echet inception distances for the generated samples from WGANs. (c) The generated samples from the WGAN trained with the unidirectional BP. (d) The generated samples from the WGAN trained with the B-BP from Algorithm 4.2 with the bidirectional start point t ∗ = 20. the discriminator network D and the generator network G. These generator G used three convolutional layers with rectified linear units and a fully connected layer between its input and output layers. The discriminator D also used three convolutional layers with leaky rectified linear units and a fully connected layer. The WGAN trained over 100 epochs with 200,000 iteration per epoch. The B-BP training showed a slight improvement with the WGANs that use convolutional networks as D and G. Figure 4.6 shows that the B-BP increased the WGAN’s IS(G) by 89 0 20 40 60 80 100 Training epoch 1 2 3 4 5 6 Inception score (IS) Unidirectional GAN B-BP (D only) B-BP (G only) B-BP (D & G) (a) WGAN: Inception Score 0 20 40 60 80 100 Training epoch 10 1 10 2 Fr´echet Inception Distance Unidirectional GAN B-BP (D only) B-BP (G only) B-BP (D & G) (b) WGAN: Fr´ echet Inception Distance (c) WGAN samples: Unidirectional BP (d) WGAN samples: Bidirectional BP Figure4.6: BidirectionalBPslightlyimprovedWassersteinGANs: TheB-BPalgorithmoutperformed the ordinary unidirectional BP training of WGANs on CIFAR-10 dataset. The WGANs trained over 100 epochs with 200,000 iterations per epoch. B-BP slightly improved the IS and FID. (a) The inception scores for the generated samples from WGANs. (b) The Fr´ echet inception distances for the samples from WGANs. (c) The generated samples from the WGAN trained with the unidirectional BP. (d) The generated samples from the WGAN trained with the B-BP from Algorithm 4.2 with the bidirectional start point t ∗ = 0. 2.86% and reduced the WGAN’sFID by 2.51%. Table 4.7 shows that the best WGAN used a B-BP-trainedD and a unidirectional BP-trainedG. Table 4.8 shows that B-BP performed best when t ∗ = 0. 90 Table 4.7: This shows the inception scoreIS(G) and Fr´ echet inception distanceFID of Wasserstein GANs with B-BP-trained subnetworks (generator G and discriminator G). The WGANs trained on the CIFAR-10 image dataset. Training method IS(G) FID Unidirectional BP Bidirectional BP D and G — 6.326 5.707 G D 6.507 5.564 D G 6.360 5.988 — D and G 6.403 6.064 Table 4.8: This shows the effect of the bidirectional start point t ∗ on the inception score IS(G) and Fr´ echet inception distance FID of B-BP-trained Wasserstein GANs with Algorithm 4.2. The Wasserstein GANs trained on CIFAR-10 image dataset over 100 epochs. Bidirectional start t ∗ IS(G) FID 0 6.507 5.564 10 6.454 5.743 20 6.479 5.689 50 6.470 5.723 100 6.432 5.707 4.4 Conclusion Training a GAN’s discriminator network with bidirectional backpropagation improved the quality of the generated samples. B-BP outperformed standard unidirectional BP for both thevanillaGANsandWassersteinGANs. B-BPtrainingsubstantiallyimprovedtheinception scores of the vanilla GANs and largely removed the problem with mode collapse. So it makes sense to use a B-BP-trained vanilla GAN if one uses a vanilla GAN. The B-BP gains were less pronounced for the deep-convolutional GAN on the CIFAR-10 image data. They were more substantial with using the JS distance (Algorithm 4.1) than the Wasserstein distance (Algorithm 4.2). 91 Chapter 5 High Capacity Neural Classifiers This chapter extends neural classifier networks to bidirectional blocks of networks. The terminal layers of the interior blocks use logistic neurons and a new form of random coding. So logistic output neurons replace the softmax output neurons in most modern neural classifiers. The output logistic neurons allow the choice of binary codewords for K patterns from any of the 2 K vertices of a binary or bipolar hypercube. Output softmax neurons limit this choice to the K vertices of the simplex embedded in the hypercube. Random logistic coding gives codewords that are approximately orthogonal. This allows the same network to store and recall far more patterns than simple softmax 1-in-K encoding and will help classifier networks scale to much larger number K of training patterns. Deep-sweep and bidirectional training of the entire network further improves network performance. The next section presents this new blocking framework and shows how it performs with several image test sets. 5.1 Logistic versus Softmax Output Neurons This section shows that a network with logistic output neurons and random coding can store the same number K of patterns as a softmax classifier but with a smaller number M of output neurons. The logistic network’s classification accuracy falls as M becomes much smaller than K. This implies that a properly coded logistic network can store far more patterns with similar accuracy than a softmax network can with the same number of outputs. The randomly encoded logistic blocks lead to still more efficient deep networks. Almost all deep classifiers map input patterns to K output softmax neurons. So they code the K pattern classes with K unit bit vectors and thus with 1-in-K coding. The softmax output layer has the likelihood structure of a one-shot multinomial probability or 92 vs. [ 1 , 0 , 0 ] [ 0 , 1 , 0 ] [ 0 , 0 , 1 ] [ 1 , 0 , 0 ] [ 0 , 1 , 0 ] [ 0 , 0 , 1 ] [ 0 , 1 , 1 ] [ 1 , 1 , 0 ] [ 1 , 0 , 1 ] [ 1 , 1 , 1 ] [ 0 , 0 , 0 ] 1-in- coding Logistic coding Simplex Hypercube Figure 5.1: High capacity networks with a number of classesK: Logistic coding gives bigK capacity because it uses all the vertices. 1-in-K coding has a limited capacity because it only uses the vertices on the simplex. This coding has a maximum capacity of 3 classes because the code length M = 3. The simplex has only 3 vertices in this case. The logistic coding has a maximum capacity of 8 classes because the code length M = 3. This scheme uses all the vertices (2 3 = 8). the single roll of K-sided die and thus its log-likelihood is the negative of the cross entropy [9], [155]. This softmax structure produces an output probability vector and so restricts its coding options to the K unit bit vectors of the K-dimensional unit hypercube [0, 1] K . Logistic output coding can use any of the 2 K binary vertices of the hypercube [0, 1] K . This allows far fewer output logistic neurons to accurately code for the K pattern classes. The logistic layer’s likelihood is that of a product of Bernoulli probabilities. Its log-likelihood has a double cross-entropy structure [9], [155]. The softmax and logistic networks coincide when K = 2. Figure 5.2 shows sample random logistic coding vectors of lengths M = 20, 60, and 100 for logistic networks that encode K = 100 pattern classes. The next subsection describes the activation, decision rule and loss function for softmax and logistic output neurons. 5.1.1 Activation, Decision Rule, and Error Function Input x passes through a classifier network N and gives o y =N (a x ) where o y is the input to the output layer. The output activation a y equals f(o y ) where f is a monotonic and differentiable function. Softmax or Gibbs activation functions [34], [89] remain the most used output activation for neural classifiers. This chapter explores instead binary and bipolar output logistic activations. Logistic output activations give a choice of 2 M codewords at the vertices of the unit cube [0, 1] M to code for theK patterns as opposed to the softmax choice of just the M vertices of the embedded probability simplex. 93 Random logistic coding = = = 1-in- coding = Figure 5.2: Random logistic codewords: This shows the Bipolar codewords generated from the random coding method in Algorithm 5.1 with p = 0.5, M≤ 100, and K = 100. The algorithm found the set of codewords C ∗ with the smallest meanµ c of the inter-codeword similarity measured kl . This search ran over 10,000 iterations. The black pixels denote the bit value 1 and the white pixels denote the bit value−1. The 1-in-K coding shows the 100 equidistant unit basis-vector codewords from the bipolar Boolean cube{−1, 1} 100 with M = 100. The random logistic coding image shows 3 sets of codewords from using the algorithm. It shows the best random codesets C ∗ with M∈{20, 60, 100}. Codeword c k is anM-dimensional vector that represents thek th class. M is the codeword length. Each target y is one of the K unique codewords{c 1 , c 2 ,...., c K }. The decision rule for classifying x maps the output activation a y to the class with the closest codeword: C(x) = arg min k K X l=1 c kl − a y l (5.1) whereC(x) is the predicted class for input x,a y l is thel th component of the output activation, andc kl is thel th component of thek th codeword c k . We now describe the output activations and their respective layer-likelihood structure. 5.1.1.1 Softmax or Gibbs Activation This activation maps the neuron’s input o y to a probability distribution over the predicted output classes [89], [155]. The activation a y l of the l th output neuron has the multi-class Bayesian form: a y l = exp(o y l ) P K k=1 exp(o y k ) (5.2) where o y l is the input of the l th output neuron. A single such logistic function defines the Bayesian posterior in terms of the log-posterior odds for a simple two-class classification [34]. The softmax activation in (5.2) uses K binary basis vectors from the Boolean{0, 1} K as 94 the codewords. The codeword lengthM equals the numberK of classes in this case: M =K. For example the codewords for a 3-class dataset with softmax activation are [1, 0, 0], [0, 1, 0], and [0, 0, 1]. The predicted classC(x) for an input x by a network with a softmax activation follows from the generalized decision rule in equation (5.1). The decision rule follows from (5.1). The error function E s for the softmax layer is the cross entropy [9] since it equals the negative of the log-likelihood for a layer multinomial likelihood—a single roll of the network’s implied K-sided die: E s =− K X l=1 y l log(a y l ) =− log K Y l=1 a y l y l (5.3) where y l is the l th component of the target. The softmax decision rule follows from (5.1). The rule simplifies for the unit bit basis vectors as the codewords. Let P K l=1 c kl − a y l =D (k) and d (k) l =|c kl −a y l | where D (k) is the distance between a y and c k . Then C(x) = arg min k K X l=1 c kl − a y l = arg min k D (k) = arg max k a y k (5.4) becauseM =K. SoC(x) =m implies thatD (m) ≤D (k) fork∈{1, 2,...,K}. The decision rule simplifies as in (5.4) because c kl = 1, if l ==k 0, otherwise (5.5) and 0≤a y l ≤ 1 for l∈{1, 2,...,K}. 5.1.1.2 Binary Logistic Activation The binary activation a y maps the input o y to a vector in the unit hypercube [0, 1] M : a y l = 1 1 + exp(−o y l ) (5.6) is the activation of the l th output neuron where o y l is the input of the l th output neuron. The codewords are vectors from{0, 1} M where log 2 K≤M. The decision rule for binary logistic activation follows from (5.1). The binary logistic output activation allows us to pick the codewords with small code length. The necessary condition on M is that log 2 K≤ M. This reduces the number of output neurons and size of neural classifiers for highly multiclass datasets. For example the codewords for a 3-class dataset with binary logistic activation are [1, 0], [0, 1], and [1, 1]. 95 K equals 3 and M equals 2 in this case. The logistic output activation can also impose the equidistant condition on the codewords by picking the basis vectors from the Boolean {0, 1} M as the codewords with M =K. The error function E log for the binary logistic output activation is the double cross entropy. This is equivalent to the negative of the log-likelihood with independent Bernoulli probability distribution. The double cross-entropy is E log =− M X l=1 y l log(a y l ) + (1−y l ) log(1−a y l ) =− log M Y l=1 a y l (y l ) 1−a y l (1−y l ) . (5.7) The term a y l denotes the activation of the l th output neuron and y l is the l th argument of the target vector with y l ∈{0, 1}. 5.1.1.3 Bipolar Logistic Activation A bipolar logistic activation maps o y to a vector in [−1, 1] M . The activation a y l of the l th output neuron has the form a y l = 2 1 + exp(−o y l ) − 1 = 1− exp(−o y l ) 1 + exp(−o y l ) (5.8) where o y l is the input into the l th output neuron. The codewords are K bipolar vectors from {−1, 1} M such that log 2 K≤M. The decision in this case follows from (5.1). The corresponding error function E bip is the double cross entropy. This requires a linear transformation of a y l and y l as follows: ˜ a y k = 1 2 (1 +a y k ) (5.9) ˜ y k = 1 2 (1 +y k ) . (5.10) The bipolar logistic activation uses the transformed double cross-entropy. This is equivalent to the negative of the log-likelihood NLL bip of the transformed terms with independent Bernoulli probabilities: NLL bip =− M X l=1 ˜ y l log(˜ a y l ) + (1− ˜ y l ) log(1− ˜ a y l ) (5.11) =− 1 2 M X l=1 (1 +y l ) log(1 +a y l ) + (1−y l ) log(1−a y l )− 2 log 2) (5.12) =− log M Y l=1 (˜ a y l ) (˜ y l ) (1− ˜ a y l ) (1−˜ y l ) . (5.13) 96 Table 5.1: Output logistic and softmax neurons. This shows the likelihood and activation functions for logistic and softmax neurons. The bipolar logistic requires transform a y l to ˜ a y l = 0.5(a y l + 1.0). Softmax Binary Logistic Bipolar Logistic Activation a y l exp(o y l ) P K k=1 exp(o y k ) 1 1+exp(−o y l ) 1−exp(−o y l ) 1+exp(−o y l ) Likelihood Q K l=1 a y l y l Q M l=1 a y l (y l ) 1−a y l (1−y l ) Q M l=1 (˜ a y l ) (˜ y l ) (1− ˜ a y l ) (1−˜ y l ) Training seeks the best parameter Θ ∗ that minimizes the error function. So we can drop the constant terms in NLL bip . The modified error E bip has the form E bip =− M X l=1 (1 +y l ) log(1 +a y l ) + (1−y l ) log(1−a y l ) . (5.14) Table 5.1 summarizes the properties of softmax and logistic output activations. 5.1.2 Random Coding with Bipolar Codewords We now present the method for picking K random bipolar codewords from{−1, 1} M with log 2 K≤M <K. The bipolar Boolean cube contains 2 M codewords since the bipolar unit cube [−1, 1] M has M vertices. It is computationally expensive to pick M =K for a dataset with big values of K such as 10,000 or more [62], [104]. Our goal is to find an efficient way to pick K codewords with log 2 K≤M <K. Let code C be a K×M matrix such that the k th row c k is the k th codeword and d kl be the similarity measure between c k and c l . We have d kl = c k · c l . There are 1 2 (K(K− 1)) unique pairs of codewords. The mean µ c of the inter-codeword similarity measure has the normalized correlation form µ c = 2 K(K− 1) K X k=1 K X l>k c k · c l . (5.15) This random coding method uses µ c to guide the search. The method finds the best code C ∗ with the minimum similarity mean µ ∗ c for a fixed M. Algorithm 5.1 shows the pseudocode for this method. A high value of µ c implies that most of the codewords are not orthogonal while a low value of µ c implies that most of the codewords are orthogonal. Figure 5.2 shows examples of codewords from Algorithm 5.1. 97 Algorithm 5.1: Random Bipolar Logistic Coding Data: Code length M, number of codewords K, number of search iterations T, and sampling probability p. Result: A set C ∗ of K unique codewords of length M each 1 • Initialize µ ∗ = 0.0; 2 for t=1,...,T do 3 for k=1,...,K do 4 if k == 1 then 5 • Form a (1×M) vector C (t) by picking M samples with replacement from{−1, 1} with the probability p; 6 else 7 • search_status = True 8 while search_status == True do 9 • Form a (1×M) vector c k by picking M samples with replacement from{−1, 1} with the probability p; 10 • Compute the intercodeword orthogonality between the (t− 1) codewords in C (t−1) and c k : d k = c T k C (t−1) if max(d k )<M then 11 • C (t) ⇐ Stack c k and C (t−1) to form a (k×M) matrix: C (t) ⇐ C (t) , c k • search_status = False 12 • Compute the mean µ (t) c for code C (t) using (5.15); 13 if (t == 1) or (µ ∗ c <µ (t) c ) then 14 • Update the best code C ∗ and best mean µ c ∗ : C ∗ ⇐ C (t) 5.1.3 Backpropagation Invariance and Network Likelihood The backpropagation (BP) learning laws remain invariant at a softmax or logistic layer if the error functions have the appropriate respective cross-entropy or double-cross-entropy form. The learning laws are invariant for softmax and binary logistic activations because taking the derivative of their network likelihood gives the same partial derivatives [20], [154]. Suppose a network has J hidden layers h 1 , h 2 ,......, h J . The term h j denotes the j th hidden layer after the input (identity) layer. The complete likelihood is the probability density 98 Input Block ( ) ℎ ( 1 ) ( 1 ) ( ) ( 2 ) ℎ ( 2 ) ( 2 ) ℎ ( 5 ) ( 5 ) ( 3 ) ℎ ( 3 ) ( 3 ) ( 4 ) ℎ ( 4 ) ( 4 ) ℎ ( 1 ) ( 1 ) Hidden Block ( ) ℎ ( 2 ) ( 2 ) Hidden Block ( ) ℎ ( 3 ) ( 3 ) Hidden Block ( ) ℎ ( 4 ) ( 4 ) Output Block ( ) ℎ ( 5 ) ( 5 ) Input Figure 5.3: Blocking a deep neural network: This shows the modular architecture of a long block neural network. The deep-sweep training method in Algorithm 5.2 used blocking to break a deep neural network into small multiple blocks. The network had an input block N (1) , three hidden blocks {N (2) ,N (3) ,N (4) }, and output block N (5) . Each block had three layers in the simplest case. The terms a y(1) ,....,a y(4) represent the activations for the visible hidden layers and a y(5) is the output activation. The terms a h(1) ,....,a h(5) represent the activations of the non-visible hidden layers. The deep-sweep method used two stages: pre-training and fine-tuning. The pre-training stage trained the blocks separately. It used supervised training for each block by using the block error E (b) between the output activation a y(b) and the target y. The fine-tuning stage began after the pre-training and also used supervised learning. It stacked all the blocks together and used an identity matrix I to connect contiguous blocks. Fine tuning optimized the weights with respect to the joint error E ds . p(y, h J ,....., h 1 |x, Θ). The chain rule or multiplication theorem of probability factors the likelihood into a product of the layer likelihoods: p(y, h J ,....., h 1 |x, Θ) =p(y|h J ,....., h 1 , x, Θ) p(h 1 |x, Θ) J Y j=2 p(h j |h j−1 ,....., h 1 , x, Θ) (5.16) where we assume that p(x) = 1 for simplicity [9], [99], [170]. So the complete log-likelihood L(y, h|Θ) is L(y, h|Θ) = logp(y, h J ,....., h 1 |x, Θ) = L(y|x, Θ) + P J j=1 L(h j |x, Θ) where L(h j |x, Θ) = logp(h j |h j−1 ,....., h 1 , x, Θ). The output layer has log-likelihood L(y|x, Θ) = logp(y|h J ,....., h 1 , x, Θ). The next section presents a method that trains on the complete likelihood of a network instead of using only the output layer likelihood. 5.2 Blocking and Deep-sweep Training Deep-sweep training optimizes a network with respect to the network’s complete likelihood in (5.16). This method performs blocking on deep networks by breaking the network down into a contiguous multiple small networks or blocks. Figure 5.3 shows the architecture of a deep neural network with the deep-sweep training method. The figure shows the small blocks that make up the deep neural network. N (1) is the input block, N (B) is the output block, and the others are hidden blocks. The layer of connection between two blocks is treated as a visible hidden layer. We need the number of blocksB≥ 2 to use the deep-sweep 99 method. Let the term L (b) denote the number of layers for block N (b) . L (b) must be greater than 1 because each block has at least an input layer and an output layer. The term Θ b represents the weights of N (b) . 5.2.1 Training with Deep-sweep The deep-sweep training method trains a neural network in two stages. The first stage is the pre-training and the second stage is fine-tuning. The pre-training stage trains the blocks separately as supervised learning tasks. N (1) maps x into the corresponding range of the output activation. We have a y(b) = N (b) (y), if b∈{2,....,B} N (b) (x), otherwise (5.17) where y is the target, o y(b) is the input to the output layer ofN (b) , and x is the input vector. The error function E (b) measures the error between the target y and activation a y(b) . The error function E (b) of N (b) for b∈{1, 2, 3,..,B} with a bipolar logistic activation is: E (b) =− M X l=1 (1 +y l ) log 1 +a y(b) l + (1−y l ) log 1−a y(b) l (5.18) where a y(b) l is the l th component of the output activation of N (b) . The fine-tuning stage follows the pre-training stage. It involves stacking the blocks and a deep-sweep across the entire networkN from the input layer to the output layer. Figure 5.3 shows the stacked blocks where x is the input through N (1) and the output activation ˜ a y(B) comes from the output ofN. We have: ˜ o y(b) = N (b) ˜ a y(b−1) , if b∈{2,....,B} N (b) (x), otherwise (5.19) and ˜ a y(b) =f ˜ o y(b) . The deep-sweep errorE (b) ds for the fine-tuning stage is different from the errorE (b) . E (b) ds is the deep-sweep error between ˜ a y(b) and the target y. So the corresponding deep-sweep error for a network with bipolar logistic activation is: E (b) ds =− M X l=1 (1 +y l ) log 1 + ˜ a y(b) l + (1−y l ) log 1− ˜ a y(b) l (5.20) for b∈{1, 2,...,B} where ˜ a y(b) l is the l th component of the activation ˜ a y(b) . The update rule at this stage differs from ordinary BP. Ordinary BP trains network parameters with a single 100 Algorithm 5.2: Deep-sweep with Bipolar Codewords Data: R input vectors{x 1 ,..., x R } and the corresponding R target vectors {y 1 ,..., y R }, learning rate η, batch size M, training epochs T, iterations per epoch R, number of blocks B, size of blocks{L (1) ,....,L (B) }, and start of fine-tuning stage t o . Result: Trained network parameter Θ (ds) 1 , Θ (ds) 2 , ... , Θ (ds) B ; 1 • Initialize network parameter Θ (0) 1 , Θ (0) 2 , ... , Θ (0) B ; 2 for t=1,...,T do 3 for r=1,...,R do 4 if t≤t o then 5 for b=1,...,B do 6 • Compute the output activation a y(b) r using (5.8); 7 • Compute the pre-training error E (b) r between a y(b) r and y r using (5.20); 8 • Back-propagate error E (b) r and update parameter Θ b with gradient descent or its variant; Θ (t+1) b = Θ (t) b −η∇ Θ b E (b) Θ b =Θ (t) b 9 else 10 • Stack the B blocks into a single deep network: 11 for b=1,...,B do 12 • Compute the output activation ˜ a y(b) r for input x r using (5.8) and (5.19); 13 • Compute the joint deep-sweep error E ds using equation (5.21); 14 • Back-propagate the error and update the weights with gradient descent or any of its variants; Θ (t+1) b = Θ (t) b −η∇ Θ b E ds Θ b =Θ (t) b error function at the output layer since the algorithm does not directly know the correct output value of a hidden layer. But we do know the correct output layer of an interior block since it just equals the random codeword. So the deep-sweep method updates the weights with respect to errors at the output layer of the blocks. The joint deep-sweep error E ds is E ds =− B X b=1 M X l=1 (1 +y l ) log 1 + ˜ a y(b) l + (1−y l ) log 1− ˜ a y(b) l = B X b=1 E (b) ds (5.21) and the update rule for any parameter Θ b follows from the derivative of this joint error. Algorithm 5.2 shows the pseudocode for this method. 101 5.2.2 Deep-sweep with Bidirectional Backpropagation Input Output Hidden Block ( ) Hidden Block ( ) Input Block ( ) ℎ ( 1 ) ℎ ( 2 ) ℎ ( 2 ) ℎ ( 3 ) ℎ ( 4 ) ℎ ( 9 ) ℎ ( 8 ) Hidden Block ( ) Output Block ( ) ℎ ( 4 ) ℎ ( 5 ) ℎ ( 6 ) ℎ ( 6 ) ℎ ( 7 ) ℎ ( 8 ) ℎ ( 1 ) ( 1 ) ℎ ( 5 ) ( 5 ) ℎ ( 2 ) ( 2 ) ℎ ( 3 ) ( 3 ) ℎ ( 4 ) ( 4 ) Figure 5.4: Blocking with bidirectional backpropagation: This shows the modular architecture of a deep block neural network training with the bidirectional backpropagation method. The deep-sweep training method in Algorithm 5.2 used blocking to break a deep neural network into small multiple blocks. The network had an input block N (1) , three hidden blocks{N (2) ,N (3) ,N (4) }, and output block N (5) . Each block had three layers in the simplest case. The terms a y(1) ,....,a y(4) represent the activations for the visible hidden layers (red neurons) and a y(5) is the activation of the output layer (brown neurons). The terms a h(1) ,....,a h(5) (green neurons) represent the activations of the non-visible hidden layers. Algorithm 5.2 uses bidirectional backpropagation method to train the blocks in two stages: pre-training and fine-tuning. The pre-training stage trained the blocks separately as bidirectional associative memory (BAM) networks. It used supervised training for each block by using the joint error from the forward pass and backward pass. The forward error E f measures the error between the forward pass N (j) (a x ) and the target y and the backward error E b is the error between the backward pass N (j) T (y) and input activation a x for the input block. E f is the error between the forward pass N (j) (y) and the target y and the E b is the error between the backward passN T (y) and target y for the hidden and output blocks. Then it updates the weight Θ b with respect to the joint error from the N (j) . The fine-tuning stage began after the pre-training and also used supervised learning. It stacked all the blocks together and used an identity matrix I to connect contiguous blocks. Fine tuning optimized the weights with respect to the joint error E (ds) . The joint error in this case is made up of the sum of joint errors across all the blocks and the optimization is joint. B-BP algorithm endows a multilayered neural network with bidirectional representation between input spaceX and output spaceY [5], [8]. B-BP is also a form of maximum likelihood estimation. The forward pass maps the input x toN (x) and the backward pass maps the label y toN T (y). This algorithm maximizes the forward likelihood p f (y|x, Θ) and backward likelihood p b (x|y, Θ) with respect to the Θ. B-BP finds the weight Θ ∗ that maximizes the joint likelihood: Θ BBP = arg max Θ p f (y|x, Θ) p b (x|y, Θ). (5.22) This joint optimization of the the likelihoods simplifies to the following: Θ BBP = arg max Θ logp f (y|x, Θ) + logp b (x|y, Θ) (5.23) because log is a monotonic function. 102 The combination of deep-sweep and B-BP training jointly optimizes complete forward and backward likelihoods. The chain rule or multiplication theorem of probability factors the complete forward likelihood p f (y, h J ,....., h 1 |x, Θ) factors into the products of likelihoods p f (y, h J ,....., h 1 |x, Θ) =p(y|h J ,....., h 1 , x, Θ) p(h 1 |x, Θ) × J Y j=2 p(h j |h j−1 ,....., h 1 , x, Θ) (5.24) where J is the number of blocks. The complete backward likelihood p b (x, h 1 ,....., h J |y, Θ) similarly factors into the products of likelihoods p b (x, h 1 ,....., h J |y, Θ) =p(x|h 1 ,....., h J , y, Θ) p(h J |y, Θ) × J−1 Y j=1 p(h j |h j+1 ,....., h 1 , y, Θ). (5.25) The deep-sweep and B-BP training solution Θ BBP ds simplifies as follows Θ BBP ds = arg max Θ p f (y, h J ,....., h 1 |x, Θ) p b (x, h 1 ,....., h J |y, Θ). (5.26) Figure 5.4 shows the modular architecture for training a network with B-BP and deep-sweep. This involves breaking a deep network down into contiguous and small bidirectional blocks. The deep-sweep training with B-BP runs in two stages namely the pre-training and the fine-tuning. The forward pass of x over the j th block N (j) during the pre-training stage is a yf j = N (j) (x), if j == 1 N (j) (y), otherwise (5.27) where a yf j is the output layer activation of the forward pass over thej th block. The backward pass of y over N (j) during the pre-training stage is: a xb j =N (j) T (y). (5.28) where a xb j is the input layer activation of the backward pass over the j th block. The forward pass of x over the j th block N (j) during the fine-tuning stage is a yf j = N (j) (x), if j == 1 N (j) a yf (j−1) , otherwise (5.29) 103 The backward pass of y over N (j) during the fine-tuning stage is a xb j =N (j) T (y). (5.30) The forward error E (j) f for the j th block during the pre-training is E (j) f =E jf (a yf j , y). (5.31) where E jf is the error function due to the output activation of the j th block along the forward pass. The backward error E (j) b for the j th block during the pre-training is: E (j) b = E jb (a xb j , x), if j == 1 E jb (a xb j , y), otherwise (5.32) where E jb is the error function with respect to the input activation of the j th block along the backward error. Identity activation uses squared error, binary logistic uses the double cross-entropy error in (5.7), and bipolar logistic uses the transformed double-cross entropy error in (5.14). The pre-training computes joint error E BBP (j) per block. We have E BBP (j) =E (j) f +αE (j) b (5.33) where α> 0. The joint error E BBP ds for the fine-tuning stage sums the error function over index j. E BBP ds = J X j=1 E BBP (j) = J X j=1 E (j) f +αE (j) b (5.34) where J is the number of blocks. Algorithm 5.3 gives the pseudocode for running the deep-sweep training with the B-BP algorithm. 5.3 Simulation and Results This simulation compared the performance of the output activations. Output logistic activations outperformed the softmax activation. This section reports the performance of the random logistic coding method in Algorithm 5.1. The classification accuracy of neural classifiers decreased as µ c increased with a fixed M and log 2 ≤ M < K. The result also shows that the accuracy with bipolar codewords and M = 0.4K is comparable with the accuracy from using the softmax activation with K-dimensional codewords (basis vectors). 104 Algorithm 5.3: Deep-sweep Training with Bidirectional BP Data: R input vectors{x 1 ,..., x R } and the corresponding R target vectors {y 1 ,..., y R } , learning rate η, batch size M, training epochs T, iterations per epoch R, number of blocks J, size of blocks{L (1) ,....,L (J) }, start of fine-tuning stage t o , and and backward error scale α. Result: Trained network parameter Θ (ds) 1 , Θ (ds) 2 , ... , Θ (ds) J ; 1 • Initialize network parameter Θ (0) 1 , Θ (0) 2 , ... , Θ (0) J ; 2 for t=1,...,T do 3 for r=1,...,R do 4 if t≤t o then 5 for j=1,...,J do 6 • Compute the output activation along the forward pass using (5.27); 7 • Compute the input activation along the backward pass using (5.28); 8 • Compute the pre-training joint error E BBP (j) using (5.33); 9 • Back-propagate error E BBP (j) and update parameter Θ j with gradient descent or its variant; Θ (t+1) j = Θ (t) b −η∇ Θ j E BBP (j) Θ j =Θ (t) j 10 else 11 • Stack the J blocks into a single deep network: 12 for j=1,...,J do 13 • Compute the output activation along the forward pass using (5.29); 14 • Compute the input activation along the backward pass using (5.30); 15 • Compute the joint deep-sweep error E BBP ds using equation (5.34); 16 • Back-propagate the error and update the weights with gradient descent or any of its variants; Θ (t+1) j = Θ (t) j −η∇ Θ j E BBP ds Θ j =Θ (t) j 5.3.1 Dataset This classification experiments used the CIFAR-100 and Caltech-256 image datasets. 5.3.1.1 CIFAR-100 CIFAR-100 is a set of 60,000 color images from 100 pattern classes with 600 images per class. The 100 classes divide into 20 superclasses. Each superclass consists of 5 classes [156]. Each image had dimension 32×32×3. 105 (a) CIFAR-100 (b) Caltech-256 Figure 5.5: Simulation data sets: Sample images from the simulation data sets. (a) shows sample images from the CIFAR-100 image set. The 100 samples show one image per class. (b) shows sample images from the Caltech-256 image set. The 256 samples show one image per class. 5.3.1.2 Caltech-256 This dataset had 30,607 images from 256 pattern classes. Each class had between 31 and 80 images. The 256 classes consisted of the two superclasses animate and inanimate. The animate superclass contained 69 pattern classes. The inanimate superclass contained 187 pattern classes [94]. We removed the cluttered images and reduced the size of the dataset to 29,780 images. We resized each image to 100×100×3. 5.3.2 Logistic Coding and Deep-sweep Training This subsection reports the results from the performance of non-convolutional neural classi- fiers. The ordinary deep-sweep training is a form of the ordinary unidirectional BP. 5.3.2.1 Network Description We trained several deep neural classifiers on the CIFAR-100 and Caltech-256 datasets. The classifiers used 3,072 input neurons and K = 100 if they trained on the CIFAR-100 data. All the classifiers we trained on the CIFAR-100 had 512 neurons per hidden layer. The hidden neurons used ReLU activations of the form a(x) = max(0,x) although logistic hidden units also performed well in blocks. We trained some classifiers with the ordinary BP [127], [156] and then further trained others with the deep-sweep method. We used dropout pruning 106 Softmax Binary logistic Bipolar logistic 28 29 30 31 32 33 Classification accuracy (a) CIFAR-100 Softmax Binary logistic Bipolar logistic 15.5 16.0 16.5 17.0 17.5 Classification accuracy (b) Caltech-256 Figure 5.6: Logistic activations outperformed softmax activations for the same number K of output neurons. We compared the classifier accuracy of networks that used output softmax, binary logistic, and bipolar logistic neurons. Pattern coding used the K binary and unit-length basis vectors from the Boolean{0, 1} K as the codewords for softmax or binary logistic outputs. Coding used K bipolar basis vectors from the bipolar cube{−1, 1} K as the codewords for bipolar logistic outputs. Ordinary unidirectional backpropagation trained the networks. (a) shows the classification accuracy of the neural classifiers trained on the CIFAR-100 dataset with K = 100 where each model used 5-hidden layers with 512 neurons each. (c) shows the classification accuracy of the neural classifiers trained on the Caltech-256 dataset with K = 256 where each model used 7-hidden layers with 1,024 neurons each. method for the hidden layers [236]. A dropout value of 0.9 for the non-visible hidden layers reduced overfitting. We did not use a dropout with the visible hidden layers. The neural classifiers differed when trained on the Caltech-256 dataset. We used 30,000 neurons at the input layer and K equals 256 of the deep classifiers trained on this dataset. All the models trained on Caltech-256 used 1,024 neurons per hidden layer with the ReLU activation. We varied the value of code length M for the models with the bipolar logistic activation such that log 2 256≤ M≤ 256. We trained some classifiers with the ordinary BP and others with the deep-sweep method. The deep neural classifiers used 30,000 input neurons and M output neurons. Dropout pruned all the non-visible hidden layers with a dropout value of 0.8. We did not use a dropout with the visible hidden layers. 5.3.2.2 Results and Discussion Table 5.6 compares the effect of the output activations on the classification accuracy of deep neural classifiers. It shows that the logistic activations outperformed the softmax activation. We used theK-dimensional basis vectors as the codewords. Figure 5.6 shows the result from training neural classifiers with different configurations. The figure shows that the logistic activation outperformed the softmax in all the cases we tested. We used the random coding method from Algorithm 5.1 to search for bipolar logistic codewords. We varied the value of M and searched over 10,000 iterations for the best 107 Table 5.2: Output logistic activations outperformed softmax activation for the same number of output neurons. We used K binary basis vectors from the Boolean{0, 1} K as the codewords with softmax or binary logistic activations. We used K vertices of the bipolar hypercube{−1, 1} K as the codewords for bipolar logistic outputs. Ordinary backpropagation trained the classifiers. K = 100 for the CIFAR-100 dataset and K = 256 for the Caltech-256 dataset. Dataset Hidden Layers Classification accuracy Softmax Binary Logistic Bipolar Logistic CIFAR-10 3 layers 30.38% 32.92% 32.65% 5 layers 29.04% 32.47% 32.19% 7 layers 27.80% 29.64% 29.89% 9 layers 26.58% 26.77% 27.47% Caltech-256 3 layers 17.01% 19.43% 19.06% 5 layers 15.82% 18.59% 17.93% 7 layers 15.19% 18.08% 17.61% 9 layers 15.84% 17.16% 17.25% Table 5.3: Random bipolar logistic coding scheme with neural classifiers: The classifiers trained with random bipolar codewords from Algorithm 5.1 and used 5 hidden layers per model. We used code length M = 30 with the CIFAR-100 dataset and code length M = 80 with the Caltech-256 dataset. We used probability p to pick M samples with replacement from{−1, 1} when choosing the codewords. The mean µ c of the similarity measure decreased as p increased from 0 to 0.5. The classification accuracy increased as the value µ c decreased for a fixed value of M. Dataset Probability p Mean µ c Accuracy CIFAR-100 0.1000 16.9915 16.12% 0.1250 14.6145 17.90% 0.1875 9.9794 20.27% 0.2125 8.2638 21.08% 0.2750 5.6162 23.52% 0.3500 4.4537 24.13% 0.5000 4.1923 24.86% Caltech-256 0.1125 45.969 9.392% 0.1875 28.867 10.18% 0.2125 24.382 10.42% 0.2250 22.205 11.10% 0.2500 18.293 12.53% 0.3500 8.7057 14.24% 0.5000 7.0074 15.97% code C ∗ with the minimum mean µ ∗ c . Figure 5.2 displays different sets of bipolar random codewords from Algorithm 5.1 with p = 0.5 and K = 100. The codewords came from the 108 16.99 12.49 7.79 4.19 Mean μ c of codeword similarity 16 18 20 22 24 26 Classification accuracy (a) CIFAR-100: Code length M = 30 4 6 8 10 12 14 16 Mean μ c of codeword similarity 16 18 20 22 24 Classification accuracy M = 30, K = 100 (b) CIFAR-100: Code length M = 30 45.97 24.38 12.19 7.01 Mean μ c of codeword similarity 9 11 13 15 17 Classification accuracy (c) Caltech-256: Code length M = 80 5 10 15 20 25 30 35 40 45 50 Mean μ c of codeword similarity 10 12 14 16 Classification accuracy M = 80, K = 256 (d) Caltech-256: Code length M = 80 Figure 5.7: Random bipolar coding with neural classifiers: Classification accuracy fell with an increase in the mean µ c of inter-codeword similarity measure for a fixed code length M. The trained neural classifiers used 5 hidden layers with 512 neurons each and had code length M = 30 on the CIFAR-100 dataset. The trained neural classifiers used 5 hidden layers with 1,024 neurons each and had code length M = 80 on the Caltech-256 dataset. The random coding method in Algorithm 5.1 picked the codewords. We compared the effect of µ c on the classification accuracy. (a) shows the accuracy when training the classifiers with the codewords from Algorithm 5.1. (b) shows that the accuracy decreased with an increase in µ c for a fixed code length M = 30. (c) shows the accuracy when training the classifiers with the codewords from Algorithm 5.1. (d) shows that the accuracy decreased with an increase in µ c for a fixed code length M = 80. bipolar Boolean cube{−1, 1} M . It shows the respective bipolar codewords for code length 20, 60, and 100 using algorithm 5.1. It also shows the bipolar basis vector with K = 100 from{−1, 1} 100 . Table 5.3 shows that decreasing the mean µ c of code C increases the classification accuracy of the classifiers trained with the codewords. This is true when the length M of codewords is such that M <K. We also found the best set of codewords with p = 0.5. Figure 5.7 also supports this. Table 5.10 shows that logistic networks can achieve high accuracy with small values of M. The table shows that the random codewords can achieve a comparable classification 109 M = 10 M = 40 M = 100 20 22 24 26 28 Classification accuracy 5 layers 7 layers 9 layers (a) CIFAR-100 0 20 40 60 80 100 Training Epoch 0 5 10 15 20 25 30 Classification accuracy M = 10 M = 40 M = 100 (b) CIFAR-100 M = 10 M = 80 M = 200 8 10 12 14 16 Classification accuracy 5 layers 7 layers 9 layers (c) Caltech-256 0 20 40 60 80 100 Training Epoch 0 4 8 12 16 Classification accuracy M = 10 M = 80 M = 200 (d) Caltech-256 Figure 5.8: Random bipolar coding and ordinary BP: Algorithm 5.1 picked K codewords from {−1, 1} M . The marginal increase in classification accuracy with an increase in the code length M decreased as M approached K. (a) shows the classification accuracy of the deep neural classifiers trained with the random bipolar coding (Algorithm 5.1). (b) shows the classification accuracy of the neural classifiers with 5 hidden layers. The accuracy increased by 8.31% with an increase from M = 10 to M = 40 and the accuracy increased by 0.61% with an increase from M = 40 to M = 100. (c) shows the classification accuracy of the deep neural classifiers trained with codewords generated with random bipolar coding. (d) shows the classification accuracy of neural classifiers with 5 hidden layers. The accuracy increased by 4.92% with an increase from M = 10 to M = 80 and the accuracy increased by 0.40% with an increase from M = 80 to M = 200. accuracy with a small code lengthM relative to the accuracy from training with the softmax output activation using K binary basis vectors from{0, 1} K as the codewords. It took M = 40 = 0.4K to get between 88%− 90% of the classification accuracy from using the softmax activation with M =K = 100 on the CIFAR-100 dataset. It took M = 80< 0.32K to get between 84%− 101% of the classification accuracy from using the softmax output activation (withM =K = 256) on the Caltech-256 dataset. The random codes withM = 80 outperformed the softmax activation with M = 256 for neural classifiers with 5 or 7 hidden layers. Figure 5.8 shows that the marginal increase in classification accuracy with an increase in the code length M decreases as M approaches K. Table 5.4 shows the benefit of training deep neural classifiers with the deep-sweep method 110 Table 5.4: Deep-sweep versus ordinary backpropagation with basis vectors as the codewords: We compared the effect of the algorithms on the classification accuracy of the classifiers. We used the bipolar basis vectors from{−1, 1} K as the codewords. Deep-sweep method outperformed the ordinary BP with deep neural classifiers. The deep-sweep benefit increased with an increase in the depth of the classifiers. No. of No. of Layers per Deep-sweep Classification Accuracy Layers Blocks Block CIFAR-100 Caltech 256 5 layers 1 5 No 32.65 % 19.06 % 2 3 Yes 30.62 % 18.59% 7 layers 1 7 No 32.19 % 17.93% 2 4 Yes 32.69 % 20.11 % 3 3 Yes 27.95 % 16.25 % 9 layers 1 9 No 29.89 % 17.61% 2 5 Yes 32.20 % 19.75 % 11 layers 1 11 No 27.47 % 17.25 % 2 6 Yes 30.68 % 18.77 % 13 layers 1 13 No 25.55 % 16.40 % 3 5 Yes 30.76 % 18.47 % in Algorithm 5.2. The deep-sweep training method reduces both the vanishing-gradient and slow-start problem. Simulations showed that the deep-sweep method improved the classification accuracy of deep neural classifiers. The deep-sweep benefit increases as the depth of the classifier increases. Table 5.5 shows the relationship between the accuracy and the block size with the deep-sweep method. The relationship follows an inverted U-shape with a fixed number of blocks B. We also compared the effect of using the deep-sweep method and Algorithm 5.1 to pick the codewords. Figure 5.9 shows that the deep-sweep and random coding method with M = 40 = 0.4K outperformed training with the 100 basis vectors as the codewords (with softmax output activation) without the deep-sweep. We used the CIFAR-100 dataset with K = 100 in this case. We also found the same trend with the models we trained on the Caltech-256 dataset. The combination of the deep-sweep and random coding method with M = 80< 0.32K outperformed training with basis vectors from{0, 1} K as the codewords (with softmax output activation) with the ordinary BP. 5.3.3 Logistic Coding, Deep-sweep, and Bidirectional Backpropagation This subsection reports the results from the performance of non-convolutional neural classi- fiers. The networks trained with the bidirectional BP algorithm. 111 0 20 40 60 80 100 Code Length M 18 20 22 24 26 Classification accuracy Baseline 1 block (No deep-sweep) 2 blocks (With deep-sweep) (a) CIFAR-100: Unidirectional BP 0 50 100 150 200 250 Code Length M 11 12 13 14 15 16 17 Classification accuracy Baseline 1 block (No deep-sweep) 2 blocks (With deep-sweep) (b) Caltech-256: Unidirectional BP Figure 5.9: Deep-sweep training with the random bipolar code search and (M <K) outperformed the baseline. The baseline is training with the combination of ordinary BP and softmax activation with the binary basis vectors from{0, 1} K . (a) shows the performance of deep neural classifiers with 9 hidden layers and trained with the ordinary BP (no deep-sweep). It also show the performance of a 2-block network with 5 hidden layers per block and trained with the deep-sweep method. (b) shows the performance of deep neural classifiers with 11 hidden layers and the ordinary BP (no deep-sweep). It also show the performance of a 2-block network with 6 hidden layers per block and trained with the deep-sweep method. 0.1 0.2 0.3 0.4 0.5 Sampling probability p 12 15 18 21 24 Classification accuracy Binary logistic Bipolar logistic (a) Sampling probability p 20 40 60 80 Code length M 12 15 18 21 24 27 Classification accuracy 3 hidden layers 5 hidden layers (b) Code length M Figure 5.10: Random logistic codewords with BBP: We trained deep neural classifiers with logistic codewords from Algorithm 5.1. This shows the effect of sampling probability p on the performance of the models we trained on the random logistic codewords. We also compared the effect of code length M <K on the performance of the model. We used 5 hidden neurons. (a). The accuracy of the model increases as p increase from 0.0 to 0.5 (b). The accuracy increases as the code length M increases over log 2 K≤M <K. 5.3.3.1 Network Description We trained several deep neural classifiers on the CIFAR-100 dataset. The classifiers used 3,072 input neurons and K = 100 if they trained on the CIFAR-100 data. The classifiers that trained on the CIFAR-100 dataset used 512 neurons per hidden layer. The non-visible hidden neurons used ReLU activations of the form a(x) = max(0,x). The visible hidden 112 Table 5.5: Finding the best block size with the deep-sweep algorithm. We trained deep neural classifiers with the bipolar basis vectors from {−1, 1} K as the codewords. The relationship between the classification accuracy and the block size with a fixed number of blocks B follows an inverted U-shape. No. of blocks No. of layers Classification Accuracy per block CIFAR-100 Caltech 256 2 blocks 3 layers 30.62 % 18.59 % 4 layers 32.69 % 20.11 % 5 layers 32.20 % 19.75 % 6 layers 30.68 % 18.77 % 3 blocks 3 layers 27.95 % 16.25 % 4 layers 31.84 % 18.76 % 5 layers 30.76 % 18.47 % 4 blocks 3 layers 27.94% 13.96 % 4 layers 28.45% 16.97% 5 layers 25.80% 16.64 % 5 blocks 3 layers 26.28 % 11.79 % 4 layers 29.69 % 15.06 % 5 layers 23.15 % 14.47 % layer and output layer used the same type of activation (softmax, binary logistic, or bipolar logistic). We trained some of the classifiers with B-BP [6], [8] and then further enhanced the performance by using B-BP with deep-sweep. We also studied the effect of different coding methods (1-in-K and logistic) on the performance of deep-neural classifiers. We used dropout pruning method for the hidden layers [236]. We used a dropout value of 0.9 for the non-visible hidden layers to reduce overfitting. 5.3.3.2 Results and Discussion Figure 5.4 shows the modular architecture for training a deep block neural network with B-BP and deep-sweep. The figure shows a network with 5 blocks. Each block j has a forward log-likelihood L (j) f and a backward log-likelihood L (j) b . The deep-sweep training involves two stages: pre-training and fine-tuning. The pre-training stage trains each block j as a BAM with B-BP. This involves the joint optimization of L (j) f and L (j) b with respect to the block parameter Θ (j) [8]. We stacked all the B blocks together by using an identity matrix I to connect contiguous blocks after the pre-training. The fine-tuning stage jointly optimizes the complete foward likelihood L f and the complete backward likelihood L b . Table 5.6 shows that B-BP with logistic activations outperformed using B-BP with softmax output activation. Figure 5.10 shows the relationship the sampling probability p for 113 Table 5.6: Logistic activation outperformed softmax. We also found that the output logistic activations outperformed softmax activations for the same number of output neurons. We used K binary basis vectors from the Boolean{0, 1} K as the codewords with softmax or binary logistic activations. We used K bipolar basis vectors from the bipolar cube{−1, 1} K as the codewords for bipolar logistic outputs. No. of Hidden Layers Classification accuracy Softmax Binary Logistic Bipolar Logistic 3 layers 29.85% 32.04% 32.14% 5 layers 28.40% 31.88% 32.04% 7 layers 28.35% 29.95% 29.29% 9 layers 25.78% 28.88% 28.08% picking the random codewords using Algorithm 5.1 and the B-BP training performance. The accuracy increases as p from 0.0 to 0.5. Table 5.7 also shows the same trend. The random binary logistic codewords are from the linear transformation of the bipolar codewords (using Algorithm 5.1) from{−1, 1} M to{0, 1} M . It also compared the effect of code length M on the performance of B-BP. Table 5.11 shows that the deep-sweep approach further improved the performance of random logistic codewords with the B-BP algorithm. Table 5.8 shows the benefit of the deep-sweep training with the B-BP method. The blocking allows signal to flow freely and the benefit is more pronounced with very deep neural classifiers. Table 5.9 shows the best block size for different configurations of deep neural classifiers. Table 5.7: Random logistic coding scheme with deep neural classifiers and B-BP training: We set M = 40 and varied the value of p from 0.1 to 0.5. We trained the neural classifiers with 5 hidden layers and B-BP. The performance of the codewords from Algorithm 5.1 increases as the sampling probability p increases. Probability p Mean µ c Classification Accuracy Binary Logistic Bipolar Logistic 0.1000 16.9915 14.14% 13.83% 0.1125 15.9067 14.77% 14.41% 0.1250 14.6145 15.54% 15.68% 0.1625 11.8335 17.23% 16.93% 0.2000 9.1051 17.45% 17.32% 0.2125 8.2638 18.94% 18.50% 0.2250 7.7931 20.65% 19.86% 0.2750 6.4323 21.20% 21.02% 0.3000 5.0485 21.74% 21.77% 0.4000 4.2364 23.47% 23.34% 114 Table 5.8: Deep-sweep with bidirectional BP: Deep-sweep training improved the performance of B-BP training with deep neural classifiers and basis vectors as the codewords. We compared the effect of Algorithm 5.3 on the classification accuracy of deep neural classifiers. We used the bipolar basis vectors as the codewords. B-BP with deep-sweep outperformed B-BP without deep-sweep. The deep-sweep benefit increased with an increase in the depth of the classifiers. No. of No. of Layers per Deep-sweep Classification Accuracy Softmax Binary Bipolar Layers Blocks Block Logistic Logistic 5 layers 1 5 No 29.84% 31.84% 31.63% 2 3 Yes 29.27% 28.92% 28.26% 7 layers 1 7 No 26.74% 31.09% 30.58% 2 4 Yes 30.44% 31.35% 31.05% 3 3 Yes 28.62% 28.00% 28.78% 9 layers 1 9 No 26.34% 28.20% 28.64% 2 5 Yes 28.65% 31.34% 31.44% 11 layers 1 11 No 24.68% 25.69% 26.69% 2 6 Yes 27.38% 30.90% 29.95% 13 layers 1 13 No 22.48% 23.14% 23.52% 3 5 Yes 28.57% 30.94% 29.53% 10 20 30 40 50 60 70 80 90 Code-length M 18 20 22 24 26 Classification accuracy Random binary and deep-sweep (M <K) Bipolar logistic w/out deep-swwep (M =K) Softmax w/out deep-swwep (M =K) (a) Binary coding: Deep-sweep and B-BP 10 20 30 40 50 60 70 80 90 Code-length M 18 20 22 24 26 Classification accuracy Random bipolar and deep-sweep (M <K) Bipolar logistic w/out deep-swwep (M =K) Softmax w/out deep-swwep (M =K) (b) Bipolar coding: Deep-sweep and B-BP Figure 5.11: Deep-sweep, logistic coding, and bidirectional backpropagation: Combining B-BP with the deep-sweep method improves the performance of a small capacity deep neural classifiers with M <K. Deep-sweep and B-BP reduces the capacity of deep neural classifiers without suffering a drop in performance. We used 11 hidden layers with the classifiers. 5.3.4 Logistic Coding and Convolutional Neural Classifiers This subsection reports the performance of logistic codewords with convolutional neural classifiers. 115 Table 5.9: Finding the best block size with logistic codewords, deep-sweep, and B-BP algorithm. We trained deep neural classifiers with basis vectors . We used {−1, 1} K as the codewords with bipolar logistic and{0, 1} K as the codewords with binary logistic. The relationship between the classification accuracy and the block size with a fixed number of blocks B follows an inverted U-shape. No. of blocks No. of layers Classification Accuracy per block Binary Logistic Bipolar Logistic 2 blocks 3 layers 28.10% 28.47% 4 layers 31.35% 31.03% 5 layers 31.34% 31.44% 6 layers 30.10% 29.95% 3 blocks 3 layers 28.78% 28.00% 4 layers 31.56% 30.16% 5 layers 30.94% 29.53% 6 layers 29.23% 27.66% 4 blocks 3 layers 28.29% 27.20% 4 layers 28.77% 27.70% 5 layers 29.34% 26.97% 5 blocks 3 layers 26.96% 26.12% 4 layers 29.05 % 26.18 % 5 layers 27.50% 24.14 % 5.3.4.1 Network Description We used ResNet-50 model architecture [107]. This model is made up of 50 layers. We used a pre-trained version of the ResNet-50 model. The pre-trained version trained on ImageNet [224] dataset with 1,000 classes. We used the convolutional filters from the pre-trained model to extract features. We resized the image from the CIFAR-10 and Caltech-256 data sets so as to match the input size (224×224×3) of the ReNet-50 model. We added two fully connected layers between the convolutional layers and the output layer. The first fully connected layers used 2,048 rectified linear unit (ReLU) neurons. The second fully connected layers used 1,024 ReLU neurons. We trained the fully connected layers to adapt to the CIFAR-100 and Caltech-256 data sets. We also considered the ResNet-50 without pre-training. This involved training the convolutional filters and the fully connected layers from scratch with a random weight initialization. 5.3.4.2 Result and Discussion Table 5.11 shows that logistic output neurons with 1-in-K coding outperformed softmax output neurons on the two test cases. 116 10 30 50 70 90 CodelengthM 64 67 70 73 76 Classification Accuracy CIFAR-100 (With pre-training) Softmax & 1-in- K Logistic & 1-in- K Random logistic (a) CIFAR-100: With pre-training 10 30 50 70 90 CodelengthM 40 50 60 70 Classification Accuracy CIFAR-100 (Without pre-training) Softmax & 1-in- K Logistic & 1-in- K Random logistic (b) CIFAR-100: No pre-training 10 50 100 150 200 250 CodelengthM 65 70 75 80 85 Classification Accuracy Caltech-256 (With pre-training) Softmax & 1-in- K Logistic & 1-in- K Random logistic (c) Caltech-256: With pre-training 10 50 100 150 200 250 CodelengthM 30 35 40 45 50 Classification Accuracy Caltech-256 (Without pre-training) Softmax & 1-in- K Logistic & 1-in- K Random logistic (d) Caltech-256: No pre-training Figure 5.12: Pre-training on Bigger-K improved logistic coding: This shows the performance of ResNet-50 models trained on CIFAR-100 and Caltech-256 data sets. The pre-trained models trained on the ImageNet dataset with K = 1, 000. Then we fine-tuned these models to K = 100 for the CIFAR-100 and K = 256 for the Caltech-256. The random logistic codewords with code length M < K outperform the using 1-in-K coding with M = K. (a) Logistic codewords with M≥ 30 outperformed the 1-in-K codewords. The models trained on the CIFAR-100. (b) Logistic codewords with M <K performed worse that the 1-in-K codewords. The models trained from scratch on the CIFAR-100. (c) Logistic codewords with M≥ 50 outperformed the 1-in-K codewords. The models trained on the Caltech-256 (d) Logistic codewords with M <K performed worse that the 1-in-K codewords. The models trained from scratch on the Caltech-256. Figure 5.12 shows that convolutional neural classifiers with pre-trained initialization improved the performance of logistic codewords. The pre-trained models trained on a bigger dataset with a bigger number of classes. The logistic codewords with M≥ 30 outperformed the using the 1-in-K codeword with the CIFAR-100 dataset. The trend also held in the case of the Caltech-256 dataset. Training the models from scratch did not provide such benefit. The logistic codewords with M < K did not outperform the 1-in-K with the models that trained from with a 117 Table 5.10: Using the bipolar codewords with small codeword length and logistic outputs gave a classifier accuracy comparable to that of using softmax outputs and K binary basis vectors from {0, 1} K . The deep neural classifiers trained with bipolar codewords from Algorithm 5.1 on the CIFAR- 100 and Caltech-256 datasets. We compared the performance of these classifiers to the accuracy of the models trained withK-basis vectors and softmax activations (from Table 5.6). Training models on the CIFAR-100 dataset with bipolar codewords of length M = 40 = 0.4K achieved between 88%− 90% of the accuracy obtained from using 100 binary basis vectors and softmax outputs. Training models on Caltech-256 dataset with bipolar codewords of length M = 80 = 0.3125K achieved between 84%-101% of the accuracy obtained from using the 256 binary basis vectors and softmax outputs (from Table 5.6). It outperformed softmax activations in some cases with the Caltech-256 dataset. Dataset Code Length M Best mean µ c Classification Accuracy 3 Hidden 5 Hidden 7 Hidden Layers Layers Layers CIFAR-100 8 2.1156 16.68% 17.55% 17.44% 10 2.3733 19.63% 20.28% 19.57% 20 3.3774 24.24% 23.43% 23.08% 30 4.1923 25.80% 24.86% 23.78% 40 4.8158 26.98% 25.86% 24.82% 50 5.3830 27.61% 26.33% 25.03% 60 5.9079 27.63% 26.52% 25.57% 70 6.3766 27.78% 26.82% 25.23% 80 6.8505 27.62% 26.41% 25.14% 90 7.2509 27.86% 26.84% 25.26% 100 7.6444 27.80% 26.47% 25.01% Caltech-256 10 2.4287 9.78% 11.05% 11.03% 20 3.4642 12.56% 13.78% 13.70% 50 5.5320 13.82% 15.63% 14.99% 80 7.0074 14.28% 15.97% 15.43% 100 7.8234 14.13% 15.94% 15.42% 150 9.5934 14.36% 16.22% 15.39% 200 11.101 14.36% 16.25% 16.06% 250 12.401 14.06% 16.36% 16.70% Table 5.11: Logistic output activation improved convolutional neural classifiers. The codewords are the K unit-bit vectors from{0, 1} K . Dataset Activation Classification Accuracy CIFAR-10 Softmax 71.44% Binary logistic 72.30% CIFAR-10 Softmax 76.48% Binary logistic 78.08% 118 random initialization. This held for both the CIFAR-100 and Caltech-256 data sets. 5.4 Conclusion Modern deep classifier networks do not scale up well for large numbers K of learned patterns. A key part of the problem is the naive use of 1-in-K encoding to represent the K patterns as unit bit vectors. These unit bit vectors are orthogonal basis vectors because they are just the K vertices of the probability simplex embedded in the K-dimensional unit hypercube [0, 1] K . But such encoding ignores the remaining 2 K −K bit-vector vertices of the hypercube. Randomly sampling from these remaining vectors tends to produce approximately orthogonal code words. Logistic output neurons naturally encode these code words. This allows both higher-capacity or “Big K” classifiers and the application of the new bidirectional backpropagation algorithm. Future work can improve these results by exploring new schemes for approximately uncorrelated codewords and other types of hidden neurons. Logistic output neurons with random coding allow a given deep neural classifier to encode and accurately detect more patterns than a network with the same number of softmax output neurons. The logistic output layer of a neural block uses length-M code words with log 2 K≤M <K. Algorithm 5.1 gives a simple way to randomly pick K reasonably separated bipolar codewords with a small code length M. Many other algorithms may work as well or better. Each block has so few hidden layers that there was no problem of vanishing gradients. The network instead achieved depth by adding more blocks. Deep-sweep training further outperformed ordinary backpropagation with deep neural classifiers. The benefit of logistic coding and deep-sweep training also extends to the new bidirectional backpropagation algorithm. 119 Chapter 6 Deeper Neural Networks: Non-Vanishing (NoVa) Hidden Units This chapter introduces the new NoVa or nonvanishing hidden neuron for deep neural networks. The NoVa neuron perturbs the logistic sigmoid neuron so that its derivative does not vanish. This additive perturbation helps mitigate the vexing problem of “vanishing gradients” in deep networks that train with some form of backpropagation. Simulations show that NoVa hidden neurons tend to give better classification accuracy than ReLU (rectified linear unit) hidden neurons in very deep classifiers trained on a large number K of patterns. The next sections demonstrate how NoVa neurons perform in deep classifiers and extend them to the more generalized perturbed neurons that we call G-NoVa neurons. 6.1 The Search for a Better Hidden Neuron What it is the best neural activation function a :R→R to use in the hidden layers of a deep neural network? This chapter presents the new hybrid nonvanishing or NoVa neuron as a candidate. The current answer or commonest hidden neuron appears to be the threshold-linear neuron [150] that dates from the 1960s (see Equation (2) in both Fukushima papers [75] and [76]): a(x) = ReLU(x) = max{0,x} = 0 x≤ 0 x x> 0 (6.1) 120 and now called the ReLU (rectified linear unit) neuron [87], [157]. Goodfellow [89] even states that the ReLU is the “default recommendation” for the choice of hidden neuron in a deep feedforward classifier. Figure 6.1a shows the ramp-shaped graph of a ReLU neuron. The default answer in the 1980s and 1990s was the logistic sigmoid σ in Figure 1(b): a(x) =σ(bx) = 1 1 +e −bx (6.2) for some steepness constant b> 0. A large b turns the logistic’s soft threshold into a hard threshold as in a classical on-off or threshold neuron. So the logistic activation seemed to have it both ways: It could easily model a threshold neuron and yet it was smoothly differentiable as all modern learning algorithms required. The logistic is also bounded because it lies in the unit interval [0, 1]. So it naturally defines a probability and indeed describes the posterior for two-class Bayesian decisions. A bipolar logistic results from the scaled and translated binary logistic 2σ(bx)− 1 and lies in the bipolar interval [−1, 1]. The logistic activation is differentiable. It has a simple closed-form and nonnegative derivative a ′ ≥ 0: a ′ (x) = da dx =bσ(bx)[1−σ(bx)] (6.3) for steepness parameter b> 0. It is just this product form (6.3) that has led so many neural engineers to abandon the logistic neuron as a viable hidden unit in deep networks. The derivative quickly approaches zero as the neuron saturates to either its upper bound 1 or its lower bound 0. Steep logistic sigmoid quickly saturate because they approximate on-off thresholds so well. This saturation leads to a “vanishing gradient” in the many learning algorithms that use the chain rule in computing the gradient of an error function. The learning gradient tends to vanish quickly as the number of hidden logistic layers grows. There has been a number of attempts at fixing the flaws of ReLU and logistic sigmoid neurons. We now review some of the main hidden neurons in deep learning including the modifications of ReLU and logistic sigmoid. 6.1.1 Review of Old Hidden Neurons: Activation Function and Derivative This subsection reviews the main hidden activations in use. The neurons work for both ordi- nary unidirectional backpropagaiton and for the more general bidirectional backpropagation training algorithms. The term a(x) denotes the activation of input x and a ′ (x) denotes the corresponding derivation. This notation holds for the rest of this chapter. 121 −6 −4 −2 0 2 4 6 x 0 2 4 6 a(x) (a) Rectified Linear Unit (ReLU) −6 −4 −2 0 2 4 6 x 0.0 0.2 0.4 0.6 0.8 1.0 a(x) c = 0.5 c = 1.0 c = 2.0 (b) Logistic Sigmoid −6 −4 −2 0 2 4 6 x 0 2 4 6 8 10 a(x) b = 0.5 b = 1.0 b = 1.5 (c) Swish −6 −4 −2 0 2 4 6 x −2 0 2 4 6 a(x) c = 0.2 c = 0.5 (d) Leaky Rectified Linear Unit Figure 6.1: Activation functions for multilayer neural networks: threshold linear or rectified linear unit ReLU, logistic sigmoid, leaky rectified linear unit (LReLU), and swish. 6.1.1.1 Logistic Sigmoid The sigmoidal logistic activation acts as a smooth or soft threshold and so has served as a model neuron for decades. The logistic endows a neural network with proven nonlinearity approximation power [56] and has a simple closed-form derivative. But it suffers from the problem of vanishing gradient [106] precisely because of the form of its derivative. The logistic activation function involves a ratio of exponentials and in fact describes two-hypothesis Bayesian classification as in simple logistic regression (compared with the more general softmax output neuron that describes multi-class Bayesian classification or so-called multinomial regression). The logistic has the ratio form a(x) =σ(cx) = 1 1 +e −cx (6.4) 122 −6 −4 −2 0 2 4 6 x 0.0 0.2 0.4 0.6 0.8 1.0 a ′ (x) (a) Rectified Linear Unit −6 −4 −2 0 2 4 6 x 0.0 0.2 0.4 0.6 a ′ (x) c = 0.5 c = 1.0 c = 2.0 (b) Logistic Sigmoid −6 −4 −2 0 2 4 6 x 0.0 0.4 0.8 1.2 1.6 a ′ (x) b = 0.5 b = 1.0 b = 1.5 (c) Swish −6 −4 −2 0 2 4 6 x 0.2 0.4 0.6 0.8 1.0 a ′ (x) c = 0.2 c = 0.5 (d) Leaky Rectified Linear Unit Figure 6.2: Derivative of activation functions for multilayer neural networks: threshold linear or rectified linear unit ReLU, logistic sigmoid, Swish, nonvanishing (NoVa) logistic, and the leaky rectified linear unit (LReLU). with derivative a ′ (x) = dσ(cx) dx =cσ(cx)(1−σ(cx)) = ce −cx 1 +e −cx 2 . (6.5) So the derivative vanishes for extreme values or their machine-word equivalents: a ′ (x) = 0 if σ = 0 or σ = 1. Figures 6.1b and 6.2b show the respective activation and derivative for a logistic sigmoid neuron. The logistic activation is smooth everywhere and indeed is a diffeomorphism but its gradient “vanishes” for deep neural networks. 123 6.1.1.2 Swish The swish activation is a scaled logistic [216]: a(x) =xσ(bx) = x 1 +e −(bx) (6.6) and the corresponding derivative is a ′ (x) = d xσ(bx) dx =σ(bx) 1 + (bx)(1−σ(bx)) = 1 +e −(bx) 1 + (bx) 1 +e −(bx) 2 . (6.7) Figures 6.1c and 6.2c show the respective activation and derivative of a swish neuron. 6.1.1.3 Leaky and Ordinary Rectified Linear Units The leaky ReLU or LReLU activation modifies the threshold-linear structure of the ordinary ReLU activation. It uses the identity function a(x) =x on the positive domain and scales the negative domain by c≥ 0: a(x) =LReLU(x) = cx, x≤ 0 x, x> 0 (6.8) with derivative a ′ (x) = d LReLU(x) dx = c, x< 0 1, x> 0 undefined, x == 0 . (6.9) The leaky ReLU uses c> 0 [180]. The leaky ReLU’s nonlinear approximation power is low because this function is the identity function over the positive domain and is scaled linear over its negative domain. The derivative of the leaky ReLU is not defined at x = 0. Figures 6.1d and 6.2d show the respective activation function and derivative of a leaky ReLU neuron. Settingc = 0 gives the non-leaky or ordinary rectified linear unit or ReLU. This threshold- linear unit truncates or rectifies the negative domain by setting the function equal to zero there [10], [87], [197]. The derivative of ReLU equals zero over the negative domain. Deep neural networks with ReLU suffers from dying neurons [10], [177], [245]. Figures 6.1a and 6.2a show the respective activation function and derivative of a non-leaky ReLU neuron. ReLU activation applies to tasks in speech recognition [180], computer vision, and other areas. 124 −3 −2 −1 0 1 2 3 x −2 0 2 a(x) c = 1.0,b = 5.0 c = 0.1,b = 3.0 c = 0.5,b = 3.0 (a) NoVa neuron: Activation function −3 −2 −1 0 1 2 3 x 1 2 3 4 a ′ (x) c = 1.0,b = 5.0 c = 0.1,b = 3.0 c = 0.5,b = 3.0 (b) NoVa neuron: Derivative Figure 6.3: Activation function and derivative of NoVa units: This presents the representation of the nonvanishing (NoVa) units. (a) The activation function of NoVa units. (b) The derivative of NoVa units. 6.2 Nonvanishing (NoVa) Units We now introduce the nonvanishing or NoVa unit as a generalization of both the logistic and ReLU neurons. The NoVa neuron is a family of parameterized neurons. A simple example would be any scaled sum of a ReLU and a logistic. We instead focus on the NoVa neuron as a sum of a scaled linear or identity neuron and a logistic: a(x) =cx +σ(bx) = 1 +cx(1 +e −bx ) 1 +e −bx (6.10) for scalar input x and where σ is the logistic sigmoid in (6.2). We assume that c > 0 and again that b > 0. Figure 1(c) shows the hybrid nature of three NoVa curves. Simulations found that the particular choice a(x) = 0.5x +σ(3x) gave the best classification accuracy on deep multilayer classifiers trained on a large number K of patterns. The crucial property of the NoVa neuron is that its derivative cannot vanish if c> 0: a ′ (x) =c +bσ(bx) 1−σ(bx) =c + b e −(bx) 1 +e −(bx) 2 ≥ c> 0. (6.11) This property justifies the name “nonvanishing”. It also shows that the NoVa neuron is a type of perturbed logistic neuron where the constantc> 0 controls the degree of perturbation. An ordinary logistic σ results when there is no perturbation and thus when c = 0. Figure 6.3 shows the activation function and derivative of NoVa units. 125 −6 −4 −2 0 2 4 6 x −3 0 3 6 9 a(x) c = 0.1,b = 1.0 c = 0.3,b = 1.0 c = 0.5,b = 2.0 (a) G-NoVa neuron: Activation function −6 −4 −2 0 2 4 6 x 0.0 0.6 1.2 1.8 2.4 a ′ (x) c = 0.1,b = 1.0 c = 0.3,b = 1.0 c = 0.5,b = 2.0 (b) G-NoVa neuron: Derivative Figure 6.4: Activation function and derivative of G-NoVa units: This presents the representation of the generalized nonvanishing (G-NoVa) units. (a) The activation function of G-NoVa units. (b) The derivative of G-NoVa units. 6.2.1 Generalized Nonvanishing (G-NoVa) Units The generalized nonvanishing (G-NoVa) unit is the generalized extension of the NoVa unit The new G-NoVa activation function is an additively and multiiplicatively perturbed logistic sigmoid: a(x) =cx +xσ(bx) = cx(1 +x +e −bx ) 1 +e −bx (6.12) with derivative a ′ (x) =c +σ(bx) 1 +bx(1−σ(bx)) (6.13) =c + 1 +e −(bx) 1 + (bx) 1 +e −(bx) 2 (6.14) withc≥ 0 andb≥ 0. G-NoVa simplifies to swish function with c = 0 andb> 0. It simplifies to a linear function with c≥ 0 and b = 0. Figures 6.4a and 6.4b show the respective activation and derivative of G-NoVa neurons. 6.3 Simulation and Results This simulation compared the performance of the six types of hidden neurons (ReLU, logistic sigmoid, leaky ReLU, swish, NoVa, and G-NoVa) on the deep neural classifiers that trained on three data sets. These data sets are the CIFAR-10 image set and the much larger (“Big-K”) image sets CIFAR-100 and Caltech-256. The simulation compared the neurons with the ordinary backpropagation (BP) training and the new bidirectional backpropagation 126 (a) CIFAR-10 samples (b) CIFAR-100 samples (c) Caltech-256 samples Figure 6.5: Samples images from the data sets. (b) shows sample images from CIFAR-10 image set. The 10 samples show one image per class. (c) shows sample images from CIFAR- 100 image set. The 100 samples show one image per class. (d) shows sample images from Caltech-256 image set. The 256 samples show one image per class. (B-BP) training. G-NoVa networks tended to perform best in classification accuracy while logistic networks performed worst for very deep networks trained on large-K image sets. This held for both 127 Table 6.1: Experimental Dataset: The summarized description of the CIFAR-10, CIFAR-100, and Caltech-256 datasets. Dataset Training Set Testing Set Number of Classes CIFAR-10 50,000 10,000 10 CIFAR-100 50,000 10,000 100 Caltech-256 23,824 5,956 256 ordinary BP training and B-BP training. The logistic networks suffered quickly from the predicted vanishing gradient [113] for neural classifiers with only a few hidden layers. ReLU networks did better but also died [10], [177], [245]. A surprising finding was that hidden layers with simple identity hidden neurons a(x) =x often performed quite well and tended to easily outperform ReLU networks in deeper classifiers. NoVa networks performed best in the deepest networks. The neural classifiers consisted of several layers. All input layers used identity neurons as data registers. All output layers used K softmax classifier neurons. The hidden neurons were either all ReLU (or identity) or all logistic or all NoVa. NoVa networks still performed best for the deeper classifiers. The next section describes the dataset for the experiments. 6.3.1 Dataset The simulated deep classifiers used three image datasets. The first was the CIFAR-10 dataset and the second its extension to CIFAR-100. The third was the Caltech-256 image dataset. Table 6.1 shows the sample distributions of these image datasets 6.3.1.1 CIFAR-10 The CIFAR-10 test set consists of 60,000 color images from 10 categories (K = 10). Each image has size 32× 32× 3. The 10 pattern categories are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck [156]. Each class consists of 5,000 training samples and 1,000 testing samples. Figure 6.5a shows sample images with one image per class. 6.3.1.2 CIFAR-100 CIFAR-100 dataset is a set of 60,000 color images with image size 32× 32× 3. The images are from 100 pattern classes with 600 images per class. Each class is made up of 500 training images and 100 testing images. Figure 6.5b shows sample images with one image per class. 128 Table 6.2: Classification accuracy: Training deep neural classifiers with unidirectional BP algorithm and bidirectional BP algorithm on CIFAR-10 image dataset. This shows the performance with different hidden activations. The G-NoVa units performed best. Training method Activation 3 Layers 11 Layers 21 Layers Unidirectional BP Sigmoid 41.11% 10.00% 10.00% Linear 35.24% 34.57% 34.87% ReLU 50.36% 46.82% 26.74% Leaky ReLU 49.61% 47.88% 29.22% Swish 48.54% 17.04% 10.00% NoVa 48.08% 48.16% 46.05% G-NoVa 51.80% 52.59% 51.55% Bidirectional BP Sigmoid 33.81% 10.00% 10.00% Linear 42.09% 41.96% 41.96% ReLU 57.67% 54.71% 34.89% Leaky ReLU 57.52% 56.99% 38.83% Swish 55.30% 27.54% 10.00% NoVa 55.63% 58.91% 53.38% G-NoVa 57.79% 58.91% 57.59% 6.3.1.3 Caltech-256 This dataset has 30,607 images from 256 pattern classes. Each class has 80 or more images. The 256 classes consist of the two superclasses animate and inanimate. The animate superclass contains 69 pattern classes. The inanimate superclass contains 187 pattern classes [94]. We removed the cluttered images and reduced the size of the dataset to 29,780 images. This removal reduced this dataset to 23,824 training images and 5,956 test images. The images had different dimensions. We resized each image to 100 ×100×3. Figure 6.5c shows sample images with one image per class. 6.3.2 Network Description and Training Parameters The deep neural classifiers trained on CIFAR-10, CIFAR-100, and Caltech-256 with ordinary and bidirectional backpropagation. Each classifier network used 500 neurons per hidden layer and used softmax output activations. We varied the size of the hidden layers and the type of hidden activation. The depth or number of hidden layers varied as the values in{1, 3, 5, 7,..., 21}. The competing hidden activations were logistic, linear, ReLU, leaky ReLU, swish, NoVa, and G-NoVa. 129 1 5 9 13 17 21 Number of hidden layers 0 10 20 30 40 50 60 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (a) Performance: CIFAR-10 1 5 9 13 17 21 Number of hidden layers 0 5 10 15 20 25 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (b) Performance: CIFAR-100 1 5 9 13 17 21 Number of hidden layers 0 5 10 15 20 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (c) Performance: Caltech-256 Figure 6.6: Benefit of G-NoVa with deep neural classifiers using unidirectional or ordinary BP: The neural classifiers trained over 100 epochs with 500 neurons per hidden layer. G-NoVa a(x) = cx+xσ(bx) withc = 0.5 andb = 1.0 outperformed other activation functions whereσ denotes logistic sigmoid. The G-NoVa benefit increases as the depth of the neural classifier increases. Unidirectional BP trained by minimizing the cross entropy at the output layer of softmax neurons. B-BP used the same activation and error function at the output layer but minimized the squared error in the backward direction because the terminal input neurons have identity activations and so define a backward regressor. The neural classifiers trained over 100 epochs with stochastic gradient descent with learning rate α = 0.001 and batch size B = 64. 6.3.3 Results and Discussion Figure 6.7 shows the classification accuracy curves from the simulation on the three datasets using B-BP training algorithm. G-NoVa also outperformed other hidden activation functions in this case. The models that used G-NoVa significantly outperformed leaky ReLU and other activations. G-NoVa benefit grows as the number of hidden layers increases. The G-NoVa benefit also tends to increase with big- K dataset images such as CIFAR-100 and Caltech-256 where we have K = 100 and K = 256 respectively. Tables 6.2 - 6.4 also show the same trend. 130 1 5 9 13 17 21 Number of hidden layers 0 10 20 30 40 50 60 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (a) Performance: CIFAR-10 1 5 9 13 17 21 Number of hidden layers 0 5 10 15 20 25 30 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (b) Performance: CIFAR-100 1 5 9 13 17 21 Number of hidden layers 0 5 10 15 20 25 Classification accuracy Logistic Linear ReLU L-ReLU Swish NoVa G-NoVa (c) Performance: Caltech-256 Figure 6.7: Benefit of G-NoVa with deep neural classifiers using bidirectional BP: The deep neural classifiers trained over 100 epochs using 500 neurons per hidden layer. G-NoVa outperformed other activation functions. G-NoVa a(x) = cx +xσ(bx) with c = 0.5 and b = 1.0 outperformed other activation functions where σ denotes logistic sigmoid. The benefit of using G-NoVa increases as the depth of neural classifier network increases. Figure 6.6 shows the classification-accuracy curves from the simulation on the three datasets: CIFAR-10, CIFAR-100, and Caltech-256. The neural classifiers trained with ordinary unidirectional BP. The models that used G-NoVa significantly outperformed leaky ReLU and other activations. The benefit of G-NoVa grew as the depth of the network increased. The consistent performance of G-NoVa is similar to what we noticed with the linear activation. The G-NoVa benefit tended to increase with big- K dataset images such as CIFAR-100 and Caltech-256 with respective K = 100 and K = 256. Tables 6.2 - 6.4 show the same winning trend for G-NoVa networks. 6.4 Conclusion The new generalized nonvanishing (GNoVa) logistic hidden activation is a generalized perturbed logistic sigmoid whose derivative does not vanish. Detailed simulations showed 131 Table 6.3: Classification accuracy: Training deep neural classifiers with unidirectional BP algorithm and bidirectional BP algorithm on CIFAR-100 image dataset. This shows the performance with different hidden activations. The G-NoVa units performed best. Training method Activation 3 Layers 11 Layers 21 Layers Unidirectional BP Sigmoid 11.46% 1.00% 1.00% Linear 14.25% 13.86% 14.07% ReLU 20.60% 12.19% 2.00% Leaky ReLU 20.29% 15.82% 1.75% Swish 21.01% 2.36% 1.00% NoVa 19.60% 19.86% 18.25% G-NoVa 24.86% 25.93% 24.41% Bidirectional BP Sigmoid 6.31% 1.00% 1.00% Linear 19.37% 18.97% 17.99% ReLU 27.06% 17.07% 2.17% Leaky ReLU 26.90% 22.07% 3.13% Swish 26.13% 1.00% 1.00% NoVa 25.01% 25.48% 18.21% G-NoVa 29.08% 31.46% 28.69% Table 6.4: Classification accuracy: Training deep neural classifiers with unidirectional BP algorithm and bidirectional BP algorithm on Caltech-256 image dataset. This shows the performance with different hidden activations. The G-NoVa units performed best. Training method Activation 3 Layers 11 Layers 21 Layers Unidirectional BP Sigmoid 8.46% 2.72% 2.72% Linear 12.59% 12.29% 11.85% ReLU 15.98% 11.82% 7.22% Leaky ReLU 16.17% 15.68% 12.73% Swish 16.32% 15.78% 12.53% NoVa 14.71% 15.14% 13.62% G-NoVa 17.33% 17.90% 16.75% Bidirectional BP Sigmoid 8.38% 2.72% 2.72% Linear 16.62% 15.26% 13.70% ReLU 20.63% 13.80% 6.09% Leaky ReLU 20.53% 17.53% 6.98% Swish 19.98% 8.68% 2.72% NoVa 18.23% 19.51% 14.74% G-NoVa 23.81% 24.17% 21.21% 132 that it outperformed the simpler NoVa hidden neuron as well as the popular ReLU activation and its main variants. This classifier performance held up for very deep networks and for both ordinary unidirectional as well as the newer bidirectional forms of backpropagation gradient training. 133 Chapter 7 Noise-Boosted Recurrent Backpropagation This chapter shows how noise can improve recurrent backpropagation (RBP) that trains on time-varying signals or patterns. The noise boost speeds up training and improves performance for both neural classifiers and regressors. The noise itself is not the ordinary dither or blind white noise. Such blind or annealing noise has appeared for decades in neural learning algorithms [33], [217], [218]. Minsky even mentioned noise boosting in his 1961 AI overview on the challenges of random hill climbing [188]. The noise is instead the same type of noise that boosts the Expectation Maximization (EM) algorithm as it iteratively climbs the nearest hill of log-likelihood [20], [61], [154], [203]. This is just the noise perturbation that makes the current training signal more probable on average. It applies to the BP algorithm because the BP algorithm is itself is a special case of the EM algorithm [20], [154]. The next sections show how such noisy EM or “NEM” noise improves RBP at little extra cost. Simulations demonstrate this RBP-NEM benefit on video classification and on time-series regression of financial currency data. 7.1 BackpropagationandtheExpectation-Maximization(EM) Algorithm Backpropagation (BP) is the benchmark algorithm for training deep neural networks [133], [164], [223]. EM is the standard algorithm for finding maximum-likelihood parameters in the case of missing data or hidden parameters [61]. BP finds the network weights and parameters that minimize some error or other performance measure. It equivalently maximizes the 134 network likelihood and this permits the EM connection. The Noisy EM (NEM) Theorem [199], [200], [203] gives a sufficient condition for noise to speed up the average convergence of the EM algorithm. It holds for additive or multiplicative or any other type of noise injection [200]. The NEM Theorem ensures that NEM noise improves training with the EM algorithm by ensuring that at each iteration NEM takes a larger step on average up the nearest hill of probability or log-likelihood. The BP algorithm is a form of maximum likelihood estimation [34] and a special case of generalized EM algorithm. So NEM noise also improves the convergence of BP [20]. This remaining part of this section states the recent result that backpropagation is a special case of the generalized Expectation-Maximization (EM) algorithm and that careful noise injection can always speed up EM convergence. It also states the companion result that carefully chosen noise can speed convergence of the EM on average and thereby noise-boost BP. 7.1.1 A Review of Backpropagation as Generalized EM BP-as-EM theorem casts BP in terms of the maximum likelihood structure that explicitly relies on hidden parameters. Both classification and regression have the same BP (EM) learning laws but they differ in the type of neurons in their output layer. The output neurons of a classifier network are softmax or Gibbs neurons. The output neurons of a regression network are identity neurons. The output layer of a classifier network assumes a multinomial distribution. But the regression case assumes that the output training vectors are the vector population means of a multidimensional normal probability density with an identity population covariance matrix. Theorem 7.1 restate the recent theorem that BP is a special case of the generalized EM (GEM) algorithm [20]. Theorem 7.1. [Backpropagation as Generalized Expectation Maximization]: The backpropagation update equation for a differentiable likelihood function p(y|x, Θ) at epoch n Θ n+1 = Θ n +η▽ Θ log p(y|x, Θ) Θ=Θ n (7.1) equals the GEM update equation at epoch n Θ n+1 = Θ n +η▽ Θ Q(Θ|Θ n ) Θ=Θ n (7.2) where GEM uses the differentiable Q-function Q(Θ|Θ n ) =E h|y,x,Θ n h log p(h, y|x, Θ) i . (7.3) 135 The master gradient equation∇ logp(y|x, Θ n ) =∇Q(Θ n |Θ n ) from (7.1) and (7.2) does require comment to see the connection to backpropagation. The right-hand side∇Q(Θ n |Θ n ) describes the gradient step of the GEM algorithm. The gradient∇ logp(y|x, Θ n ) on the left- hand-side depends on the network’s probability density p(y|x, Θ n ). It thus depends on the network structure. This structure depends on the output layer configuration (classification or regression) and the architecture of the hidden units. Classification typically picks the network probability p(y|x, Θ n ) as a one-shot multinomial or categorical distribution forK output patterns or categories. So a pass through the network corresponds to the roll of a K-sided die. This structure arises in part from the 1-in-K encoding of theK target patterns or videos asK unit bit vectors. It also arises from the use of softmax output neurons and a cross-entropy performance measure. Taking the logarithm of the one-roll multinomial probabilities gives the log-likelihood as the negative cross-entropy. So minimizing the cross-entropy maximizes the network likelihood. A direct calculation shows that the gradient of this cross-entropy gives back precisely the backpropagation learning laws [155]. The same backpropagation learning laws result for a regression network if the network’s probability p(y|x, Θ n ) is a multidimensional Gaussian probability and the output neurons have identity activation functions. Therefore its log-likelihood is proportional to the negative summed squared error if the covariance matrix is diagonal. Then minimizing the summed squared error again maximizes the network likelihood. Different network structures will give back the same invariant backpropagation learning laws if the user picks the appropriate output activation functions and performance measures. 7.1.2 Noise-Boosting the EM Algorithm The Noisy EM Theorem shows that noise injection will speed up the EM algorithm on average if the noise obeys the NEM positivity condition [200], [203]. The Noisy EM Theorem for additive noise states that a NEM noise benefit holds on average at each iteration n if the following positivity condition holds: E x,h,N|Θ ∗ log p(x + N, h|Θ n ) p(x, h|Θ n ) ≥ 0. (7.4) Then the surrogate-likelihood EM noise benefit Q(Θ n |Θ ∗ )≤Q N (Θ n |Θ ∗ ) (7.5) 136 holds on average at iteration n: E x,N|Θ n h Q(Θ n |Θ ∗ )−Q N (Θ n |Θ ∗ ) i ≤E x|Θ n h Q(Θ ∗ |Θ ∗ )−Q(Θ n |Θ ∗ ) i (7.6) where Θ ∗ denotes the maximum-likelihood vector of parameters. The NEM positivity condition (7.4) has a simple form for Gaussian mixture models [19] and for classification and regression networks [20]. The intuition behind the NEM sufficient condition (7.4) is that some noise realizations n make a signal x more probable: f(x + n|Θ)≥f(x|Θ). (7.7) Taking logarithms gives the key log-likelihood ratio: log f(x + n|Θ) f(x|Θ) ≥ 0. (7.8) Taking expectations gives a NEM-like positivity condition. The proof of the NEM Theorem uses Kullback-Liebler divergence to show that the noise-boosted likelihood is closer on average at each iteration to the optimal likelihood function than is the noiseless likelihood [200]. The pointwise NEM inequality (7.7) also helps explain the observed NEM-based accuracy improvement in the simulated recurrent neural networks below. Consider the case of a classifier neural network N :R n → [0, 1] n with the usual 1-in-K encoding for the network’s K output neurons with softmax or Gibbs activation. We assign each output training vector y to exactly one of the K unit-bit basis vectors [1, 0, 0,..., 0], [0, 1, 0,..., 0], ..., [0, 0,..., 1] based on the class membership of the corresponding input pattern vector x. So we assign the input vector x from the k th pattern class to the k th output basis vector that has a 1 in the k th slot and has 0s elsewhere. The K softmax output neurons in (7.59) produce a discrete length-K probability vector a y ∈ [0, 1] K . The softmax output neurons give rise to an output probability p(y|x, Θ n ) = Q K k=1 p j (y j |x, Θ n ) as a one-shot multinomial probability density in accord with BP invariance. Then (7.7) gives on average a NEM accuracy boost p j (y j +n j |x, Θ n )≥p j (y j |x, Θ n ). (7.9) An input x from category j leads to a correct classification if and only if the corresponding output activation is such that a y j ≥ a y i for all i where again the non-negative softmax activation a y k sum to unity for each input vector x. So the NEM-boost (7.9) from a probability-increasing noise realizationN j =n j can only increase the chance that the output 137 activation a (y)NEM k wins the output competition for input x on average. The normalization effect of the softmax units likewise can only decrease the other K− 1 activations on average. A NEM-based accuracy argument also follows from the likelihood structure of the NEM sufficient condition in [155] We also observe that the NEM positivity inequality in (7.4) is not vacuous. This holds because the expectation in (7.4) conditions on the converged parameter vector Θ ∗ rather than on the current estimated parameter vector Θ n . Vacuity would result in the usual case of averaging a log-likelihood ratio. Take the expectation of the log-likelihood ratio log f(x|Θ) g(x|Θ) with respect to the probability density functiong(x|Θ) to giveE g h log f(x|Θ) g(x|Θ) i . Then Jensen’s inequality and the concavity of the logarithm imply that E g log f(x|Θ) g(x|Θ) ≤ logE g f(x|Θ) g(x|Θ) (7.10) = log Z x f(x|Θ) g(x|Θ) g(x|Θ) dx (7.11) = log Z x f(x|Θ) dx (7.12) = log 1 (7.13) = 0. (7.14) So E g h log f(x|θ) g(x|θ) i ≤ 0 holds. But the expectation in (7.4) does not in general lead to this cancellation of probability densities because the integrating density in (7.4) depends on the optimal maximum-likelihood parameter Θ ∗ rather than on just Θ n [200]. So density cancellation occurs only when the NEM algorithm has converged to a local likelihood maximum because then Θ n = Θ ∗ . The NEM theorem simplifies for a classifier network with K softmax output neurons. Then the additive noise must lie above the defining NEM hyperplane [20]. A similar NEM result holds for regression except that then the noise-benefit region defines a hypersphere. NEM noise can also inject into the hidden neurons. § 7.3 and § 7.4 show the respective extension of these NEM results for recurrent classifiers and regression models 7.2 Recurrent Neural Networks A recurrent neural network (RNN) is a neural network whose directed graph contains cycles or feedback. This feedback allows an RNN to sustain an internal memory state that can hold and reveal information about recent inputs. So the RNN output depends not only on the input but on the recent output and on the internal memory state. This internal dynamical structure makes RNNs suitable for time-varying tasks such as speech processing 138 [92], [164], video recognition [66], [273], and stock price prediction [164]. RBP is the most common method for training RNNs [114], [261]. 7.2.1 Long-Short-Term Memory (LSTM) Recurrent Neural Networks The popular Long Short-Term Memory (LSTM) network is a deep RNN that can process time-varying data [114]. LSTM RNNs have outperformed many other methods in areas such as natural language processing, speech recognition, and handwriting recognition [12]–[14]. A single LSTM unit consists of I input neurons, J hidden units, and K output neurons. LSTM uses control gates at the input, the hidden, and the output layers. An input gate controls the input. A forget gate controls the hidden cells. An output gate controls the output [114]. The LSTM network captures the time dependency of the input data where the time index t takes values in {1, 2, ....., T}. 139 The parameters for an LSTM network are as follows: h a x(t) i i = a x(t) ∈R I : input activation at time t h a γ(t) j i = a γ(t) ∈ (0, 1) J : input gate activation at time t h a ψ(t) j i = a ψ(t) ∈ (0, 1) J : forget gate activation at time t h a ω(t) j i = a ω(t) ∈ (0, 1) J : output gate activation at time t h a c(t) j i = a c(t) ∈ (−1, 1) J : hidden unit activation at time t h a s(t) j i = a s(t) ∈ (−1, 1) J : internal cell activation at time t h a h(t) j i = a h(t) ∈ (−1, 1) J : hidden state at time t h a y(t) k i = a y(t) ∈R K : output activation at time t h w x ji i = W x ∈R J×I : connects the input x (t) to a c(t) h w γ ji i = W γ ∈R J×I : connects the input x (t) to the a γ(t) h w ψ ji i = W ψ ∈R J×I : connects the input x (t) to the a ψ(t) h w ω ji i = W ω ∈R J×I : connects the input x (t) to the a ω(t) h v h jl i = V h ∈R J×J : connects a h(t−1) to a s(t) h v γ jl i = V γ ∈R J×J : connects a h(t−1) to a γ(t) h v ψ jl i = V ψ ∈R J×J : connects a h(t−1) to a ψ(t) h v ω jl i = V ω ∈R J×J : connects a h(t−1) to a ω(t) h u y jk i = U y ∈R J×K : connects a h(t) to a y(t) h b x j i = b x ∈R J : bias for the hidden cell h b γ j i = b γ ∈R J : bias for the input gate h b ψ j i = b ψ ∈R J : bias for the forget gate h b ω j i = b ω ∈R J : bias for the output gate h b y k i = b y ∈R K : bias for the output unit The initial hidden state a h(0) = 0 and initial internal cell s (0) = 0. 140 7.2.1.1 LSTM: Forward Pass This forward pass propagates the time-varying input x (t) T t=1 over the network from the input layer through the gates and hidden units to the output layer. The choice of output activation depends on the network architecture of the output layer. The output neurons uses softmax activation if the network is a classifier while the output activation for a regression model is an identity function [34], [68]. The input uses an identity neuron so the activation a x(t) i of the i th input neuron at time t is: a x(t) i =x (t) i . (7.15) The activation a c(t) j of the hidden unit follows from the following equations o c(t) j = I X i=1 w x ji a x(t) i + J X l=1 v h jl a h(t−1) j +b x j (7.16) a c(t) j = tanh o c(t) j = exp o c(t) j − exp −o c(t) j exp o c(t) j + exp −o c(t) j . (7.17) The following equations show the respective activations a γ(t) j , a ϕ (t) j , and a ω(t) j of the input, the forget, and the output gates: o γ(t) j = I X i=1 w γ ji a x(t) i + J X l=1 v γ jl a h(t−1) j +b γ j (7.18) a γ(t) j =σ o γ(t) j = 1 1 + exp −o γ(t) j (7.19) o ϕ (t) j = I X i=1 w ϕ ji a x(t) i + J X l=1 v ϕ jl a h(t−1) j +b ϕ j (7.20) a ϕ (t) j =σ o ϕ (t) j = 1 1 + exp −o ϕ (t) j (7.21) o ω(t) j = I X i=1 w ω ji a x(t) i + J X l=1 v ω jl a h(t−1) j +b ω j (7.22) a ω(t) j =σ o ω(t) j = 1 1 + exp −o ω(t) j . (7.23) 141 The activation a s(t) j of the internal cell comes from the following equations s (t) j = a c(t) j a γ(t) j + s (t−1) j a ϕ (t) j (7.24) a s(t) j = tanh s (t) j = exp s (t) j − exp −s (t) j exp s (t) j + exp −s (t) j . (7.25) The j th hidden state a h(t) j of the j th at time t is a h(t) j =a s(t) j a ω(t) j (7.26) and the input o k y(t) of the k th output neuron at time t is o y(t) k = J X j=1 u x kj a h(t) j +b y k . (7.27) The output activation depends on the structure of the network. The output neurons use softmax activations for a classification network [20], [34]. So the k th output neuron a y(t) k at time t has the softmax form a y(t) k = exp(o y(t) k P K l=1 exp o y(t) l . (7.28) The softmax ratio structure produces a length-K output probability vector. So the softmax output’s normalization structure enforces a winner-take-all structure a y(t) k = 1 if any output neuron achieves a maximal activation. The output layer uses identity activation for a regression network. The activation is a y(t) k =o y(t) k . (7.29) 7.2.2 Gated Recurrent Unit (GRU) Recurrent Neural Networks An alternative RNN architecture is the Gated Recurrent Unit (GRU) network. A GRU network resembles an LSTM network because it also uses a gating mechanism. But a GRU network does so with fewer parameters than does an LSTM network [136]. This makes GRU training faster than LSTM training. A GRU network merges the output and forget gates into a single gate. It also merges the hidden cell and memory unit into one unit. GRU performance is similar to that of LSTM in areas such as music modeling and speech recognition [48]. We develop noise-boosted versions of both LSTM and GRU RNNs. The simplest cases inject NEM noise additively or multiplicatively into only the output neurons. More complex versions inject NEM noise into the hidden units as well as into the output neurons. The GRU network captures the time dependency of the input data where the time 142 index t takes values in {1, 2, ....., T}. The parameters for a GRU network are as follows: h a x(t) i i = a x(t) ∈R I : input activation at time t h a z(t) j i = a z(t) ∈ (0, 1) J : update gate activation at time t h a r(t) j i = a r(t) ∈ (0, 1) J : reset gate activation at time t h a c(t) j i = a c(t) ∈ (−1, 1) J : hidden unit activation at time t h a h(t) j i = a h(t) ∈ (−1, 1) J : hidden memory at time t h a y(t) k i = a y(t) ∈R K : output activation at time t h w x ji i = W x ∈R J×I : connects the input x (t) to a c(t) h w z ji i = W z ∈R J×I : connects the input x (t) to the a z(t) h w r ji i = W r ∈R J×I : connects the input x (t) to the a r(t) h v h jl i = V h ∈R J×J : connects a h(t−1) to a c(t) h v z jl i = V z ∈R J×J : connects a h(t−1) to a z(t) h v r jl i = V r ∈R J×J : connects a h(t−1) to a r(t) h u y jk i = U y ∈R J×K : connects a h(t) to a y(t) h b x j i = b x ∈R J : bias for the hidden unit h b z j i = b z ∈R J : bias for the update gate h b r j i = b r ∈R J : bias for the reset gate h b y k i = b y ∈R K : bias for the output unit The initial hidden state a h(0) = 0. Let us consider the forward pass of input vector x across an GRU-RNN. 7.2.2.1 GRU: Forward Pass Forward pass over an GRU RNN is the propagation of input x (t) from the input layer to the output layer through the hidden units. The input uses an identity activation so a x(t) i =x (t) i where x (t) i is the input of the i th input neuron at time t. The following equations show the respective activations a z(t) j and a r(t) j of the update and 143 the output gates o z(t) j = I X i=1 w z ji a x(t) i + J X l=1 v z jl a h(t−1) l +b z j (7.30) a z(t) j =σ o z(t) j = 1 1 + exp −o z(t) j (7.31) o r(t) j = I X i=1 w r ji a x(t) i + J X l=1 v r jl a h(t−1) l +b r j (7.32) a r(t) j =σ o r(t) j = 1 1 + exp −o r(t) j . (7.33) The activation a c(t) j of the hidden unit follows from the following equations o c(t) j = I X i=1 w x ji a x(t) i + J X l=1 v h jl a h(t−1) j +b x j (7.34) a c(t) j = tanh o c(t) j = exp o c(t) j − exp −o c(t) j exp o c(t) j + exp −o c(t) j . (7.35) The j th hidden state a h(t) j of the j th at time t is: a h(t) j = a c(t) j a z(t) j + 1−a z(t) j )a h(t−1) j (7.36) and (7.27) shows the input o k y(t) of the k th output neuron at time t is. The corresponding output activation is the same as (7.28) for a GRU classifier. Regression network uses identity activation at the output layer from (7.29). 7.2.3 Recurrent Backpropagation Training The RBP trains RNNs so as to iteratively minimize some error functionE(Θ). BP invariance implies that this is the same as iteratively maximizing the network log-likelihood L. The time-sampled version of the error function E(Θ) sums the error function’s corresponding time slices: E(Θ) = T X t=to E (t) (7.37) where E (t) is the output error at time t and t o ≤t≤T. RBP uses gradient descent or any of its variants to iteratively minimize E(Θ) with respect to the weights or parameters of the RNN [6]. The output-layer time-slice error E (t) equals the cross entropy for a classifier 144 network [34] and equals the squared error for a regression network at time t. The network’s total output log-likelihood L(Θ) takes the logarithm of the output conditional probability p(y|x, Θ) that itself factors in terms of both the time slices and the K output neurons where the output delay t o ∈{1,....,T}. E(Θ) = T X t=to E (t) (7.38) =− T X t=to K X k=1 y (t) k log a y(t) k (7.39) =− T X t=to log K Y k=1 p k (y (t) =y (t) k |x, Θ) (7.40) =− T X t=to log p(y (t) |x, Θ) (7.41) =− log T Y t=to p(y (t) |x, Θ) (7.42) =− log p(y|x, Θ) (7.43) =−L(Θ) (7.44) SominimizingtheerrorE(Θ)maximizesthelog-likelihoodL(Θ)forclassificationorregression for any other type of recurrent network whose global probability structure obeys BP invariance. BP invariance ensures that the gradient∇ Θ L gives back the same BP learning laws at a given layer. The output-layer probability density function p(y (t) |x, Θ) for classification is a categorical or one-shot multinomial probability density: y (t) ∼ Multinomial(a y(t) ). So the recurrent classifier’s total conditional probability at the output layer is p(y|x, Θ) = T Y t=to K Y k=1 (a y(t) k ) y (t) k . (7.45) The probability density function of y (t) for regression is the vector normal density: y (t) ∼ N (y (t) |a y(t) , Θ) with population mean vector a y(t) and identity covariance matrix I. So the 145 recurrent regression model’s total output conditional probability at the output layer is p(y|x, Θ) = T Y t=to N (y (t) |a y(t) , Θ) (7.46) = T Y t=1 1 (2π) K 2 p det(I) exp − ||a y(t) −y (t) || 2 2 . (7.47) This layer likelihood can also approximate the likelihood of a hidden layer of rectified-linear- unit or ReLU neurons because such activations have the truncated-identity form max(0,x). The likelihood function of a layer of logistic-sigmoid neurons has the from of a double cross entropy. 7.2.4 Recurrent Backpropagation and EM Connection Recurrentbackpropagationisatime-varyingformofBP.Itisaniterativemaximumlikelihood estimation that runs over time. The RBP solution Θ ∗ simplifies as follows: Θ ∗ = arg max Θ p(y|x, Θ) (7.48) = arg max Θ T Y t=1 p(y (t) |x, Θ) (7.49) = arg max Θ log T Y t=1 p(y (t) |x, Θ) (7.50) = arg max Θ T X t=1 L(Θ) (t) (7.51) where L(Θ) (t) = logp(y (t) |x, Θ) is the log-likelihood at time t. The update rule for RBP training after epoch n is Θ n+1 = Θ n +η∇ Θ L(Θ) Θ=Θ n (7.52) = Θ n +η∇ Θ T X t=1 L(Θ) (t) Θ=Θ n (7.53) = Θ n +η T X t=1 ∇ Θ L(Θ) (t) Θ=Θ n (7.54) where η is the learning rate. The RBP algorithm reduces to a time-varying form of EM. Theorem 7.1 applies to the 146 L(Θ) (t) so we have: ∇ Θ logp(y (t) |x, Θ) Θ=Θ n =∇ Θ L(Θ) (t) Θ=Θ n =∇ Θ Q(Θ|Θ n ) (t) Θ=Θ n (7.55) where the surrogate log-likelihood Q(Θ|Θ n ) (t) after epoch n is Q(Θ|Θ n ) (t) =E h|y (t) ,x,Θ n logp(h, y (t) |x, Θ) . (7.56) So the RBP update law (7.52) simplifies as follow: Θ n+1 = Θ n +η T X t=1 ∇ Θ L(Θ) (t) Θ=Θ n (7.57) = Θ n +η T X t=1 ∇ Θ Q(Θ|Θ (t) ) (t) Θ=Θ n . (7.58) This shows that RBP equals the time-varying form of EM. So the beneficial NEM noise injection applies to RBP algorithm. 7.3 NEM Noise Injection in Recurrent Network’s Output Neurons The sufficient condition for a NEM-noise benefit in a recurrent network depends on the output log-likelihood L . So it depends both on the error function and the structure of the output neurons. Noise injection in hidden layers treats such layers as modified output layers in the recursive training [155]. So the NEM sufficient condition for a given layer also depends on that layer’s log-likelihood and thus in turn on that layer’s error function and the activation structure of the neurons in that layer. This section derives the NEM noise-benefit conditions for noise injection into the output neurons of a recurrent classifier network and a recurrent regression network. The benefit condition for additive noise injection leads at once to a similar condition for multiplicative noise injection. This result generalizes to any measurable combination of the signal and noise because of the corresponding general result for noise-boosting [200]. 147 7.3.1 NEM Noise injection in Recurrent Neural Classifiers Recurrent neural classifiers also use Gibbs or softmax activation functions for their K output neurons: a y(t) k = exp o y(t) k P K l=1 exp o y(t) l (7.59) wherea y(t) k is the activation of thek th output neuron ando y(t) k is the input to thek th output neuron at time t. The classifier scheme also uses 1-in-K encoding for the time-varying patterns from the K pattern classes. So the ideal classifier maps a time-varying pattern from the k th pattern class to the k basis vector e k where the unit bit vector e k has a 1 in the k th slot and has 0s elsewhere. Let the time-varying signal or input x be a sequence x (t) T t=1 with T many ordered samples or time slices. The recurrent classifier’s error function E(Θ) sums the output cross entropy over the T time slices for a given input x: E(Θ) =− T X t=to K X k=1 y (t) k loga y(t) k (7.60) BP invariance requires that the recurrent classifier’s output conditional probability p(y|x, Θ) is a time-factored one-shot multinomial probability p(y|x, Θ) = Q T t=to Q K k=1 (a y(t) k ) y (t) k . Then the corresponding output log-likelihoodL is just the negative of the cross entropyE(Θ) from (7.38) - (7.44) and thus L(Θ) =−E(Θ). So RBP iteratively maximizes the log-likelihood function for a RNN classifier. 7.3.1.1 Additive NEM Noise Injection Theorem 7.2 presents the NEM theorem for an RNN classifier [6]. The theorem gives a sufficient condition for injecting beneficial additive noise in the classifier’s output neurons. Theorem 7.2. [NEM Noise Benefit for an RNN Classifier with Additive Noise Injection in the Output Neurons]: The NEM noise-benefit positivity condition (7.4) holds for the maximum-likelihood training of a classifier recurrent neural network with additive noise injection into its K output softmax neurons if the following hyperplane condition holds: E y,h,n|x,Θ ∗ h T X t=to (n (t) ) T log a y(t) i ≥ 0 (7.61) 148 where the additive noise n (t) injects into the output neurons and a y(t) is the output activation. Proof: Equations (7.38) - (7.44) show that the RNN classifier’s output cross entropy E(Θ) equals the sum of the negative log-likelihood functions over time t∈{t o , t o + 1,...,T} where t o ∈{1,....,T}. So p(y|x, Θ) = exp −E(Θ) . This shows again that minimizing the cross entropy E(Θ) maximizes the log-likelihood L(Θ). The NEM sufficient condition (7.4) simplifies to the product of exponentiated output activations over time t and each with K output neurons. Then additive noise injection gives p(y + n, h|x, Θ) p(y, h|x, Θ) = p(y + n, h|x, Θ) p(h|x, Θ) p(h|x, Θ) p(y, h|x, Θ) (7.62) = p(y + n|h, x, Θ) p(y|h, x, Θ) (7.63) = Q T t=to p(y (t) + n (t) |h, x, Θ) Q T t=to p(y (t) |h, x, Θ) (7.64) = T Y t=to Q K k=1 (a y(t) k ) n (t) k +y (t) k Q K k=1 (a y(t) k ) y (t) k (7.65) = T Y t=to K Y k=1 (a y(t) k ) n (t) k . (7.66) So the NEM positivity condition in (7.4) becomes E y,h,n|x,Θ ∗ log T Y t=to K Y k=1 (a y(t) k ) n (t) k ≥ 0. (7.67) This simplifies to the inequality E y,h,n|x,Θ ∗ T X t=to K X k=1 n (t) k log(a y(t) k ) ≥ 0 . (7.68) The inner sum is the inner product of n (t) with log a y(t) . Then we can rewrite (7.68) as the hyperplane inequality E y,h,n|x,Θ ∗ h T X t=to n (t) T log a y(t) i ≥ 0. (7.69) ■ Figure 7.1 shows the additive noise samples that inject in the output neurons of a classifier. The noise samples below the NEM hyperplane satisfies the NEM positivity condition in (7.4) with respect to a y = [0.6, 0.3, 0.1] and y = [1, 0, 0]. The additive NEM noise samples in this 149 Figure 7.1: Additive NEM-noise with neural classifiers: The additive NEM noise injection in the output neurons of a recurrent neural classifier. The noise samples below the NEM hyperplane satisfy the positivity condition. The output activation a y(t) = [0.6, 0.3, 0.1] and the target y (t) = [1, 0, 0]. The NEM noise comes from picking random samples from a Gaussian distribution with mean 0, identity covariance I, and satisfy the inequality in (7.61). case are the random samples from a Gaussian distribution 1 with mean 0, identity covariance I, and satisfy the corresponding inequality condition in (7.61) with respect to a y (t) and y (t) . The definition of the additive NEM condition is not restricted to only Gaussian distribution samples. The NEM noise can also inject multiplicatively into the output neurons of the recurrent classifier. 7.3.1.2 Multiplicative NEM Noise Injection Theorem 7.2 simplifies to the following corollary with the multiplicative NEM noise injection in the output neurons of recurrent neural classifiers. Corollary 7.1. Multiplicative NEM-Noise Injection into the Output Neurons of a Recurrent Classifier The NEM positivity condition (7.4) holds for the maximum-likelihood training of a recurrent classifier neural network with multiplicative noise injection in the output softmax neurons if the following condition holds: 1 This mean 0 is a vector of ones and the covariance I is an identity matrix. 150 E y,h,n|x,Θ ∗ h T X t=to ((n (t) − 1)◦ y (t) ) T log a y(t) i ≥ 0 (7.70) where n (t) is the NEM noise that multiplies the output neuron with the output activation a y(t) and◦ is the Hadamard product. The proof in this multiplicative case tracks the proof for additive NEM noise in a classifier’s output neurons from Theorem 7.2. The proof replaces p(y + n, h|x, Θ) with p(y◦ n, h|x, Θ) in (7.62) - (7.66) where y◦ n denotes the pairwise multiplication of the signal vector y and the noise vector n. The result gives the equivalent NEM positivity condition in (7.70). Figure 7.2 shows the multiplicative noise samples that inject in the output neurons of a neural classifier. The noise samples on left side of the NEM hyperplane satisfy the NEM positivity condition in (7.4) with respect to a y(t) = [0.6, 0.3, 0.1] and y (t) = [1, 0, 0]. The multiplicative NEM noise samples in this case are the random samples from a Gaussian distribution 2 with mean 1, identity covariance I, and satisfy the corresponding inequality condition in (7.70) with respect to a y and y. The definition of the multiplicative NEM condition is not restricted to only Gaussian distribution samples. 7.3.2 NEM Noise injection in Recurrent Regression Networks Regression networks approximate functions that take values in real vector spaces. A time- varying pattern defines a curve in such a space. Its sampled version defines a point in a finite product of such vector spaces. The regression approximation entails that the network’s output need not define a length- K probability vector in general. Each output neuron should be free to equal any real number. So regression networks use identity activation for output neurons. The output activations can also be linear or affine. The output activation a y(t) for a recurrent regression network is the identity activation because it equals its own input: a y(t) k =o y(t) k . (7.71) The error function is the squared error E(Θ) over all the time slices: E(Θ) = T X t=to ||y (t) − a y(t) || 2 . (7.72) 2 This mean 1 is a vector of ones and the covariance I is an identity matrix. 151 Figure 7.2: Multiplicative NEM-noise with neural classifiers: The multiplicative NEM noise injection in the output of a recurrent neural classifier. The noise samples on the left side of the NEM hyperplane satisfy the NEM positivity condition in (7.4). The output activation a y(t) = [0.6, 0.3, 0.1] and the target y (t) = [1, 0, 0]. The NEM noise comes from picking random samples from a Gaussian distribution with mean 1, identity covariance I, and satisfy the inequality in (7.70). The output log-likelihood L of the RNN regressor is L(Θ) = log T Y t=to p(y (t) |x, Θ) (7.73) = log T Y t=to 1 (2π) K 2 exp − ||y (t) − a y(t) || 2 2 ! (7.74) = log T Y t=to K Y k=1 1 √ 2π exp − |y (t) k −a y(t) k | 2 2 ! (7.75) =− T X t=to K 2 log 2π + 1 2 ||y (t) − a y(t) || 2 (7.76) RBP trains the RNN regression model by iteratively maximizing the log-likelihood L(Θ) in (7.76). BP invariance requires that this likelihood maximization occur successively at each hidden layer as well as at the output layer. This remaining part of this subsection shows how to inject beneficial additive NEM noise into the output neurons of a RNN regression model. 152 7.3.2.1 Additive NEM Noise Injection Theorem 7.3 present the NEM theorem for a regression RNN. The theorem gives a sufficient condition for injecting beneficial additive noise in the output neurons of the regression RNN. Theorem 7.3. [RBP Noise Benefit for a Regression RNN with Additive Noise Injection into the Output Neurons.] The NEM positivity condition (7.4) holds for the maximum-likelihood training of a regression recurrent neural network with Gaussian target vector y (t) ∼N (y (t) |a y(t) , I) and with additive noise injection into the output identity neurons if the following inequality condition holds: E y,h,n|x,Θ ∗ h T X t=to ||y (t) + n (t) − a y(t) || 2 i −E y,h,n|x,Θ ∗ h T X t=to ||y (t) − a y(t) || 2 i ≤ 0 (7.77) where the noise n (t) injects additively into the output neurons and where a y(t) is the output activation. Proof: The error function E(Θ) is the squared error for a regression network. This proof first shows that minimizing the squared error E(Θ) is the same as maximizing the likelihood function L(Θ): E(Θ) = T X t=to E (t) (7.78) =− T X t=to − ||y (t) − a y(t) || 2 2 (7.79) =− T X t=to log h exp − ||y (t) − a y(t) || 2 2 i (7.80) =−(T−t o ) log(2π) − K 2 + (T−t o ) log(2π) − K 2 − log h T Y t=to p(y = y (t) |x, Θ) i (7.81) =−L(Θ) + (T−t o ) log(2π) − K 2 (7.82) where the network’s output probability density functionp(y = y (t) |x, Θ) is the vector normal densityN (y (t) |a (t) , Θ). So minimizing E(Θ) is the same as maximizing L(Θ) with respect 153 to Θ. Injecting additive noise into the output neurons gives p(y + n, h|x, Θ) p(y, h|x, Θ) = p(y + n, h|x, Θ) p(h|x, Θ) p(h|x, Θ) p(y, h|x, Θ) (7.83) = p(y + n|h, x, Θ) p(y|h, x, Θ) (7.84) = T Y t=to p(y (t) + n (t) |h, x, Θ) p(y (t) |h, x, Θ) (7.85) = T Y t=to exp 1 2 ||(y (t) − a y(t) || 2 exp 1 2 ||(y (t) + n (t) − a y(t) || 2 . (7.86) Then the NEM positivity condition in (7.4) reduces to E y,h,n|x,Θ ∗ " log T Y t=to exp( 1 2 ||y (t) − a y(t) || 2 ) exp( 1 2 ||y (t) + n (t) − a y(t) || 2 ) # ≥ 0. (7.87) This positivity condition reduces to the following hyperspherical inequality E y,h,n|x,Θ ∗ T X t=to ||y (t) + n (t) − a y(t) || 2 −E y,h,n|x,Θ ∗ T X t=to ||y (t) − a y(t) || 2 ≤ 0. ■ Figure 7.3 shows the additive noise samples that inject in the output neurons of a neural regression network. The noise samples inside the NEM sphere satisfy the NEM positivity condition in (7.4) with respect to a y(t) = [1.0, 2.0, 1.0] and y (t) = [2.0, 3.0, 2.0]. The additive NEM noise sample in this case are the random samples from a Gaussian distribution with mean 0, identity covariance I, and satisfy the corresponding inequality condition in (7.77) with respect to a y(t) and y (t) . The definition of the additive NEM condition is not restricted to only Gaussian distribution samples. The NEM noise can also inject multiplicatively into the output neurons of the recurrent regression model. 7.3.2.2 Multiplicative NEM Noise Injection Theorem 7.3 simplifies to the following corollary with multiplicative NEM noise into the output neurons of a recurrent regression network. The NEM noise samples injects over time t in the output neurons of the network. Corollary 7.2. Multiplicative NEM Noise Injection into the Output Neurons of a Regression RNN. 154 Figure 7.3: Additive NEM-noise with regression networks: The additive NEM noise injection in the output of a recurrent neural regression model. The output activation a y = [1.0, 2.0, 1.0] and the target y = [2.0, 3.0, 2.0]. The noise samples inside the NEM sphere satisfy the positivity condition. This additive NEM noise comes from picking the random samples from a Gaussian distribution with mean 0, identity covariance I, and satisfy the inequality in (7.77). The NEM positivity condition (7.4) holds for the maximum-likelihood training of a regression recurrent neural network with Gaussian target vector y (t) ∼N (y (t) |a y(t) , I) and multiplicative noise injection into the output neurons if the following inequality condition holds: E y,h,n|x,Θ ∗ h T X t=to 2y (t) + 2n (t) − a y(t) T n (t) i ≤ 0 (7.88) where n (t) is the multiplicative noise injected into the output neurons and a y(t) is the output activation at time t. The proof tracks the proof for injecting additive NEM noise in a regression network from 7.3. The proof replaces p(y + n, h|x, Θ) with p(y◦ n, h|x, Θ) in (7.83) -(7.86). The result reduces the NEM positivity condition to the inequality in (7.88). Figure 7.4 shows the multiplicative noise samples that inject in the output neurons of a regression recurrent neural network. The noise samples inside the NEM ellipsoid satisfies the inequality condition in (7.88) with respect to a y(t) = [1.0, 2.0, 1.0] and y (t) = [2.0, 3.0, 2.0]. The multiplicative 155 Figure 7.4: Multiplicative NEM-noise with regression networks: The multiplicative NEM noise injection in the output of a regression recurrent neural network. The output prediction or activation a y(t) = [1.0, 2.0, 1.0] and the target y (t) = [2.0, 3.0, 2.0]. The noise samples inside the NEM ellipsoid satisfy the positivity condition. This additive NEM noise comes from picking the random samples from a Gaussian distribution with mean 1, identity covariance I, and satisfy the inequality in (7.88). NEM noise sample in this case are the random samples from the Gaussian distribution with mean 1, identity covariance I, and satisfy the corresponding inequality in (7.88) with respect to a y and y. The definition of the multiplicative NEM condition is not restricted to Gaussian distribution samples. 7.4 NEM Noise Injection in the Hidden Layers of a Recurrent Neural Network This subsection shows next how to inject beneficial NEM noise into the hidden units of a RNN. This noise injection applies to both the LSTM and GRU RNNs. The proof technique extends from the NEM noise injection for the ordinary BP training [155]. The RBP algorithm trains RNNs by iteratively maximizing the log-likelihood function L(Θ). Theorem 7.1 above states that the backpropagation is a form of generalized expectation maximization.Then maximizing the log-likelihood logp(y|x, Θ) maximizes the differentiable Q-function Q(Θ|Θ n ) =E p(z|y,x,Θ n ) [logp(z, y|x, Θ)] (7.89) 156 where z denotes the hidden or latent variables. RBP depends on the timet. The RBP log-likelihoodL(Θ) equals the sum of time-indexed log-likelihood functions. Then (7.44) shows that the log-likelihood L(Θ) for RBP and the update rule for the n th training epoch with RBP is Θ n+1 = Θ n +η∇ Θ L(Θ) Θ=Θ n (7.90) = Θ n +∇ Θ T X t=to L(Θ) (t) Θ=Θ n (7.91) = Θ n + T X t=to ∇ Θ logp(y (t) |x, Θ) Θ=Θ n (7.92) = Θ n + T X t=to ∇ Θ Q(Θ|Θ n ) (t) Θ=Θ n (7.93) because Theorem 7.1 shows that ∇ Θ logp(y (t) |x, Θ) Θ=Θ n =∇ Θ Q(Θ|Θ n ) (t) Θ=Θ n . (7.94) The surrogate likelihood Q-function for time t has the form Q(Θ|Θ n ) (t) = X z (t) p(z (t) |y (t) , x,Θ n ) logp(z (t) , y (t) |x, Θ) (7.95) where z (t) is the latent variable for the hidden units at time t. The computational cost of computing p(z (t) |y (t) , x, Θ n ) is high because it requires T× 2 J samples over the entire time window T. We can use Monte Carlo importance sampling to approximate Q(Θ|Θ n ) (t) as in [155]. This approximation assumes Bernoulli random variable for latent variable z (t) . This means that z (t) j can take up either 0 or 1 and the conditional probabilities are p(z (t) j = 1|x, Θ) =a z(t) j (7.96) p(z (t) j = 0|x, Θ) = 1−a z(t) j (7.97) where a z(t) j is the activation of j th neuron in the hidden unit z (t) . The neurons in each single unit are independent so the joint probability density function factors into product of marginals p(z (t) |x, Θ n ) = J Y j=1 p(z (t) j |x, Θ n ) = J Y j=1 (a z(t) j ) z (t) j (1−a z(t) j ) 1−z (t) j (7.98) 157 where z (t) j ∈{0, 1}. The conditional probability p(z (t) |y (t) , x, Θ n ) for the hidden unit z (t) is as follows: p(z (t) |y (t) , x, Θ n ) = p(z (t) , y (t) |x, Θ n ) p(y (t) |x, Θ n ) (7.99) = p(y (t) |z (t) , x, Θ n )p(z (t) |x, Θ n ) P z (t)p(y (t) |z (t) , x, Θ n )p(z (t) |x, Θ n ) . (7.100) Monte Carlo can approximate p(z (t) |x, Θ n ) with M independent and identically distributed Bernoulli samples. Then the approximation converges almost surely to the desired probability density function p(z (t) |x, Θ n ) in accord with the strong law of large numbers because the approximation is just a sample mean or a random sample [116]: p(z (t) |x, Θ n )≈ 1 M M X m=1 δ K (z (t) − z m(t) ) (7.101) where δ K is the J-dimensional Kronecker delta and where z m(t) is the m th sample at time t. We approximate p(z (t) |y (t) , x, Θ n ) by using the Monte Carlo estimator of p(z (t) |x, Θ n ) in (7.100). The importance-sampling approximation for the conditional probability density function is p(z (t) |y (t) , x, Θ n )≈ P M m=1 δ K (z (t) − z m(t) )p(y (t) |z (t) , x, Θ n ) P M m 1 =1 p(y (t) |z m 1 (t) , x, Θ n ) (7.102) = M X m=1 δ K (z (t) − z m(t) )γ m(t) (7.103) where the parameter γ m(t) is γ m(t) = p(y (t) |z (t) , x, Θ n ) P M m=1 p(y (t) |z m(t) , x, Θ n ) (7.104) and z m(t) is the m th samples for latent variable z at time t. The term γ m(t) measures the relative importance of sample z m(t) with respect to other samples. Let λ(x, y (t) , z (t) ) be as follows: λ(x, y (t) , z (t) ) = logp(z (t) |x, Θ) + logp(y (t) |z (t) , x, Θ). (7.105) 158 The Monte Carlo approximation for Q(Θ|Θ n ) (t) is Q(Θ|Θ n ) (t) ≈ X z (t) M X m=1 γ m(t) δ K (z (t) − z m(t) )λ(x, y (t) , z (t) ) (7.106) ≈ M X m=1 γ m(t) logp(z m(t) |x, Θ) + logp(y (t) |z m(t) , x, Θ) . (7.107) Let λ N (x, y (t) , z (t) ) be as follows: λ N (x, y (t) , z (t) ) = logp(y (t) |z (t) , x, Θ) + logp(z (t) + n (t) |x, Θ). (7.108) The Monte Carlo approximation for the noisy surrogate likelihood Q N (Θ|Θ n ) is Q N (Θ|Θ n ) (t) ≈ X z (t) M X m=1 γ m(t) δ K (z (t) − z m(t) )λ N (x, y (t) , z (t) ) (7.109) ≈ M X m=1 γ m(t) logp(z m(t) + n (t) |x, Θ) + logp(y (t) |z m(t) , x, Θ) . (7.110) The log-likelihood logp(y|x, Θ) factors as follows: logp(y|x, Θ) =Q(Θ|Θ n ) +H(Θ|Θ n ) for the entropy term H(Θ|Θ n ) =−E z|y,x,Θ n logp(z|y, x, Θ) . Then the log-likelihood L(Θ) for the RNN is L(Θ) = T X t=to logp(y (t) |x, Θ) (7.111) = T X t=to Q(Θ|Θ n ) (t) + T X t=to H(Θ|Θ n ) (t) (7.112) =Q(Θ|Θ n ) +H(Θ|Θ n ) (7.113) where Q(Θ|Θ n ) sums the time-indexed surrogate likelihoods Q(Θ|Θ n ) (t) over t o ≤ t≤ T. The next subsection presents the sufficient conditions for NEM noise injection in the hidden units of LSTM and GRU recurrent networks. 7.4.1 NEM Noise Injection in the Hidden Units of Long Short-Term Memory (LSTM) RNNs The subsection shows how to inject NEM noise into the hidden units of a recurrent network. An LSTM-RNN uses three sigmoidal gates: the input, forget, and output gates. The hidden variable z (t) denotes the gates. The terms i (t) , f (t) , and o (t) represent the respective activations for the input, the forget and the output gate. These gates are statistically 159 independent. The terms z (t) i , z (t) f , and z (t) o are the respective latent variables for the input, forget, and output gates. The conditional probability density function p(z (t) |x, Θ) is p(z (t) |x, Θ) =p(z (t) i , z (t) f , z (t) o |x, Θ) =p(z (t) i |x, Θ)p(z (t) f |x, Θ)p(z (t) o |x, Θ) (7.114) because of the statistical independence. The terms n m(t) i , n m(t) f , and n m(t) o are the respective noise injections in the input, forget, and output gates. Then the noise-injected likelihood has the form p(z (t) + n (t) |x, Θ) =p(z (t) i + n (t) i , z (t) f + n (t) i , z (t) o + n (t) o |x, Θ) (7.115) =p(z (t) i + n (t) i |x, Θ) p(z (t) f + n (t) f |x, Θ) p(z (t) o + n (t) o |x, Θ). (7.116) Putting (7.114) in (7.107) gives the surrogate likelihood Q(Θ|Θ n ) (t) for an LSTM-RNN. Putting (7.116) in (7.110) gives the noisy surrogate likelihood Q N (Θ|Θ n ) (t) for the LSTM- RNN. These results lead to the next theorem. 7.4.1.1 Additive NEM Noise Injection Theorem 7.4 presents the additive NEM theorem for an LSTM-RNN. The theorem gives a sufficient condition for injecting beneficial additive noise in the hidden neurons of the LSTM-RNN. Theorem 7.4. [Additive NEM Noise Injection into the Hidden Neurons of an LSTM-RNN]: A NEM noise benefit holds for the iterative maximum-likelihood training of an LSTM-RNN with additive noise injection into the hidden units (gates) if the following positivity condition holds: E z,n|y,x,Θ ∗ " log p(z + n|y, x, Θ) p(z|y, x, Θ) # ≥ 0. (7.117) This inequality reduces to E z,n|y,x,Θ ∗ " T X t=to n (t) i T log i (t) 1− i (t) + n (t) f T log f (t) 1− f (t) + n (t) o T log o (t) 1− o (t) # ≥ 0 (7.118) where the noises n (t) i , n (t) f , and n (t) o inject additively into the respective input, forget, and output gates at time t. The terms i (t) , f (t) , and o (t) are the activations for the respective 160 input, forget, and output gates. Proof. The inequality (7.117) comes from the noisy EM Theorem [200], [203] and implies that the following holds on average: p(z (t) + n (t) |y, x, Θ)≥p(z (t) |y, x, Θ) (7.119) where the noises n (t) i , n (t) f , and n (t) o inject into the respective input, forget, and output gates at timet. The conditional probability function p(z (t) + n (t) |y, x, Θ) factors in (7.116) just as p(z (t) |y, x, Θ) factors in (7.114). Then dividing by the noiseless term and taking logarithms gives log p(z (t) + n (t) |y, x, Θ) p(z (t) |y, x, Θ) = log p(z (t) i + n (t) i |y, x, Θ) p(z (t) i |y, x, Θ) + log p(z (t) f + n (t) f |y, x, Θ) p(z (t) f |y, x, Θ) + log p(z (t) o + n (t) o |y, x, Θ) p(z (t) o |y, x, Θ) (7.120) LSTM uses sigmoid activations for the gates. The Monte Carlo approximation above assumes that a Bernoulli distribution describes the latent variables z (t) i , z (t) f , and z (t) o . This gives log p(z (t) i + n (t) i |y, x, Θ) p(z (t) i |y, x, Θ) = log T Y t=to J Y j=1 (i (t) j ) z (t) i j +n (t) i j (1−i (t) j ) 1−z (t) i j −n (t) i j (i (t) j ) z (t) i j (1−i (t) j ) 1−z (t) i j (7.121) = log T Y t=to J Y j=1 (i (t) j ) n (t) i j (1−i (t) j ) n i(t) j (7.122) = T X t=to J X j=1 n (t) i j log(i (t) j )− log(1−i (t) j ) (7.123) = T X t=to n (t) i T log i (t) 1− i (t) (7.124) for the input gates. The same approach also extends to forget and output gates and we have the following: log p(z (t) f + n (t) f |y, x, Θ) p(z (t) f |y, x, Θ) ! = T X t=to n (t) f T log f (t) 1− f (t) (7.125) log p(z (t) o + n (t) o |y, x, Θ) p(z (t) o |y, x, Θ) ! = T X t=to n (t) o T log o (t) 1− o (t) (7.126) 161 because they also use sigmoid activations. Taking the appropriate NEM averages gives E z,n|y,x,Θ ∗ " log p(z + n|y, x, Θ) p(z|y, x, Θ) # =E z,n|y,x,Θ ∗ " T X t=to n (t) i T log i (t) 1− i (t) + n (t) f T log f (t) 1− f (t) + n (t) o T log o (t) 1− o (t) # ≥ 0 (7.127) from (7.117), (7.120), (7.124), (7.125), and (7.126). ■ The NEM noise can also inject multiplicatively into the hidden units (gates) of an LSTM- RNN. 7.4.1.2 Multiplicative NEM Noise Injection Theorem 7.4 simplifies to the following corollary with multiplicative NEM noise into the gates of the LSTM-RNN. Corollary 7.3. Multiplicative NEM Noise Injection into the Hidden Neurons of an LSTM-RNN. A NEM noise benefit holds for iterative maximum-likelihood training of an LSTM-RNN with multiplicative noise injection into the hidden units (gates) if the following positivity condition holds: E z,n|y,x,Θ ∗ " log p(z◦ n|y, x, Θ) p(z|y, x, Θ) # ≥ 0. (7.128) This inequality reduces to E z,n|y,x,Θ ∗ " T X t=to n (t) i ◦ z (t) i − z (t) i T log i (t) 1− i (t) + n (t) f ◦ z (t) f − z (t) f T log f (t) 1− f (t) + n (t) o ◦ z (t) o − z (t) o T log o (t) 1− o (t) !# ≥ 0 (7.129) where◦ is the Hadamard product, the noises n (t) i , n (t) f , and n (t) o inject multiplicatively into the respective input, forget, output gates at time t. The terms i (t) , f (t) , and o (t) are the activations for the respective input, forget, and output gates . The proof for corollary 7.3 follows from replacing the additive noise with a multiplicative version in Theorem 7.4. The proof replaces p(n + y, h|x, Θ) with p(n◦ y, h|x, Θ) in (7.120)- (7.126). The result gives the positivity condition in (7.129). 162 7.4.2 NEM Noise Injection in the Hidden Units of Gated-Recurrent-Unit (GRU) RNNs Noise injection in the hidden units of a GRU-RNN has the same form as in the case of an LSTM-RNN. The GRU gating mechanism differs from that of the LSTM because the GRU uses two gates (input and output) instead of three gates. Then the likelihood p(z (t) |y, x, Θ) has the factored form p(z (t) |y, x, Θ) =p(z (t) i , z (t) o |y, x, Θ) =p(z (t) i |y, x, Θ)p(z (t) o |y, x, Θ). (7.130) The noise likelihood p(z (t) + n (t) |y, x, Θ) has the related factored form p(z (t) + n (t) |y, x, Θ) =p(z (t) i + n (t) i , z (t) o + n (t) o |y, x, Θ) (7.131) =p(z (t) i + n (t) i |y, x, Θ) p(z (t) o + n (t) o |y, x, Θ). (7.132) 7.4.2.1 Additive NEM Noise Injection Theorem 7.5 presents the additive NEM theorem for an GRU-RNN. The theorem gives a sufficient condition for injecting beneficial additive noise in the hidden neurons of the GRU-RNN. Theorem 7.5. [Injecting Additive NEM Noise into the Hidden Neurons a GRU- RNN]: A NEM noise benefit holds for the iterative maximum-likelihood training of a GRU-RNN with noise injection into the hidden units (gates) if the following positivity condition holds: E z,n|y,x,Θ ∗ " log p(z + n|y, x, Θ) p(z|y, x, Θ) # ≥ 0. (7.133) The inequality reduces to E z,n|y,x,Θ ∗ " T X t=to n (t) i T log i (t) 1− i (t) + n (t) o T log o (t) 1− o (t) # ≥ 0 (7.134) where the noises n (t) i and n (t) o inject into the respective input and output gates. The terms i (t) and o (t) are the activations for the respective input and output gates at time t. Proof. The NEM condition (7.133) entails that on average p(z (t) + n|y, x, Θ)≥p(z (t) |y, x, Θ). (7.135) 163 This above likelihood factorizations give p(z (t) i + n i |x, Θ) p(z (t) o + n o |x, Θ)≥p(z (t) i + n i |x, Θ) p(z (t) o + n o |x, Θ). (7.136) The proof now proceeds as in the above case of the LSTM RNN since the sigmoid activations correspond to a Bernoulli probability structure. Then (7.124) and (7.126) give the corre- sponding expressions for the input and output gates. Then the log of the likelihood-ratio becomes log p(z (t) + n (t) |y, x, Θ) p(z (t) |y, x, Θ) = log p(z (t) i + n (t) i |y, x, Θ) p(z (t) i |y, x, Θ) + log p(z (t) o + n (t) o |y, x, Θ) p(z (t) o |y, x, Θ) (7.137) = T X t=to n (t) i T log i (t) 1− i (t) + n (t) o T log o (t) 1− o (t) (7.138) So the NEM positivity condition reduces to E z,n|y,x,Θ ∗ " log p(z + n|y, x, Θ) p(z|y, x, Θ) # = E z,n|y,x,Θ ∗ " T X t=to n (t) i T log i (t) 1− i (t) + n (t) o T log o (t) 1− o (t) # ≥ 0 (7.139) from (7.133) and (7.138). ■ The NEM noise can also inject multiplicatively into the hidden units (gates) of a GRU-RNN. 7.4.2.2 Multiplicative NEM Noise Injection Theorem 7.5 simplifies to the following corollary with multiplicative NEM noise into the gates of a GRU-RNN. Corollary 7.4. Multiplicative NEM Noise Injection into the Hidden Neurons of a GRU-RNN. A NEM noise benefit holds for the iterative maximum-likelihood training of a GRU-RNN with multiplicative noise injection into the hidden units (gates) if the following positivity condition holds: E z,n|y,x,Θ ∗ " log p(z◦ n|y, x, Θ) p(z|y, x, Θ) # ≥ 0 (7.140) 164 and this simplifies to E z,n|y,x,Θ ∗ " T X t=to n (t) i ◦ z (t) i − z (t) i T log i (t) 1− i (t) + n (t) o ◦ z (t) o − z (t) o T log o (t) 1− o (t) # ≥ 0 (7.141) where◦ is the Hadamard product, n (t) i and n (t) o are the respective multiplicative noise injec- tions into the input and output gates at time t. The terms i (t) and o (t) are the respective activations for input and output gates . Algorithm 7.1 shows the pseudocode for injecting the additive NEM noise into the hidden and output neurons of a GRU classifier. This algorithm also extends to other modifications such as multiplicative noise, LSTM-RNN, and recurrent regression networks with appropriate modifications. 7.5 Simulation and Results This simulation includes the NEM noise algorithms for RBP training for recurrent neural classifiers and regression models. The recurrent classifiers had either a long-short-term- memory or gated-recurrent-unit structure. The same held for the recurrent regression models. 7.5.1 Datasets This simulation considered video classification with recurrent neural classifiers and time-series prediction with recurrent regression model. This subsection describes the data sets for these two sets of experiments. 7.5.1.1 Classification The recurrent neural classifiers classifiers trained on the standard UCF-11 sports-action YouTube video dataset [220], [235]. The dataset consists of 11 categories of sport-action videos namely basketball shooting, trampoline jumping, golf-swing, swinging, cycling, diving, soccer juggling, tennis swinging, volleyball spiking, horse riding, and walking. We extracted 30,000 training samples from the UCF-11 video dataset. We used 5:1 data splits for training and testing. So we used 5,000 video-clip samples for testing the network after training. Each training instance had 7 image frames sampled uniformly from 165 Algorithm 7.1: Noise-Boosted Recurrent Backpropagation: Training a GRU recurrent neural classifier with additive NEM noise injection into the output and hidden units. Data: M time-dependent input vectors{x 1 ,..., x M }, the corresponding M target label 1-in-K vectors{y 1 ,..., y M }, number of training epochs R Result: Trained parameter Θ RBP 1 • Initialize the network parameter Θ (0) ; 2 while epoch r : 0→R− 1 do 3 while training sample index m : 1→M do 4 • Compute the forward pass of the input data x m = n x (t) m o T t=1 using (7.15), and (7.30)-(7.36); 5 • Compute the output activation vector a m y = a m y(t) T t=to using (7.59); 6 for t=t o ,...,T do 7 • Generate output noise vector n (t) and test for positivity using Theorem 7.2; 8 if n (t) T log(a y(t) )≥ 0 then 9 • Add NEM noise: y (m) (t) ← y (m) (t) + n (t) ; 10 else 11 • Do nothing 12 • Compute the cross entropy between a (m) y(t) T t=to and y m = y (t) (m) T t=to using (7.60); 13 • Back-propagate the error ; 14 for t=1,...,T do 15 • Generate hidden noise vectors n (t) i and n (t) o ; 16 if n (t) i T log im (t) 1−im (t) ≥ 0 then 17 • Add NEM noise: i m (t) ← i m (t) + n (t) i ; 18 else 19 • Do nothing 20 if n (t) o T log om (t) 1−om (t) ≥ 0 then 21 • Add NEM noise: o m (t) ← o m (t) + n (t) o ; 22 else 23 • Do nothing 24 • Update the network parameter Θ using gradient descent or its variant ; 25 Θ RBP ← Θ (R) ; a sports-action video at the rate of 1.4 frames per second. Each image frame in the dataset had 299×299×3 pixels. So the size of each input was 299×299×3×7. Convolutional filters in the RNN extracted features from the images and reduced the 166 Figure 7.5: Image frames of a sport-action video: This shows sampled images frames of a diver from the UCF-11 YouTube sports database. Sampling gave these diving images at a rate of 7 image frames in 5 seconds. The diving videos were from the 3 rd of 11 video pattern categories. So the target output vector for this sampled video clip was the 1-in-K-coded bit vector [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]. dimension of the input pattern space. Figure 7.6 shows the structure of the convolutional layers. The convolutional layers of a trained inception-v3 network [241] extracted the features. It reduced the dimension of each image to 2048×1. This reduced the dimension of each input to 2048×1×7. We fed the extracted features into the input neurons of the RNN classifier and tried different numbers of hidden neurons. 7.5.1.2 Regression The recurrent regression networks trained on the dollar-rupee exchange rates from January 1980 to August 2017. The dataset gives the daily average exchange rate for the US dollar to the Indian rupee over 9,697 business days. The exchange-rate data came from the currency web site https://www.investing.com/currencies/usd-inr-historical-data. Figure 7.7 shows the dollar-rupee time-series data over this period of 9,697 days. The RNNs trained on the exchange-rate data from day 0 until day 8,000—from January 1980 to August 2017. We tested the RNN regression models on the exchange-rate data from day 8,001 until day 9,697—from March 2011 to August 2017. The trained RNN regression models predicted the dollar-rupee exchange rate for the i th day d (i) given data from the 4 previous days. The inputs were the open price, high price, low price, and average price from the 4 previous days d (i−1) , d (i−2) , d (i−3) , and d (i−4) . 7.5.2 Performance Metrics The metric of performance for the recurrent neural classifiers was classification accuracy. The recurrent regression model used the mean-squared error. Thespeed-upbenefitmetriccomparedthenumberofepochsittookamodeltooutperform the other. For example let us consider modelM 1 with a performance metricP B outperforms the baseline modelM B a performance metricP B in the case of regression. The word “outperforms” meansP 1 >P B in the case of classification accuracy and P 1 P B in the case of regression. 167 Inception module 1 Image frames from a video Frame 1: Frame 2: × × × ⋮ ⋮ Pooling × × × × × × × × × × × × Frame 7: Pooling × × Extracted features Inception module 3 Inception module 2 Pooling (a) Inception-v3: Feature extraction Output 1 × 1 Filter 1 × 1 Filter Pooling 1 × 1 Filter 3 × 3 Filter Input 3 × 3 Filter 3 × 3 Filter 1 × 1 Filter (b) Inception module 1 Output 1 × 1 Filter 1 × 1 Filter Pooling 1 × 1 Filter 3 × 3 Filter Input 7 × 1 Filter 1 × 7 Filter 1 × 1 Filter 1 × 7 Filter 3 × 3 Filter (c) Inception module 2 Output 1 × 1 Filter 1 × 1 Filter Pooling 3 × 3 Filter Input 3 × 3 Filter 3 × 3 Filter 1 × 1 Filter 1 × 1 Filter 3 × 3 Filter 3 × 3 Filter 3 × 3 Filter (d) Inception module 3 Figure 7.6: Image extraction with inception-v3 network: This shows the architecture of feature extraction from image frames with the convolutional layers of a trained inception-v3 network [241]. The convolutional layers consisted of convolutional masks, pooling layers, and inception modules. Inception modules arranged the convolutional masks and pooling layers to extract features from images. (a) signal flow chart for using the convolutional layers of inception-v3 network to extract features and to reduce the dimension of images. The process converted an input image of size 299×229×3 to the reduced size 2048×1. (b) shows the architecture for inception module 1. (c) shows the architecture of inception module 2. (d) shows the architecture of inception module 3. LetN 1B is the number of training epochs it tookM 1 to reachP B andN B1 is the number of training epochs forM B to reachP 1 . LetN 1 andN B be the respective number of training epochs forM 1 toP 1 andM B to reachP B The mathematical representation of the speed-up 168 0 2000 4000 6000 8000 10000 Day 0 20 40 60 80 Price (Indian rupee) 1 US dollar to Indian rupee Training period Testing period Figure 7.7: US dollar Indian rupee time-series dataset: This is the time-series data for the average daily exchange rate of the US dollar to the Indian rupee over the 9,697 business days from January 1980 to August 2017. The recurrent neural networks trained on the data from day 0 to day 8,000— from January 1980 to March 2011. We tested the RNN models on the exchange-rate data from day 8,001 to day 9,697—from March 2011 to August 2017. benefit S B is S B = N B −N 1B N B , IfM 1 is better thanM B N B1 −N B N 1 , Otherwise . (7.142) This means thatS B > 0 ifM 1 is better thanM B andS B < 0 otherwise. 7.5.3 Experimental Results This subsection reports the experimental results from the training of recurrent neural classifiers and recurrent regression networks with different forms of RBP. The experiments compared RBP training without noise, RBP training with blind noise, and RBP training with NEM noise. The experiments also compared noise injection in the output layer only, in the hidden layers only, and in both the output and hidden neurons. The best performance with NEM noise injection in both the hidden and output units. 7.5.3.1 LSTM: Recurrent Neural Classfiers ThissimulationinvolvedthetrainingrecurrentLSTMclassifiersUCF-11sports-videodataset. Table 7.1 shows the classification accuracy for different noise-boosted RBP training regimens. Injecting additive blind noise had little effect on the RNN’s classification accuracy. It slightly increased the classification accuracy in some cases and it reduced it in others. Total blind-noise injection in the hidden and output neurons decreased the classification accuracy 169 0 20 40 60 80 100 Training Epoch 0 20 40 60 80 100 Classification Accuracy Noise injection (Output layer) No Noise Blind Noise NEM Noise (a) 0 20 40 60 80 100 Training Epoch 0 20 40 60 80 100 Classification Accuracy Noise injection (Hidden layer) No noise Blind noise NEM noise (b) 0 20 40 60 80 100 Training Epoch 0 20 40 60 80 100 Classification Accuracy Noise injection (Hidden & Output layers) No Noise Blind Noise NEM Noise (c) Figure 7.8: Noise-benefit with LSTM classifiers: Noise-boosted accuracy performance of LSTM-RNN classifiers trained on the YouTube sports-video UCF-11 test set. The best accuracy resulted from additive NEM-noise injection in both the output and hidden neurons. (a) Additive noise injection in only the output layer. (b) Noise injection in only the hidden units. (c) Noise injection in both the output neurons and hidden neurons. by 1.15%. This accuracy change used noiseless RBP as the base line. The noiseless case had a classification accuracy of 85.82% with LSTM-RNN. Figure 7.8 shows that additive NEM noise improved the classification accuracy in all cases compared with noiseless RBP and blind-noise injection. Additive NEM noise also outperformed multiplicative NEM noise in all cases. The best performance occurred with total additive NEM-noise injection in both the output neurons and in all hidden units. This improved the performance to 93.65% classification accuracy. This was an increase of 7.83% over the noiseless classification accuracy. The extra computation involved in the NEM-noise injection was slight. 170 Table 7.1: NEM-benefit in classification accuracy of LSTM classifiers: This shows the classification accuracy of LSTM on the YouTube sports-video dataset UCF-11 for injected blind noise and NEM noise. All training was for 100 epochs. Noise Type Classification accuracy Output Hidden Output & Hidden Baseline (No noise) 85.82% Additive and blind 85.62% 84.10% 84.67% Additive and NEM 92.72% 87.45% 93.65% Multiplicative and blind 85.43% 86.93% 86.98% Multiplicative and NEM 87.30% 87.26% 87.78% Table 7.2: NEM-speed-up with LSTM classifiers: This shows the training speed-up of LSTM classifiers trained on the YouTube sports-video dataset UCF-11 for injected blind noise and NEM noise. Noise Type Speed-up benefit Output Hidden Output & Hidden Additive and blind -6% -9% -7% Additive and NEM 37% 17% 41% Multiplicative and blind -13% 2% 10% Multiplicative and NEM 16% 17% 20% Table 7.2 shows that additive NEM noise injection sped up training more than did multiplicative noise. Both gave more far more pronounced speed-ups than did blind-noise injection. The best speed-up occurred with total additive NEM-noise injection in both the output neurons and in all hidden units. 7.5.3.2 GRU: Recurrent Neural Classfiers This subsection presents the performance of trained and noise-boosted recurrent GRU classifiers on the UCF-11 sports-video dataset. Table 7.3 shows the classification accuracy for different RBP training regimens. Additive injection of NEM noise in both the hidden and output neurons increased the classification accuracy by 3.02%. Multiplicative NEM noise injection in both the hidden and output neurons increased the accuracy by 2.25%. Injecting blind noise or dither only slightly improved the classification accuracy. Table 7.4 shows that NEM noise injection also sped up the training of the GRU RBP classifier. The best speed-up came from additive NEM-noise injection in both the hidden and output neurons. 171 Table 7.3: NEM-benefit in classification accuracy of GRU classifiers: This shows the classification accuracy of GRU classifiers on the YouTube sports-video dataset UCF-11 for injected blind noise and NEM noise. All training was for 100 epochs. Noise Type Classification accuracy Output Hidden Output & Hidden Baseline (No noise) 92.10% Additive and Blind 93.83% 93.62% 94.93% Additive and NEM 94.92 % 94.52% 95.12% Multiplicative and Blind 93.03% 90.55% 92.17% Multiplicative and NEM 94.23% 94.10% 94.35% Table 7.4: NEM-speed-up with GRU classifiers: Training speed-up due to noise injection in recurrent GRU classifiers trained on the YouTube sports-video dataset UCF-11. Noise Type Speed-up benefit Output Hidden Output & Hidden Additive and Blind 26% 11% 16% Additive and NEM 26 % 15% 28% Multiplicative and Blind 10% -10% 8% Multiplicative and NEM 20% 18% 21% 7.5.3.3 LSTM: Recurrent Regression Networks Simulations showed that NEM-noise boosted LSTM RBP with NEM noise outperformed ordinary noiseless RBP and outperformed RBP with injected blind noise. Figure 7.9 and Table 7.5 show these squared-error results for the additive noise injection. The same relationships held for additive and multiplicative noise injection. The baseline squared error was the the average squared error of the noiseless LSTM-RNN that had trained for 100 epochs. Blind noise also reduced the error but not as much as NEM-noise injection did. The best exchange-rate prediction in terms of the averaged squared error came from injecting additive NEM noise in both the hidden and output neurons. This also gave the fastest convergence in training. It took fewer training epochs for the NEM-noise-boosted RNN to reach the baseline squared-error value of 2.087. Table 7.6 shows the training speed-up from noise injection. RBP training with blind noise outperformed RBP without noise. This result held for noise injection in the output layer only, in the hidden layer only, and in both the hidden and output layers. Figure 7.10 shows that the best result was for additive NEM injection in both the the hidden and output layers. This blind-noise benefit may reflect some form of a stochastic-resonance noise benefit for threshold-like systems [152], [184], [209]. Tables 7.5 and 7.6 show that blind noise injection helped slightly but not as much as NEM-noise 172 0 20 40 60 80 100 Training epoch 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Output layer) No noise Blind noise NEM noise (a) 0 20 40 60 80 100 Training epoch 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Hidden layer) No noise Blind noise NEM noise (b) 0 20 40 60 80 100 Training epoch 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Hidden & Output layers) No noise Blind noise NEM noise (c) Figure 7.9: Additive NEM-benefit with LSTM regression networks: This shows noise-boosted prediction performance of LSTM-RNN regression models trained on the dollar-rupee exchange-rate dataset. The models usedAdditive noise samples. The best results used NEM-noise injection in both the output layer and hidden units. (a) Noise injection in only the output layer. (b) Noise injection only in the hidden units. (c) Noise injection in both the the output neurons and hidden neurons. injection helped. The tables also show that the best LSTM-RBP performance was with NEM noise injection in both the hidden and output layers. This held both for reducing the average squared error and for reducing the number of training epochs that the system took to reach the baseline noiseless squared-error value. 7.5.3.4 GRU: Recurrent Regression Networks This part shows how injecting NEM noise improved the predictive accuracy of GRU-RNN regression models. Figure 7.11 shows that injecting NEM noise in the GRU-RNN regression 173 8000 8500 9000 9500 Day 40 50 60 70 Price (Indian rupee) NEM noise injection Target No noise Output layer Hidden layer Output & Hidden layers Figure 7.10: Time-series prediction with LSTM regression networks: LSTM RNN predictions of the dollar-rupee exchange rate for additive NEM-noise injection in the output neurons, in the hidden neurons, and in both output and hidden neurons. Table 7.5: NEM-benefit with the mean squared error of LSTM regression networks: Predictive accuracy of noise-injected LSTM-RNN regression models that trained on the dollar-rupee exchange- rate dataset. Each entry shows the average squared error after training for 100 epochs. Noise Type Mean squared error Output Hidden Output & Hidden Baseline (No noise) 2.087 Additive and Blind 1.689 2.465 2.988 Additive and NEM 1.490 1.636 1.059 Multiplicative and Blind 2.310 1.756 2.359 Multiplicative and NEM 1.170 1.191 1.147 Table 7.6: NEM-speed-up with LSTM regression networks: This shows the training speed-up due to noise injection in LSTM-RNN regression models trained on the dollar-rupee exchange-rate dataset. Noise Type Speed-up benefit Output Hidden Output & Hidden Additive and NEM -24% -12% -24% Additive and NEM 20% 36% 38% Multiplicative and Blind -11 % -24 % -11% Multiplicative and NEM 31% 30% 29% outperformed injecting blind noise and outperformed noiseless RBP. The RBP with blind noise outperformed RBP without noise. Figure 7.11 shows that NEM-noise injection in both the GRU-RNN’s hidden and output 174 0 20 40 60 80 100 Training epoch 10 −1 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Output layer) No noise Blind noise NEM noise (a) 0 20 40 60 80 100 Training epoch 10 −1 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Hidden layer) No noise Blind noise NEM noise (b) 0 20 40 60 80 100 Training epoch 10 −1 10 0 10 1 10 2 10 3 Mean squared error Noise injection (Hidden & Output layers) No noise Blind noise NEM noise (c) Figure 7.11: Additive NEM-benefit with GRU regression networks: This shows the noise-boosted prediction performance of GRU-RNN regression models trained on the dollar-rupee exchange-rate dataset. The best predictions in terms of least average squared error resulted from the Additive NEM-noise injection in both the output and hidden neurons. (a) Additive noise injection in only the output layer. (b) Noise injection in only the hidden units. (c) Noise injection in both the output neurons and hidden neurons. neurons outperformed injecting NEM noise in either its hidden neurons or output neurons separately. Injecting NEM noise outperformed injecting blind noise. But injecting blind noise did substantially better than injecting no noise at all. Tables 7.7 and 7.8 give the resulting squared-error values of the predictions. Table 7.7 shows the noise benefits in terms of the average squared errors of the predictions. Additively injecting NEM noise in both the hidden and output neurons gave the best squared-error performance for prediction. But multiplicative NEM-noise injection gave the best training speed-up to the baseline squared-error value of 1.508 of noiseless RBP after 100 training epochs. This was the only case where multiplicative NEM-noise injection outperformed additive NEM-noise injection. 175 Table 7.7: NEM-benefit in mean squared error of GRU regression networks: This shows the predictive accuracy of noise-injected GRU-RNN regression models that trained on the dollar-rupee exchange-rate dataset. Each entry shows the average squared error after training for 100 epochs. Noise Type Mean squared error Output Hidden Output & Hidden Baseline (No noise) 1.508 Additive and Blind 1.566 1.463 1.461 Additive and NEM 1.425 1.362 1.280 Multiplicative and NEM 1.558 1.483 1.462 Multiplicative and NEM 1.422 1.161 1.151 Table 7.8: NEM-speed-up with GRU regression networks: Training speed-up due to noise injection in GRU-RNN regression models trained on the dollar-rupee exchange-rate dataset. All training was for 100 epochs. Multiplicative NEM-noise injected into the hidden and output neurons gave the best performance. Noise Type Speed-up benefit Output Hidden Output & Hidden Additive and Blind -10% 10% 4% Additive and NEM 28% 27% 32% Multiplicative and Blind -10 % 18 % 22% Multiplicative and NEM 18% 36% 37% 7.6 Conclusion Careful noise injection improved the convergence and the accuracy of recurrent backpropa- gation for both classification and regression with LSTM and GRU recurrent neural networks. Simulations confirmed the noise benefits that the corresponding noise-benefit theorems predicted will hold on average. These results extend the recent noise-boosting of the simple backpropagation algorithm. The noise-boosting itself depends on the facts that the backprop- agation algorithm is a special case of the generalized Expectation-Maximization algorithm and that noise can always boost the EM algorithm on average. This NEM formulation of recurrent backpropagation shows that recurrent backpropagation is ultimately a statistical model. 176 Chapter 8 Future Work This final section proposes three key ways that future research can profitably extend the results of this dissertation. The goal remains Big-K scaling: Find new neural architectures and algorithms that learn and recall an ever larger number K of patterns. The three proposed areas would (1) adapt the bidirectional block structure to larger test sets, (2) find better coding schemes that produce quasi-orthogonal codewords in high dimensions, and (3) analyze these systems both as probabilistic models and as proper nonlinear dynamical systems. 8.1 Big-K Simulation Databases There are surprisingly few large database test sets. Most current neural classifiers work with a pattern number K that is no more than 1,000 and often at most 100. Good examples are the popular CIFAR-10 and CIFAR-100 image test sets that we have used throughout this dissertation. The benefits of random logistic coding should grow exponentially with K because the technique uses quasi-orthogonal codewords. But countervailing costs may well set in for large K. So balanced cost-benefit analyses will require extensive simulations over a wide range of network parameters for different large numbers of training and test patterns. The paucity of large publicly available datasets appears due to both lack of user demand and the computational costs of hosting such test sets. Table 8.1 shows examples of some publicly available Big-K datasets. Near-term work on Big-K scalability may require forming noisy or rotation-scale variations or other synthetic extensions of these and similar test sets. 177 Table 8.1: Big-K simulation databases: Examples of publicly available Big-K databases. Dataset Number of Samples Number of Classes K JFT-300M [46], [111] 300M 18,291 Open Images [159] 9.2M 20,000 Tiny-Images [244] 79M 76,000 Tencent-ML [266] 18M 11,000 ImageNet-21K [144] 14.2M 21,841 8.2 Improved Random Coding The shift from 1-in-K encoding with output softmax neurons to logistic coding and logistic output neurons depends on some form of random coding. The unit bit-vector codewords of softmax outputs are always orthogonal. But picking codewords from all 2 K of the Boolean cube’s vertices tends to produce codewords that are at best approximately orthogonality. Such random codewords do tend to be orthogonal as the cube dimension K increases but the computational costs of such random search also grows rapidly as well. So logistic coding will require better schemes for picking codewords for very large K databases. These schemes may be deterministic or fast and may well involve different geometric or information-theoretic performance measures. 8.3 Probabilistic and Dynamical Analysis of Bidirectionality The bidirectional framework involves both a probabilistic component and a feedback compo- nent. This suggests analyzing and experimenting with bidirectional neural structures in two different ways. The first way is probabilistic: the forward and backward flow of neural signals involves changes in the network likelihood or posterior structure. A good example is the use of different weight priors in Bayesian bidirectional backpropagation. Here the number of potential priors is literally infinite. And such priors can apply to many different subsets and combinations of the network parameters. Priors can also apply at a higher level to different subsets of blocks in long-blocked networks. Probabilistic analysis of these alternatives may also give insight into network robustness to changes in the database or to random shocks to network subsystems. The benefits and harms of noise injection also involve probabilistic analysis. Different types of NoVa or other hidden neurons will produce different layer probability structures and thereby produce different constraints on helpful noise boosts. The layer probability 178 structure may not have a closed form and so analysis will likely involve some form of Monte Carlo simulation. Feedback offers another way to analyze bidirectional structures. The architectures and algorithms in this dissertation have used feedback in a sequential and limited manner. Most reported classification and regression results have been from one-shot passes through the network. But bidirectional structures define proper dynamical systems if run continuously forwards and backwards. Their equilibrium behavior can be quite complex. It can range from simple fixed-point attractors to various forms of chaos or aperiodicity. This dynamical structure opens the door to constructive uses of feedback to control the learning and recall of patterns. These uses may involve techniques from the rich and growing fields of optimal control or agent-based simulation. 179 Bibliography [1] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network struc- tures and optimization techniques for speech recognition.,” in Interspeech, Citeseer, vol. 11, ISCA, 2013, pp. 3366–3370. [2] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Con- volutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014. doi: 10.1109/TASLP.2014.2339736. [3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), IEEE, 2012, pp. 4277–4280. doi: 10.1109/ICASSP.2012.6288864. [4] S. Abramovich, G. Jameson, and G. Sinnamon, “Refining jensen’s inequality,” Bulletin mathématique de la Société des Sciences Mathématiques de Roumanie, pp. 3–14, 2004. [5] O.AdigunandB.Kosko,“Bidirectionalrepresentationandbackpropagationlearning,” in The 2016 International joint conference on advances in big data analytics(ABDA), CSREA, 2016, pp. 3–6. [6] O. Adigun and B. Kosko, “Using noise to speed up video classification with recurrent backpropagation,” in International Joint Conference on Neural Networks, IEEE, 2017, pp. 108–115. doi: 10.1109/IJCNN.2017.7965843. [7] O. Adigun and B. Kosko, “Training generative adversarial networks with bidirectional backpropagation,” in 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE, 2018, pp. 1178–1185. doi: 10.1109/ICMLA.2018. 00190. [8] O. Adigun and B. Kosko, “Bidirectional backpropagation,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 5, pp. 1982–1994, 2019. doi: 10.1109/TSMC.2019.2916096. 180 [9] O. Adigun and B. Kosko, “Noise-boosted bidirectional backpropagation and adver- sarial learning,” Neural Networks, vol. 120, pp. 9–31, 2019. doi: 10.1016/j.neunet. 2019.09.016. [10] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018. [11] R. Alec, M. Luke, and C. Soumith, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, [12] G. Alex, M. Abdel-Rahman, and H. Geoffrey, “Speech recognition with deep recurrent neural networks,” Proceedings of International Conference on Acoustic, Speech and Signal Processing, pp. 6645–6649, 2013. doi: 10.1109/ICASSP.2013.6638947. [13] G. Alex, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 855–868, 2009. doi: 10.1109/TPAMI.2008.137. [14] G. Alex, F. Santiago, L. Marcus, B. Horst, and J. Schmidhuber, “Unconstrained on-line handwriting recognition with recurrent neural networks,” Proceedings of the 20th International Conference on Neural Information Processing System, pp. 577–584, 2007. [15] M. S. Ali, J Yogambigai, S Saravanan, and S Elakkia, “Stochastic stability of neutral- type markovian-jumping bam neural networks with time varying delays,” Journal of Computational and Applied Mathematics, vol. 349, pp. 142–156, 2019. doi: 10.1016/ j.cam.2018.09.035. [16] C. Ambroise, M. Dang, and G. Govaert, “Clustering of spatial data by the em algorithm,” in geoENV I—Geostatistics for environmental applications, Springer, 1997, pp. 493–504. [17] M.Arjovsky,S.Chintala,andL.Bottou,“Wassersteingan,”arXiv preprint arXiv:1701.07875, 2017. [18] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” In International Conference on Learning Representations, 2017. [19] K. Audhkhasi, O. Osoba, and B. Kosko, “Noisy hidden markov models for speech recognition,” in The 2013 international joint conference on neural networks (IJCNN), IEEE, 2013, pp. 1–6. doi: 10.1109/IJCNN.2013.6707088. 181 [20] K. Audhkhasi, O. Osoba, and B. Kosko, “Noise-enhanced convolutional neural net- works,” Neural Networks, vol. 78, pp. 15–23, 2016. doi: 10.1016/j.neunet.2015. 09.014. [21] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neural network for video sequence background subtraction,” Pattern Recognition, vol. 76, pp. 635–649, 2018. doi: 10.1016/j.patcog.2017.09.040. [22] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,” in International workshop on human behavior understanding, Springer, 2011, pp. 29–39. doi: 10.1007/978-3-642-25446-8. [23] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2016, pp. 4945– 4949. doi: 10.1109/ICASSP.2016.7472618. [24] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Machine learning, vol. 21, no. 1, pp. 51–80, 1995. doi: 10.1007/BF00993379. [25] A. Baldominos, Y. Saez, and P. Isasi, “A survey of handwritten character recognition with mnist and emnist,” Applied Sciences, vol. 9, no. 15, p. 3169, 2019. [26] E.BarnardandD.Casasent,“Shiftinvarianceandtheneocognitron,” Neural Networks, vol. 3, no. 4, pp. 403–410, 1990. doi: 10.1016/0893-6080(90)90023-E. [27] S.BarrattandR.Sharma,“Anoteontheinceptionscore,”arXiv preprint arXiv:1801.01973, 2018. [28] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, “Automatic differentiation in machine learning: A survey,” Journal of Marchine Learning Research, vol. 18, pp. 1–43, 2018. [29] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. doi: 10.1561/2200000006. [30] J. Bergstra, F. Bastien, O. Breuleux, et al., “Theano: Deep learning on gpus with python,” in NIPS 2011, BigLearning Workshop, Granada, Spain, Citeseer, vol. 3, 2011. [31] S. Bhatia and R. Golman, “Bidirectional constraint satisfaction in rational strategic decision making,” Journal of Mathematical Psychology, vol. 88, pp. 48–57, 2019. [32] M.Binkowski,G.Marti,andP.Donnat,“Autoregressiveconvolutionalneuralnetworks for asynchronous time series,” in International Conference on Machine Learning, vol. 80, PMLR, 2018, pp. 579–588. 182 [33] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995. doi: 10.1162/neco.1995.7.1.108. [34] C. M. Bishop, Pattern recognition and machine learning. springer, 2006. [35] J. Bolte and E. Pauwels, “A mathematical model for automatic differentiation in machine learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 10809–10819, 2020. [36] A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional time series forecasting with convolutional neural networks,” arXiv preprint arXiv:1703.04691, 2017. [37] A. E. Bryson, “A gradient method for optimizing multi-stage allocation processes,” in Proc. Harvard Univ. Symposium on digital computers and their applications, vol. 72, 1961, p. 22. [38] A. E. Bryson and W. F. Denham, “A steepest-ascent method for solving optimum programming problems,” 1962. [39] A. R. Bulsara and L. Gammaitoni, “Tuning in to noise,” Physics today, vol. 49, no. 3, pp. 39–47, 1996. [40] G. Celeux and G. Govaert, “A classification em algorithm for clustering and two stochastic versions,” Computational statistics & Data analysis, vol. 14, no. 3, pp. 315– 332, 1992. [41] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 4960–4964. doi: 10.1109/ICASSP.2016.7472621. [42] F.-C. Chen and M. R. Jahanshahi, “Nb-cnn: Deep learning-based crack detection using convolutional neural network and naïve bayes data fusion,” IEEE Transactions on Industrial Electronics, vol. 65, no. 5, pp. 4392–4400, 2017. doi: 10.1109/TIE. 2017.2764844. [43] Y. Chen, M. Rouhsedaghat, S. You, R. Rao, and C.-C. J. Kuo, “Pixelhop++: A small successive-subspace-learning-based (ssl-based) model for image classification,” in 2020 IEEE International Conference on Image Processing (ICIP), IEEE, 2020, pp. 3294–3298. doi: 10.1109/ICIP40778.2020.9191012. [44] D. Chicco, P. Sadowski, and P. Baldi, “Deep autoencoder neural networks for gene ontology annotation predictions,” in Proceedings of the 5th ACM conference on bioinformatics, computational biology, and health informatics, ACM, 2014, pp. 533– 540. doi: 10.1145/2649387.2649442. 183 [45] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun, “Using recurrent neural network models for early detection of heart failure onset,” Journal of the American Medical Informatics Association, vol. 24, no. 2, pp. 361–370, 2017. doi: 10.1093/jamia/ ocw112. [46] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258. [47] M. Chu, Y. Xie, L. Leal-Taixé, and N. Thuerey, “Temporally coherent gans for video super-resolution (tecogan),” arXiv preprint arXiv:1811.09393, vol. 1, no. 2, p. 3, 2018. [48] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recur- rent neural networks on sequence modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. [49] D. CireAan, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column deep neural network for traffic sign classification,” Neural networks, vol. 32, pp. 333–338, 2012. doi: 10.1016/j.neunet.2012.02.023. [50] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, high performance convolutional neural networks for image classification,” in Proceed- ings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI 2011 , IJCAI/AAAI, 2011, pp. 1237–1242. doi: 10.5591/978-1-57735-516-8/IJCAI11- 210. [51] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: Extending mnist to handwritten letters,” in 2017 international joint conference on neural networks (IJCNN), IEEE, 2017, pp. 2921–2926. doi: 10.1109/IJCNN.2017.7966217. [52] M. A. Cohen and S. Grossberg, “Absolute stability of global pattern formation and parallel memory storage by competitive neural networks,” IEEE transactions on systems, man, and cybernetics, vol. 13, no. 5, pp. 815–826, 1983. doi: 10.1109/TSMC. 1983.6313075. [53] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deepneuralnetworkswithmultitasklearning,”in Proceedings of the 25th international conference on Machine learning, 2008, pp. 160–167. [54] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of machine learning research, vol. 12, no. ARTICLE, pp. 2493–2537, 2011. doi: 10.5555/1953048. 2078186. 184 [55] Z. Cui, W. Chen, and Y. Chen, “Multi-scale convolutional neural networks for time series classification,” arXiv preprint arXiv:1603.06995, 2016. [56] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989. doi: 10.1007/ BF02551274. [57] G. Dahl, M. Ranzato, A.-r. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted boltzmann machine,” Advances in neural information processing systems, vol. 23, pp. 469–477, 2010. [58] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011. doi: 10.1109/ TASL.2011.2134090. [59] P. Dai, R. Ji, H. Wang, Q. Wu, and Y. Huang, “Cross-modality person re-identification with generative adversarial training,” in IJCAI, vol. 1, 2018, p. 6. doi: 10.24963/ ijcai.2018/94. [60] A. Dash, J. C. B. Gamboa, S. Ahmed, M. Liwicki, and M. Z. Afzal, “Tac-gan- text conditioned auxiliary classifier generative adversarial network,” arXiv preprint arXiv:1703.06412, 2017. [61] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incom- plete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977. [62] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei, “What does classifying more than 10,000 image categories tell us?” In European conference on computer vision (ECCV), vol. 6315, Springer, 2010, pp. 71–84. doi: 10.1007/978-3-642-15555-0. [63] L. Deng, J. Li, J.-T. Huang, et al., “Recent advances in deep learning for speech research at microsoft,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, IEEE, 2013, pp. 8604–8608. doi: 10.1109/ICASSP. 2013.6639345. [64] L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and trends in signal processing, vol. 7, no. 3–4, pp. 197–387, 2014. [65] C. Ding and D. Tao, “Trunk-branch ensemble convolutional neural networks for video-based face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 1002–1014, 2017. doi: 10.1109/TPAMI.2017.2700390. 185 [66] J. Donahue, L. Anne Hendricks, S. Guadarrama, et al., “Long-term recurrent convo- lutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, 2015, pp. 2625–2634. doi: 10.1109/CVPR.2015.7298878. [67] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convo- lutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015. doi: 10.1109/TPAMI.2015.2439281. [68] R. O. Duda, P. E. Hart, and D. G. Stork, Eds., Pattern Classification . 2000, vol. 2nd. [69] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016. [70] R. Durbin and D. E. Rumelhart, “Product units: A computationally powerful and biologically plausible extension to backpropagation networks,” Neural computation, vol. 1, no. 1, pp. 133–142, 1989. doi: 10.1162/neco.1989.1.1.133. [71] K. Ehsani, R. Mottaghi, and A. Farhadi, “Segan: Segmenting and generating the invisible,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018, pp. 6144–6153. doi: 10.1109/CVPR.2018.00643. [72] A.El-Sawy,E.-B.Hazem,andM.Loey,“Cnnforhandwrittenarabicdigitsrecognition based on lenet-5,” in International conference on advanced intelligent systems and informatics, Springer, vol. 533, 2016, pp. 566–575. doi: 10.1007/978-3-319-48308- 5. [73] X. Feng, H. Zhang, Y. Ren, et al., “The deep learning–based recommender system “pubmender” for choosing a biomedical publication venue: Development and validation study,” Journal of medical Internet research, vol. 21, no. 5, e12957, 2019. [74] S. Fernández, A. Graves, and J. Schmidhuber, “Sequence labelling in structured domains with hierarchical recurrent neural networks,” in Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI 2007 , 2007, pp. 774– 779. [75] K. Fukushima, “Visual feature extraction by a multilayered network of analog thresh- old elements,” IEEE Transactions on Systems Science and Cybernetics, vol. 5, no. 4, pp. 322–333, 1969. doi: 10.1109/TSSC.1969.300225. [76] K. Fukushima, “Cognitron: A self-organizing multilayered neural network,” Biological cybernetics, vol. 20, no. 3, pp. 121–136, 1975. [77] K. Fukushima, “Neural network model for a mechanism of pattern recognition unaffected by shift in position-neocognitron,” IEICE Technical Report, A, vol. 62, no. 10, pp. 658–665, 1979. 186 [78] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mecha- nism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980. [79] K. Fukushima, “Neocognitron: A hierarchical neural network capable of visual pattern recognition,” Neural networks, vol. 1, no. 2, pp. 119–130, 1988. doi: 10.1016/0893- 6080(88)90014-7. [80] K. Fukushima, “A hierarchical neural network model for selective attention,” in Neural computers, Springer, 1989, pp. 81–90. [81] K. Fukushima, “Analysis of the process of visual pattern recognition by the neocog- nitron,” Neural Networks, vol. 2, no. 6, pp. 413–420, 1989. doi: 10.1016/0893- 6080(89)90041-5. [82] K. Fukushima and S. Miyake, “Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position,” Pattern recognition, vol. 15, no. 6, pp. 455–469, 1982. doi: 10.1016/0031-3203(82)90024-3. [83] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” in Competition and cooperation in neural nets, Springer, 1982, pp. 267–285. [84] K. Fukushima, S. Miyake, and T. Ito, “Neocognitron: A neural network model for a mechanism of visual pattern recognition,” IEEE transactions on systems, man, and cybernetics, vol. 13, no. 5, pp. 826–834, 1983. doi: 10.1109/TSMC.1983.6313076. [85] L. Gammaitoni, “Stochastic resonance and the dithering effect in threshold physical systems,” Physical Review E, vol. 52, no. 5, p. 4691, 1995. [86] F. A. Gers and E Schmidhuber, “Lstm recurrent networks learn simple context-free and context-sensitive languages,” IEEE Transactions on Neural Networks, vol. 12, no. 6, pp. 1333–1340, 2001. doi: 10.1109/72.963769. [87] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2011, pp. 315–323. [88] Y. Goldberg and O. Levy, “Word2vec explained: Deriving mikolov et al.’s negative- sampling word-embedding method,” arXiv preprint arXiv:1402.3722, 2014. [89] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. [90] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014. 187 [91] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning, PMLR, vol. 32, 2014, pp. 1764–1772. [92] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 6645–6649. doi: 10.1109/ICASSP.2013.6638947. [93] E. Grefenstette, P. Blunsom, N. De Freitas, and K. M. Hermann, “A deep architecture for semantic parsing,” arXiv preprint arXiv:1404.7296, 2014. [94] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007. [95] S. Grossberg, “Adaptive pattern classification and universal recoding: Ii. feedback, expectation, olfaction, illusions,” Biological cybernetics, vol. 23, no. 4, pp. 187–202, 1976. [96] S. Grossberg, “How does a brain build a cognitive code?” In Studies of mind and brain, Springer, 1982, pp. 1–52. [97] S. Grossberg, “Nonlinear neural networks: Principles, mechanisms, and architectures,” Neural networks, vol. 1, no. 1, pp. 17–61, 1988. doi: 10.1016/0893-6080(88)90021- 4. [98] S. Grossberg, “Towards solving the hard problem of consciousness: The varieties of brain resonances and the conscious experiences that they support,” Neural Networks, vol. 87, pp. 38–95, 2017. doi: 10.1016/j.neunet.2016.11.003. [99] J. A. Gubner, Probability and random processes for electrical and computer engineers. Cambridge University Press, 2006. [100] J. A. Gubner, Probability and random processes for electrical and computer engineers. Cambridge University Press, 2006. [101] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru- Guzik, “Objective-reinforced generative adversarial networks (organ) for sequence generation models,” arXiv preprint arXiv:1705.10843, 2017. [102] I. Gulrajani, F. Ahmed, M. Arjovsky, V. V. Dumoulin, and A. Courville, “Improved training of wasserstein gans,” arXiv preprint arXiv:1704.00028, 2017. [103] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gans,” arXiv preprint arXiv:1704.00028, 2017. [104] M. R. Gupta, S. Bengio, and J. Weston, “Training highly multiclass classifiers,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1461–1492, 2014. doi: 10.5555/2627435.2638582. 188 [105] M. R. Gupta and Y. Chen, Theory and use of the EM algorithm. Now Publishers Inc, 2010, vol. 4, pp. 223–296. doi: 10.1561/2000000034. [106] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning,” in International workshop on artificial neural networks, Springer, 1995, pp. 195–201. doi: 10.1007/3-540-59497-3. [107] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE Computer Society, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [108] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision, Springer, 2016, pp. 630–645. [109] X. He, J. Gao, and L. Deng, “Deep learning for natural language processing: Theory and practice,” in Conference on Information and Knowledge Management (CIKM), Shanghai, China, 2014. [110] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017. [111] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network (2015),” arXiv preprint arXiv:1503.02531, vol. 2, 2015. [112] S. Hochreiter, “The vanishing gradient problem during learning recurrent neu- ral nets and problem solutions,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998. doi: 10.1142/ S0218488598000094. [113] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradient flow in recurrent nets: The difficulty of learning long-term dependencies , 2001. [114] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735. [115] R Hogg, J Mckean, and A Craig, “Introduction to mathematical statistics seventh edition,” Pearson Education, 2012. [116] R. V. Hogg, J. McKean, and A. T. Craig, Introduction to Mathematical Statistics. Pearson, 2013. [117] R. V. Hogg, E. A. Tanis, and D. L. Zimmerman, Probability and statistical inference. Pearson/Prentice Hall Upper Saddle River, NJ, USA: 2010. [118] A. G. Howard, M. Zhu, B. Chen, et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. 189 [119] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017. [120] J.-T. Huang, J. Li, and Y. Gong, “An analysis of convolutional neural networks for speech recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 4989–4993. doi: 10.1109/ICASSP. 2015.7178920. [121] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of physiology, vol. 148, no. 3, pp. 574–591, 1959. [122] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, pp. 106–154, 1962. [123] D. H. Hubel and T. N. Wiesel, “Shape and arrangement of columns in cat’s striate cortex,” The Journal of physiology, vol. 165, no. 3, p. 559, 1963. [124] D. H. Hubel and T. N. Wiesel, Brain and visual perception: the story of a 25-year collaboration. Oxford University Press, 2004. [125] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic, “Generating images with recurrent adversarial networks,” arXiv preprint arXiv:1602.05110, 2016. [126] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134. doi: 10.1109/CVPR.2017.632. [127] W. P. J., “Beyond regression: New tools for prediction and analysis in the behavioral sciences.,” Doctoral Dissertation, Applied Mathematics, Harvard University, MA, 1974. [128] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010. doi: 10.1016/j.patrec.2009.09.011. [129] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning. Springer, 2013, vol. 112. [130] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013. doi: 10.1109/TPAMI.2012.59. [131] Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang, “Towards the automatic anime characters creation with generative adversarial networks,” arXiv preprint arXiv:1708.05509, 2017. 190 [132] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision, ECCV 2016, Springer, vol. 9906, 2016, pp. 694–711. doi: 10.1007/978-3-319-46475-6. [133] M. Jordan and T. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, pp. 255–260, 2015. [134] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016. [135] B. H. Juang and L. R. Rabiner, “Hidden markov models for speech recognition,” Technometrics, vol. 33, no. 3, pp. 251–272, 1991. [136] C. Junyoung, G. Caglar, C. Kyunghyun, and B. Yoshua, “Gated feedback recurrent neural networks,” in Proceedings of the 32nd International Conference on Machine Learning (PMLR 37), vol. 37, JMLR.org, 2015, pp. 2067–2075. [137] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” arXiv preprint arXiv:1404.2188, 2014. [138] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large- scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2014, pp. 1725–1732. doi: 10.1109/CVPR.2014.223. [139] H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal, vol. 30, no. 10, pp. 947–954, 1960. [140] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE Computer Society, 2016, pp. 1646–1654. doi: 10.1109/CVPR.2016.182. [141] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross- domain relations with generative adversarial networks,” in International conference on machine learning, PMLR, 2017, pp. 1857–1865. [142] C. Kleanthous and S. Chatzis, “Gated mixture variational autoencoders for value added tax audit case selection,” Knowledge-Based Systems, vol. 188, p. 105048, 2020. doi: 10.1016/j.knosys.2019.105048. [143] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990. doi: 10.1109/5.58325. 191 [144] A. Kolesnikov, L. Beyer, X. Zhai, et al., “Big transfer (bit): General visual repre- sentation learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, vol. 12350, 2020, pp. 491–507. doi: 10.1007/978-3-030-58558-7. [145] B. Kosko, “Stochastic competitive learning,” IEEE Transactions on Neural Networks, vol. 2, no. 5, pp. 522–529, 1991. doi: 10.1109/72.134289. [146] B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence. Prentice Hall, 1991. [147] B. Kosko, “Adaptive bidirectional associative memories,” Applied optics, vol. 26, no. 23, pp. 4947–4960, 1987. [148] B. Kosko, “Bidirectional associative memories,” IEEE Transactions on Systems, man, and Cybernetics, vol. 18, no. 1, pp. 49–60, 1988. doi: 10.1109/21.87054. [149] B. Kosko, “Unsupervised learning in noise,” Neural Networks, IEEE Transactions on, vol. 1, no. 1, pp. 44–57, 1990. doi: 10.1109/72.80204. [150] B. Kosko, Neural Networks and Fuzzy Systems. Prentice-Hall, 1991. [151] B. Kosko, Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence, 1992. [152] B. Kosko, Noise. Penguin, 2006. [153] B. Kosko, “Bidirectional associative memories: Unsupervised hebbian learning to bidi- rectional backpropagation,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 103–115, 2021. doi: 10.1109/TSMC.2020.3043249. [154] B. Kosko, K. Audhkhasi, and O. Osoba, “Noise can speed backpropagation learning and deep bidirectional pretraining,” Neural Networks, vol. 129, pp. 359–384, 2020. doi: 10.1016/j.neunet.2020.04.004. [155] B. Kosko, K. Audkhasi, and O. Osoba, “Noise can speed backpropagation learning and deep bidirectional pretraining,” Neural Networks, vol. 129, pp. 359–384, 2020. doi: 10.1016/j.neunet.2020.04.004. [156] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009. [157] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012. 192 [158] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84– 90, 2017. doi: 10.1145/3065386. [159] A.Kuznetsova,H.Rom,N.Alldrin,et al.,“Theopenimagesdatasetv4,”International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020. doi: 10.1007/ s11263-020-01316-z. [160] D. Kwon, K. Natarajan, S. C. Suh, H. Kim, and J. Kim, “An empirical study on network anomaly detection using convolutional neural networks,” in 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), IEEE Computer Society, 2018, pp. 1595–1598. doi: 10.1109/ICDCS.2018.00178. [161] C. E. Lawrence and A. A. Reilly, “An expectation maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences,” Proteins: Structure, Function, and Bioinformatics, vol. 7, no. 1, pp. 41–51, 1990. [162] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A convolu- tional neural-network approach,” IEEE transactions on neural networks, vol. 8, no. 1, pp. 98–113, 1997. doi: 10.1109/72.554195. [163] P. Le and W. Zuidema, “Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive lstms,” arXiv preprint arXiv:1603.00423, 2016. [164] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015. [165] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [166] Y. LeCun, Y. Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995. [167] Y. LeCun, B. Boser, J. Denker, et al., “Handwritten digit recognition with a back- propagation network,” Advances in neural information processing systems, vol. 2, pp. 396–404, 1989. [168] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. doi: 10.1109/5.726791. 193 [169] C. Ledig, L. Theis, F. Huszár, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, 2017, pp. 4681–4690. doi: 10.1109/ CVPR.2017.19. [170] A. Leon-Garcia, “Probability, statistics, and random processes for electrical engineer- ing,” 2017. [171] D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE Computer Society, 2017, pp. 7398–7407. doi: 10.1109/CVPR.2017.782. [172] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE Computer Society, 2017, pp. 1222–1230. doi: 10.1109/CVPR.2017.211. [173] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural networks: Analysis, applications, and prospects,” IEEE Transactions on Neural Networks and Learning Systems, 2021. [174] S. Linnainmaa, “Taylor expansion of the accumulated rounding error,” BIT Numerical Mathematics, vol. 16, no. 2, pp. 146–160, 1976. [175] G. Litjens, T. Kooi, B. E. Bejnordi, et al., “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017. doi: 10.1016/j. media.2017.07.005. [176] J. Liu, W. Li, H. Pei, et al., “Identity preserving generative adversarial network for cross-domain person re-identification,” IEEE Access, vol. 7, pp. 114021–114032, 2019. doi: 10.1109/ACCESS.2019.2933910. [177] L. Lu, “Dying relu and initialization: Theory and numerical examples,” Communica- tions in Computational Physics, vol. 28, no. 5, pp. 1671–1706, 2020. [178] L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis, “Dying relu and initialization: Theory and numerical examples,” arXiv preprint arXiv:1903.06733, 2019. [179] A. Maas, Q. V. Le, T. M. O’neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent neural networks for noise reduction in robust asr,” pp. 22–25, 2012. [180] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, Citeseer, vol. 30, 2013, p. 3. 194 [181] C Maharajan, R Raja, J. Cao, G Rajchakit, and A. Alsaedi, “Impulsive cohen– grossberg bam neural networks with mixed time-delays: An exponential stability analysis issue,” Neurocomputing, vol. 275, pp. 2588–2602, 2018. doi: 10.1016/j. neucom.2017.11.028. [182] A. Mathiasen and F. Hvilshøj, “Fast fréchet inception distance,” arXiv e-prints, arXiv–2009, 2020. [183] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independent facial ex- pression recognition with robust face detection using a convolutional neural net- work,” Neural Networks, vol. 16, no. 5-6, pp. 555–559, 2003. doi: 10.1016/S0893- 6080(03)00115-1. [184] M. McDonnell, N. Stocks, C Pearce, and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization. Cambridge University Press, 2008. [185] G. J. McLachlan and T. Krishnan, The EM algorithm and extensions. John Wiley & Sons, 2007, vol. 382. [186] G. Mesnil, Y. Dauphin, K. Yao, et al., “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 530–539, 2015. doi: 10.1109/TASLP.2014. 2383614. [187] L. Metz, B. Poole, D. D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016. [188] M. Minsky, “Steps toward artificial intelligence,” Proceedings of the IRE, vol. 49, no. 1, pp. 8–30, 1961. [189] M. Minsky and S. Papert, “An introduction to computational geometry,” Cambridge tiass., HIT, vol. 479, p. 480, 1969. [190] T. Miyato and M. Koyama, “Cgans with projection discriminator,” arXiv preprint arXiv:1802.05637, 2018. [191] E.Mizutani,S.E.Dreyfus,andK.Nishio,“Onderivationofmlpbackpropagationfrom the kelley-bryson optimal-control gradient formula and its application,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, IEEE, vol. 2, 2000, pp. 167–172. doi: 10.1109/IJCNN.2000.857892. [192] O Mogren, “Continuous recurrent neural networks with adversarial training,” arXiv preprint arXiv:1611.09904, 2016. 195 [193] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018. [194] R. Mu, “A survey of recommender systems based on deep learning,” IEEE Access, vol. 6, pp. 69009–69022, 2018. doi: 10.1109/ACCESS.2018.2880197. [195] K. Muhammad, S. Khan, J. Del Ser, and V. H. C. De Albuquerque, “Deep learning for multigrade brain tumor classification in smart healthcare systems: A prospective survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 507–522, 2020. doi: 10.1109/TNNLS.2020.2995800. [196] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012. [197] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010. [198] M. Neumann and N. T. Vu, “Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech,” arXiv preprint arXiv:1706.00612, 2017. [199] O. Osoba and B. Kosko, “Noise-enhanced clustering and competitive learning algo- rithms,” Neural Networks, vol. 37, pp. 132–140, 2013. doi: 10.1016/j.neunet.2012. 09.012. [200] O. Osoba and B. Kosko, “The noisy expectation-maximization algorithm for multi- plicative noise injection,” Fluctuation and Noise Letters, vol. 15, no. 01, p. 1650007, 2016. [201] O. Osoba and B. Kosko, “Noisy expectation-maximization: Applications and general- izations,” arXiv preprint arXiv:1801.04053, 2018. [202] O. Osoba, S. Mitaim, and B. Kosko, “Noise benefits in the expectation-maximization algorithm: Nem theorems and models,” in The 2011 International Joint Conference on Neural Networks, IEEE, 2011, pp. 3178–3183. doi: 10.1109/IJCNN.2011.6033642. [203] O. Osoba, S. Mitaim, andB. Kosko,“The noisy expectation–maximization algorithm,” Fluctuation and Noise Letters, vol. 12, no. 03, p. 1350012, 2013. [204] N. O’Mahony, S. Campbell, A. Carvalho, et al., “Deep learning vs. traditional computervision,”inAdvances in Computer Vision - Proceedings of the 2019 Computer Vision Conference, CVC 2019, Springer, 2019, pp. 128–144. doi: 10.1007/978-3- 030-17795-9. [205] D. Palaz, M. M. Doss, and R. Collobert, “Convolutional neural networks-based continuous speech recognition using raw speech signal,” in 2015 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 4295–4299. doi: 10.1109/ICASSP.2015.7178781. 196 [206] A Palonpon, J Amistoso, J Holdsworth, W Garcia, and C Saloma, “Measurement of weak transmittances by stochastic resonance,” Optics letters, vol. 23, no. 18, pp. 1480– 1482, 1998. [207] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, PMLR, 2013, pp. 1310–1318. [208] A. Paszke, S. Gross, S. Chintala, et al., “Automatic differentiation in pytorch,” 2017. [209] A. Patel and B. Kosko, “Optimal mean-square noise benefits in quantizer-array linear estimation,” IEEE Signal Processing Letters, vol. 17, no. 12, pp. 1005–1009, 2010, issn: 1070-9908. doi: 10.1109/LSP.2010.2059376. [210] A. Patel and B. Kosko, “Stochastic resonance in continuous and spiking neuron models with levy noise,” IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 1993–2008, 2008. doi: 10.1109/TNN.2008.2005610. [211] A. Patel and B. Kosko, “Noise benefits in quantizer-array correlation detection and watermark decoding,” IEEE Transactions on Signal Processing, vol. 59, no. 2, pp. 488– 505, 2010. doi: 10.1109/TSP.2010.2091409. [212] X. Qian, Y. Fu, T. Xiang, et al., “Pose-normalized image generation for person re-identification,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 650–667. doi: 10.1007/978-3-030-01240-3. [213] Y. Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263–2276, 2016. doi: 10.1109/TASLP.2016.2602884. [214] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [215] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [216] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017. [217] R. Reed, R. Marks, and S. Oh, “Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter,” IEEE Transactions on Neural Networks, vol. 6, no. 3, pp. 529–538, 1995. doi: 10.1109/72.377960. [218] R. Reed, S. Oh, R. Marks, et al., “Regularization using jittered training data,” in International Joint Conference on Neural Networks, vol. 3, 1992, pp. 147–152. 197 [219] H. Ren, B. Xu, Y. Wang, et al., “Time-series anomaly detection service at microsoft,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2019, pp. 3009–3017. doi: 10.1145/3292500. 3330680. [220] M. Rodriguez, “Spatio-temporal maximum average correlation height templates in action recognition and video summarization,” 2010. [221] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing, vol. 1, 1986. [222] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error propagation. rumelhart d, mcclelland j (eds) parallel distributed processing: Explorations in the microstructure of cognition, 1986. [223] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986. [224] O. Russakovsky, J. Deng, H. Su, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y. [225] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette, “Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes,” Computer Vision and Image Understanding, vol. 172, pp. 88–97, 2018. doi: 10.1016/ j.cviu.2018.02.006. [226] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, and R. A., “Improved techniques for training gans,” arXiv preprint arXiv:1606.0349, 2016. [227] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, pp. 2234–2242, 2016. [228] A. Sathyanarayana, S. Joty, L. Fernandez-Luque, et al., “Sleep quality prediction from wearable data using deep learning,” JMIR mHealth and uHealth, vol. 4, no. 4, e6562, 2016. [229] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015. doi: 10.1016/j.neunet.2014.09.003. [230] J. Shao, C.-C. Loy, K. Kang, and X. Wang, “Slicing convolutional neural network for crowd video understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2016, pp. 5620–5628. doi: 10.1109/CVPR.2016.606. 198 [231] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semantic representations using convolutional neural networks for web search,” in Proceedings of the 23rd international conference on world wide web, 2014, pp. 373–374. doi: 10.1145/ 2567948.2577348. [232] L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for emission tomog- raphy,” IEEE transactions on medical imaging, vol. 1, no. 2, pp. 113–122, 1982. [233] W. Shi, J. Caballero, F. Huszár, et al., “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE Computer Society, 2016, pp. 1874–1883. doi: 10.1109/CVPR.2016.207. [234] R. Socher, Y. Bengio, and C. D. Manning, “Deep learning for nlp (without magic),” in The 50th Annual Meeting of the Association for Computational Linguistics, Tutorial Abstracts, 2012, p. 5. [235] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos,” in Computer vision in sports, Springer, 2014, pp. 181–208. [236] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov,“Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [237] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [238] P. Swietojanski, A. Ghoshal, and S. Renals, “Convolutional neural networks for distant speech recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1120–1124, 2014. doi: 10.1109/LSP.2014.2325781. [239] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence , AAAI Press, 2017, pp. 4278–4284. [240] C. Szegedy, W. Liu, Y. Jia, et al., “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. doi: 10.1109/CVPR.2015.7298594. [241] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE Computer Society, 2016, pp. 2818–2826. doi: 10.1109/CVPR.2016.308. 199 [242] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 4784–4788. doi: 10.1109/ICASSP.2018.8461829. [243] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional learning of spatio-temporal features,” in European conference on computer vision, Springer, 2010, pp. 140–153. doi: 10.1007/978-3-642-15567-3. [244] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 11, pp. 1958–1970, 2008. doi: 10.1109/ TPAMI.2008.128. [245] L. Trottier, P. Giguere, and B. Chaib-Draa, “Parametric exponential linear unit for deep convolutional neural networks,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2017, pp. 207–214. doi: 10.1109/ICMLA.2017.00038. [246] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, 2018, pp. 1526–1535. doi: 10.1109/CVPR.2018. 00165. [247] B. Van Merriënboer, O. Breuleux, A. Bergeron, and P. Lamblin, “Automatic differ- entiation in ml: Where we are and where we should be going,” Advances in neural information processing systems, vol. 31, 2018. [248] C. Villani., Optimal Transport: Old and New. Springer, Berlin, 2009. [249] P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol,“Stackeddenoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010. doi: 10.5555/1756006.1953039. [250] C.Vondrick,H.Pirsiavash,andA.Torralba,“Generatingvideoswithscenedynamics,” Advances in neural information processing systems, vol. 29, pp. 613–621, 2016. [251] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Computational intelligence and neuroscience, vol. 2018, 2018. doi: 10.1155/2018/7068349. [252] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recogni- tion using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989. doi: 10.1109/29.21701. 200 [253] J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video forecasting by generating pose futures,” in Proceedings of the IEEE international conference on computer vision, IEEE, 2017, pp. 3332–3341. doi: 10.1109/ICCV.2017.361. [254] F. Wang, Y. Chen, and M. Liu, “Pth moment exponential stability of stochastic memristor-based bidirectional associative memory (bam) neural networks with time delays,” Neural Networks, vol. 98, pp. 192–202, 2018. doi: 10.1016/j.neunet.2017. 11.007. [255] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning for recommender systems,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2015, pp. 1235–1244. doi: 10.1145/ 2783258.2783273. [256] J.Wang,A.Dogandzic,andA.Nehorai,“Maximumlikelihoodestimationofcompound- gaussian clutter and target parameters,” IEEE Transactions on Signal Processing, vol. 54, no. 10, pp. 3884–3898, 2006. doi: 10.1109/TSP.2006.880209. [257] X. Wang, K. Yu, S. Wu, et al., “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European conference on computer vision (ECCV) workshops, Springer, 2018, pp. 63–79. doi: 10.1007/978-3-030-11021-5. [258] Y. E. Wang, G.-Y. Wei, and D. Brooks, “Benchmarking tpu, gpu, and cpu platforms for deep learning,” CoRR, 2019. [259] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3365–3387, 2020. doi: 10.1109/TPAMI.2020.2982166. [260] P. J. Werbos, “Generalization of backpropagation with application to a recurrent gas market model,” Neural networks, vol. 1, no. 4, pp. 339–356, 1988. doi: 10.1016/0893- 6080(88)90007-X. [261] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. doi: 10.1109/5.58337. [262] P. J. Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” Doctoral Dissertation, Applied Mathematics, Harvard University, MA, 1974. [263] P.J.Werbos,The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. John Wiley & Sons, 1994, vol. 1. [264] H. White, “Learning in artificial neural networks: A statistical perspective,” Neural computation, vol. 1, no. 4, pp. 425–464, 1989. doi: 10.1162/neco.1989.1.4.425. 201 [265] K. Wiesenfeld and F. Moss, “Stochastic resonance and the benefits of noise: From ice ages to crayfish and squids,” Nature, vol. 373, no. 6509, pp. 33–36, 1995. [266] B. Wu, W. Chen, Y. Fan, et al., “Tencent ml-images: A large-scale multi-label image database for visual representation learning,” IEEE Access, vol. 7, pp. 172683–172693, 2019. doi: 10.1109/ACCESS.2019.2956775. [267] D. Wu, K. Zhang, S.-J. Zheng, et al., “Random occlusion recovery for person re- identification,” Journal of Imaging Science and Technology, vol. 63, no. 3, pp. 30405– 1, 2019. [268] Y. Xiao, A. Jiang, C. Liu, and M. Wang, “Single image colorization via modified cyclegan,” in 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 3247–3251. doi: 10.1109/ICIP.2019.8803677. [269] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation,” Neuroinformatics, vol. 16, no. 3, pp. 383–392, 2018. [270] D. Yang, T. Xiong, D. Xu, et al., “Automatic vertebra labeling in large-scale 3d ct using deep image-to-image network with message passing and sparsity regularization,” in International conference on information processing in medical imaging, Springer, 2017, pp. 633–644. doi: 10.1007/978-3-319-59050-9. [271] D. Yu and L. Deng, Automatic speech recognition. Springer, 2016, vol. 1. [272] H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetric metric learning for unsupervised person re-identification,” in Proceedings of the IEEE international conference on computer vision, IEEE Computer Society, 2017, pp. 994–1002. doi: 10.1109/ICCV.2017.113. [273] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE Computer Society, 2015, pp. 4694–4702. doi: 10.1109/CVPR.2015.7299101. [274] H. Zhang, T. Xu, H. Li, et al., “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, IEEE, 2017, pp. 5907–5915. doi: 10.1109/ICCV.2017. 629. [275] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recommender system: A survey and new perspectives,” ACM Computing Surveys (CSUR), vol. 52, no. 1, pp. 1–38, 2019. doi: 10.1145/3285029. 202 [276] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, “Parallel distributed processing model with local space-invariant interconnections and its optical architecture,” Applied optics, vol. 29, no. 32, pp. 4790–4797, 1990. [277] W. Zhang et al., “Shift-invariant pattern recognition neural network and its optical architecture,” in Proceedings of annual conference of the Japan Society of Applied Physics, 1988. [278] Y. Zhang, J. M. Brady, and S. Smith, “Hidden markov random field model for segmentation of brain mr image,” in Medical Imaging 2000: Image Processing, In- ternational Society for Optics and Photonics, vol. 3979, 2000, pp. 1126–1137. doi: 10.1117/12.387617. [279] S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, and W. He, “Genegan: Learning ob- ject transfiguration and attribute subspace from unpaired data,” arXiv preprint arXiv:1705.04932, 2017. [280] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp. 2223–2232. doi: 10.1109/ICCV. 2017.244. [281] A Žilinskas, Practical mathematical optimization: An introduction to basic optimiza- tion theory and classical and new gradient-based algorithms, 2006. [282] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition,ComputerVisionFoundation/IEEEComputerSociety, 2018, pp. 8697–8710. doi: 10.1109/CVPR.2018.00907. 203
Abstract (if available)
Abstract
This dissertation shows that three new feedback-based architectures increase the pattern capacity and the performance of deep neural networks: (1) bidirectional backpropagation and its Bayesian extension, (2) blocking networks with random logistic output coding and the new NoVa (non-vanishing) hidden neurons, and (3) noise-boosted recurrent backpropagation for time-varying classification and regression.
The first and central contribution of this thesis is the new bidirectional backpropagation (B-BP) algorithm. B-BP trains a classifier or regression network both forwards and backwards through the same network of synapses and neurons. This introduces a feedback structure into the network’s training and its probabilistic structure. B-BP tends to improve classification accuracy at little extra cost compared with ordinary forward-only BP. Ordinary BP ignores such backward training. So it ignores the hidden regressor in the backward direction when training a classifier. Bayesian B-BP further allows prior probabilities to shape the optimization of the network’s global posterior probability structure. B-BP also improves the classification performance of generative adversarial networks.
The second main contribution extends classifier networks to bidirectional blocks ofnetworks with random logistic coding at the terminal layers of the interior blocks. Output logistic neurons allow the choice of binary codewords for K patterns from any of the 2^K vertices of a binary hypercube. Output softmax neurons limit this choice to the K vertices of the simplex embedded in the hypercube. Random logistic coding allows the same network to store and recall far more patterns than simple softmax 1-in-K encoding. The new NoVa hidden neurons allow still deeper and more powerful networks per block before the problem of the vanishing gradient takes its toll. NoVa neurons tended to outperform and “live” longer than threshold-linear ReLU (rectified linear unit) hidden neurons in very deep classifier networks that train on a large number K of patterns.
The third main contribution is the new noisy recurrent backpropagation algorithm for time-varying signals or patterns. This feedback architecture uses the recent reduction of the backpropagation algorithm to the Expectation-Maximization algorithm for iterative maximum-likelihood estimation. This noise boost makes the current training signal more probable as the system climbs the nearest hill of log-likelihood, The noise boost improves both classification and regression performance.
Current neural classifiers tend to work with a small number K of patterns. The number K may be only in the hundreds. Future networks will have to work with much larger numbers of patterns in the hundreds of thousands or millions. This will require new neural architectures and learning algorithms that scale with such large K. The new techniques in this thesis offer a step in that direction.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Noise benefits in expectation-maximization algorithms
PDF
Experimental analysis and feedforward design of neural networks
PDF
Noise benefits in Markov chain Monte Carlo computation
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Exploring complexity reduction in deep learning
PDF
Efficient graph learning: theory and performance evaluation
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
PDF
Object classification based on neural-network-inspired image transforms
PDF
Differential verification of deep neural networks
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Dynamical representation learning for multiscale brain activity
PDF
A data-driven approach to image splicing localization
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Bidirectional neural interfaces for neuroprosthetics
PDF
Hashcode representations of natural language for relation extraction
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Biological geometry-aware deep learning frameworks for enhancing medical cyber-physical systems
Asset Metadata
Creator
Adigun, Olaoluwa Adekola
(author)
Core Title
High-capacity feedback neural networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
08/01/2022
Defense Date
06/20/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
backpropagation invariance,bidirectional associative memory,bidirectional backpropagation,blocking,deep-sweep training,expectation–maximization algorithm,logistic codewords,neural networks,noise benefit,OAI-PMH Harvest,random coding
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kosko, Bart (
committee chair
), Jonckheere, Edmond (
committee member
), Kuo, Jay (
committee member
), Moore, James (
committee member
)
Creator Email
adigun@usc.edu,adigunolaoluwa@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375981
Unique identifier
UC111375981
Legacy Identifier
etd-AdigunOlao-11062
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Adigun, Olaoluwa Adekola
Type
texts
Source
20220802-usctheses-batch-966
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
backpropagation invariance
bidirectional associative memory
bidirectional backpropagation
blocking
deep-sweep training
expectation–maximization algorithm
logistic codewords
neural networks
noise benefit
random coding