Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Empirical study of informational regularizations in learning useful and interpretable representations
(USC Thesis Other)
Empirical study of informational regularizations in learning useful and interpretable representations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EMPIRICAL STUDY OF INFORMATIONAL REGULARIZATIONS IN
LEARNING USEFUL AND INTERPRETABLE REPRESENTATIONS
by
Dong Guo
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2019
Copyright 2019 Dong Guo
Acknowledgement
I want to thank my PhD advisors Professor Cauligi S. Raghavendra. He gave
me enough freedom of conducting my research and his experience and insights in
machine learning application helped me greatly through my PhD study.
I also want to thank Professor Fei Sha. Working with him on Babel program
in 2015 deepened my understanding of machine learning theory and improved
my experiment skills. I also want to thank the members of my qualifying and
dissertation committee: Professor Aiichiro Nakano, Professor Yan Liu, Professor
Viktor Prasanna and Professor Iraj Ershaghi. Their feedback on my thesis topic
was very critical and instructional.
I want to thank Wenzhe Li, Ayush Jaiswal, Kuan Liu, Zhiyun Lu and Alireza
BagheriGarakaniforourcollaborationandnumerousdiscussionindeepgenerative
model and in kernel speech model projects.
I want to thank Center for Interactive Smart Oilfield Technologies (CiSoft) that
supported me with research assistantship, and Computer Science department at
University of Southern California that supported me with teaching assistantship
during first and second half of my PhD study.
2
Contents
Acknowledgement 2
List of Tables 5
List of Figures 6
1 Introduction 10
1.1 Representation Learning for Machine Learning . . . . . . . . . . . . 11
1.1.1 Learning to extract representations . . . . . . . . . . . . . . 13
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Applications and Contributions . . . . . . . . . . . . . . . . . . . . 18
2 Information Bottleneck Principle and Applications 20
2.1 Information Bottleneck Principles . . . . . . . . . . . . . . . . . . . 22
2.2 Application on Supervised Learning . . . . . . . . . . . . . . . . . . 25
3 Information Regularization in Supervised Classification Models 30
3.1 Background: Entropy in Acoustic Classification Models . . . . . . . 30
3.1.1 Comparing large scale kernel models and DNN . . . . . . . . 31
3.1.2 Tradeoff between perplexity and entropy . . . . . . . . . . . 35
3.1.3 ERLL: a new model selection criterion . . . . . . . . . . . . 37
3.2 Interpreting ERLL . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Tuning Supervised Models with Information Regularization . . . . . 40
3.3.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Trade-offs between accuracy and confidence . . . . . . . . . 42
3.4 Contributions and Discussions . . . . . . . . . . . . . . . . . . . . . 45
4 Information Bottleneck Layer and Concept Discovery 50
4.1 Representation of primitives of variations . . . . . . . . . . . . . . . 51
4.1.1 Density estimation using latent variable model . . . . . . . . 51
4.1.2 Deep generative model and variational Auto-Encoder . . . . 58
4.1.3 Information theoretic interpretation of VAE . . . . . . . . . 62
3
5 Unsupervised Learning of Latent Factor of Variations 67
5.1 Learning to Disentangle Factors of Variation . . . . . . . . . . . . . 68
5.2 Factors of Variation from Groups of Latent Representations . . . . 72
5.2.1 Factorizing the prior distribution of representation . . . . . . 75
5.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . 77
5.3 biVAE: Variational Factorization of Prior Model . . . . . . . . . . . 83
5.3.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Contributions and Discussions . . . . . . . . . . . . . . . . . . . . . 89
Reference List 92
A Derivations and Experiments for Supervised Models 104
A.1 Kernels and random features approximation . . . . . . . . . . . . . 104
A.2 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B Derivations and Experiments for Unsupervised Models 109
B.1 Approximate inference of factorization prior model using gibbs sam-
pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.2 Objective function of biVAE model . . . . . . . . . . . . . . . . . . 112
4
List of Tables
3.1 Comparison in perplexity (ppx) and accuracy (acc: %) on BN-50
validation dataset. The kernel model used 100K random features . . 35
3.2 Regularized perplexity is a better model selection criteria . . . . . . 36
3.3 Comparison in perplexity (ppx), accuracy (acc: %), and ERLL.
Kernel models used 50K dimensional kernel features, and 1000 units
in the bottleneck layer. . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Comparison in accuracies of densenets with and without entropy
regularization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Testing datasets make entropy regularization more useful. . . . . . . 45
5.1 The summary of the models evaluated in this chapter.The models
are identified through the property of covariance of random variablez 77
5.2 ClassificationaccuracyonbothsourceandtargetdomainsforMulti-
PIE dataset. The source and target domains are separated by IDs.
Better accuracy in T column means better domain adaptation per-
formance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
B.1 Architecture of VAE encoder and decoders . . . . . . . . . . . . . . 114
5
List of Figures
1.1 Machine learning models learn from numerical features of data sam-
ples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 3D nonlinear Swiss Roll (left) and 2D linear manifold (right). . . . 13
1.3 Two machine learning approaches: classifier y = f(z) (top) and
feature extractor z = g(x) followed by classifier y = f(z) (bot-
tom). The solid lines mean inference, and dashed lines mean back-
propagating model objectives to update model parameters. . . . . . 14
1.4 Simple discriminative and generative neural networks . . . . . . . . 15
2.1 Learning representation Z in discriminative model Y|X. . . . . . . 24
2.2 Representationlearninginneuralnetworkclassificationmodels: each
layer Z
i
summarizes some but not all of the information present in
the previous layer, Z
i−1
or X, while retaining as much information
relevant to output Y as possible [5]. . . . . . . . . . . . . . . . . . . 25
3.1 Kernel model using random approximation can be seen as a shallow
neural network [73]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Linear factorization of model parametersθ
c
reduces model complex-
ity and improves generalization performance [73]. . . . . . . . . . . 32
6
3.3 Training DNN acoustic models to minimize perplexity is not suffi-
cient to arrive at the best WER – after the typical early-stopping
point where the perplexity is lowest, continuing to train to increase
the perplexity but decrease the entropy leads to the best WER. . . 36
3.4 Similar to training DNN acoustic models, as in figure 3.3, train-
ing kernel models also has tradeoff between perplexity and entropy
where the lowest perplexity does not correspond to the lowest WER. 37
3.5 WER is almost linear in the regularized perplexity . . . . . . . . . . 37
3.6 In supervsied NN model, representation z and classifier p
φ
(y|z) are
learned to approximate intractable predictor p
θ
(z|x). . . . . . . . . 38
3.7 Both classification error and prediction entropy have significant gap
on training and testing data. . . . . . . . . . . . . . . . . . . . . . . 44
3.8 The testing dataset taught optimizer how confident model should be. 45
4.1 Two representative probabilistic graphical latent variable model for
density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Neural networks naturally split observations and latent variables
into differen layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Generating image of specific properties, by mapping from attributes
to images with decoder. The attributes in this example are class c
of object in the image, view angle v and transformation θ [2]. . . . . 58
4.4 Abstract diagrams of probabilistic decoder and encoder models. . . 58
4.5 Model architecture of Variational Auto-Encoder [64]. . . . . . . . . 60
5.1 VAEandβ-VAEhardcodedstructureandpriordistributionofinter-
mediate layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7
5.2 Qualitative evaluation of reconstruction and generation capability
of beta-VAE moels. In subplots, top two rows are image decoded
from μ
φ
(z|x), bottom two rows are images decoded from z∼P
θ
(z). 71
5.3 Representation ofx is split into a subsetY that encodes label infor-
mation and another subset Z that encodes other variations. Opti-
mizing supervised and adversarial objectives encourage{Y,Z} to be
informative of x and encourages Y and Z to be independent. The
black, blue and red arrows represent encoding, decoding operations
and loss functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Graphical representation of model with unknown block structure on
the prior. α,σ
d
,σ
g
,σ
C
are the hyperparameters. {s
d
},{g
dl
},{C
nl
}
are the parameters of the prior model. {z
i
} is the latent represen-
tation, and{x
i
} are our observations. . . . . . . . . . . . . . . . . . 75
5.5 Samples from three generative models of MNIST dataset. Most of
them look like real digits. . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 GeneratedsamplesfromfourdifferentmodelsonMulti-PIEdataset.
Each model has 3 subplots. The first factor is identity, the second
factor is brightness and the third factor is view angle. . . . . . . . . 80
5.7 Visual demonstration of high level ideas in factor-VAE model. . . . 84
5.8 The representations of X in concept space and feature space are
both inferred with deep neural network and re-parameterization
trick. The conditional distribution of feature given concepts is also
modeled with deep neural network. . . . . . . . . . . . . . . . . . . 85
5.9 Reconstruction and randomly sampled images using biVAE . . . . . 87
5.10 Reconstruction and randomly sampled images using biVAE . . . . . 88
5.11 biVAE outperforms VAE in 17 of 18 attributes. . . . . . . . . . . . . 88
8
5.12 biVAE discovered ten clusters in MNIST dataset. The left column
shows 10 cluster centers. The right columns show variation inside
each clusters. The variations inside each cluster is caused by noise
d
added to z∼P
θ
(z|c). . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1 By perturbing the image with small amount of noise, model mis-
classified human to beaver. (Left): original image; (Middle): per-
turbation noise; (Right): perturbed image, recognized by classifier
as beaver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9
Chapter 1
Introduction
After decades of development in theory and models, in recent years, machine
learning (ML) has been successfully applied to solve many academic and real
world problems [109, 94, 18]. The reason for this include rise of computational
power, availability of large amount of data, and invention of new deep learning
(DL) algorithms [11, 13]. Deep learning has achieved excellent performance for
many tasks in computer vision, [67, 110, 45, 39] automatic speech recognition
(ASR) [11, 50, 86, 106] and natural language processing (NLP) ( [83, 114, 129]),
and has made it possible to solve many tasks which used to be very challenging,
e.g. generating images and captions [6, 61, 126], playing Go games [109]. The
advantages of DL models are largely attributed to the ability of learning informa-
tive representations that discover latent structures from unstructured data [13],
through design of suitable architectures [57, 54] and activation functions [87, 37].
Thus characterizing and understanding properties of deep learning representation
is an important research topic [4]. Currently most works in this direction empha-
size on the qualitative visualization [7, 9, 71, 34] and the analysis are done after
models are trained. In this dissertation, we study the properties of representations
during the model learning stage. We emphasize on representations of deep gener-
ative models, integrate task dependent regularization and auxiliary problems, and
study quantitative evaluating of learned representations.
10
1.1 Representation Learning for Machine Learn-
ing
Machine learning methods, which build models of datasets, and learns to opti-
mize models’ performance on specific tasks, have been widely applied in tasks
where analytical mathematical formulation or explicit programming is not avail-
able [16, 84]. For example, the logistic regression (LR) on web data, was used in
spam filtering [59] or ads click prediction [81], the convolutional neural networks
(CNN) were used in image classification and object detection [67, 110, 115, 45],
and the recurrent neural networks (RNN) were used in modeling, understanding
and generation of natural languages [113, 68, 21]. As shown in Figure 1.1, building
machine learning models will generally have the following steps: (1) transform the
raw attributes of input samples (e.g. values of all pixels in an image, list or all
words in one sentence, etc.) into quantitative values, usually as vectors and are
referredtoasfeature orrepresentation, anddenotedasX inthegraph; (2)choosing
the model suitable for the specific task (e.g. classifier for predicting labels of sam-
ples, or clustering analysis for grouping of unlabeled samples, etc.); (3) quantifying
the objective of task (e.g. cross entropy loss or hinge loss for classification task,
residual sum of squares for k-mean clustering, etc.) and learn model parameters
by optimizing it over datatset.
Highqualityfeaturesofdataattributesarefoundationofgoodmachinelearning
models. We usually expect the features to have the following properties. For ease
of description, we usex,z andy to indicate data, features and tasks, respectively.
1. Accuracy: Goodfeaturescancaptureandquantifyusefulinformationfrom
input samples that are relevant to tasks. From an information theoretic view
11
Figure 1.1: Machine learning models learn from numerical features of data samples.
point [119], we expect to maximize the correlation (e.g. mutual information
I(z,y)) between features and predictions.
2. Efficiency: Good features are compact in space and can be calculated
quickly. TheefficiencyofrepresentationsmakeitpossibletoextendMLmod-
elstolargescaleorreal-timedata. Moreover, constrainingthedimensionality
of features is useful in discovering the dominant directions of variations in
dataset, and in improving robustness to noise in data samples.
3. Simplicity: Many high dimensional complex data are generated and trans-
formed from simpler manifold in lower-dimensional subspace, e.g. Figure 1.2.
Feature extraction by transforming data attributes back to low-dimensional
manifold helps discovering the latent structure of dataset and making many
following-up tasks simple.
12
Figure 1.2: 3D nonlinear Swiss Roll (left) and 2D linear manifold (right).
4. Interpretability: From datasets, human can extract factors of variations
and concepts. If different factors are distributed to subsets of features, and
feature instances are grouped into concepts, the features will be suitable to
adapt to new tasks, and will be potentially helpful in image and language
understanding tasks.
5. Invariance: Machine learning models are optimized on finite training
dataset and implemented on new testing samples. Good features of data
samples are not affected by transformations irrelevant to the model objective.
This makes a model perform well on new testing cases.
1.1.1 Learning to extract representations
Most traditional machine learning algorithms consists of two stages: features
extractionusingmanuallydefinedfunctions, thenpredictionusingthesefeaturesas
input. Designing and tuning feature extractor (e.g. choosing the suitable nonlinear
kernels, choosing the suitable dimensionality reduction approach, etc.) requires
thorough understanding of dataset and extensive human efforts.
13
Figure 1.3: Two machine learning approaches: classifier y = f(z) (top) and feature
extractor z = g(x) followed by classifier y = f(z) (bottom). The solid lines mean
inference, and dashed lines mean back-propagating model objectives to update model
parameters.
With deep learning models, or more generally with differentiable machines, it
becamepopulartolearnnotonlyamodelfromfeaturestofinalprediction, butalso
a model from raw input signals to intermediate features. This approach is usually
called representation learning. We use a supervised classification model as example
toexplainthedifferencebetweentwomachineapproachesinFigure1.3. Inthefirst
one, the features z are usually calculated with some manually defined functions.
Minimizing the classification lossL(f(z),y) only affect parameters of classifier
f(z). In the second one, i.e. the representation learning approach, the features
z are calculated from data x with differentiable functions g(x). The feedback
from model objectiveL(f(g(x)),y) to z can be further propagated to the feature
extractor. Consequently, the feature extractor’s parameters are suitable to both
task and dataset.
The representation learning method is usually implemented with deep neural
networks. One example is the deep layers in discriminative neural networks. For
example, in multi-layer fully connected neural network (FCNN) classifier y|x in
Figure 1.4a, through consecutive non-linear operations, raw data attribute x is
14
(a) Example of a simple neural network for
classification.)
(b) Example of a simple neural network for encoding
and reconstruction.)
Figure 1.4: Simple discriminative and generative neural networks
mapped to hidden layers z
1
, z
2
and z
3
. Finally a simple linear classifier f(z
3
)
attempts to match label y. The parameters of operations mapping x to{z
i
}
and predicting y from z
3
will be learned jointly via back-propagation to minimize
classification loss. All hiddenz
i
layers are informative of inputx. Compared to the
shallower layers, the deeper layers are better at encoding the complex nonlinearity
and abstract conceptsfrom input. The deepest hidden layeris fed to final classifier,
andareoftenreferredtoasrepresentation ofxinthecontextofsupervisedlearning.
Another example is the middle layer in generative neural networks in Fig-
ure 1.4b of encoder-decoder architectures. The first part of model mapping x to
z is called encoder, and the next part of mapping z back to x is called decoder.
Usually both encoder and decoder stack multiple nonlinear layers to be expres-
sive. The encoding and decoding parameters are learned jointly to optimize the
reconstruction quality. This implies that (1) z contains enough information of x
and (2) the information stored in layer z can be processed to reconstruct data.
15
For the purpose of robustness and efficiency, z is a bottleneck layer of compact
dimensionality. This layer is also often referred to as representation of x in the
context of generative models and unsupervised learning.
1.2 Motivation
Deep neural networks are usually treated as black boxes between input data
and prediction because the parameters in intermediate layers are learned from
data without human intervention. There is a trade-off between effectiveness and
interpretability. The representations are learned to adapt to the data and task,
thus usually achieve better accuracies than manually defined features; on the other
hand, since parameters’ values are learned from data, they are less interpretable
than manually defined feature extractors. It is thus an important research topic in
understanding deep representations. Currently, most works toward understanding
neural representations use visualization and attribution methods:
1. Visualizing distributions or interactions between representation vectors of
data samples is useful to qualitatively test if representations encoded seman-
tic information. Forexample, the nonlineardimensionality reduction method
t-SNE [122] is widely used to project high dimensional representation vectors
of dataset to points in 2D plot.Good representations usually group samples
of the same class close to each other and produce clusters corresponding
to classes in the plot. Word analogy is a another famous example of rep-
resentation visualization. In NLP domain, words are usually represented
with unique embedding vectors. For example, [83] reported the differences
between pairs of embedding vectors from {(women, men), (girl, boy), (aunt,
uncle)} have similar values. The fact that embed(gir) + (embed(uncle−
16
aunt))≈ embed(boy), is an evidence that representation of words contain
useful semantics.
2. In computer vision domain, visualizing patterns that activate hidden neurons
can qualitatively explain how deep networks process raw attributes into high
level concepts, e.g. object categories. For example, in deep CNN models for
visual object classification, the neurons in lower layers are sensitive to various
textures, and neurons in deep layers are sensitive to parts of object [10, 88].
This implies that in lower layers, the activations of neurons detect textures
from images. In deep layers, after assembling and processing signals from
lower layers, the activations of neurons can detect whether parts of objects
from specific classes exist in images, and will be very useful representation
for image classification task.
3. The attribution methods attempt to quantify the contribution of specific
neurons (including raw signals in lowest layer of neural network) to the final
predictions. When applied to NLP models, they usually produce scores for
individual words to quantify how much they are in favor of or against final
predictions; whenappliedtoimageclassificationmodels,theyusuallyidentify
image pixels or regions in favor of or against final predictions. By manually
checking these scores, we can qualitatively know how well models capture
the useful parts of raw data. Although these methods can quantify relevance
betweenattributestopredictions,therearenotgroundtruthrelevancescores,
so for the task of model interpretation, it is still a qualitative approach.
17
1.3 Applications and Contributions
Most of existing works in understanding deep models attempt to qualitatively
analyze properties of representation vectors, or analyze correlations between hid-
den neurons and observable patterns [7, 9, 71, 34]. These methods work after deep
models are trained, attempt to explain why a specific model works well or not.
They do not affect learning of model parameters, thus projection from input data
to final prediction is still a black box. Recently deep generative models [2, 64, 39, 6]
became a hot research topic, and learning to generate datasets of true distribu-
tion potentially creates new ways toward understanding deep learning and deep
representation. Meanwhile, theoretical advances, such as information bottleneck
method [3, 130, 118, 119, 108] suggested theoretic foundation for understanding
deep learning and ways of regularizing deep models.
In this dissertation, we explored approaches of learning useful and interpretable
representations using information theoretic methods, in supervised learning and
unsupervised learning applications. In each of these two topics, we first review
related works from information theoretic view point, then propose new methods
using information theoretic principles, and finally we empirically evaluate them.
The contributions of this dissertation are:
Supervised learning application We reviewed information bottleneck princi-
ple for deep learning, and use it to analyze and enhance supervised models its real
world applications. Specifically,
1. Startingfrominformationbottleneckprinciple,wederivedatheoreticalexpla-
nation for the effectiveness of entropy regularized log-likelihood (ERLL) for
model selection we observed in large scale acoustic model.
18
2. Empirically, we studied the advantage and disadvantage of using ERLL for
training supervised models. We demonstrated improving accuracy of very
accurate model, by carefully tuning information regularization during model
learning.
Unsupervised learning application
1. VariationalAuto-encoder(VAE)isoneofthemostimportantdeepgenerative
models. Using minimum description length (MDL) principle, we provided a
cleaner explanation of VAE’s inefficiency in learning informative representa-
tion, and demonstrated using MDL to analyze variants of VAE models.
2. Learning representations that can disentangle factors of variations in dataset
is beneficial for model interpretation and for model adaption into new appli-
cations. We proposed a new framework for factor disentanglement that
simultaneously learns a concept representation and a data representation.
It is based on an information theoretic foundation and integrated probabilis-
tic graphic model into neural networks. Compared to existing approaches, it
requires much less human efforts to define factors in datasets. We also intro-
duced new quantitative evaluation metrics to evaluate model’s capability of
concept discovery. Empirically, we applied our model on face image genera-
tion, demonstrated it achieve better overall performance in data generation,
representation interpretability and concept discovery.
.
19
Chapter 2
Information Bottleneck Principle
and Applications
Deep learning models are good at learning from data. Since 2006, deep learn-
ing has advanced the state-of-the-art records on various domains and become a
dominant method in machine learning research. In the computer vision domain,
the deep belief networks (DBN) [52], were demonstrated to learn hidden represen-
tations that are good at encoding data distributions and can outperform shallow
models on image classification tasks. Later, convolution neural networks (CNN)
such as AlexNet [67] and their variants (e.g. VGGNet [110], GoogleNet [115],
resNetk [45] and denseNet [56]) kept renewing the records in image classification
and object detection. In the natural language processing (NLP) domain, context
models[83,82,38], recursiveneuralnetworks[111]andCNN[62]wereimplemented
tolearnembeddingofsinglewordsorwordsequencesforthetasksofsemanticanal-
ysis, sentiment analysis, sentence compression and generation etc. The success of
word embedding methods were followed with revival of Recurrent Neural Network
(RNN) [40], which can explicitly model the contextual dependence between tokens
in long sequences, and Long Short Term Memory model (LSTM) [54], which can
model longer sequences by introducing gates to RNN. In the domain of speech
recognition, Deep neural networks (DNNs) have significantly advanced the state-
of-the-art in automatic speech recognition (ASR) [11, 50, 86, 106].
20
Despite the success of deep learning models in various benchmark datasets and
tasks, understanding the mechanism of their good performances is still challeng-
ing. One direction toward understanding deep learning models is analyzing the
properties of intermediate layers, i.e. the representation. Empirical studies have
shown that the representation layers, usually of much lower dimensions than raw
input signals, can extract useful and interpretable information from input signals:
1. High quality representation layers are critical to exploiting the expressive
power of deep networks. Theoretically, three-layer neural nets with sig-
moidal hidden units can approximate to arbitrary accuracy any posterior
function[36]. However, in practice, for a long time, it was very hard to learn
good network parameters, and the neural networks were not able to out-
perform other classification models[87, 54], e.g. SVM [16]. The pioneering
deep learning model, DBN [52], demonstrated that layer-wise pre-training
can significantly improve deep neural networks’ performance. The key inno-
vation is that representation layers learned in pre-training stage can model
the distribution of input dataset well. It opened the door of learning better
prediction models by learning better representations[12].
2. Many deep networks trained on large scale datasets have good generaliza-
tion capabilities. For example, using the VGG convolutional neural net-
work [110] pre-trained on ImageNet dataset
1
as feature extractors, com-
bined with linear SVM classifier, generalized well to other image classifi-
cation tasks. One heuristic explanation for the generalization capability is
that hidden neurons in the intermediate layers of supervised neural network
1
Large-Scale Visual Recognition Challenge (ILSVRC) [98]
21
encoded semantics aligned with labeled concepts[12, 10]. Extensive experi-
mentalstudiessupportedthishypothesis. Forexample, invisiondomain,[10]
showed that in main-stream deep image classification models there are units
corresponding to various visual concepts, and in NLP domain, [71] demon-
strated some hidden neurons in LSTM [54] language model correspond to
sentiment intensification and negation.
3. Many successful deep models used the methodology of extracting only useful
information from large amount of input information, and the representative
works are models of attention mechanism[75, 68, 127, 85]. For example, the
attention mechanism was implemented to discover the useful patch for spe-
cific prediction tasks from whole image in vision domain[127, 85], to discover
informative keywords[75, 46] from long sentence or texts in NLP domain,
and significantly improved performance in many applications, e.g. image
classification[85], machine translation[75] and question answering[22, 68].
Above examples show that learning representations capable of characterizing
data distribution, discovering latent concepts, and extracting the most useful infor-
mation is important to learn good deep models. Theoretically, the information
bottleneck (IB) principle[118, 108] is a framework that explains these properties of
representations. In this chapter, we will review this principle, and its application
in regularizing machine learning models.
2.1 Information Bottleneck Principles
In many machine learning models, we have an input random variable X and
an output random variable Y, and we wish to use X to predict Y. For example,
in an image classification task, X represent images of objects, and Y represent a
22
collection of candidate object names. Throughout this dissertation, we will refer to
the model’s random variables with upper caseX andY, and refer to data instances
with lower casex andy. We assume the observed set ofN instances{x
n
,y
n
}
n=1···N
are independently and identically distributed according to an unknown data dis-
tribution P
D
(X,Y ), as in figure 2.1a. We usually approximate P
D
with empirical
distributions
2
:
P
D
(¯ y) =
N
X
n=1
δ(y
n
, ¯ y)/N (2.1)
Z
P
D(x,y)
f(x,y)dxdy =
N
X
n=1
f(x
n
,y
n
)/N (2.2)
ExtractingsubsetofinformationinX thatarerelevanttoY isusefulinbuilding
good machine learning models. For example, to classify an image into one of
two classes, vehicle or airplane, binary features "containing wheels or not" and
"containing wings or not" are informative features. They are good enough to
make accurate prediction, and will perform well on testing image of new vehicle or
airplanes that were never seen in training dataset, i.e. robust to over-fitting. Note
that extracting the subset relevant to Y means compression of information in X.
Extracting useful information as representation of X is naturally implemented
inneuralnetworksandwewilladopttheterminologiesfromneuralnetworkmodels.
We use Z to refer to random variables that characterize X. We call the mapping
from X to Z, p(z|x), as encoding stage, call Z as bottleneck layer, and call the
mapping from Z to Y, p(y|z), as decoding stage, as shown in Figure 2.1b.
[119, 108] formulated representation learning for machine learning into infor-
mation theoretic optimization problem. They used mutual informationI
θ
(Z;Y ) to
2
Usually X is high dimensional and continuous. Empirical estimation of P
D
(x) like eq (2.1)
is not a good approximation, and we are not focusing on estimation of P
D
(x).
23
X Y
p
D
(x,y)
(a) Joint distribution of labeled datasetD ={(x,y)}
X
p
D
(x)
Y
p
D
(y)
p(y|x)
(b) We wish to predict Y from X.
X
p
D
(x)
Z Y
p
D
(y)
p
θ
(z|x) p
φ
(y|z)
p
θ
(y|z)
(c) The encoder p
θ
(z|x) extracts representation z from input
x. The decoder p
φ
(y|z) predict y from z. P
D
and
p
θ
(z|x) lead to another decoding distribution p
θ
(y|z) =
E
p
D
(x)
p
D
(x,y)p
θ
(z|x)
/E
p
D
(x,y)
p
D
(x,y)p
θ
(z|x)
Figure 2.1: Learning representation Z in discriminative model Y|X.
quantify the information retrained inZ aboutY, and mutual informationI
θ
(Z;X)
to quantify how much Z compressed X. The task of learning Z informative of Y
while discarding irrelevant information from X is formulated as optimizing the
Lagrangian:
max
θ
I
θ
(Y ;Z)−βI
θ
(X;Z) (2.3)
This optimization problem does not have closed form solution. However, it
inspirednewalgorithmsinbothsupervisedandunsupervisedlearning,bybounding
and approximating mutual information. We will review applications of information
bottleneck in this chapter, and present our own works in future chapters.
24
2.2 Application on Supervised Learning
In deep neural networks (e.g. Figure 2.2) of L intermediate layers between
X and Y, each intermediate layer Z
i
|
i=1···L
processes input only from previous
layer Z
i−1
or X. Due to the Markovian properties through layers, the following
derivations work for all Z layers, and also work for arbitrary subset of L layers.
X Z
1
Z
2
Z
3 Y
p(z
1
|x) p(z
2
|z
1
) p(z
3
|z
2
) p(y|z
3
)
Figure 2.2: Representation learning in neural network classification models: each layer
Z
i
summarizes some but not all of the information present in the previous layer, Z
i−1
or
X, while retaining as much information relevant to output Y as possible [5].
maximizing I(Z;Y ) to minimize cross-entropy The cross entropy is the
most widely used objective in training classification models. It can also be derived
from maximizing I
θ
(Z;Y ) in information bottleneck principle, by decomposing
mutual information into summation of cross-entropy and KL-divergence between
P
θ
(y|z) and P
φ
(y|z):
Z
p
θ
(y,z) log
p
θ
(y|z)
p
θ
(y)
dydz
=
Z
P
θ
(y,z) log
P
θ
(y|z)P
φ
(y|z)
P
θ
(y)P
φ
(y|z)
dydz
=H(P
D
(y)) +
Z
P
θ
(y,z) logP
φ
(y|z)dydz +
Z
P
θ
(y,z) log
P
θ
(y|z)
P
φ
(y|z)
dydz
=H(P
D
(y)) +E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
+E
P
θ
(z)
h
KL[P
θ
(y|z)||P
φ
(y|z)]
i
(2.4)
≥E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
(2.5)
The cross entropy
E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
25
is lower bound of mutual information. Suppose all optimization problems can be
solved perfectly. Then maximizing it w.r.t. θ andφ maximizes mutual information:
1. Given fixed parameters θ, maximizing the cross entropy leads to
E
P
θ
(z)
h
KL[P
θ
(y|z)||P
φ
(y|z)]
i
= 0
which implies that classification model P
φ
(y|z) is accurate estimation of the
intractable ground truth distribution P
θ
(y|z).
2. Given fixed parametersφ satisfyingP
θ
(y|z) =P
φ
(y|z), maximizing the cross
entropy increases I
θ
(Y ;Z) because both E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
and
E
P
θ
(z)
h
KL[P
θ
(y|z)||P
φ
(y|z)]
i
are increased.
By running these two steps iteratively till convergence, we can learn θ
∗
that
maximizes the mutual information I
θ
(Y ;Z). Meanwhile, we can learn φ
∗
that is
same to true distribution P
θ
∗(y|z)
3
. In practice, when learning neural network
models, θ and φ are usually optimized simultaneously and numerically.
minimizing I(Z;X) to regularize model The constraints term,−βI
θ
(X;Z),
in the information bottleneck formulation (2.3) was used to design information
regularizations on Z. The intractable I
θ
(X;Z) can be upper bounded as follows:
3
In practice, we are not able optimize parameters perfectly, so P
φ
∗(y|z) is very similar, but
not same to P
θ
∗(y|z)
26
I
θ
(x;z) =
Z
p
θ
(x,z) log
p
θ
(z|x)
p
θ
(z)
dxdz
=
Z
p
D
(x)p
θ
(z|x) log
p
θ
(z|x)
q(z)
dxdz +
Z
p
θ
(z) log
q(z)
p
θ
(z)
dz
=E
p
D
(x)
KL
h
p
θ
(z|x)||q(z)
i
−KL[p
θ
(z)||q(z)]
≤E
p
D
(x)
KL
h
p
θ
(z|x)||q(z)
i
≈
N
X
n=1
KL
h
p
θ
(z|x
n
)||q(z)
i
/N (2.6)
where q(z) is an auxiliary distribution on random layer Z. Approximating
I
θ
(X;Z) with E
p
D
(x)
KL
h
p
θ
(z|x)||q(z)
i
and minimizing it requires distribution of
z, p
θ
(z|x), close to a prior distribution q(z).
Deep Variational Bottleneck model [3] proposed to modelp
θ
(z|x) andq(z) with
Gaussian distribution. The information bottleneck objective function is thus
arg max
θ,φ
N
X
n=1
h
P
θ
(z|x
n
) logP
φ
(y
n
|z)
i
−βKL
h
p
θ
(z|x
n
)||q(z)
i
(2.7)
Without loss of generality, definep
θ
(z|x) =N (μ(x;θ),σ
2
(x;θ)) and assumeq(z) =
N (0,I). It simplifies KL
h
p
θ
(z|x)||q(z)
i
to
1
2
P
D
d=1
h
− logσ
2
d
+σ
2
d
+μ
2
d
− 1
i
, whose
27
derivatives can be easily calculated in back-propagation. The learning algorithm
is the following, and we will implement it in the end of chapter 3.
Algorithm 1: Learning classifier with Gaussian variational bottleneck.
initialize θ,φ;
while θ,φ not converged do
for each pair of training samples{x
n
,y
n
} do
Calculate μ(x;θ),σ
2
(x;θ);
Sample z
`
∼N (μ,σ
2
) for ` = 1,···L ;
CalculateL
1
=
P
L
`=1
logP
φ
(y
n
|z
`
)/L;
CalculateL
2
=
1
2
P
D
d=1
h
− logσ
2
d
+σ
2
d
+μ
2
d
− 1
i
;
θ,φ← minimizer(−L
1
−L
2
;θ, ]phi);
The Gaussian distribution assumption is very restrictive, as it requires that
nonlinear activations not applied on Z. Thus, [3] applied it only on the last
intermediate layer Z
L
. [1] relaxed this constraints. It proposed log-uniform dis-
tribution as prior modelq(z) when the rectified linear unit (RELU)[87] activations
are applied on Z, and proposed log-normal prior model q(z) when the Softplus
activations are applied on Z.
Both [3] and [1] compared classification models with and without information
bottleneck layers, on image classification tasks. Improvement on simple datasets,
e.g. MNIST and cluttered MNIST, were reported. However, the information bot-
tleneck models failed to improve the state-of-the-art models. [3] failed to improve
ResNet on ImageNet dataset. [1] declared information bottleneck improved the
benchmark Cifar-10 classification accuracy. However, it was comparing with a
very weak baseline model.
In the next chapter, we will derive a new way of bounding information bot-
tleneck objective, and demonstrate using first order entropy to tune classification
28
model. We will experiment with both vision and speech datasets, and compare
with very competitive baseline models.
29
Chapter 3
Information Regularization in
Supervised Classification Models
In this chapter, we conduct empirical study on information regularization, and
use information bottleneck principle to explain experimental results. In the begin-
ning of this chapter, we briefly reviewed our work on large scale kernel speech
model [73], where we found that first order entropy is good metric for acoustic
model selection. We then implement information bottleneck principle to explain
this observation. Finally, we study how to tune entropy to learn good supervised
models.
3.1 Background: Entropy in Acoustic Classifica-
tion Models
One way toward understanding data representation is comparing the represen-
tations of same data in different models [8]. In contrast to deep learning whose
mechanism is black-box, the kernel methods have solid theoretical foundation.
Moreover, kernel methods are powerful for modeling highly nonlinear data [105]
and achieved performance comparable to deep learning [74] on many applications.
Thus by comparing how kernel and deep learning models are similar and dif-
ferent from each other, we may discover some reasons that explain advantages
of deep neural network (DNN) in some specific applications. In this section, we
30
report empirical results of scaling up kernel methods on big dataset toward the
objective of matching the performance of DNN. We chose the ASR task and train
both kernel and neural network models for frame-level acoustic modeling, focus-
ing on comparison of their speech recognition performances. We found that DNN
model can predict acoustic state with both good accuracy and high confidence.
3.1.1 Comparing large scale kernel models and DNN
The challenge in comparing kernel and neural network models on large scale
datasets is scaling up the kernel models. The computational complexity of exact
kernel methods depends quadratically on the number of training examples at train-
ing time and linearly at testing time. Hence, scaling up kernel methods has been
a long-standing and actively studied problem [19, 30, 91, 120, 25]. Exploiting
structures of the kernel matrix can scale kernel methods to 2 million to 50 million
training samples [112].
In theory, kernel methods provide a feature mapping to an infinite dimensional
space. But for any practical problem the dimensionality is bounded above by
the number of training samples. Approximating kernels with finite-dimensional
features has been recognized as a promising way of scaling up kernel methods. The
most relevant approach for our work is the observation [92] that inner products
between features derived from random projections can be used to approximate
translation-invariant kernels [15, 105, 92]. Follow-up work on using those random
features (“weighted random kitchen sinks” [93]) is a major inspiration for our work.
There has been a growing interest in using random projections to approximate
different kernels [60, 43, 69, 124].
For acoustic modeling, we can plug the random feature vector
ˆ
φ(x) (converted
from frame-level acoustic features) into a multinomial logistic regression model.
31
Specifically, our model is a special instance of the weighted sum of random kitchen
sinks [93]
p(y =c|x) =
e
θ
T
c
ˆ
φ(x)
P
c
e
θ
T
c
ˆ
φ(x)
(3.1)
Figure 3.1: Kernel model using random approximation can be seen as a shallow neural
network [73].
Figure 3.2: Linear factorization of model parameters θ
c
reduces model complexity and
improves generalization performance [73].
We conducted extensive empirical studies comparing kernel methods to deep
neural networks (DNNs) on typical ASR tasks. We trained both DNNs and
32
kernel-based multinomial logistic regression models to predict context-dependent
HMM state labels from acoustic feature vectors. The acoustic features are 360-
dimensional real-valued dense vectors, and are a standard speaker-adapted repre-
sentation used by IBM [66]. The state labels are obtained via forced alignment
using a GMM/HMM system.
In [73] we tested these models on three datasets. The first two are the IARPA
BabelProgramCantonese(IARPA-babel101-v0.4c)andBengali(IARPA-babel103b-
v0.4b) limited language packs, each pack containing a 20-hour training and a 20-
hour test set. The third dataset is a 50-hour subset of Broadcast News (BN-50)
[65, 100]. In this dataset, each sample is in one of 5,000 phonetic states. Among
these 50 hours of audios, 45 hours are used for training, 5 hours are a held-out set.
And there is a test set that has 2 hours of audios. This is well-studied benchmark
task in the ASR community due to both its convenience and relevance to devel-
oping core ASR technology. In this dissertation, we will focus on BN-50 dataset.
Results and technical details on first two language packs can be found in [73, 80].
For kernel-based models, we used Gaussian kernels to extract approximate fea-
ture
ˆ
φ. Kernel models have 3 hyperparameters: the bandwidths for Gaussian, the
number of random projections, and the step size of the (convex) optimization pro-
cedure (as adjusting it has a similar effect to early-stopping). As a rule of thumb,
the kernel bandwidth ranges from 0.3–5 times the median of the pairwise distances
in the data (with 1 times the median working well). The acoustic state accuracy
is stable at 25,000 random features, and further increasing dimension of random
features has diminishing return in accuracy at high cost of time efficiency. When
parameters are optimized with stochastic gradient descent (SGD), using large ini-
tial learning rate is preferred, as long as it does not cause exploding gradients.
33
Finally, imposing regularization (e.g. l2 normalization) on learnable parameters is
important to avoid over-fitting kernel model.
For DNN, we first pre-trained it as stacked restricted Boltzmann machines [52],
then fine-tuned it using SGD. The best model has 4 hidden years, with 2000 hidden
units per layer. We also used another DNN model from IBM’s Attila package,
which were layer-wise discriminatively trained [107, 66]. It contains five hidden-
layers, each of which contains 1024 units with logistic nonlinearities.
We refer to kernel base model as kernel, our DNN model as DNN-rbm, and
IBM’s model as DNN-ibm. All layers in both DNN models are fully connected, and
all intermediate layers are activated with sigmoid function. The outputs of either
types of models have 5000 softmax units, corresponding to the context-dependent
HMM states clustered using decision trees.
We used three metrics to evaluate the acoustic models. The first two are direct
measures of phonetic state classifier performance. The third one is an indirect
measure of classifier performance, but is the metric we care most in ASR pipeline.
1. Perplexity Given examples,{(x
i
,y
i
),i = 1···m}, the perplexity is defined
as
ppx = exp
(
−
1
m
m
X
i=1
logp(y
i
|x
i
)
)
2. Accuracy The classification accuracy is defined as
acc =
1
m
m
X
i=1
1[y
i
= arg max
y∈1,2,···,C
p(y|x
i
)]
34
Table 3.1: Comparison in perplexity (ppx) and accuracy (acc: %) on BN-50 validation
dataset. The kernel model used 100K random features
Metric ppx acc TER
DNN-ibm 7.4 50.8 16.7
DNN-rbm 6.7 52.7 16.6
kernel 7.3 51.2 18.6
3. Token Error Rate (TER) We feed the predictions of acoustic models,
which are real-valued probabilities, to the rest of the ASR pipeline and cal-
culate the misalignment between the decoder’s outputs and the ground-truth
transcriptions. For BN-50, the error is the word-error-rate (WER).
Table 3.1 contrasts the best perplexity and accuracy attained by various acous-
tic models on held-outs. Note that cross-entropy errors (ie, the logarithm of the
perplexity) are the training criteria of those models. Thus, ppx correlates with
classification accuracies well. Moreover, the performances by those 3 models are
close to each other onppx andacc. However, the TER of kernel model is noticeably
worse than two DNN models.
3.1.2 Tradeoff between perplexity and entropy
One possible explanation is that the perplexity might be an inadequate proxy
for TER. As the predictions are probabilities to be combined with language mod-
els, we can capture the characteristics of the predictions using the entropy as it
considers posterior probabilities assigned to all labels while the perplexity (as a
training criteria) focuses only on the posterior probability assigned to the correct
state label. The entropy measures the degree of confusions in the predictions,
which could have interplayed with the language models.
Figure 3.3 plots the progression of several DNN models in perplexity and
entropy (each cyan colored line corresponds to a model’s training course). We also
35
Early-stopping/model selection
with lowest perplexity
More desired models even
with slightly higher perplexity
but lower entropy
Figure 3.3: Training DNN acoustic models to minimize perplexity is not sufficient to
arrive at the best WER – after the typical early-stopping point where the perplexity is
lowest, continuing to train to increase the perplexity but decrease the entropy leads to
the best WER.
Table 3.2: Regularized perplexity is a better model selection criteria
Model perplexity regularized perplexity Oracle
rbm 16.6 16.1 15.8
kernel 18.6 17.5 17.5
plot with colored markers the WERs evaluated at the end of every four epochs.
Clearly, in the beginning of the training, both the entropy and the perplexity
decrease, which also corresponds to an improving WER. Note that, using per-
plexity for early-stopping — a common practice in training multinomial logistic
regression model — will result in models that are designated by the blue colored
points on the leftmost of the plot. However, those models have sub-optimal WERs
as continuing the training to have an increased perplexity but in exchange for a
decreased entropy results in models with better WERs (the red colored points).
Weobserveasimilartradeoffintrainingkernel-basedacousticmodels, asshown
in figure 3.4. Similarly, WER depends jointly on the perplexity and the entropy
and the best perplexity or entropy does not result in the best WER. Note that
when decoding, we tune the scaling of acoustic scores. Thus, balancing perplexity
and entropy cannot be trivially achieved by scaling the inputs to the softmax.
36
Early-stopping/model selection
with lowest perplexity
More desired models even
with slightly higher perplexity
but lower entropy
Figure 3.4: Similar to training DNN acoustic models, as in figure 3.3, training kernel
models also has tradeoff between perplexity and entropy where the lowest perplexity
does not correspond to the lowest WER.
3.5 4 4.5 5
15
16
17
18
19
20
21
regularized perplexity
WER
kernel
DNN
Figure 3.5: WER is almost linear in the regularized perplexity
3.1.3 ERLL: a new model selection criterion
Figure 3.3 and 3.4 suggest that we should select the best acoustic model using
both perplexity and entropy. Figure 3.5 shows that it is possible to predict WER
from entropy-regularized log-likelihood (ERLL), defined as log-likelihood+entropy,
namely,
−
1
N
N
X
n=1
K
X
k=1
I(k =y
i
)−
1
N
N
X
n=1
P (y =k|x
i
) logP (y =k|x
i
)
=−
1
N
X
n
K
X
k=1
h
I(k =y
i
) +P (y =k|x
i
)
i
logP (y =k|x
i
) (3.2)
37
which has an almost linear relationship with the WER.
Table 3.2 illustrates the advantage of using this regularized perplexity on held-
out to select models – for both kernel and DNN acoustic models, their WERs
are improved, and the improvement on kernel models is substantial (1% WER
reduction in absolute).
3.2 Interpreting ERLL
In the previous section, we reported that ERLL is a good acoustic classification
model selection metric. In this section, we use information bottleneck principle to
explain the advantage of ERLL as model selection metric.
x
y
p
D
(x,y)
(a) Joint distribution of labeled datasetD ={(x,y)}
x
p
D
(x)
z
y
p
D
(y)
p
θ
(z|x) p
θ
(y|z)
(b) The parametric encoderp
θ
(z|x) extracts representationz from
input x. The predictor p
θ
(y|z) is intractable to calculate.
x
p
D
(x)
z
y
p
D
(y)
p
θ
(z|x) p
φ
(y|z)
(c) The intractable predictorp
θ
(y|z) is approximated withp
φ
(y|z).
Figure 3.6: In supervsied NN model, representation z and classifier p
φ
(y|z) are learned
to approximate intractable predictor p
θ
(z|x).
In chapter 2, we have shown that decomposition of I
θ
(Z;Y ) into
E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
+E
P
θ
(z)
h
KL[P
θ
(y|z)||P
φ
(y|z)]
i
38
explains why cross-entropy E
P
D
(x,y)
h
P
θ
(z|x) logP
φ
(y|z)
i
is a good objective for
learning representation parameters θ and classification parameters φ. Now we use
another decomposition of mutual information I
θ
= H
D
(Y )−H
θ
(Y|Z) to explain
why ERLL is useful model selection metric:
H
D
(Y )−H
θ
(Y|Z)
=H
D
(Y ) +
Z
P
θ
(y,z) logP
θ
(y|z)dydz
=H
D
(Y )
+β
Z
P
θ
(y,z) log
P
θ
(y|z)
P
φ
(y|z)
dydz
+β
Z
P
θ
(y,z) logP
φ
(y|z)dydz
+ (1−β)
Z
P
θ
(y,z) logP
θ
(y|z)dydz
=H
D
(Y )
+β
Z
P
θ
(z)KL
h
P
θ
(y|z)||P
φ
(y|z)
i
dz
+β
Z
P
θ
(y,z) logP
φ
(y|z)dydz
− (1−β)
Z
P
θ
(z)ENT [P
θ
(y|z)]dydz (3.3)
Note that in the special case β = 1, ignoring model independent entropy
H
D
(Y ), the decomposition (3.3) reduces to decomposition (2.4).
Suppose the optimization problems can be solved perfectly. Similar to the
analysis of decomposition(2.4), if we learn{θ,φ} by maximizing cross entropy
Z
P
θ
(y,z) logP
φ
(y|z)dydz
39
, the optimal classifier parameters φ satisfies P
θ
(y|z) =P
φ
(y|z), which implies
KL
h
P
θ
(y|z)||P
φ
(y|z)
i
= 0
, and
Z
P
θ
(z)ENT [P
θ
(y|z)]dydz =
Z
P
θ
(z)ENT [P
φ
(y|z)]dydz (3.4)
Thus the entropy regularized log-likelihood differ from mutual information by
model independent constant, and the ERLL
−βE
P
D
(x,y)
E
P
θ
(z|x)
logP
φ
(y|z) + (1−β)E
P
D
(x,y)
E
P
θ
(z|x)
ENT [P
θ
(y|z)] (3.5)
is a good metric measuring how well the mutual information I
θ
(Y,Z) are maxi-
mized. Whenβ =
1
2
, equation 3.5 is the metric we used in acoustic model selection.
Please note that (1)θ contains parameters for representationz extraction, and
parameters for label inference y|z, as illustrated in figure 3.6; (2) in section 3.1,
while ERLL was used for model evaluation, it was not used as objective of training
model.
3.3 Tuning Supervised Models with Information
Regularization
3.3.1 Experimental results
Wehaveshownthatentropyregularizedlog-likelihoodisagoodmodelselection
metrics, if models were trained by minimizing cross-entropy. A natural question
40
is whether ERLL is a good model training criterion, i.e. reducing ERLL while
retaining performance in perplexity and states accuracy.
We studied ERLL as learning objective on two benchmark datasets from speech
and vision domains. The speech dataset is BN-50 used in section 3.1. The vision
dataset is Cifar
1
. On speech dataset, we tune kernel models with entropy regular-
ization and compare with deep networks. On vision dataset, we choose DenseNet
as classification model, and tune it with entropy regularization. Our deep acous-
tic model on BN-50 and DenseNet on Cifar perform close to the state-of-the-art.
Thus it would be a significant contribution if entropy regularization can improve
them.
We will call models trained by minimizing ERLL as ERLL model, and models
trained by minimizing cross-entropy as CE model.
acoustic model When training the kernel ERLL model, we use stochastic gradi-
ent descent (SGD) optimizer[16], initialized with large learning rate
2
. We evaluate
the ERLL on heldout dataset after one quarter epoch of training data are used. If
the ERLL on validation data does not decrease after evaluating twice, we reduce
the learning rate by 50%. We experimented with other optimization algorithms
(RMSProp[29], Adam[63] and AdaGrad[35] optimizers), but none of them can
match SGD. The learning rate decay scheduling hyper-parameters were selected
according to heldout evaluation
3
.
1
https://www.cs.toronto.edu/ kriz/cifar.html
2
More details can be found in section A.2
3
More details can be found in section A.2
41
Table 3.3: Comparison in perplexity(ppx), accuracy (acc: %), and ERLL.Kernelmodels
used 50K dimensional kernel features, and 1000 units in the bottleneck layer.
Model CE ACC ENT ERLL
DNN-CE 1.90 52.7 2.10 3.99
DNN-ERLL 1.96 53.1 1.26 3.22
Kernel-CE 2.06 49.9 1.89 3.95
Kernel-ERLL 2.42 50.2 0.95 3.37
Table 3.4: Comparison in accuracies of densenets with and without entropy regulariza-
tion.
Model DenseNet-BC DenseNet-BC + β ENT
Reported Err: % 22.27 NA
Implementation Err: % 22.49 22.84
Both DNN and kernel acoustic ERLL model outperformed CE model in classifi-
cationaccuracyandERLL,asshowninTable3.3. AndDNNtrainedbyoptimizing
ERLL outperformed all other DNN models and all kernel models.
cifar densenet models The DenseNet models on Cifar-100 are compared only
on classification error. The optimal error reported on Cifar-100 is 22.27%, and in
our implementation, we were able to tune it to 22.49% error, very close to 22.27%.
If DenseNet is trained with ERLL instead of CE, the error increased significantly.
3.3.2 Trade-offs between accuracy and confidence
trade-offs between accuracy and confidence The empirical studies showed
riskofusingERLLasmodellearningobjective. Intuitively, minimizingtheentropy
regularization term encourages models to be more confident, even if it is making
mistakes. Quantitatively, if p
φ
(y|z) is a bad estimation of p
θ
(y|z),
Z
P
θ
(z)ENT [P
θ
(y|z)]dydz =
Z
P
θ
(z)ENT [P
φ
(y|z)]dydz
42
is not a good approximation of
Z
P
θ
(z)ENT [P
θ
(y|z)]dydz =
Z
P
θ
(z)ENT [P
θ
(y|z)]dydz
, and minimizing ERLL may learn to underestimateENT (p
φ
(y|z)) to compensate
the loss in cross-entropy term.
It implies that entropyENT [p
φ
(y|z)] is an informative indicator of over-fitting.
In a good model,p
φ
(y|z) is a good approximation ofp
θ
(y|z), which generalizes well
on both training and testing data. So ifENT [p
φ
(y|z)] is very different on training
data and testing data, the parameters are over-fitted. Figure 3.7a and Figure 3.7b
showed the gap between predictions on training data and testing data. In the
left column, there is an obvious gap between training and testing accuracies, and
training accuracy is close to 100%, no matter model were trained with or without
entropy regularization. The gaps in error indicate over-fitting, but testing label is
unavailablewhentrainingandselectingmodels. Theentropiesintherightcolumns
can be calculated without labels, and the gap in training and testing confidence
can qualitatively tell the over-fitting.
regularizing entropy toward a reasonable target value Instead of min-
imizing entropy ENT [p
φ
(y|z)] and encouraging confident prediction, we should
minimize ENT [p
φ
(y|z)] until it is smaller than a threshold value, and encourage
confidentpredictionuntilitislargerthanathresholdlevel. However, howconfident
the model prediction should be is another black-box. Instead of extensive search
of expected confidence, i.e. entropy, we can ask the heldout and testing dataset to
teach the model how confident it should be on training dataset. As shown in right
column of Figure 3.7a and Figure 3.7b, if entropy on training data is significantly
43
(a) Training and testing error (left) and entropy (right), DenseNet
model on Cifar-100, trained by optimizing cross entropy.
(b) Training and testing error (left) and entropy (right), DenseNet
model on Cifar-100, trained by optimizing ERLL.
Figure 3.7: Both classification error and prediction entropy have significant gap on train-
ing and testing data.
smaller than entropy on testing data, model is becoming over-confident on training
data,
To control the gap in confidence level, we proposed to simultaneously minimiz-
ing the entropy on training and testing datasets.
min
θ,φ
−βE
P
D
(x,y)
E
P
θ
(z|x)
logP
φ
(y|z) + (1−β)E
P
D
(x)
E
P
θ
(z|x)
ENT [P
θ
(y|z)] (3.6)
44
Table 3.5: Testing datasets make entropy regularization more useful.
Model DenseNet-BC DenseNet-BC DenseNet-BC
+ β ENT + β + unlabeled
Reported Err: % 22.27 NA NA
Implementation Err: % 22.49 22.84 20.67
The only difference between (3.6) from (3.4) is the subscript in second term,
fromE
P
D
(x,y)
toE
P
D
(x)
. The former means averaging over all labeled training sam-
ples{X,Y}, andthelattermeansaveragingoverallsamples{X}, bothlabeledand
unlabeled from training and testing dataset respectively. Table 3.5 showed that
this simple modification significantly improved classification accuracy and outper-
formed state-of-art by large margin. Comparing Figure 3.8 with Figure 3.7a, mod-
els trained by minimizing ERLL (3.6) has smaller entropy; comparing Figure 3.8
with Figure 3.7b, models trained by minimizing ERLL (3.6) of both training and
testing dataset avoid over-confidence on training data and reduced overfitting.
Figure 3.8: The testing dataset taught optimizer how confident model should be.
3.4 Contributions and Discussions
main contributions
45
1. By comparing deep and kernel models on large scale speech datasets, we dis-
covered advantage of deep neural networks in reducing perplexity and pre-
diction entropy simultaneously, and discovered ERLL as a promising model
selection metric in real world acoustic recognition tasks.
2. Using information bottleneck principle, we explained why ERLL is a good
model selection metric.
3. Empirically, we studied performance of ERLL as model learning objective.
We demonstrated that on specific applications, by constraining expected
entropy to be some reasonable value, we can achieve better classification
performance than traditional models that are learned by minimizing cross
entropy loss.
(a) Model with linear bottleneck. (b) Model with information bottleneck.
discussions When tuning the acoustic and vision models, we have some non-
trivial observations:
1. The weight decay on model parameters outperformed variational informa-
tion bottleneck regularization: [3] declared information bottleneck is a good
model regularization, and demonstrated it on simple MNIST classification
46
task. Weimplementedthisregularizationonthelinearbottlenecklayer(algo-
rithm 1) when training acoustic model on BN-50 dataset, as shown in Fig-
ure 3.9b. We found that linear bottleneck model (3.9a) without weight decay
suffered from serious over-fitting, the variational bottleneck model (3.9b)
reduced overfitting significantly, yet still slightly worse than linear bottle-
neck model (3.9a) with L2-norm weight decay (i.e. results in Table 3.3).
In fact, [3] suffered from similar problem: it failed to improve competi-
tive models (ResNet[45]) on large scale dataset (ImageNet[98]). Looking for
large scale real world applications where information bottleneck constraints
improves performance is important to further develop information bottleneck
principle.
47
(a)OnBN-50dataset, entropyregularizationdecreasedrobustnesstoadver-
sarial attack. Optimizing ERLL on mixed dataset could improve ERLL
model’s robustness, but still worse than CE model.
(b) On Cifar dataset, entropy regularization slightly decreased robustness
to adversarial attack, while entropy regularization on mixed dataset is most
robust on the hard-to-fool samples.
2. The neural network was shown to prone to adversarial attack that makes
model give different prediction by very slight perturbation on input sig-
nals[116,90,97,99]. Empirically, weobservedcorrelationbetweenprediction
confidence and adversarial robustness: overconfident models by minimizing
ERLL (ERLL model) is more prone to adversarial attack, less robust than
the models learned by minimizing cross-entropy (CE model). Tuning entropy
toward reasonable values, e.g. via entropy minimization on both training and
48
testing dataset, achieved better robustness compared with ERLL model, and
on subset of samples, it is even more robust than the CE model. It is inter-
esting to design algorithms that use entropy to help learning more robust
models [72].
49
Chapter 4
Information Bottleneck Layer and
Concept Discovery
Besides interpreting the representations of supervised models that predict dis-
crete labels, another direction toward exploring and understanding deep represen-
tation is learning to generate data. There are two main-stream approaches for deep
unsupervised learning. One is casting the unsupervised learning task into super-
vised learning task, by learning to reconstruct input data or transformed input
data. This method is usually implemented with auto-encoders, consisting of an
encoder and a decoder, which transforms inputs into hidden representations, and
then reconstructs input from latent transformation respectively [125, 14]. Another
approach is fitting distribution of dataset. For example, the deep belief nets [52]
and deep Boltzmann machine [101] stacked multiple Boltzmann machines to model
the data distribution. Later, convolutional and recurrent neural networks, re-
parameterization tricks and adversarial losses were introduced to fit more compli-
cated datasets [64, 39, 6]. We refer to the models in the second approach as deep
generative models (DGM) and will focus on it in the second half of the dissertation.
Compared to supervised deep features, the generative features, due to lack of
feedback from labels, are not able to align hidden units to manually defined con-
cepts as in supervised models. And it became interesting to understand how deep
networks, when trained without explicit labels, extract concepts from raw data. In
50
this chapter, we will review the works in generative representation learning, from
the probabilistic and information theoretic view points.
4.1 Representation of primitives of variations
4.1.1 Density estimation using latent variable model
density estimation Suppose we haveD ={x
1
,x
2
,··· ,x
N
}, a finite set of inde-
pendent and identically distributed samples drawn from some distribution P with
an unknown density f. We want to build models to quantify P or f that fit the
empirical distribution of these samples. There are parametric and non-parametric
approaches.
In non-parametric learning, we usually call this task density estimation, and a
representative work is kernel density estimation. It use kernels to take weighted
local density estimates at each observation x
i
, and then aggregates them to yield
and overall density:
ˆ
f(x) =
1
Nh
N
X
i=1
K(
x−x
i
h
)
where h is the smoothing parameter that controls the size of the neighborhood
around x. This method makes very simple assumption: new samples are more
likely to be observed in regions close to existing observations. It is suitable for
observations that are low dimensional, and whose features are manually defined
and interpretable, e.g. in signal processing domain. However, it is not suitable for
high dimensional data, e.g. images and texts, because the feature space, is much
larger than number of observations.
In parametric learning, we usually make more assumptions on property or gen-
eration process of data, and formulate the distribution of data to be P
θ
(x), where
51
θ are parameters of model. We learn θ by fitting the empirical distribution ofD.
The simplest application is maximum likelihood estimation (MLE) of distribution
parameters, θ
∗
= arg max
θ
P
N
i=1
logP
θ
(x
i
), assuming x∼P
θ
(x). It is also applied
to supervised learning as the foundation of probabilistic regression and logistic
regression models, i.e. W
∗
= arg max
W
P
N
i=1
logP (y
i
|W
T
·x
i
), where P is Gaus-
sian distribution for regression model andP is logistic distribution for classification
model.
Theparametricprobabilisticmodelsusuallyrequiressmalleramountoftraining
samples than non-parametric models because models’ assumptions constrain the
distribution of data in feature space. However, it suffers if assumptions deviate
from real distributions. It is very challenging to use one distribution to characterize
all variations in a dataset, because the data may be high dimensional and noisy,
and multiple feature dimensions may be correlated in complicated way.
latent variable models The latent variable models assume there exist lower
dimensional, unobserved variables underlying the high dimensional observations.
The latent variables follow simpler distribution, and more robust to noise in obser-
vations. For example,
1. The principal component analysis (PCA) and similar methods such as sin-
gular value decomposition (SVD) are suitable for continuous data following
Gaussian distribution. PCAattemptstobestexplainthevarianceinthedata
using a few uncorrelated orthogonal basisW, whose columns are called com-
ponents, and use orthogonal transformsz =x
T
·W to convert observation
vectorx to latent vectorz of linearly uncorrelated dimensions. By keeping
only the components of large variances in data, it reduces dimensionality of
data, and can filter out noises effectively.
52
2. The non-negative matrix factorization (NMF) is suitable for observations
consisting of non-negative values, e.g. textusingcountofwordasfeature[33].
NMF attempts to reconstruct the high dimensional observation from non-
negative combination of non-negative basis: min
U,V
||X−VU||
F
, s.t. U≥
0,V ≥ 0, where X∈R
D×N
+
,V ∈R
D×C
+
. NMF has inherent clustering prop-
erty, and each column ofV is an cluster centroid. When NMF is applied on a
text corpus represented using bag of word features, V
k
is the word frequency
that explains the k-th cluster. The word frequencies are interpretable by
human, and are usually called topics.
R
ij
V
j
U
i
σ
U
σ
V
σ
d
j = 1···M
i = 1···N
(a) Probabilistic Matrix Factorization [102]
w
w
z
w θ
d
α
β
D
N
(b) Latent Dirichlet Allocation [17]
Figure 4.1: Two representative probabilistic graphical latent variable model for density
estimation
53
The parametric formulationP
θ
(x) allows easy sampling of new data, and latent
variable models allows discovery of latent structure of dataset. The probabilistic
graphic model (PGM) integrated advantages of both methods. It allows flexible
settings of latent structures and distributions over random variables. For example
1. Figure (4.1a) is an example of probabilistic factor-based model [102], widely
applied for collaborative filtering[103]. The hidden variables U∈R
D×N
and
V ∈ R
D×M
are latent variables representing N users and M items, and
similarity between U
i
and V
j
decides R
ij
, how much user i may rate j. The
hyperparameters{σ
V
,σ
U
} allows regularizing the uncertainty in user and
item properties, and σ
d
allows specifying magnitude of observation noise. It
can be considered as an extension of SVD, and in next chapter, we will use
a similar architecture to model a factor-analysis prior.
2. Figure (4.1b) is an example of probabilistic topic model, widely applied on
text data. It assumes one document has a distribution over semantic topics
θ
d
, and every word’s topic z
w
and ID w
w
in that document are generated
following that distribution. With Dirichlet prior α, β and hierarchically
structured latent variables{θ
d
} and{z
w
}, it can fit the distribution of words
in corpus texts very well.
Theprobabilisticgraphicmodelsmostlyworkonraw, manually defined features
that are easy to measure, e.g. users’ rating of items, count of words in documents.
All prior and conditional distributions, and all random variables are manually
defined, thus interpretable. This framework has two major drawbacks. One is
lack of efficiency: inference of latent variables usually requires many numerical
and iterative calculations, and modification of model may change inference algo-
rithm significantly. The other is lack of expressiveness: manually defined latent
54
variables and connection with observations may not be able to capture complex
latent structure accurately. These problems will be partly solved with variational
Auto-Encoder (VAE) models.
X
h
ˆ
X
h
X
(a) Building blocks of deep learning: Denoising Auto-Encoder (left) and
Restricted Boltzmann Machine (right)
h
h
1
Y
h
2
h
3
X
(b) Deep belief network.
Figure 4.2: Neural networks naturally split observations and latent variables into differen
layers.
deep generative models To fit data distribution, the deep learning approach
aims at mapping raw features into latent representations that (1) are good at
discovering the latent structure of data and (2) can be mapped to observations.
Figure 4.2a and Figure 4.2b show classical generative neural networks:
55
1. Generative building blocks of deep learning in Figure 4.2a:
• The denoising auto-encoder (DAE)[125, 14] is a directed model that
learns to reconstructx from perturbed input samples ˆ x =x+, through
a bottleneck layer: arg min
w
1
,w
2
P
x|f(w
T
2
· (g(w
T
1
ˆ x)))
, wheref(·),g(·)
represent activation functions. The noise in the input signal encourages
bottleneck layer h to filter out the noise and only extract information
about intrinsic property of x. If g(·) is linear activation function, DAE
can be considered as PCA with orthogonality constraints relaxed.
• The Restricted Boltzmann Machine (RBM) [51, 52] is an undirected
modelthatdefinesjointdistributionovervisiblelayerxandhiddenlayer
h,P (x,h) =
1
Z
exp
−E(x,h)
withZ =
P
x∈X
P
h∈H
exp
−E(x,h)
.
E(x,h) =a
T
x +b
T
h +x
T
Wh is a parametric function that associates a
scalar energy (a measure of compatibility) to each configuration of the
variables v,h [70].
Because there is no inter-layer connection in RBM, block Gibbs sampling
P (x|h) and P (h|x) is straight-forward, and Markov chain P (x|h),P(h|x)
converges to p
(
x,h). The DAE does not define P (x,h) explicitly. How-
ever, [14] and [130] demonstrated that Markov chain x
(t+1)
∼ P
x|f(w
T
2
·
h
(t)
)
,h
(t+1)
∼P
g(w
T
1
x
(t+1)
)
also converges to p
D
(x) if it is ergodic.
2. The deep belief network (DBN) in Figure 4.2b stacks an RBM P (h,h
1
) on
top of a multiple layered neural network NN.
• The RBM module generates samples of latent variables (h,h
1
), and
the NN part below RBM translates representation h
1
to observation x.
Specifically,theblockGibbssamplingofRBMgenerates (h,h
1
)P
RBM
(h,h
1
),
andthenh
1
samplesarepropagationdownwardtogeneratexP
NN
(X|h
1
).
56
• The model parameters are trained to maximize probability of observing
training data P (x) =
R
h
P
RBM
(h,h
1
)P
NN
(X|h
1
).
• InRBMP (h,x), therelationbetweenlatentrepresentationhandobser-
vation x need to quantified with a bilinear energy function, which is a
strong restriction. In DBN, the relation between h and x could be
arbitrarily complicated because of expressiveness of NN.
This model also works for modeling labeled datasets. For example, in Fig-
ure 4.2b, a vector of labels is part of RBM’s visible layer. Fixing the label
vector and sampling the RBM can generate x samples of the specified class.
This is an example demonstrating the flexibility of neural network’s archi-
tecture: one layerz could connect with multiple modalitiesv
1
,v
2
···v
K
, and
information from multiple modalities could interact through the shared layer
z.
1
DBN was an early attempt that took advantage of deep network’s expressive-
ness to generate data samples. With development of new deep learning techniques,
e.g. convolutionalanddeconvolutionalmodules [67,115], batchnormalization [57]
and new numerical optimization algorithms [35, 128, 63], deep neural networks
are widely used as decoder to map from latent representations to observations
[61, 126, 2, 55]. Figure 4.3 shows an example of generating images given specific
properties of images [2].
1
This method is used by many generative models to manipulate the property of generated
samples.
57
c
v
θ
z
Dec
x
Figure4.3: Generatingimageofspecificproperties,bymappingfromattributestoimages
with decoder. The attributes in this example are class c of object in the image, view
angle v and transformation θ [2].
4.1.2 DeepgenerativemodelandvariationalAuto-Encoder
We have shown that probabilistic latent variable models, which consist of man-
ually defined conditional distributions, lack expressiveness, while deep generative
models like DBN are inefficient at sampling data samples. The variational autoen-
coder (VAE) provided a framework that integrate advantage of two approaches.
z
P
θ
(z)
x
P
θ
(x)
p
θ
(x|z)
p
θ
(z|x)
(a) Latent variables model for density estimation. P
θ
(z|x) is true
posterior distribution of latent variables.
x
p
D
(x)
z
p
φ
(z)
p
φ
(z|x)
p
φ
(z|x)
(b) The joint distribution p(x,z) modeled as encoder model.
P
φ
(z|x) is an approximate posterior.
Figure 4.4: Abstract diagrams of probabilistic decoder and encoder models.
Figure4.4aisabstractionoflatentvariablemodels: P
θ
(x) =
R
z
P
θ
(z)P
θ
(x|z). In
Bayesian statistics, generation of data sample x∼P
θ
(x) can be decomposed into
two stages: (1) generation of latent variable z∼ p
θ
(z) under prior distribution
and (2) generation of observable under x∼ p
θ
(x|z). Learning parameters θ by
58
maximizing likelihood of all observations logp
θ
(D) requires integrating over latent
variable z and is thus intractable. One widely used solution [16] is introducing
variational posterior distribution q
φ
(z|x) as shown in Figure 4.4b and estimate an
approximation to data likelihood:
logp
θ
(x) =
Z
q
φ
(z|x) log
p
θ
(x,z)q
φ
(z|x)
p
θ
(z|x)q
φ
(z|x)
dz
=E
q
φ
(z|x)
logp
θ
(x|z) +KL[q
φ
(z|x)||p
θ
(z|x)]−KL[q
φ
(z|x)||p
θ
(z)] (4.1)
≥E
q
φ
(z|x)
logp
θ
(x|z)−KL[q
φ
(z|x)||p
θ
(z)] (4.2)
Equation (4.2) is called Evidence Lower BOund (ELBO). Intuitively, term-1
and term-2 of ELBO are interpreted as reconstruction loss and regularization on
latent representation z, and this interpretation explains the autoencoding termi-
nology [23]. Generative parameters θ are learned jointly with φ by maximizing
ELBO:
max
θ
max
φ
E
p
D
(x)
h
E
q
φ
(z|x)
logp
θ
(x|z)−KL[q
φ
(z|x)||p
θ
(z)]
i
(4.3)
Suppose{θ
∗
,φ
∗
} are global optimal solutions to 4.3. The optimal φ
∗
should
minimize the gap between logp
θ
∗(D) and ELBO. It implies that
E
p
D
(x)
KL[q
φ
∗(z|x)||p
θ
∗(z|x)] = 0 (4.4)
, i.e. q
φ
∗(z|x) =p
θ
∗(z|x),∀x∈D. Thus term E
p
D
(x)
E
q
φ
∗(z|x)
logp
θ
∗(x|z), equaling
to E
p
D
(x)
E
p
θ
∗(z|x)
logp
θ
∗(x|z), is an unbiased estimation of log-likelihood.
Usuallyp
θ
(z) is parameterized with simple, manually defined distribution, e.g.
p(z) =N (0,I). In traditional variational inference, the variational distribution
59
p
φ
(z|x) is preferred to be conjugate withp
θ
(x|z) to enable analytical estimation of
the lower bound.
The VAE model [64] introduced the re-parameterization trick to relax this
constraint. As shown in Figure 4.5, an encoder is to used to predict the sufficient
statistics of distribution p(z|x), and z are sampled as differentiable function of
sufficient statistics and random noise. Finally, these z samples are used to sample
x.
x
Enc
z
p
θ
(z)
Dec ˆ x
p
D
(x)p
φ
(z|x)
p
θ
(z)p
θ
(x|z)
Figure 4.5: Model architecture of Variational Auto-Encoder [64].
1. Invanilla VAE model [64], andp
φ
(z|x) is Gaussian distribution withdiagonal
covariance matrix: z∼N
μ
φ
(x),diag(σ
2
φ
(x))
, where φ are parameters of
encoder functions μ
φ
and σ
φ
. Sampling z is re-parameterized as
z =μ
φ
(x) +σ
φ
(x) (4.5)
where ∼N (0,I). z generated with this method is differentiable w.r.t.
μ(x),σ(x) andx. Ifp
θ
(z) is a Gaussian prior distribution, the KL divergence
term in equation (4.3) can be calculated analytically.
2. If latent variable is discrete, suppose every node p
φ
(z
d
|x) follows a categor-
ical distribution, where probability vector [p(z
d
= 0|x),···p(z
d
= K|x)] is
60
predicted with encoder parameterized with φ. [77, 58] implemented the
Gumbel-Max trick [44] and used a vector as sample of z
d
∼p
φ
(z
d
|x):
z
d,k
=
exp
(logp
φ;k
+g
k
)/τ
P
i
exp
(logp
φ;i
+g
i
)/τ
(4.6)
where p
φ;k
represent probability of this latent node being category k, g
k
is
random gumbel noise, andτ is a temperature hyperparameter. z
d
generated
with this method is also differentiable w.r.t. p
φ;k
(x) andx. Ifp
θ
(z) is defined
categorical distribution, the KL divergence term in equation (4.3) can also
be calculated analytically.
The neural network is suitable for fitting the encoder and decoder function
[64], because it is expressive and differentiable. Moreover, because neural network
has no inter-layer connection, conditional independence between z dimensions in
p
φ
(z|x) is reasonable assumption.
Algorithm 2: End-to-end Learning of VAE parameters φ,θ.
initialize θ,φ;
while θ,φ not converged do
for each training samples x do
Encoder(φ): Calculate distribution p
φ
(z|x);
Re-parameterization: Sample ˆ z according equation (4.5) or
eq(4.6);
Decoder(θ): Map ˆ z to ˆ x;
Calculate ` = logp(ˆ x)−KL
h
q
φ
(z|x)||p
θ
(z)]
i
;
θ,φ← minimizer(−`);
With equations (4.3) and (4.4), we have explained that, if ELBO is optimized
accurately, it is an accurate estimation of log-likelihood logP
θ
(D). As a byproduct,
latent variablez following the auxiliary distributionP
φ
(z|x) is considered as a good
representation of x that is applicable to many following up tasks.
61
However, z ∼ P
φ
(z|x) are learned to maximize a lower bound of likelihood
logP
θ
(D). z is not guaranteed to perform well when applied on a different task:
encoding information of data x. In fact, [23] reported that, if conditional dis-
tribution P
θ
(x|z) is sufficiently expressive, the latent code z ∼ P
φ
(z|x) ignores
information of x. [23] analyzed ELBO from an information theoretic perspective
to explain the inefficiency of latent z in VAE. In the last step, they attributed
the inefficiency of z to imperfect optimization of θ,φ parameters. In the following
section, we will follow a similar direction to analyze it but give a more precise
explanation.
4.1.3 Information theoretic interpretation of VAE
minimum description length principle In chapter 2 we explained that, in
supervised learning, the information bottleneck principle encourages latent layer
z to compress x while extracting sufficient information content relevant to y. In
unsupervised learning, a widely used information theoretic learning objective is
minimum description length (MDL)[76], which encourages model to maximally
compressx when communicating information content ofx. [41, 23] derived VAE’s
ELBO as Bits-Back Coding [53] under MDL principle [76]. To be self-contained,
we rephrase the Bits-Back argument in [53, 121] and its relation to VAE [41, 23]
in the next paragraph.
In the context of information theory, given distributions P
θ
(z),P
θ
(x|z) and
P
φ
(z|x), to communicate observation x, the sender create message in three steps:
62
(1) sample z∼ P
φ
(z|x), (2) communicate z, with encoding precision
z
, (3) com-
municate x given z with encoding precision
x
. According to Shannon’s source
coding theorem, the total message length is
L =L(z) +L(x|z)
=
Z
p
φ
(z|x) log
h
p
θ
(z)
Dz
z
i
dz +
Z
p
φ
(z|x) log
h
p
θ
(x|z)
Dx
x
i
dz (4.7)
where D
z
and D
x
are dimensionality of z and x respectively. Among these bits of
information, a subset describes distribution P
φ
(z|x), not x:
L(z;φ) =
Z
p
φ
(z|x) log
h
p
φ
(z|x)
Dz
i
(4.8)
Subtracting part (4.8) from (4.7) gives description length of observation x:
L(x) =E
p
φ
(z|x)
log
h
p
θ
(z)−p
φ
(z|x)
i
+E
p
φ
(z|x)
log
h
p
θ
(x|z)
i
+D
x
log
x
(4.9)
inefficiency of VAE at representating inpug signal Ignoring the data inde-
pendent term in equation (4.9), the code length equals to negative ELBO of VAE.
And maximizing ELBO w.r.t.{θ,φ} can be interpreted as designing efficient cod-
ing scheme. [23] showed that the gap between model’s description lengthL(x) and
optimal code length is
2
E
p
D
(x)
L(x)−H
D
(x)≥E
p
D
(x)
KL
h
p
φ
(z|x)||p
θ
(z)
i
(4.10)
and it declares that, because no methods are able to completely close the gap
between approximate posterior P
φ
(z|x) and true posterior P
θ
)(z|x), z learned by
2
This gap is same to the gap between logP
θ
(x) and ELBO.
63
minimizing code length L(x) or equivalently maximizing ELBO is inefficient rep-
resentation. However, this explanation is vague. Indeed, we can formulate the gap
in another way to obtain a very clear explanation:
E
p
D
(x)
KL
h
P
φ
(z|x)||P
θ
(z)
i
−E
p
D
(x)
E
p
φ
(z|x)
logP
θ
(x|z)
−H
D
(x)
=E
P
φ
(z)
KL
h
P
φ
(x|z)||P
θ
(x|z)
i
+KL
h
P
φ
(z)||P
θ
(z)
i
(4.11)
1. If decoder network P
θ
(x|z) is sufficiently expressive, it can approximate any
function P
φ
(x|z) =
P
D
(x)P
φ
(z|x)
E
P
D
(x)
h
p
φ
(z|x)
i
.
2. If prior distribution P
θ
(z) is simple, e.g. white Gaussian distribution, it is
easy to find some parameter φ
∗
s.t. P
D
(x)P
φ
∗(z|x) = P
θ
(z). Then taking
advantage of decoder model’s expressiveness, there exist θ
∗
s.t. P
θ
∗(x|z) =
P
φ
∗(x|z).
3. Thus, in order for parameters θ
∗
,φ
∗
to reduce gap (4.11) to 0, the only
requirement on encoding distribution is that
E
P
D
(x)
h
P
φ
(z|x)
i
=P
θ
(z) (4.12)
z∼P
φ
(z|x) needs not to capture latent structure ofx. An extremely useless
but valid case is P
φ
(z|x) =P
θ
(z).
In summary, the ELBO of VAE, whether derived from variational inference,
or derived from MDL, is vulnerable to learning useless encoder P
φ
(z|x). And
it become more serious when decoder P
θ
(x|z) is more expressive and θ is better
optimized. But it should be noted that, the optimal parameters{φ
∗
,θ
∗
}, and
64
consequently the latent variablesz givenx, is not necessarily unique
3
. While some
{φ
∗
,θ
∗
} may simultaneously maximize ELBO (and minimize MDL) and produce
z that capture structure in datum x, some other{
¯
φ
∗
,
¯
θ
∗
} maximize ELBO (and
minimize MDL) but produce useless z.
learning to represent input signals in VAE A direct solution to the problem
of useless z is optimizing objective that also encourages z to be informative of x.
Similar to maximizing I(Z;Y ) in supervised learning, [3] and [130] proposed to
maximize the mutual information between X and Z in the encoding model:
I
p
φ
(x,z)
=
Z
p
D
(x)p
φ
(z|x) log
p
φ
(x|z)
p
D
(x)
=H
D
(x) +E
p
φ
(z)
KL[p
φ
(x|z)||p
θ
(x|z)] +E
p
D
(x)
E
p
φ
(z|x)
logp
θ
(x|z)
≥E
p
D
(x)
E
p
φ
(z|x)
logp
θ
(x|z) (4.13)
Optimizing(4.13)inducesP
φ
(x|z) =P
θ
(x|z). TomakeP
θ
(x,z) =P
D
(x)P
φ
(z|x),
[130] requires P
φ
(z) = P
θ
(z). The unconditional objective function of soft con-
straint is
max
θ,φ
E
p
D
(x)
E
p
φ
(z|x)
logp
θ
(x|z)−βdist(P
θ
(z),P
φ
(z)) (4.14)
wheredistancebetweendistributionsarequantifiedasmaximummeandiscrepancy
(MMD) or adversarial divergence. This model is called infoVAE.
3
For example, if P
θ
∗(x|z) is a deep neural network, every intermediate hidden layer z
i
could
be treated as a solution
65
While constraintP
φ
(z)≈P
θ
(z) and property of optimal parameters,P
φ
(x|z) =
P
θ
(x|z), look same to VAE, the decomposition of mutual information I
φ
(x,z) into
I
p
φ
(x,z)
(x,z) :=
Z
p
φ
(x,z) log
p
φ
(x,z)
p
D
(x)p
φ
(z)
=E
p
D
(x)
KL[p
φ
(z|x)||p
θ
(z)]−KL[p
φ
(z)||p
θ
(z)] (4.15)
revealed the difference between infoVAE and VAE: VAE minimize total descrip-
tion length of z and x|z, but infoVAE minimize description length of x|z while
maximizing description length of latent representation z under some constraints.
The infoVAE model increased information of z about x. However, the impli-
cation of maximizing description length L(z) may harm the task of sampling
data, becauselargeE
p
D
(x)
KL[p
φ
(z|x)||p
θ
(z)]reduceslog-likelihoodofobservations,
according to equation (4.1). In chapter 5, we will propose and experiment with
another approach. We increase the complexity of prior model, to encourage fewer
bits of information to describesx givenz, and more bits of information to describe
z, while keeping description length unchanged. We will demonstrate that our
model can simultaneously extract informative representation z and sample high
quality data x.
66
Chapter 5
Unsupervised Learning of Latent
Factor of Variations
Representation learning aims to learn good representations of the data that
make the underlying machine learning problems, such as classification and regres-
sion, easier to solve. A highly desired property for representation learning is to
disentangle the factors of variations in the dataset [11, 12]. For instance, the
natural image of a person is the composition of factors such as human identity,
pose, illumination, etc. Successful separation of these factors naturally gives rise
to some important properties like representation invariance [26]. Given a task of
recognizing different persons from images, fundamentally we only need to examine
the part of the representation which accounts for the identity factor and simply
ignores the other parts
1
. This property is also referred to as discovering concepts
from datasets in some works [47, 49, 131], closely related to the interpretability of
representation.
The questions we need to answer are: how to define the disentangling effect
for the representation, and how to quantify the disentangling property as (part
of) model learning objective? These questions are especially challenging when
building generative models of unlabeled data. Take VAE [64] as example. In
chapter 4, we have proved that, by maximizing likelihood of observing dataset or
minimizing model’s description length of data, the learned representation z is not
1
This is intuition underlying lossyVAE [23]
67
even guaranteed to contain any information of x. Even if we use infoVAE [130] to
learnz,z only needs to extract information from x, but does not have to organize
the information in an efficient and interpretable manner.
Intuitively,wethinkthatdisentanglement of factors of variations meansdecom-
posing the feature space of data into concatenation of multiple feature subspaces.
If we could divide the hidden units ofz into groups and let each group represent a
specific factor, then variation of one group of hidden units only affect one factor in
data. We can imagine that the hidden units belong to the same group tend to have
high correlation while the units correspond to different factors may not. There are
several ways to measure such correlations mathematically including covariance,
mutual information, total correlation and etc. In this chapter, we first review
works that learns to discover factor of variations via regularization on correlation
between latent variables. We then provide a new way to discover the correlation
between latent variables from unlabeled dataset. Finally, we reformulate disen-
tanglement into a task of simultaneously learning concept representation and data
representation, and empirically demonstrated advantage of this framework.
5.1 Learning to Disentangle Factors of Variation
Early attempts on disentangling factors [123, 42, 20] mostly rely on the bilin-
ear models and component analysis. With development of deep learning methods,
more efforts were turned to disentangling capability of latent neural representa-
tions. For example, various higher-order Boltzmann machines [96, 117, 27, 31, 95]
have been proposed to incorporates the multiplicative interactions between group
oflatentfactors. Recently, deepauto-encoderswereimplementedtoexplicitlylearn
representations corresponding to various factors of variations in dataset [24], by
68
manually splitting middle layer into multiple blocks and enforcing different blocks
to predict different types of labels. And theoretically, [26] gave a theoretical anal-
ysis of disentangled representation from the perspective of group theory, by lever-
aging the symmetry property.
These models are limited in the following. First, most of them rely on the label
information: the labels tell learner the number and types of factors of variations a
priori, and the labels caused unique properties of different subsets of representa-
tion, e.g. Figure 5.1a. But in many cases multiple types of labels are not available.
Second, generalizing them to more than two factors is sometimes difficult: for mul-
tiplicative interactions based method, we need 4-way tensor to model three factors
which is computationally challenging. The auto-encoder model in [24] can only
discover up to two factors according to its design. Moreover, all these models need
to specify the exact block structure, i.e., the number of blocks and the size of each
block in the latent representation layer. This limits the power of these models.
β-VAE for discovery of concepts The β-VAE [47][48] model was proposed to
discover latent factor of variations by enforcing latent variables in z to be mutu-
ally independent. It encourages mutual independence between latent variables by
increasing weight of KL divergence term in VAE’s ELBO:
ELBO =E
P
D
(x)
E
P
φ
(z|x)
logP
θ
(x|z)−βKL
h
P
φ
(z|x)||P
θ
(z)
i
(5.1)
where a large value of β > 1 will prefer P
φ
(z|x) to be close to P
θ
(z). [47]
declaresthat, ifunitsofz areindependentinthepriordistributionP
θ
(z), e.g. when
P
θ
(z) =N (0,I) in vanilla VAE model, correlation between units of z∼ P
φ
(z|x)
will be weak. Moreover, the decoder P
θ
(z|x) will be trained to map z, whose
units are weakly correlated, to observations. It demonstrated that, (1) on simple
69
x
ˆ y
z
Enc Dec ˆ x
(a) Early works relied on label information to split factors of vari-
ations into subsets of representations, e.g. [24]
x
z
1
z
2
z
3
z
d
z
···
Enc Dec ˆ x
(b) Theβ-VAE treat each latent variable as one factor of variation.
Figure 5.1: VAE andβ-VAE hardcoded structure and prior distribution of intermediate
layer
synthetic dataset, the units of representationz are independent and well correlated
to ground-truth factors of variation, and (2) on complex dataset, the decoder can
generates samples that qualitatively look like training data, and by varying value
of one unit inz, only one (or a few) visual factors are affected, as in Figure 10∼ 14
in [48].
This model is problematic if it is used for learning representation. According
to discussion on description length and uselessness of z in VAE,beta-VAE focuses
on mapping samples z∼P
θ
(z) to samplesx∈D, at the cost of too short descrip-
tion length on z and too little information in z about specific x samples. In the
extreme case, whenβ 1,P
φ
(z|x)≈P
θ
(z) for allx samples, which means encoder
in beta-VAE does not retain information in input x.
70
(a) Random samples from CelebA testing
dataset.
(b) β = 1
(c) β = 16 (d) β = 64
Figure 5.2: Qualitative evaluation of reconstruction and generation capability of beta-
VAE moels. In subplots, top two rows are image decoded from μ
φ
(z|x), bottom two rows
are images decoded from z∼P
θ
(z).
Figure5.2demonstratedthepricesβ-VAEpaysforlargeβ, oncelebA dataset
2
.
When beta is big, e.g. β = 64 in Figure 5.2d, the images decoded from random
samples z from prior looks most close to true face images. However, the recon-
struction quality is significantly worse than models of smallerbeta values, implying
that P
φ
(z|x) is bad representation of x.
Another flaw in the model is the strong constraint that units ofz are mutually
independent. It implies that each unit of z corresponds to one factor of variation.
This over simplified structure of data. Some factors are complicated and cannot
2
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
71
be characterized with one unit. Moreover, some factors may have hierarchical
relations.
Whileβ-VAE cannot learn good representation ofx in theory, it also has practi-
cal problems in empirical implementation. One challenge is setting the dimension-
ality ofz. If dim(z) is large, the reconstruction quality is good but interpretability
of z is bad. There may be too many independent units compared to actual num-
ber of factors of variation, and the decoder parameters is prone to over-fitting.
If dim(z) is small, it is easier to pick hidden units z
i
that’s relevant to one spe-
cific factor, however, the reconstruction quality is bad, and z as representation
will loss much information of input x. Another challenge is lack of quantitative
evaluation method: It is hard to compare two models on which one is better at
disentanglement, and there is no reliable metric for model tuning and selection.
5.2 Factors of Variation from Groups of Latent
Representations
β-VAE focuses on learning decoder that’s able to map z consisting of mutually
independent units, e.g. z∼N (0,I), to data samples of good quality. Later there
were works focusing on learning representation z∼P
φ
(z|x) that has disentangle-
ment property. A widely used assumption is that subsets of latent representation
corresponds to factors of variations, and different subsets are not correlated. Take
a dataset of human faces for example, the factors of variation may include human
identity, face expression, picture brightness, view angle, with or without glasses,
et al. the correlation between hidden units within the same group is high while
the correlation between groups is small.
72
(a)Z is independent of Y, thus predicting labels from representa-
tion Z should be inaccurate [28].
(b) Z should include all message irrelevant to the label and Y
should include no message irrelevant to the label (top).Y should
include all message relevant to the label (bottom) [79].
Figure 5.3: Representation of x is split into a subset Y that encodes label information
and another subset Z that encodes other variations. Optimizing supervised and adver-
sarial objectives encourage{Y,Z} to be informative of x and encouragesY andZ to be
independent. The black, blue and red arrows represent encoding, decoding operations
and loss functions.
Various techniques has been developed to learn informative and independent
subsets of representation, using discriminative analysis and adversarial losses. Fig-
ure 5.3 explains two example models [28, 79]. Both models split the latent repre-
sentation into two subsets, one subset, Y, is expected to encode label information
and the other subset, Z, encodes other information. In this setting, disentangle-
ment means extracting and separating label information from other variation in
the data.
[28] induces independence via prediction minimization [104]. Z are trained to
be bad at predicting labels (by minimizing the accuracy of optimal classifier that
73
predict label fromZ), whileY is trained to be good at predicting labels. Moreover,
Z and Y together should reconstruct ˜ x that (1) is not distinguishable from a true
sample and (2) retains label information. This implies that{Y,Z} is informative
of x. The shortcoming of this model is Y may encode information irrelevant to
labels, and learning it requires labels.
[79] further reduced relevance betweenY andZ. SupposeX
1
,X
2
belong to one
class, and X
3
,X
4
belong to another class. The disentanglement property requires
that:
• Y should not encode information that is irrelevant to labels, and Z should
encode all information that is irrelevant to labels. Thus, Y
1
and Y
2
should
encode same message. Both{Y
1
,Z
1
} and{Y
2
,Z
1
} should be able to recon-
struct X
1
, as shown in the top diagram of Figure 5.3.
• Y should encode all information that is relevant to labels, and Z should not
encode information that is relevant to labels. Thus,{Y
3
,Z
2
} should be able
to reconstruct a sample of the same class to X
4
.
Models [79][28] split representation into label message subset and other message
subset. Theycanbeextendedtomultiplesubsetseasily, atthecostofquadratically
increasing number of discriminative modules, and the cost of potentially much
hard optimization. Empirically they demonstrated their advantages in generating
samples of controlled properties, and better at classification tasks. However, the
good performance on these metrics is largely because they used labels at training
stage. Compared to auto-encoder [24], the major contribution is integration of
latest deep learning techniques, e.g. adversarial learning. They are not able to
achieve the goal we are more interested in: discovering the latent structure of data
from x only.
74
α
π
s
d
z
nd
C
nl
σ
C
x
n
L
D D
N
g
dl
σ
g
D×L
Figure 5.4: Graphical representation of model with unknown block structure on the
prior. α,σ
d
,σ
g
,σ
C
are the hyperparameters.{s
d
},{g
dl
},{C
nl
} are the parameters of the
prior model.{z
i
} is the latent representation, and{x
i
} are our observations.
5.2.1 Factorizing the prior distribution of representation
One shortcoming of disentanglement models [79][28] is they need to manually
split the representation layer into subsets, and specify the factor for each subset,
before learning model parameters. This is impractical when dataset is unlabeled
and factors are unknown. So the question is: is it possible to learn to split repre-
sentation into subsets, s.t. each subset corresponds to one factor?
Our solution is reformulating the prior distribution P
θ
(z) with a variable clus-
tering prior [89] on p(z)∈ R
D
. We call it factor-VAE. This prior is acted as
another generative process which models the correlation within the cluster and
assign the cluster membership to each dimension of latent representation. After
incorporation of this prior, the new probabilistic model is depicted in Figure 5.4,
and the corresponding generative process is shown as below:
1. Draw π∼ Dir(α), where π∈R
L
and L is number of latent factors.
2. For each latent dimension d from 1 to D, where D is dimension of latent
representation Z.
(a) Draw dimension membership s
k
∼ multinomial(π)
75
3. For each dimension d = 1...D and block l = 1...L
(a) Draw factor loading variable g
dl
∼N(0,σ
2
g
)
4. For each sample n = 1...N
(a) For each block/cluster l = 1...L
i. Draw latent factor C
nl
∼N(0,σ
C
)
(b) For each dimension d = 1...D
i. Draw latent representation
z
nd
∼N(g
ds
d
C
ns
d
,σ
2
d
)
(c) Draw each observation x
i
∼P
θ
(z
i
)
Here, α,σ
g
,σ
d
,σ
C
are hyperparameters that we can fix beforehand
3
. {C
nl
} is
the local parameters that depend onz
n
while{g
ds
d
} are the global parameters that
control the membership of each dimension. Based on this variable clustering prior
model, the covariance between z
nd
,z
nd
0 becomes
cov(z
nd
,z
nd
0|{g
dl
},{s
d
})
=
σ
2
C
g
ds
d
g
d
0
c
d
0
+σ
2
d
δ
dd
0 if s
d
=s
d
0
0 otherwise
(5.2)
This explicitly defines the block diagonal covariance structure, and enables the
model to automatically learn the structure (size) of each subset of representa-
tion. Besides the change of prior on p(z), remaining stays the same. As before,
we model p
θ
(x|z) using Gaussian distribution whose parameters are generated by
deep encoder, and the variational distributionq
φ
(z|x) is also modeled by Gaussian
3
We could also learn these parameters with Bayesian method. But in this work, we treat
them as hyperparameters and select the optimal ones using cross validation
76
Table5.1: Thesummaryofthemodelsevaluatedinthischapter.Themodelsareidentified
through the property of covariance of random variable z
VAE factor-VAE labeled-VAE
Prior Normal-Diagonal Normal-Factor Normal-Diagonal
Supervision Unsupervised Unsupervised Partly-Supervised
distribution. The block structure (size) will be learned to be adaptive to training
data.
5.2.2 Experimental results
We assess the effectiveness of our methods both qualitatively and qualitatively.
We evaluate our methods on two simple image datasets, MNIST
4
and Multi-PIE
5
.
In MNIST dataset, the factors include digit class and writing style. In Multi-PIE
dataset, the factors include person identity, view angle and brightness.
Baseline and Proposed Models Table 5.1 shows the summary of the baseline
and proposed models. They are actually the combination of different types of
prior and training data. The first model, VAE is the baseline from [64], which
assumesN (0, 1) prior on p(z). After training, we do cluster analysis of the units
in layerz intoL groups. The second model factor-VAE with probabilistic graphic
prior distribution, was explained in section 5.2.1 . The third model labeled-VAE
has same architecture to VAE, but half of latent nodes are manually specified as
factor digit id, and these latent nodes are trained to both reconstruct images and
predict digit id. In model VAE and VAE, we only learn parameters of posterior
distribution P
φ
(z|x), while in model factor-VAE we learn both prior P
θ
(z) and
posterior P
φ
(z|x) distribution parameters .
4
http://yann.lecun.com/exdb/mnist/
5
http://www.multipie.org/
77
Generating Samples Qualitatively, we evaluate the visual quality and the dis-
entanglement property. Specifically, we want to know (1) Is decoder capable of
generating good quality samples fromz∼P
θ
(z)? and (2) Could we manipulate one
or subset of factors of variation if we change subset of representation z?
We generate samples following this procedure: we firstly generate random rep-
resentation vectors for each subset of representationz according to the prior distri-
bution, then combine and concatenate samples from all subsets of representations,
finally map representation to x with decoder P
θ
(x|z). In this way, we are able to
generate images from combination of factors. Take MNIST model for example,
1. Figure 5.5a are images generated from VAE and labed-VAE. z
(1)
the first
subset of z, corresponds to factor digit id, and z
(2)
, the second subset of z,
corresponds to factor style. We randomly sampled 10 instances of z
(1)
, and
randomly sampled 20 instances of z
(2)
. Combining and concatenating them
create 200 instances of z. These 200 z instances are decoded to 200 images.
This procedure also applies on Figure 5.5c.
2. Figure 5.5b are images generated from factor-VAE. It has 2-dimensional
representation of factors, C, and latent representation of images, z. We
randomly sampled 10 instances ofC
1
, and randomly sampled 20 instances of
C
2
. Combining and concatenating them create 200 instances of C and then
200 instances of z. These 200 z instances are decoded to 200 images.
Figure 5.5 demonstrated disentanglement property of factor-VAE model. In
Figure 5.5b, images in most columns has the same digit class, and different rows
have different styles (orientation and thickness). To the contrary, in Figure 5.5a,
imagesineverycolumnandeveryrowshowedchangeofdigitid. Eveninlabeled-VAE,
the digit id of images in each row are not unique, as in Figure 5.5c. This is because
78
although labeled-VAE encodes all label-relevant information in one subset ofz, it
cannot prevent the other subset ofz also encodes some label-relevant information.
(a) Model VAE
(b) Model factor-VAE
(c) Supervised Model
Figure 5.5: Samples from three generative models of MNIST dataset. Most of them look
like real digits.
In the case of Multi-PIE, generated samples are shown in figure 5.6. There are
three factors of variation, {people identity, brightness, view angle}, thus we cannot
visualize all factors in single 2-dimensional grids of plot. Instead, for every model,
79
we present three subplots, each containing 3× 9 examples. When we generate the
i-th subplot to demonstrate the i-th factor of variation, we set units from other
group, z
(j6=i)
to be mean value in this dataset.
Qualitatively, foreachof VAEandfactor-VAEmodels, ineachofthreesubplots,
variation of one factor is more obvious than variations of other factors. However,
in both models, every subset of representation is not able to completely remove
information of other factors. Visually, the images generated by factor-VAE has
more details that characterize the factors of variations. We will quantitatively
compare these models in depth, through domain adaptation performance.
(a) Model VAE (b) Model factor-VAE
Figure 5.6: Generated samples from four different models on Multi-PIE dataset. Each
model has 3 subplots. The first factor is identity, the second factor is brightness and the
third factor is view angle.
Domain Adaptation We design a domain adaptation experiment to quantita-
tively evaluate our models in terms of their effectiveness on disentangling different
factors. Suppose we have two domains sharing the same input and labeling space
X×Y. Thetrainingdataisfromthesourcedomain. Ourgoalistolearnaclassifier
80
f :X7→Y to perform well on the target domain, of which the data have different
characteristics from those of the source domain. This scenario is quite common
in real life. Take face pose classification (Y ={front,left,right}) for instance. A
well-trained pose classifier may encounter face images of new persons at the testing
stage.
To simulate this domain adaptation setup, we equally and randomly split the
68 people in Multi-PIE into two groups. One is taken as the source domain, of
which the pose labels of the face images are given for training a classifier, while
the other as the target domain to which we will apply our classifier. We use lin-
ear SVM as our classifier and tune its hyper-parameters by 5-fold cross-validation
on the source domain. The cross-validation classification accuracy approximates
the generalization ability of the classifier. One would expect a similar classifica-
tion accuracy on the test data if they are drawn i.i.d. from the same distribution
underlying the source domain. However, due to the mismatch between the source
domain and the target domain, in general there will be a significant performance
drop from the cross-validation accuracy to the actual classification accuracy on the
target domain.
Our hunch is that if we can effectively disentangle the pose factor from the
others for the face data, the performance drop should be smaller when we use
the corresponding hidden representations than the drop using the other hidden
representations. To this end, we compare the classification accuracies on the target
domain and the performance drops using different hidden representations.
Table 5.2 shows the results on our domain adaptation task. The “S” and “T”
columns of each model (VAE, labeled-VAE, factor-VAE) correspond to the 5-fold
cross validation accuracy on the source domain and the classification accuracy on
the target domain, respectively. We report their results obtained using the hidden
81
units of respectively the view, brightness, and identity factors, as well as the all
units. Besides the absolute accuracy, we are also interested in the gap of accuracies
between "S" and "T" domains (between columns in table 5.2), and in the gap of
accuracies between subset of representation and complete representation (between
rows in table 5.2)
Take view classification as example. We can see that the results, no matter
the cross-validation results on the source domain or those on the target domain,
of the view factor are much better than those of the brightness or identity factors.
Thisreinforcesthatthemodelshavethedisentanglingeffect. Particularly, theview
factorsuccessfullyentailsthediscriminativeinformationforfaceposeclassification.
Interestingly, the cross-validation results using all the units are much better
than those of any type of factors, especially better than those of the view factor.
This implies while one subset of representation encodes most information of one
factor, some other information of this factor are encoded in other subsets of repre-
sentations. It shows that in order to disentangle the view factor from the others,
the models have to loss a bit of the discriminative information of the face poses.
Nonetheless, please note the significant performance drops from “S” to “T” when
we use all the hidden units on factor-VAE and label-VAE.
Table 5.2: Classification accuracy on both source and target domains for Multi-PIE
dataset. The source and target domains are separated by IDs. Better accuracy in T
column means better domain adaptation performance.
VAE factor-VAE label-VAE
S T S T S T
View Angle 0.724 0.713 0.756 0.734 0.915 0.887
Brightness 0.512 0.491 0.557 0.537 0.512 0.487
Identity 0.550 0.510 0.582 0.574 0.532 0.515
All 0.815 0.804 0.857 0.785 0.899 0.799
82
5.3 biVAE:VariationalFactorizationofPriorModel
The factor-VAE model in section 5.2 attempted to train VAE models of disen-
tanglement property, by partitioning representation z into subsets corresponding
todifferentfactorsofvariations. Thewayofpartitioningofz intosubsetsislearned
from data.
A probabilistic graphical priorP
θ
(z) is used to learn to partitionz into subsets.
It uses a latent variableC inconcept space to control distribution ofz. Each unit of
C specifies one factor of variation, and value of allC units determines distribution
of z in the feature space. The parameter of prior model is learned to correlate
subsets of z with each unit of C. The method is explained in Figure 5.7
6
.
FromtheviewofminimizingKL
h
P
φ
(z|x)||P
θ
(z)
i
,becauseP
θ
(z) =
R
P
θ
(z|C)P
θ
(C)
is more expressive, P
φ
(z|x) is less like get over-simplified and loss information in
x (e.g. bad reconstruction in β-VAE of large β value, shown in Figure 5.2d). From
the view of learning disentangled representation, distribution of z is conditioned
on configurationC in concept space, and imposing disentanglement regularization
on C is less likely to harm informativeness of z.
However, the prior distribution P
θ
(C,z) is very complicated. It requires iter-
ative calculation to infer value of C given x. Empirically, we found that when it
is applied on more complex dataset, e.g. celebA or cifar, it is extremely hard to
optimize the model parameters to convergence.
To increase the efficiency and stability of factor-VAE model, we use the repa-
rameterization trick [64] in VAE and approximate the variational inference of C
with deep neural network. There is one more encoder P
φ
(C|X) and one more
decoder P
θ
(Z|C), so we call this model biVAE. The model structure is illustrated
6
Note that in Figure 5.7b, single or mixture of Gaussian components are used to explain the
high level idea only. It is not a precise visualization of mathematical model.
83
x
z
C
h
Enc Dec ˆ x
(a) Sample has two representations: representation C in concept
space, and representation z in feature space. VAE only learns z,
and factor-VAE learns z and C jointly.
(b) Two ways of mappingx intoz in representation space:
VAE models [64][47] expectz to follow a single high dimen-
sional Gaussian distribution; factor-VAE expectz to fol-
low a mixture of Gaussian distribution, each component
corresponding to one configuration of factors C.
Figure 5.7: Visual demonstration of high level ideas in factor-VAE model.
with Figure 5.8, where dashed lines indicate inference of latent representations,
and real lines indicates the generative process
7
:
1. Drawc∼P
θ
(C), whereC∈R
L
andLisnumberoflatentfactors. C couldbe
continuous of discrete variables depending on properties of specific dataset.
Units of C are mutually independent.
2. InferP
θ
(Z|C), drawz samples, and add noise
d
. The magnitude of noise
d
is hyper-parameter.
3. Decode and sample x∼P
θ
(x|z +
d
)
To learn parameter of biVAE, we request parameters satisfy three properties:
(1) large I
φ
(C,Z;X), which means latent representation is informative of X; (2)
7
This generative process is much simpler than VAE of graphical model prior.
84
Figure 5.8: TherepresentationsofX inconceptspaceandfeaturespacearebothinferred
with deep neural network and re-parameterization trick. The conditional distribution of
feature given concepts is also modeled with deep neural network.
small KL
h
P
φ
(c)||P
θ
(c)]
i
, which means units of learned C are mutually indepen-
dent; (3) small KL
h
P
φ
(z|x)||P
θ
(z|c)]
i
, which means feature representation z of x
is consistent to the learned concepts.
The objective function of biVAE is:
L
θ,φ
=E
P
D
(x)
E
P
φ
(z|x)
logP
θ
(x|z)
−β
1
KL
h
P
φ
(c)||P
θ
(c)
i
−β
2
E
P
φ
(c|x)
KL
h
P
φ
(z|x)||P
θ
(z|c)
i
(5.3)
The expression of two KL divergence terms and training algorithms can be found
in appendix, section B.2
85
5.3.1 Experimental results
We experimented withcelebA dataset, and evaluated three properties of biVAE
model: (1) informativeness ofz aboutx, (2) disentanglement property ofP
θ
(C,Z)
and(3)completeness and correctness inconceptdiscovery. WeusecelebeAdataset
because each image has 40 binary attributes labeled, which makes quantitative
evaluation of disentanglement easier. We chose 18 from 40 attributes which can
be predicted from visual properties
8
. In biVAE model of celebA, we usedN (0,I)
as prior distribution of concepts C.
informativeness Figure 5.9a showed that biVAE is very good at reconstructing
images. It is because we regularize z with KL
h
P
φ
(z|x)||P
θ
(z|c)
i
, where P
θ
(z|c) is
a sample-specific, high precision prior distribution. This does not force z P
φ
(z|x)
close to a global expectation like VAE and β-VAE, and won’t harm informativeness
of z.
disentanglement Figure 5.10 qualitatively demonstrated correlation between
C units with factors of variations. We start with one point in the latent concept
space,
¯
C. At one time, we pick one unit in
¯
C, and sample C
d
∼N (0, 1) keeping
other units fixed. Each row of Figure 5.10 shows how the change of one unit in C
affects factors in face image. Because no labels were used for training model, the
units of C is not perfectly associated with unique factors. However, we can tell
that in each row, very few of 18 factors are affected by one unit. For example,
8
The 18 attributes are:{Bald, Bangs, Black Hair, Blond Hair, Brown Hair, Bushy Eyebrows,
Eyeglasses, Gray Hair, Heavy Makeup, Male, Mouth Slightly Open, Mustache, Pale Skin, Reced-
ing Hairline, Smiling, Straight Hair, Wavy Hair, Wearing Hat. Each attribute of one image is
labeled either 0 or 1.
86
1. In row 1, gender of people is obviously changed by one C unit, view angle
is slightly changed. Other important factors, e.g. expression, face color,
background are not affected. The eyes look different but this difference can
be attributed to change of gender.
2. In row 3, colors of face, hair and background is obviously changed. Other
important factors, e.g. expression, gender, shape of face are not affected.
3. In row 4, the view angle and face color are changed. Other important factors,
e.g. expression, gender, hair color are not affected.
(a) biVAE could reconstruct images very well. (top): original images,
(bottom): reconstructed images
(b) biVAE can generate images more clear than β-VAE
Figure 5.9: Reconstruction and randomly sampled images using biVAE
completeness and correctness Designing quantitative evaluation method is
important to push research of disentanglement property forward. Given a large
set of labeled attributes, we proposed to test the completeness and correctness
of models using classification models. We randomly sample many images from a
87
Figure 5.10: Reconstruction and randomly sampled images using biVAE
Figure 5.11: biVAE outperforms VAE in 17 of 18 attributes.
model to create a dataset
¯
D and then implement classifiers on this dataset. If the
the distribution of predicted labels on
¯
D is same or very close to the distribution of
predicted labels on real datasetD, we think this model is able to capture all factors
of variations, and is able to capture the distribution of these factors in dataset.
The experimental procedure is: For each of 18 binary attribute, we make pre-
diction onD,
¯
D
VAE
and
¯
D
biVAE
, and obtain three numbers P, P
VAE
and P
biVAE
.
88
They represent the percentage of class-1 samples inD,
¯
D
VAE
and
¯
D
biVAE
, accord-
ing to the classifier. Figure 5.11 shows differences|P−P
VAE
| and|P−P
biVAE
|
for each of 18 attributes. We can tell that except for one attribute,|P
biVAE
−P|
is significantly smaller than or similar to P
VAE
−P. Moreover, the absolute value
of difference|P
biVAE
−P| is small, mostly much smaller than 10%. This means
biVAE is better than VAE at capturing the distribution of factors in dataset.
5.4 Contributions and Discussions
contribution In this chapter, we have three major contribution:
1. We reviewed generative models to disentangle factors of variation in dataset.
Frominformationtheoreticalview,were-interpretedβ-VAEmodelandpointed
out its shortage in concept discovery and informative representation learning.
2. Weproposedfactor-VAE thathasaprobabilisticgraphicalpriordistribution
that can model the factorization of representation z. It automatically learns
to partition z into subsets of units, each subset correlating to one factor
of variation. Qualitatively, on random sample generation, it is capable of
controling specific factor of variations by varying corresponding subset of
representation. Quantitatively, on transfer learning, subsets of z∼ P
φ
(z|x)
corresponding to specific factors are robust to variations of other factors.
3. We proposed to formulate unsupervised disentanglement into learning two
representations, one encoding latent concepts, and another encoding detailed
information of input signals. Along this way, we built biVAE, a end-to-end
model. We developed quantitative method to demonstrate it is good at
disentangling latent factors of variations.
89
discussions On biVAE, the following directions are worth study in the future
1. Experiment with various prior model One advantage of biVAE over
factor-VAE is its flexibility. We can easily configure the prior distribution
P
θ
(C) without causing extra inference cost. On celebA dataset, we used
continuous C. On some dataset with discrete intrinsic structure, C can be
modeled as discrete random variable.
Forexample, ifwesetC tobeadiscretenumberfollowinguniformcategorical
distribution of 10 classes, and apply it on MNIST dataset, it can discover ten
digit ids as factor of variation, and is able to cluster images of same id
together, as shown in Figure 5.12.
Inthefuture,itisinterestingtotestotherpriordistributiononmoredatasets.
Forexample,acollectionofbinarydiscreteC canbeconsideredasaECOC[32]
coding of latent categorical informations, a collection of discrete C dis-
tributed in multiple layers can represent hierarchical structure in concept
space.
2. It is also interesting to integrate more technologies to upgrade biVAE mod-
els. One example is using adversarial loss to make reconstructed and sampled
images more like a real image. Another idea that’s worth trying is to max-
imize KL
h
P
φ
(c|x)||P
θ
(c)
i
−KL
h
P
φ
(c)||P
θ
(c)
i
, while carefully tuning model
to prevent value of KL
h
P
φ
(c|x)||P
θ
(c)
i
from exploding. This objective looks
counterintuitive,sinceVAEmodelsalwaysminimizeKL
h
P
φ
(c|x)||P
θ
(c)
i
. How-
ever, according to equation (4.15), KL
h
P
φ
(c|x)||P
θ
(c)
i
prefers c be informa-
tive of variations in x.
90
Figure 5.12: biVAE discovered ten clusters in MNIST dataset. The left column shows 10
cluster centers. The right columns show variation inside each clusters. The variations
inside each cluster is caused by noise
d
added to z∼P
θ
(z|c).
91
Reference List
[1] A. Achille and S. Soatto. Information dropout: Learning optimal represen-
tations through noisy computation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, PP(99):1–1, 2018.
[2] A.Dosovitskiy, J.T.Springenberg, and T.Brox. Learning to generate chairs
with convolutional neural networks. In IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), 2015.
[3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational
information bottleneck. CoRR, abs/1612.00410, 2016.
[4] M. Ancona, E. Ceolini, A. C. Öztireli, and M. H. Gross. A unified view
of gradient-based attribution methods for deep neural networks. CoRR,
abs/1711.06104, 2017.
[5] J. D. M. A. A. K. B. D. T. D. D. C. Andrew Michael Saxe, Yamini Bansal.
On the information bottleneck theory of deep learning. In ICLR, 2018.
[6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial
networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th
International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 214–223, International Convention Cen-
tre, Sydney, Australia, 06–11 Aug 2017. PMLR.
[7] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek. "what is
relevant in a text document?": An interpretable machine learning approach.
PLOS ONE, 2017.
[8] L. J. Ba and R. Caurana. Do deep nets really need to be deep? CoRR,
abs/1312.6184, 2013.
[9] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and
W. Samek. On pixel-wise explanations for non-linear classifier decisions by
layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07 2015.
92
[10] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection:
Quantifying interpretability of deep visual representations. In Computer
Vision and Pattern Recognition, 2017.
[11] Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends in
Machine Learning, 2, 2009.
[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review
and new perspectives. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 35(8):1798–1828, 2013.
[13] Y. Bengio, A. C. Courville, and P. Vincent. Representation Learning: a
Review and New Perspectives. 35(8):1798–1828, 2013.
[14] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-
encoders as generative models. In Advances in neural information processing
systems, 2013.
[15] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semi-
groups. Springer, 1984.
[16] C. M. Bishop. Pattern Recognition and Machine Learning (Information Sci-
ence and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
[17] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. the Journal of
machine Learning research, 3:993–1022, 2003.
[18] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal,
L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and
K. Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316,
2016.
[19] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors. Large Scale
Kernel Machines. MIT Press, Cambridge, MA., 2007.
[20] C. F. Cadieu and B. A. Olshausen. Learning intermediate-level repre-
sentations of form and motion from natural movies. Neural computation,
24(4):827–866, 2012.
[21] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer
open-domain questions. CoRR, abs/1704.00051, 2017.
[22] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer
open-domain questions. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics, ACL 2017, Vancouver, Canada,
July 30 - August 4, Volume 1: Long Papers, pages 1870–1879, 2017.
93
[23] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schul-
man, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. CoRR,
abs/1611.02731, 2016.
[24] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. Discovering
hiddenfactorsofvariationindeepnetworks. arXiv preprint arXiv:1412.6583,
2014.
[25] K. L. Clarkson. Coresets, Sparse Greedy Approximation, and the Frank-
Wolfe Algorithm. ACM Trans. Algorithms, 6(4):63:1–63:30, 2010.
[26] T. Cohen and M. Welling. Learning the irreducible representations of com-
mutative lie groups. arXiv preprint arXiv:1402.4437, 2014.
[27] A. C. Courville, J. Bergstra, and Y. Bengio. A spike and slab restricted
boltzmann machine. In International Conference on Artificial Intelligence
and Statistics, pages 233–241, 2011.
[28] A. Creswell, A. A. Bharath, and B. Sengupta. Conditional autoencoders
with adversarial information factorization. CoRR, abs/1711.05175, 2017.
[29] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and
equilibrated adaptive learning rates for non-convex optimization. CoRR,
abs/1502.04390, 2015.
[30] D. DeCoste and B. Schölkopf. Training Invariant Support Vector Machines.
Mach. Learn., 46:161–190, 2002.
[31] G.Desjardins, A.Courville, andY.Bengio. Disentanglingfactorsofvariation
via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
[32] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via
error-correcting output codes. J. Artif. Intell. Res., 2:263–286, 1995.
[33] C.H.Q.DingandX.He. Ontheequivalenceofnonnegativematrixfactoriza-
tion and spectral clustering. In Proceedings of the 2005 SIAM International
Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April
21-23, 2005, pages 606–610, 2005.
[34] Y. Ding, Y. Liu, H. Luan, and M. Sun. Visualizing and understanding neural
machine translation. In ACL, 2017.
[35] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for
online learning and stochastic optimization. Journal of Machine Learning
Research, 12:2121–2159, 2011.
94
[36] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New
York, 2 edition, 2001.
[37] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks.
In G. Gordon, D. Dunson, and M. Dudík, editors, Proceedings of the Four-
teenth International Conference on Artificial Intelligence and Statistics, vol-
ume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort
Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
[38] Y. Goldberg and O. Levy. word2vec explained: deriving mikolov et al.’s
negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.
[39] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27, pages 2672–2680.
Curran Associates, Inc., 2014.
[40] A.Graves, M.Liwicki, S.Fernandez, R.Bertolami, H.Bunke, andJ.Schmid-
huber. A novel connectionist system for unconstrained handwriting recogni-
tion. IEEE Trans. Pattern Anal. Mach. Intell., 31(5):855–868, 2009.
[41] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep
autoregressive networks. In E. P. Xing and T. Jebara, editors, Proceedings
of the 31st International Conference on Machine Learning, volume 32 of
Proceedings of Machine Learning Research, pages 1242–1250, Bejing, China,
22–24 Jun 2014. PMLR.
[42] D. B. Grimes and R. P. Rao. Bilinear sparse coding for invariant vision.
Neural computation, 17(1):47–73, 2005.
[43] R. Hamid, Y. Xiao, A. Gittens, and D. DeCoste. Compact Random Feature
Maps. pages 19 – 27, 2014.
[44] T. Hazan and T. S. Jaakkola. On the partition function and random max-
imum a-posteriori perturbations. In Proceedings of the 29th International
Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK,
June 26 - July 1, 2012, 2012.
[45] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. CoRR, abs/1512.03385, 2015.
[46] K.M.Hermann, T.Kociský, E.Grefenstette, L.Espeholt, W.Kay, M.Suley-
man, and P. Blunsom. Teaching machines to read and comprehend. In
Advances in Neural Information Processing Systems 28: Annual Conference
95
on Neural Information Processing Systems 2015, December 7-12, 2015, Mon-
treal, Quebec, Canada, pages 1693–1701, 2015.
[47] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed,
andA.Lerchner. Earlyvisualconceptlearningwithunsuperviseddeeplearn-
ing. CoRR, abs/1606.05579, 2016.
[48] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,
S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts
with a constrained variational framework.
[49] I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. Burgess, M. Botvinick,
D. Hassabis, and A. Lerchner. SCAN: learning abstract hierarchical compo-
sitional visual concepts. CoRR, abs/1707.03389, 2017.
[50] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep Neural
Networks for Acoustic Modeling in Speech Recognition: The Shared Views
of Four Research Groups. Signal Processing Magazine, IEEE, 29(6):82–97,
2012.
[51] G.E.Hinton. TrainingProductsofExpertsbyMinimizingContrastiveDiver-
gence. Neural computation, 14(8):1771–1800, 2002.
[52] G. E. Hinton, S. Osindero, and Y.-W. Teh. A Fast Learning Algorithm for
Deep Belief Nets. Neual Comp., 18(7):1527–1554, 2006.
[53] G. E. Hinton and D. van Camp. Keeping the neural networks simple by
minimizing the description length of the weights. In Proceedings of the Sixth
Annual ACM Conference on Computational Learning Theory, COLT 1993,
Santa Cruz, CA, USA, July 26-28, 1993., pages 5–13, 1993.
[54] S. Hochreiter and J. Schmidhuber. Long short-term memory. 9:1735–80, 12
1997.
[55] S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semi-
supervised semantic segmentation. CoRR, abs/1506.04924, 2015.
[56] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional
networks. CoRR, abs/1608.06993, 2016.
[57] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. pages 448–456, 2015.
[58] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-
softmax. CoRR, abs/1611.01144, 2016.
96
[59] N. Jindal and B. Liu. Review spam detection. In Proceedings of the 16th
International Conference on World Wide Web, WWW ’07, pages 1189–1190,
New York, NY, USA, 2007. ACM.
[60] P. Kar and H. Karnick. Random Feature Maps for Dot Product Kernels.
2012.
[61] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating
image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664–676,
Apr. 2017.
[62] Y. Kim. Convolutional neural networks for sentence classification. CoRR,
abs/1408.5882, 2014.
[63] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
CoRR, abs/1412.6980, 2014.
[64] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
[65] B. Kingsbury. Lattice-based Optimization of Sequence Classification Crite-
ria for Neural-network Acoustic Modeling. In Acoustics, Speech and Signal
Processing, 2009. ICASSP 2009. IEEE International Conference on, pages
3761–3764. IEEE, 2009.
[66] B. Kingsbury, J. Cui, X. Cui, M. J. F. Gales, K. Knill, J. Mamou, L. Mangu,
D. Nolden, M. Picheny, B. Ramabhadran, R. Schlüter, A. Sethy, and P. C.
Woodland. A High-performance Cantonese Keyword Search System. pages
8277–8281, 2013.
[67] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with
deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and
K. Weinberger, editors, Advances in Neural Information Processing Systems
25, pages 1097–1105. Curran Associates, Inc., 2012.
[68] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska,
I. Gulrajani, and R. Socher. Ask me anything: Dynamic memory networks
for natural language processing. CoRR, abs/1506.07285, 2015.
[69] Q. V. Le, T. Sarlós, and A. J. Smola. Fastfood: Approximating Kernel
Expansions in Loglinear Time. 2013.
[70] Y. LeCun, S. Chopra, R. Hadsell, F. J. Huang, and et al. A tutorial
on energy-based learning. In PREDICTING STRUCTURED DATA. MIT
Press, 2006.
97
[71] J. Li, X. Chen, E. H. Hovy, and D. Jurafsky. Visualizing and understanding
neural models in nlp. In HLT-NAACL, 2016.
[72] B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang. Detecting adver-
sarial examples in deep networks with adaptive noise reduction. CoRR,
abs/1705.08378, 2017.
[73] Z. Lu, D. Guo, A. B. Garakani, K. Liu, A. May, A. Bellet, L. Fan, M. Collins,
B. Kingsbury, M. Picheny, and F. Sha. A comparison between deep neural
nets and kernel acoustic models for speech recognition. In 2016 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing, ICASSP
2016, Shanghai, China, March 20-25, 2016, pages 5070–5074, 2016.
[74] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins,
B. Kingsbury, M. Picheny, and F. Sha. How to Scale Up Kernel Methods
to Be As Good As Deep Neural Nets, 2014. http://arxiv.org/abs/1411.
4000.
[75] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-
based neural machine translation. In Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, pages 1412–1421.
Association for Computational Linguistics, 2015.
[76] D. J. C. MacKay. Information theory, inference, and learning algorithms.
2003.
[77] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A
continuous relaxation of discrete random variables. CoRR, abs/1611.00712,
2016.
[78] B. M. Marlin and K. P. Murphy. Sparse gaussian graphical models with
unknown block structure. In Proceedings of the 26th Annual International
Conference on Machine Learning, pages 705–712. ACM, 2009.
[79] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and
Y. LeCun. Disentangling factors of variation in deep representation using
adversarial training. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,
and R. Garnett, editors, Advances in Neural Information Processing Systems
29, pages 5040–5048. Curran Associates, Inc., 2016.
[80] A. May, A. B. Garakani, Z. Lu, D. Guo, K. Liu, A. Bellet, L. Fan, M. Collins,
D. J. Hsu, B. Kingsbury, M. Picheny, and F. Sha. Kernel approximation
methods for speech recognition. CoRR, abs/1701.03577, 2017.
98
[81] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie,
T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg,
A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view
from the trenches. In Proceedings of the 19th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD’13, pages1222–
1230, New York, NY, USA, 2013. ACM.
[82] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. CoRR, abs/1301.3781, 2013.
[83] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In C. J. C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi-
tors, Advances in Neural Information Processing Systems 26, pages 3111–
3119. Curran Associates, Inc., 2013.
[84] T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA,
1 edition, 1997.
[85] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of
visual attention. In Advances in Neural Information Processing Systems 27:
Annual Conference on Neural Information Processing Systems 2014, Decem-
ber 8-13 2014, Montreal, Quebec, Canada, pages 2204–2212, 2014.
[86] A.-r. Mohamed, G. Dahl, , and G. Hinton. Acoustic Modeling Using Deep
Belief Networks. IEEE Transactions on Audio, Speech, and Language Pro-
cessing, 20(1):14–22, 2012.
[87] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltz-
mann machines. In Proceedings of the 27th International Conference on
International Conference on Machine Learning, ICML’10, pages 807–814,
USA, 2010. Omnipress.
[88] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and
A. Mordvintsev. The building blocks of interpretability. Distill, 2018.
https://distill.pub/2018/building-blocks.
[89] K. Palla, Z. Ghahramani, and D. A. Knowles. A nonparametric variable
clustering model. In Advances in Neural Information Processing Systems,
pages 2987–2995, 2012.
[90] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and
A. Swami. The limitations of deep learning in adversarial settings. In IEEE
European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken,
Germany, March 21-24, 2016, pages 372–387, 2016.
99
[91] J. C. Platt. Fast Training of Support Vector Machines using Sequential
Minimal Optimization. In Advances in Kernel Methods - Support Vector
Learning. MIT Press, 1998.
[92] A. Rahimi and B. Recht. Random Features for Large-scale Kernel Machines.
pages 1177–1184.
[93] A. Rahimi and B. Recht. Weighted Sums of Random Kitchen Sinks: Replac-
ing Minimization with Randomization in Learning. In Advances in Neural
Information Processing Systems 21, pages 1313–1320, 2009.
[94] A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn,
B. Hedayatnia, M. Cheng, A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan,
H. Song, S. Jayadevan, G. Hwang, and A. Pettigrue. Conversational AI: the
science behind the alexa prize. CoRR, abs/1801.03604, 2018.
[95] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of
variation with manifold interaction. In Proceedings of the 31st International
Conference on Machine Learning (ICML-14), pages 1431–1439, 2014.
[96] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling
factors of variation for facial expression recognition. In Computer Vision–
ECCV 2012, pages 808–822. Springer, 2012.
[97] L. Robinson and B. Graham. Confusing deep convolution networks by rela-
belling. CoRR, abs/1510.06925, 2015.
[98] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet
large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–
252, Dec. 2015.
[99] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet. Adversarial manipulation of
deep representations. CoRR, abs/1511.05122, 2015.
[100] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and
A.-r. Mohamed. Making Deep Belief Networks Effective for Large Vocabu-
lary Continuous Speech Recognition. In Automatic Speech Recognition and
Understanding (ASRU), 2011 IEEE Workshop on, pages 30–35. IEEE, 2011.
[101] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In D. van Dyk
and M. Welling, editors, Proceedings of the Twelth International Conference
on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine
Learning Research, pages 448–455, Hilton Clearwater Beach Resort, Clear-
water Beach, Florida USA, 16–18 Apr 2009. PMLR.
100
[102] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Pro-
ceedings of the 20th International Conference on Neural Information Pro-
cessing Systems, NIPS’07, pages 1257–1264, USA, 2007. Curran Associates
Inc.
[103] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen. The adaptive
web. chapter Collaborative Filtering Recommender Systems, pages 291–324.
Springer-Verlag, Berlin, Heidelberg, 2007.
[104] J. Schmidhuber. Learning factorial codes by predictability minimization.
Neural Computation, 4(6):863–879, 1992.
[105] B. Schölkopf and A. Smola. Learning with kernels. MIT Press, 2002.
[106] F. Seide, G. Li, X. Chen, and D. Yu. Feature Engineering in Context-
dependent Deep Neural Networks for Conversational Speech Transcription.
In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE
Workshop on, pages 24–29, 2011.
[107] F. Seide, G. Li, and D. Yu. Conversational Speech Transcription Using
Context-Dependent Deep Neural Networks. In Proc. of Interspeech, pages
437–440, 2011.
[108] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural net-
works via information. CoRR, abs/1703.00810, 2017.
[109] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-
man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the
game of go with deep neural networks and tree search. Nature, 529:484–503,
2016.
[110] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[111] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-
supervised recursive autoencoders for predicting sentiment distributions. In
Proceedings of the 2011 Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP.
[112] S. Sonnenburg and V. Franc. COFFIN: A Computational Framework for
Linear SVMs. pages 999–1006, Haifa, Israel, 2010.
101
[113] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End-to-end memory
networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and
R. Garnett, editors, Advances in Neural Information Processing Systems 28,
pages 2440–2448. Curran Associates, Inc., 2015.
[114] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with
neural networks. In Proceedings of the 27th International Conference on
Neural Information Processing Systems - Volume 2, NIPS’14, pages 3104–
3112, Cambridge, MA, USA, 2014. MIT Press.
[115] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR,
abs/1409.4842, 2014.
[116] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Good-
fellow, and R. Fergus. Intriguing properties of neural networks. CoRR,
abs/1312.6199, 2013.
[117] Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In Proceedings
of The 30th International Conference on Machine Learning, pages 163–171,
2013.
[118] N. Tishby, F. C. N. Pereira, and W. Bialek. The information bottleneck
method. CoRR, physics/0004057, 2000.
[119] N. Tishby and N. Zaslavsky. Deep learning and the information bottle-
neck principle. In 2015 IEEE Information Theory Workshop, ITW 2015,
Jerusalem, Israel, April 26 - May 1, 2015, pages 1–5, 2015.
[120] I. W. Tsang, J. T. Kwok, and P. Cheung. Core Vector Machines: Fast SVM
Training on Very Large Data Sets. Journal of Machine Learning Research,
6:363–392, 2005.
[121] H.Valpola. Bayesian Ensemble Learning for Nonlinear Factor Analysis. Acta
polytechnica Scandinavica. Finnish Academies of Technology, 2000.
[122] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of
Machine Learning Research, 9:2579–2605, 2008.
[123] M. A. O. Vasilescu and D. Terzopoulos. Multilinear independent components
analysis. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 1, pages 547–553. IEEE,
2005.
[124] A. Vedaldi and A. Zisserman. Efficient Additive Kernels via Explicit Feature
Maps. IEEE Trans. on Pattern Anal. & Mach. Intell., 34(3):480–492, 2012.
102
[125] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and
composing robust features with denoising autoencoders. pages 1096–1103,
2008.
[126] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural
image caption generator. 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 3156–3164, 2015.
[127] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio. Show, attend and tell: Neural image caption generation with
visual attention. In F. Bach and D. Blei, editors, Proceedings of the 32nd
International Conference on Machine Learning, volume 37 of Proceedings of
Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015.
PMLR.
[128] M. D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR,
abs/1212.5701, 2012.
[129] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks
for text classification. In Proceedings of the 28th International Conference on
Neural Information Processing Systems - Volume 1, NIPS’15, pages 649–657,
Cambridge, MA, USA, 2015. MIT Press.
[130] S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational
autoencoders. CoRR, abs/1706.02262, 2017.
[131] F. Zhou, B. Wu, and Z. Li. Deep meta-learning: Learning to learn in the
concept space. CoRR, abs/1802.03596, 2018.
103
Appendix A
Derivations and Experiments for
Supervised Models
A.1 Kernelsandrandomfeaturesapproximation
Given a pair of data points x and z, a positive definite kernel function k(·,·) :
R
d
×R
d
→R defines an inner product between the images of the two data points
under a (nonlinear) mapping φ(·) :R
d
→R
M
,
k(x,z) =φ(x)
T
φ(z) (A.1)
where the dimensionality M of the resulting mapping φ(x) can be infinite.
Kernel methods avoid inference in R
M
. Instead, they rely on the kernel matrix
over the training samples. When M is far greater than N, the number of training
samples, this trick provides a nice computational advantage. However, when N is
exceedingly large, this quadratic complexity in N becomes impractical.
[92] leverage a classical result in harmonic analysis and provide a fast way to
approximate k(·,·) with finite-dimensional features:
Theorem1. (Bochner’stheorem, adaptedfrom[92])Acontinuouskernelk(x,z) =
k(x−z) is positive definite if and only if k(δ) is the Fourier transform of a non-
negative measure.
104
More specifically, for shift-invariant kernels such as Gaussian RBF and Lapla-
cian kernels,
k
rbf
=e
−kx−zk
2
2
/2σ
2
, k
lap
=e
−kx−zk
1
/σ
(A.2)
the theorem implies that the kernel function can be expanded with harmonic basis,
namely
k(x−z) =
Z
R
d
p(ω)e
jω
T
(x−z)
dω =E
ω
[e
jω
T
x
e
−jω
T
z
] (A.3)
where p(ω) is the density of a d-dimensional probability distribution. The expec-
tation is computed on complex-valued functions ofx andz. For real-valued kernel
functions, however, they can be simplified to the cosine and sine functions, see
below.
ForGaussianRBFandLaplaciankernels, thecorrespondingdensitiesareGaus-
sian and Cauchy distributions:
p
rbf
(ω) =N
0,
1
σ
I
, p
lap
(ω) =
Y
d
1
π(1 +σ
2
ω
2
d
)
(A.4)
This motivates a sampling-based approach of approximating the kernel func-
tion. Concretely, we draw{ω
1
,ω
2
,...,ω
D
} from the distributionp(ω) and use the
sample mean to approximate
k(x,z)≈ 1/D
D
X
i=1
φ
i
(x)φ
i
(z) =
ˆ
φ(x)
T
ˆ
φ(z) (A.5)
The random feature vector
ˆ
φ is thus composed of scaled cosines of random
projections
ˆ
φ
i
(x) =
q
2/D cos(ω
T
i
x +b
i
) (A.6)
where b
i
is a random variable, uniformly sampled from [0, 2π].
105
A key advantage of using approximate features over standard kernel methods
is its scalability to large datasets. Learning with a representation
ˆ
φ(·)∈ R
D
is
relatively efficient provided thatD is far less than the number of training samples.
For example, in our experiments (section 3.1), we have 7 million to 16 million
training samples, while D≈ 25, 000 often leads to good performance.
A.2 Experiment details
Optimizing acoustic model parameters
1. Weusestochasticgradientdescent(SGD)algorithm[16]withadaptivelearn-
ing rate to optimize model.
2. The initial learning rate is set as large as possible. If learning rate is too
large, gradient and parameters will explode to infinite numbers. If learning
rate is too small, model may get stuck at local optima. In practice, we
try multiple learning rates, and choose the largest one that do not cause
parameter explosion.
3. The learning rate is decreased adaptively during training process. Every half
epoch, we evaluate on heldout dataset, and check if ERLL on heldout dataset
decreases. If ERLL does not decrease after m times of evaluation, decrease
learning rate to lrate←lrate×β. m,β are optimization hyperparameters.
And we found m = 2,β = 0.5 usually perform well.
Optimizing densenet parameters
1. Weusedstochasticgradientdescent(SGD)algorithm[16]tooptimizedensenet.
We train model for 350 epochs. In epoch 1∼ 150, learning rate = 0.1; in
106
epoch 151∼ 225, learning rate = 0.01; in epoch 226∼ 300, learning rate =
0.001; and in epoch 301∼ 350, learning rate = 1e-4.
2. When training with mixture of training and testing data, we random mix
50000 training samples and 10000 testing samples. In every minibatch, the
samples from training set contribute to cross entropy and entropy, and the
samples from testing set contribute only to entropy.
Adversarial attack We run optimization-based attacks [116, 90, 97, 99] on clas-
sification models after models’ parameters have been learned. which directly run
an ADAM optimizer [63] on image pixels to find a minimal perturbation that
changes the model’s prediction. Figure A.1 gives one adversarial attack example.
In this example, the average l1-norm of perturbation is 0.005472 per pixel, while
pixel scale is [0, 1).
Figure A.1: By perturbing the image with small amount of noise, model misclassified
human to beaver. (Left): original image; (Middle): perturbation noise; (Right): per-
turbed image, recognized by classifier as beaver
107
Algorithm 3: Generating adversarial perturbation s.t. classifier
recognize perturbed image as target class.
input : image x, target ¯ y
input : classifier P (y|x)
initialize
x
while arg maxP (y|x +
x
)6= ¯ y do
L = cross-entropy(P (y|x +
x
), ¯ y)
x
← minimizer(L;
x
)
108
Appendix B
Derivations and Experiments for
Unsupervised Models
terminology
• feature space: set of allz, wherez is the intermediate layer of neural network.
In VAE, z is the bottleneck layer between encoder and decoder.
• concept space: set of all c, where c encodes abstract information of data.
For example, each unit of c describe one factor of variation in dataset. Take
Multi-PIE in section 5.2.2 for example, c could be a 3 dimensional vector,
each dimension records information of identity, view angle, brightness.
• D: dimensionality of latent feature representation z
• L: dimensionality of latent concept representation c
B.1 Approximate inference of factorization prior
model using gibbs sampling
The learning algorithm for factor-VAE model is more complicated than VAE,
because inference of the latent variables in graphical prior model requires iterative
sampling steps. We use the Gibbs sampler [78] to compute the expected values
of G,C, where G,C represent the collection of{g
kl
} and{c
k
}. p(z|G,C) can
109
be obtained as the closed form formulation, which is made as the input to KL
divergence term in the variational likelihood.
The learning for the remaining part remains the same to VAE. The sampling
for prior model is given below.
Sample G p(g
kl
|C,G,η,Z,σ
g
)∼N (μ
∗
g
kl
, Λ
∗
g
kl
), where
μ
∗
g
kl
, Λ
∗
g
kl
=
μ
g
,σ
−2
g
if c
kl
= 0
μg
σ
2
g
+
1
σ
2
d
P
N
n=1
η
nl
z
nk
Λ
−1
g
kl
,
1
σ
2
d
σ
2
d
σ
2
g
+
P
N
n=1
η
2
kl
if c
kl
= 1
Sample η p(η
nl
|C,G,σ
η
)∼N (μ
∗
η
l
, Λ
∗
η
l
), where
μ
∗
η
l
, Λ
∗
η
l
= Λ
∗
η
l
−1
μ
η
σ
2
η
+
P
K
k=1
g
kl
c
kl
z
nk
σ
2
d
!
,
1
σ
2
η
+
P
K
k=1
g
2
kl
c
2
kl
σ
2
d
Sample C
p(c
kl
= 1|Z,X,G)∼ Λ
−1/2
g
kl
exp(
Λ
∗
g
kl
μ
∗
g
kl
2
2
)P (c
kl
= 1|α)
110
Algorithm 4: Sampling based methods, for learning factor-VAE
model.
initialize α,σ
k
,σ
g
,σ
η
;
initialize π,{c
d
},{g
dk
} by drawing from Gaussian distributions;
for epoch=1...Max_epoch (one pass over the whole training set) do
for each mini-batch D do
Run forward propagation on D and compute Z;
Run gibbs sampler to compute
p(η
nl
|{z
nk
},{c
k
},{g
kl
}); p({g
kl
}|{z
nk
},{c
k
},{η
nl
});
p({c
k
}|{z
nk
},{g
kl
},θ,{η
nl
}); p(π|α,{c
k
})
and update g,c variables using averaged samples;
Compute p(z|{g
∗
kl
},{c
∗
k
}) and substitute into global
objective function;
Learn the model parameters θ and φ using SGD ;
111
B.2 Objective function of biVAE model
In the following derivation, we usep andq to replaceP
θ
andP
φ
to make equa-
tions look cleaner. The variational inference objective function from one sample x
is
Z
q(c,z|x) log
p(c,z,x)
q(c,z|x)
dcdz
=
Z
q(c|x)q(z|x) log
p(c)p(z|c)p(x|z)
q(c|x)q(z|x)
dcdz
=−KL[q(c|x)||p(c)] (B.1)
+
Z
q(z|x) logp(x|z)dz (B.2)
−
Z
q(z|x) logq(z|x)dz (B.3)
+
Z
q(z|x)q(c|x) logp(z|c)dcdz (B.4)
Assume bothq(z|x) andq(c|x) are Gaussian. The variational distribution and
generative distribution are parameterized to be:
q(z|x) =N (z;μ
z
(x),σ
2
z
(x)) (B.5)
p(z|c) =N (z;f
z
(c),σ
2
d
) (B.6)
q(c|x) =N (c;μ
c
(x),σ
2
c
(x)) (B.7)
p(c) =N (0,I) (B.8)
Terms in Vairational Lower Bound The first three terms (B.1, B.2, B.3) are:
1.−KL[q(c|x)||p(c)] =
P
K
d=1
h
0.5 + logσ
c,d
(x)−
σ
2
c,d
(x)
+
μ
2
c,d
(x)
2
i
112
2.
R
q(z|x) logp(x|z)dz isapproximatedviare-parameterizationtrick,asinstan-
dard VAE.
3.−
R
q(z|x) logq(z|x)dz =
P
D
d=1
h
log
√
2πe + logσ
z,d
(x)
i
The last term (B.4), which characterize the correlation between c and z, are
Z
q(z|x)q(c|x) logp(z|c)dcdz (B.9)
=
Z
c
q(c|x)
h
Z
z
q(z|x) logp(z|c)dz
i
dc (B.10)
=
Z
c
q(c|x)
h
D
X
d=1
E
N(z
d
;μ
z,d
(x),σ
2
z,d
(x))
−z
2
d
−μ
z,d
(c)
2
+ 2z
d
μ
z,d
(c)
2σ
2
d
i
dc (B.11)
=
Z
c
q(c|x)
h
D
X
d=1
−μ
2
z,d
(x)−σ
2
z,d
(x)−μ
z,d
(c)
2
+ 2μ
z,d
(x)μ
z,d
(c)
2σ
2
d
i
dc (B.12)
=
D
X
d=1
−μ
2
z,d
(x)−σ
2
z,d
(x)−E
q(c|x)
[μ
z,d
(c)
2
] + 2μ
z,d
(x)E
q(c|x)
[μ
z,d
(c)]
2σ
2
d
(B.13)
In implementation, E
q(c|x)
[μ
2
z,d
(c)] and E
q(c|x)
[μ
z,d
(c)] are replaced with empiri-
cal estimation, i.e.
1
M
P
M
i=1
μ
2
z,d
(c
(i)
)|
c
(i)
∼q(c|x)
and
1
M
P
M
i=1
μ
z,d
(c
(i)
)|
c
(i)
∼q(c|x)
. In the
simplest case of setting M = 1, the numerator is simplied to be
−μ
2
z,d
(x)−σ
2
z,d
(x)−E
q(c|x)
[μ
z,d
(c)
2
] + 2μ
z,d
(x)E
q(c|x)
[μ
z,d
(c)]
=− (μ
z,d
(x)−μ
z,d
(c))
2
−σ
2
z,d
(x) (B.14)
experiment details
1. All VAE models (VAE,β-VAE, factor-VAE, biVAE) are trained with ADAM
[63] optimizer. The learning rates are 0.0002.
2. Because pixels in celebA dataset ranges in [0, 1), we set P
θ
(x|z) to be a
logistic distribution.
113
operation comments
Input 32x32x3
5x5 conv 32 BatchNormalization
5x5 conv 64 stride 2 BatchNormalization ReLU
5x5 conv 128 stride 2 BatchNormalization ReLU
5x5 conv 256 stride 2 BatchNormalization ReLU
5x5 conv 64 stride 2 BatchNormalization ReLU
4096 linear 1024 dropout
1024 linear 512 dropout
512 linear (512, 512) predict μ(x),σ(x)
4x4 deconv 1024 BatchNormalization ReLU
4x4 deconv 512 BatchNormalization ReLU
4x4 deconv 256 BatchNormalization ReLU
4x4 deconv 128 BatchNormalization ReLU
4x4 deconv 64 BatchNormalization ReLU
3x3 conv 3, 3x3 conv 3 reconstruction
Table B.1: Architecture of VAE encoder and decoders
3. The architecture of encoder and decoder of VAE are shown in table B.1. We
tried using residual modules in VAE, but it did not change performance of
models.
114
Abstract (if available)
Abstract
This dissertation studied application of information regularization in two domains: supervised acoustic model and unsupervised variational generative model. In the first case, we implemented information bottleneck principle to explain an observation that entropy regularized perplexity (ERP) is a good acoustic model selection criterion, and empirically studied ways of using ERP to improve supervised deep learning models on various datasets. In the second case, we implemented minimum description length principle to interpret variational auto-encoder's (VAE) performance in the application of learning disentangled representation, and developed method that introduces hierarchical structure into VAE's representation layer to help it learning more meaningful concepts from data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Learning distributed representations of cells in tables
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Mutual information estimation and its applications to machine learning
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Invariant representation learning for robust and fair predictions
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Discovering and querying implicit relationships in semantic data
PDF
Prediction models for dynamic decision making in smart grid
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Data-driven 3D hair digitization
PDF
Representation problems in brain imaging
PDF
Information geometry of annealing paths for inference and estimation
Asset Metadata
Creator
Guo, Dong
(author)
Core Title
Empirical study of informational regularizations in learning useful and interpretable representations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/28/2019
Defense Date
07/17/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acoustic model,entropy regularized perplexity,generative model,information bottleneck,OAI-PMH Harvest,variational autoencoder
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavendra, Cauligi (
committee chair
), Nakano, Aiichiro (
committee member
), Prasanna, Viktor K. (
committee member
)
Creator Email
dguo.fb@gmail.com,dongguo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-158849
Unique identifier
UC11660203
Identifier
etd-GuoDong-7334.pdf (filename),usctheses-c89-158849 (legacy record id)
Legacy Identifier
etd-GuoDong-7334.pdf
Dmrecord
158849
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Guo, Dong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
acoustic model
entropy regularized perplexity
generative model
information bottleneck
variational autoencoder