Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A computational framework for diversity in ensembles of humans and machine systems
(USC Thesis Other)
A computational framework for diversity in ensembles of humans and machine systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A COMPUTATIONAL FRAMEWORK
FOR DIVERSITY IN ENSEMBLES
OF HUMANS AND MACHINE SYSTEMS
by
Kartik Audhkhasi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2014
Copyright 2014 Kartik Audhkhasi
Contents
List of Tables viii
List of Figures xvi
Abstract xvii
Acknowledgements xix
1 Introduction 1
2 Modeling a Diverse Ensemble: Globally-Variant Locally-Constant
(GVLC) Model 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The Globally-Variant Locally-Constant Model . . . . . . . . . . . . . 16
2.3.1 ML Parameter Estimation using the EM algorithm . . . . . . 17
2.3.2 A Bayesian Version of the Proposed Model . . . . . . . . . . . 22
2.3.3 Inference of the Hidden Reference Label . . . . . . . . . . . . 25
2.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Classification Experiments on Synthetic Data . . . . . . . . . 26
2.4.2 Classification Experiments on UCI Databases . . . . . . . . . 28
2.4.3 Emotion Classification from Speech . . . . . . . . . . . . . . . 32
2.4.4 An Insight into GVLC Model’s Benefit . . . . . . . . . . . . . 36
2.5 Conclusions and Future Work Directions . . . . . . . . . . . . . . . . 39
3 Analysis of Ensemble Diversity: Generalized Ambiguity Decomposi-
tion (GAD) 41
i
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Generalized Ambiguity Decomposition (GAD) Theorem . . . . . . . . 45
3.3 GAD for Common Loss Functions . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Squared Error Loss . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Absolute Error Loss . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.3 Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.5 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 SimulationExperiments ontheGADTheoremforCommonLossFunc-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Behavior of Weighted Expert Loss and Diversity in GAD . . . 59
3.4.2 Accuracy of GAD-Motivated Approximation of Ensemble Loss 61
3.4.3 Comparison with Loss Function Approximation Used in Gradi-
ent Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Experiments on GAD with Standard Machine Learning Tasks . . . . 65
3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 70
4 Analysis of Ensemble Diversity: Automatic Speech Recognition 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 A Vector Space Model for ROVER WER . . . . . . . . . . . . . . . . 77
4.3 Ambiguity Decomposition for ROVER WER . . . . . . . . . . . . . . 83
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1 ASR System Training in Kaldi . . . . . . . . . . . . . . . . . . 87
4.4.2 ASR Confidence Estimation . . . . . . . . . . . . . . . . . . . 89
4.4.3 The 2012 US Presidential Debate Data Set . . . . . . . . . . . 92
4.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 ASR System Confidence Estimation . . . . . . . . . . . . . . . 93
4.5.2 ROVER WER. . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.3 Analysis of Diversity-ROVER WER Link . . . . . . . . . . . . 97
4.6 A Unified Discriminative Approach for Jointly Training Diverse ASR
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 103
5 Diversity in Ensemble Design: Diverse Maximum Entropy Models 111
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
ii
5.2 Training a ∂MaxEnt Ensemble . . . . . . . . . . . . . . . . . . . . . . 113
5.2.1 KL Divergence Diversity . . . . . . . . . . . . . . . . . . . . . 114
5.2.2 Posterior Cross-Correlation (PCC) Diversity . . . . . . . . . . 116
5.2.3 Sequential Training of a ∂MaxEnt Ensemble . . . . . . . . . . 117
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Conclusion and Scope for Future Work . . . . . . . . . . . . . . . . . 121
6 Diversity in Design: Noisy Backpropagation and Deep Bidirectional
Pre-training 122
6.1 Noise Benefits in Backpropagation . . . . . . . . . . . . . . . . . . . . 123
6.2 Backpropagation as Maximum Likelihood Estimation . . . . . . . . . 125
6.3 EM Algorithm for Neural Network ML Estimation . . . . . . . . . . . 132
6.4 The Noisy Expectation-Maximization Theorem . . . . . . . . . . . . 136
6.4.1 NEM Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5 Noise Benefits in Neural Network ML Estimation . . . . . . . . . . . 137
6.6 Noise Benefits in Classification Accuracy . . . . . . . . . . . . . . . . 139
6.7 Training BAMs or Restricted Boltzmann Machines . . . . . . . . . . 141
6.7.1 ML Training for RBMs using Contrastive Divergence . . . . . 142
6.7.2 ML Training for RBMs using the EM algorithm . . . . . . . . 143
6.8 Noise Benefits in RBM ML Estimation . . . . . . . . . . . . . . . . . 144
6.9 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Bibliography 148
iii
List of Tables
2.1 Thistableshows thenumberofparametersforsimpleplurality, models
presented in [41], [158],[140] and the proposed GVLC model. Simple
plurality and the model in [41], [158] do not involve a K-class classifier. 22
2.2 Thistableshows asummary ofdifferent databasesfromtheUCI repos-
itory used in the experiments. . . . . . . . . . . . . . . . . . . . . . . 27
2.3 This table shows the classification and inference accuracies for the
Magic Gamma Telescope database using various algorithms. Values in
bold represent a statistically significant improvement in performance
over simple plurality at the 5% significance level using the exact one-
sided binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 ThistableshowstheclassificationandinferenceaccuraciesforthePima
Indians database using various algorithms. Values in bold represent a
statistically significant improvement in performance over simple plu-
rality at the 5% significance level using the exact one-sided binomial
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 This table shows the classification and inference accuracies for the
Abalone database (binary) using various algorithms. Values in bold
represent a statistically significant improvement in performance over
simple plurality at the 5% significance level using the exact one-sided
binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 This table shows the classification and inference accuracies for the
Abalone database (3-class) using various algorithms. Values in bold
represent a statistically significant improvement in performance over
simple plurality at the 5% significance level using the exact one-sided
binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iv
2.7 This table shows the classification and inference accuracies for the
Yeast database using various algorithms. Values in bold represent a
statistically significant improvement in performance over simple plu-
rality at the 5% significance level using the exact one-sided binomial
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 This table shows the classification and inference accuracies for the
Handwritten Pen Digits database using various algorithms. Values in
bold represent a statistically significant improvement in performance
over simple plurality at the 5% significance level using the exact one-
sided binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 This tableshows theemotionclassification andinference accuracies for
the EMA database using various algorithms. Values in bold represent
a statistically significant improvement in performance over simple plu-
rality at the 5% significance level using the exact one-sided binomial
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 This table shows the valence classification and inference accuracies for
the EMA database using various algorithms. Values in bold represent
a statistically significant improvement in performance over simple plu-
rality at the 5% significance level using the exact one-sided binomial
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.11 This table shows the activation classification and inference accuracies
for the EMA database using various algorithms. Values in bold repre-
sent a statistically significant improvement in performance over simple
pluralityatthe5%significancelevelusingtheexactone-sidedbinomial
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 This table shows the valence classification and inference accuracies
for the SEMAINE database using various algorithms. Values in bold
represent a statistically significant improvement in performance over
simple plurality at the 5% significance level using the exact one-sided
binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
2.13 This table shows the activation classification and inference accuracies
for the SEMAINE database using various algorithms. Values in bold
represent a statistically significant improvement in performance over
simple plurality at the 5% significance level using the exact one-sided
binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.14 This table shows the intensity classification and inference accuracies
for the SEMAINE database using various algorithms. Values in bold
represent a statistically significant improvement in performance over
simple plurality at the 5% significance level using the exact one-sided
binomial test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Thistablegives adescription ofvariousUCIMachine LearningReposi-
torydatasetsusedforexperiments onGAD.Alldatasetswith{−1,1}
as the target set were used for training binary classifiers. Others were
used for training regressors. Abalone was used for binary classification
as well by first thresholding the target variable at 10. . . . . . . . . . 67
3.2 This table shows the relative absolute error E
x
for various UCI data
sets between the ensemble loss and approximation x which is one of
GAD, WGT, and GB corresponding to GAD, weighted sum of ex-
pert loses, and gradient boosting upper-bound on total loss. The first
threeloses used anensemble ofthree classifiers -logistic regression, lin-
ear support vector machine, and Fisher’s linear discriminant analysis
classifiers. The two last two regressors used three regressors obtained
by minimizing squared error, absolute error, and Huber loss function.
The GAD approximation has significantly lower error than the other
approximations for all cases except the Wine quality data set for the
smooth absolute loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vi
3.3 This table shows the relative absolute error E
x
for various UCI data
sets between the ensemble loss and approximation x which is one of
GAD, WGT, and GB corresponding to GAD, weighted sum of expert
loses, and gradient boosting upper-bound on total loss. The first three
loses used logistic regression classifiers trained on three data sets cre-
ated by sampling with replacement (bagged training sets). The last
two loses used a linear regressor obtained by minimizing squared error
and trained on 3 bagged training sets. The GAD approximation has
significantly lower error than the other approximations for all cases
except the Wine quality data set for the smooth absolute loss. . . . . 69
4.1 This table gives a list of key variables and their description used in
Lemmas 4.1-4.2 and Theorems 4.1-4.3. . . . . . . . . . . . . . . . . . 77
4.2 Table 4.2 summarizes the training steps for the three ASR systems
used in this chapter for each of the WSJ, HUB4, and ICSI data sets. 88
4.3 Table 4.3 summarizes the testing set 1-best WERs for various ASR
systems. Thesystems trainedontheHUB4datasetprovide thelowest
WERs. M3 models perform the best out of the three models trained
for each data set except WSJ for speaker JL. . . . . . . . . . . . . . . 88
4.4 Table 4.4 summarizes the testing set normalized cross-entropy (NCE)
for ASR system confidence estimation. Higher values of NCE indicate
better ASR confidence estimates. Perfect ASR confidence estimates
give an NCE of 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Table 4.5 summarizes the testing set equal error rate (EER) for ASR
system confidence estimation. Lower values of EER indicate better
ASR confidence estimates. Perfect ASR confidence estimates give an
EER of 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vii
4.6 Table 4.6 summarizes the testing set WERs for various ASR systems
after 10-best ROVER under various conditions. ’+’ denotes fusion of
N-best list across training data sets and/or systems. E.g. the M1 +
M2 + M3 row for the WSJ data set indicates that I fused the top-
3 hypotheses from the M1, M2, and M3 models before performing
ROVER.Oracleconfidence-basedROVERperformsappreciablybetter
than unweighted ROVER while the CRF confidence-based ROVER
gives a minor improvement. . . . . . . . . . . . . . . . . . . . . . . . 105
4.7 Table 4.7 shows the median per-utterance correlation coefficients be-
tweentheROVERWERanditsoptimalapproximationin(4.48). This
table also contains the 90% bootstrap confidence intervals for the cor-
relation coefficients. I observe that all correlation coefficients are close
to 0.9 and significant at the 10% level. . . . . . . . . . . . . . . . . . 108
4.8 Table 4.8 shows the median per-utterance γ in (4.48) estimated using
least squared regression between the ROVER WER and its optimal
approximation.
∗
indicates that the γ is significantly higher than the
corresponding γ for α = 1 using Wilcoxon’s signed rank test at the
10% significance level.
#
indicates that the γ is significantly higher
than the corresponding γ for α = 0.85 using CRF confidence-weighted
ROVER.Iobserve thattheoracleconfidence ROVERismostsensitive
to diversity in the N-best list due to its significantly higher γ, followed
by the CRF confidence ROVER and the word frequency-based ROVER.109
5.1 This table shows the NER F1 scores for 5 MaxEnt and ∂MaxEnt mod-
els using KL/PCC diversity. α
d
and α
e
denote the best values of α
tuned on the development and evaluation set respectively. Values in
bold indicate a statistically significant improvement over the MaxEnt
ensemble at the 5% level using McNemar’s test. . . . . . . . . . . . . 119
5.2 This table shows weighted F1 scores for emotion classification with 5
models on the IEMOCAP database. . . . . . . . . . . . . . . . . . . . 120
viii
List of Figures
2.1 This figureshows theBayesian network forthemodelpresented in[41],
[158]. Shaded and unshaded nodes represent observed and unobserved
randomvariablesrespectively. ThelatenttruelabelygeneratesRnoisy
labels y
1
,...,y
R
using the conditional probability matrices A
1
,...,A
R
. 10
2.2 This figure shows the model presented by Rakyar et al. [140]. A Max-
Ent classifier explicitly maps the feature vector x to the true hidden
label y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 These figures show some prototypical and non-prototypical shapes of
three digits (4, 9 and 1) from the Handwritten Pen Digits database [1]
in the first and second rows respectively. . . . . . . . . . . . . . . . . 13
2.4 This plot shows the variation of error probability for three classifiers
(logistic regression, Naive Bayes and J48) on the Yeast database with
one of the features. I divided the data set into 4 clusters using K-
means and computed the error rates over these clusters. The feature
histogram is scaled by 2.5 for illustration purposes. . . . . . . . . . . 14
2.5 Thisfigureshowstheproposedglobally-variantlocallyconstant(GVLC)
model. z is the hidden variable in the GMM which generates the fea-
ture vector x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
ix
2.6 This figure shows the Bayesian version of the proposed GVLC model.
Allunmentioned PDFsaresameasinFigure2.5. EachMaxEntweight
vector w
k
is assumed to be drawn fromN(w
k
;0,σ
2
I). A
j
PLU
denotes
the global reliability matrix of the j
th
expert computed using the sim-
ple plurality labels as a proxy for the true labels. k
th
column of the
reliability matrix A
j
is assumed to be drawn independently from a
Dirichlet distribution with parameter vector asα times the k
th
column
of A
j
PLU
. This is denoted by P(A
j
)=Dir(A;αA
j
PLU
). . . . . . . . . 23
2.7 Inference accuracies of the various models using the synthetic binary
classification database. The performance of simple plurality and the
algorithm by Smyth et al. was 79.6%. The vertical line corresponds to
the true number of Gaussians (4). . . . . . . . . . . . . . . . . . . . . 27
3.1 This figure shows common regression and classification loss functions
for target variables Y = 0 and−1 respectively. Smoothed versions of
absolute error and hinge loss are shown with ǫ = 0.1. . . . . . . . . . 54
3.2 The top plot in each figure shows the median actual ensemble loss, its
GADapproximationandweightedexpertlossacross1000MonteCarlo
samples in an ensemble of K = 3 and K = 7 experts for the squared
error loss function as a function of expected ensemble prediction μ
f
.
I used σ
2
f
= 2. Y = 1 is the correct label. I also show the median
diversity term for the same setup in the bottom plot. . . . . . . . . . 61
3.3 The top plot in each figure shows the median actual ensemble loss,
its GAD approximation and weighted expert loss across 1000 Monte
Carlo samples in an ensemble of K = 3 and K = 7 experts for the
smooth absolute error loss function as a function of expected ensemble
prediction μ
f
. I used σ
2
f
= 2 and ǫ =0.5. Y = 1 is the correct label. I
also show the median diversity term for the same setup in the bottom
plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
x
3.4 The top plot in each figure shows the median actual ensemble loss, its
GADapproximationandweightedexpertlossacross1000MonteCarlo
samples in an ensemble of K = 3 and K = 7 experts for the logistic
loss function as a function of expected ensemble prediction μ
f
. I used
σ
2
f
= 2. Y = 1 is the correct label. I also show the median diversity
term for the same setup in the bottom plot. . . . . . . . . . . . . . . 63
3.5 The top plot in each figure shows the median actual ensemble loss, its
GADapproximationandweightedexpertlossacross1000MonteCarlo
samplesinanensembleofK = 3andK =7expertsfortheexponential
loss function as a function of expected ensemble prediction μ
f
. I used
σ
2
f
= 2. Y = 1 is the correct label. I also show the median diversity
term for the same setup in the bottom plot. . . . . . . . . . . . . . . 64
3.6 The top plot in each figure shows the median actual ensemble loss, its
GADapproximationandweightedexpertlossacross1000MonteCarlo
samples in an ensemble of K = 3 and K = 7 experts for the smooth
hinge loss function as a function of expected ensemble prediction μ
f
. I
used σ
2
f
= 2 and ǫ = 0.5. Y = 1 is the correct label. I also show the
median diversity term for the same setup in the bottom plot. . . . . . 65
3.7 This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K =3 and
K =7 experts for smooth absolute error loss function as a function of
expected ensemble prediction μ
f
. I used σ
2
f
=2, ǫ =0.5 and Y = 1 as
the correct label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.8 This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K = 3
and K = 7 experts for logistic loss function as a function of expected
ensemble prediction μ
f
. I used σ
2
f
=2 and Y = 1 as the correct label. 67
3.9 The figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K =3 and
K = 7 experts for exponential loss function as a function of expected
ensemble prediction μ
f
. I used σ
2
f
=2 and Y = 1 as the correct label. 70
xi
3.10 This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K =3 and
K =7experts forsmooth hinge loss function asa function ofexpected
ensemble prediction μ
f
. I used σ
2
f
= 2, ǫ = 0.5, and Y = 1 as the
correct label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.11 This figure shows the median absolute approximation error for GAD
and the gradient boosting upper bound across 1000 Monte Carlo sam-
ples in an ensemble of K = 3 and K = 7 experts for smooth hinge
loss function as a function of expected ensemble prediction μ
f
. I used
σ
2
f
= 0.1 and Y =1 as the correct label. . . . . . . . . . . . . . . . . 72
4.1 This figure illustrates the ambiguity decomposition for ROVER WER
presented in Theorem 4.2. I consider recognition of a single word out
of a vocabulary of K = 3 words and M = 3 ASR systems. Each
of the ASR systems predicts a different word in the given cohort set.
The three axes constitute the Euclidean vector space arising due to
the 1-in-3 encoding of words. The average ROVER prediction h
avg
is [1/3,1/3,1/3] and the approximate WER is 1/3. Theorem 4.2 de-
composes this into a difference of the average WER of the 3 systems
computed from the average squared-length of the finely-dotted lines
(2/3) and the diversity of the ensemble computed from the average
squared-length of the thick-dotted lines (1/3). . . . . . . . . . . . . . 85
4.2 Thisfigureshowstheprobabilitydensityfunction(pdf)andcumulative
distribution function (cdf) for the error in the bounds in Theorem 4.1.
I observe that the bounds hold for an appreciably high fraction (≥
93%) of the decoded test files over all system combination variations
in Table 4.6. The fraction of the few bound violations is denoted
by the height of the curves in the two bottom figures at the point
of intersection with the black 0 line. These violations occur because
Theorem 4.1 assumes word errors to be IID, which is not the case in
practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xii
4.3 This figure shows the scatter plot between the approximate ROVER
WER and the difference of the average N-best WER and N-best di-
versity for all decoded test files over all system combination variations
in Table 4.6. Theorem 4.2 says that the approximate ROVER WER
equals the difference of the average N-best WER and the N-best diver-
sity. The above plots illustrate this because all the points lie on the
45
o
line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 This figure shows the scatter plot between ROVER WER and aver-
age N-best WER of different 10-best lists combined using the oracle
confidence-weighted ROVER algorithm. I have averaged the WERs
across all test utterances and ignored any impact of N-best diversity
in this plot. I observe a correlation coefficient of 0.718 and root mean-
squared error (RMSE) of 7.562 between average N-best WER and the
ROVER WER. Hence in general, ROVER WER decreases as the ASR
systems being combined become more accurate. . . . . . . . . . . . . 108
4.5 Thisfigureshows thescatterplotbetween ROVERWERanddiversity
ofdifferent 10-bestlists combined using theoracle confidence-weighted
ROVER algorithm. I have averaged the WERs across all test utter-
ances and ignored any average N-best WER in this plot. I observe
a correlation coefficient of −0.538 between N-best diversity and the
ROVER WER. Hence in general, ROVER WER decreases as the ASR
systems being combined become more diverse. . . . . . . . . . . . . . 109
4.6 This figure shows the scatter plot between ROVER WER and an
optimal linear combination of average N-best WER and diversity of
different 10-best lists combined using the oracle confidence-weighted
ROVER algorithm. I have averaged the WERs across all test utter-
ances and computed the coefficient γ = 0.49 using least-squares linear
regression. I observe a correlation coefficient of 0.991 and RMSE of
0.784 between the optimal linear combination and the ROVER WER.
ThusthisoptimallinearcombinationpredictstheROVERWERbetter
than the average N-best WER and diversity considered individually. . 110
xiii
5.1 F1 score on the NER evaluation set for PCC-∂MaxEnt and bagged
ensembles of increasing size. α
d
was tuned on the development set and
α
e
on the evaluation set. . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 This figure shows the F1 score and average log-likelihood for the NER
development set with increasing diversity weight α. Five ∂MaxEnt
models were used with PCC diversity on bagged data. . . . . . . . . . 120
6.1 This figure shows the percent median reduction in per-iteration cross
entropy for the NEM-backpropagation (NEM-BP) training relative to
thenoiseless EM-BPtrainingofa10-classclassification neuralnetwork
trainedon1000imagesfromtheMNISTdataset. Iobserveareduction
in cross entropy of 18% for the training and the testing set at the
optimal noise standard deviation of 0.42. The neural network used
three logistic (sigmoidal) hidden layers with 40 neurons each. The
input layer used 784 logistic neurons and the output layer used 10
neurons with Gibbs activation function. The bottom figure shows the
training set cross entropy as iterations proceed for EM-BP and NEM-
BP training using the optimal noise variance of 0.42. The knee-point
of the NEM-BP curve at iteration 4 achieves the same cross entropy
as does the noiseless EM-BP at iteration 15. . . . . . . . . . . . . . . 125
6.2 This figure shows the percent median reduction in per-iteration cross
entropy for the EM-backpropagation training with blind noise (Blind-
BP)relativetothenoiseless EM-BPtrainingofa10-classclassification
neural network trained on 1000 images from the MNIST data set. I
observe a marginal reduction in cross entropy of 1.7% for the training
and the testing set at the optimal noise standard deviation of 0.54.
The neural network used three logistic (sigmoidal) hidden layers with
40 neurons each. The input layer used 784 logistic neurons and the
output layer used 10 neurons with Gibbs activation function. The
bottomfigureshowsthetrainingsetcrossentropyasiterationsproceed
for EM-BP and Blind-BP training using the optimal noise variance of
0.54. Both the blind noise EM-BP and the noise less EM-BP give
similar cross-entropies for all iterations. . . . . . . . . . . . . . . . . . 126
xiv
6.3 This figure shows the percent median reduction in per-iteration classi-
fication error rate for the NEM-backpropagation (NEM-BP) training
relative to the noiseless EM-BP training of a 10-class classification
neural network trained on 1000 images from the MNIST data set. I
observe a reduction in classification error rate of 15% for the train-
ing and around 10% for the testing set at the optimal noise standard
deviation of 0.42. The neural network used three logistic (sigmoidal)
hidden layers with 40 neurons each. The input layer used 784 logistic
neurons and the output layer used 10 neurons with Gibbs activation
function. The bottom figure shows the training set classification error
rate as iterations proceed for EM-BP and NEM-BP training using the
optimal noise variance of 0.42. The knee-point of the NEM-BP curve
at iteration 4 achieves the same classification error rate as does the
noiseless EM-BP at iteration 11. . . . . . . . . . . . . . . . . . . . . . 127
6.4 This figure shows the percent median reduction in per-iteration clas-
sification error rate for the EM-backpropagation training with blind
noise (Blind-BP) relative to the noiseless EM-BP training of a 10-class
classification neural network trained on 1000 images from the MNIST
data set. I observe a minor reduction in classification error rate of 1%
forthetrainingandthetestingsetattheoptimalnoisestandarddevia-
tionof 0.28. The neural network used three logistic (sigmoidal) hidden
layers with 40 neurons each. The input layer used 784 logistic neurons
and the output layer used 10 neurons with Gibbs activation function.
The bottom figure shows the training set classification error rate as it-
erations proceed for EM-BP and Blind-BP training using the optimal
noise variance of0.28. Boththe curves show similar classification error
rates for all iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
xv
6.5 Thisfigureshowsthepercentmedianreductioninper-iterationsquared
reconstruction error for the training with NEM noise relative to the
noiseless training of a BAM on 1000 images from the MNIST data set.
I observe a reduction of 16% in the training set squared reconstruction
errorattheoptimalnoisevarianceof1024. The BAMusedonelogistic
(sigmoidal) hidden layers with 40 neurons and an input layer with 784
logistic neurons. The bottom figure shows the training set squared re-
constructionerrorasiterationsproceedforNEMandnoiseless training
using the optimal noise variance of 1024. . . . . . . . . . . . . . . . . 129
6.6 Thisfigureshowsthepercentmedianreductioninper-iterationsquared
reconstruction error for the training with blind noise relative to the
noiseless training of a BAM on 1000 images from the MNIST data
set. I observe no significant difference in the per-iteration squared
reconstruction error for the two cases. The BAM used one logistic
(sigmoidal) hidden layers with 40 neurons and an input layer with 784
logistic neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.7 Noise benefit region for a neural network with Bernoulli (logistic) out-
put neurons: Noise speeds up maximum-likelihood parameter estima-
tion of the neural network with Bernoulli output neurons if the noise
lies above a hyperplane that passes through the origin of the noise
space. The activation signal a
t
of the output layer controls the nor-
mal to the hyperplane. The hyperplane changes as learning proceeds
because the parameters and hidden layer neuron activations change.
I used independent and identically distributed (i.i.d.) Gaussian noise
with mean 0, variance 3, and (3,1,1) as the normal to the hyperplane. 131
6.8 Noise benefit region for a neural network with Gaussian output neu-
rons: Noisespeedsupmaximum-likelihoodparameterestimationofthe
neural network with Gaussian output neurons if the noise lies inside
a hypersphere. The activation signal a
t
of the output layer and the
target signal t control the center and radius of this hypersphere. This
hypersphere changes as learning proceeds because the parameters and
hidden-layer neuron activations change. I used i.i.d. Gaussian noise
with mean 0, variance 3, and center t−a
t
= (1,1,1). . . . . . . . . . 132
xvi
Abstract
MyPh.D.thesis presents acomputationalframeworkfordiversity inensembles orcol-
lections of humans and machine systems used for signal and information processing.
Machine system ensembles have out-performed single systems across many pattern
recognition tasks ranging from automatic speech recognition to online recommenda-
tion. Likewise, ensembles are central to computing with humans, for example, in
crowd sourcing-based data tagging and annotation in human behavioral signal pro-
cessing. This widespread but largely heuristic use of ensembles is motivated by their
robustness to the ambiguity in production, representation, and processing of real-
world information. Numerous researchers widely accept diversity or complementarity
oftheindividual humans and machine systems asakey ingredient inensemble design.
I present a computational framework for this diversity by addressing three important
problems - modeling, analysis, and design.
IfirstproposetheGlobally-VariantLocally-Constant(GVLC)modelforthelabel-
ing behavior of a diverse ensemble. The GVLC model captures the data-dependent
reliability anddiverse behaviorofanensemble throughalatentstate-dependent noisy
channel. Ipresent anExpectation-Maximization (EM) algorithmforfinding themax-
imum a posteriori (MAP) estimates of the GVLC model’s parameters in the absence
of the latent true label of interest. I next present the Generalized Ambiguity De-
composition (GAD) theorem that defines ensemble diversity for a broad class of loss
xvii
functions used in statistical learning and relates this diversity to ensemble perfor-
mance. I show an application of the GAD theorem by theoretically and empirically
linking the diversity of an automatic speech recognition system ensemble with the
word error rate of the fused hypothesis sentence. The final part of my thesis presents
algorithms to design a diverse ensemble of machine systems, ranging from maximum
entropy models to sequence classifiers. I also prove that introducing diversity in the
training data through careful noise addition speeds-up the maximum likelihood train-
ing of Restricted Boltzmann Machines (or the equivalent Bidirectional Associative
Memories) and feed-forward neural networks.
0
This appears in Section 4.6 of Chapter 4
xviii
Acknowledgements
First and foremost, I dedicate this thesis to my parents (Mr. G. S. Audhkhasi and
Mrs. Punam Audhkhasi) and younger brother (Romil Audhkhasi). They have been
a constant source of support and strength throughout my academic life. I would also
like to thank my fiancee, Radhika Sood, for her understanding and patience during
the last phase of my Ph.D.
I thank my Ph.D. advisor, Prof. Shrikanth (Shri) Narayanan of the University of
Southern California (USC). Shri manages all his mentees exceptionally well and gives
them lots of opportunities to grow intellectually. This thesis would not have been
possible without the freedom which he granted me to explore new research directions.
I would also like to thank other members of the defense committee (Prof. Antonio
OrtegaandProf. FeiSha),andthequalifyingexamcommittee(Prof. AntonioOrtega,
Prof. Fei Sha, Prof. Alex Dimakis, Prof. Stefan Schaal, and Dr. Abhinav Sethy) for
their guidance and feedback.
A significant part of my thesis is the result of a long collaboration with Dr. Ab-
hinav Sethy and Dr. Bhuvana Ramabhadran from the IBM T. J. Watson Research
Center. I thank them for all that I have learned by collaborating with them. My sin-
cere thanks also go to Prof. Panayiotis (Panos) Georgiou of USC. An overwhelming
majority of my automatic speech recognition (ASR) knowledge comes from working
with Panos.
xix
I express my gratitude to fellow members of the Signal Analysis and Interpreta-
tion Lab (SAIL) at USC. I enjoyed several stimulating discussions and collaborations
with (both past and present) SAILers, and especially cherish the fun we had during
numerous conference trips. My roommates and friends at USC also made the Ph.D.
process easier for me by being part of my family away from home. My gratitude also
goes to numerous collaborators on research projects and papers.
Last, but certainly not the least, I especially thank Prof. Bart Kosko of USC for
his research and academic mentorship. Prof. Kosko not only encouraged me to think
about bold research problems and become a better scholar, but also took time out to
guide me about the requisite details.
xx
Chapter 1
Introduction
“You do not understand anything until you learn it more than one way.”
– Prof. Marvin Minsky (Professor of EECS, MIT)
The ubiquitous nature of the internet has spawned several software services that aim
to improve the quality of life of an increasingly mobile and networked society. Exam-
plesofsuchservicesincludesearchengines(suchasGoogleandBing),onlineshopping
websites (such as Amazon.com), automatic speech recognition-based personal assis-
tants (such as Apple’s Siri), social media websites (such as Facebook and Twitter),
and online movie and entertainment services (such as Netflix). Each such service
has millions of human users interacting with them on a daily basis. Cutting-edge
cognitive computing systems such as IBM’s Watson seek to make human interaction
with vast amounts of unstructured information even more seemless. Such cognitive
computing systems have the potential to impact key societal problems in healthcare,
finance, and commerce.
Each of these above systems and services use tools from the broad field of signal
and informationprocessing (SIP). These modern SIP systems share the following two
salient characteristics that distinguish them from conventional SIP systems:
1. Human-centered nature: Modern SIP systems intricately involve humans
during all stages of the SIP pipeline - information generation, processing, and
consumption. For example, an automatic sentiment or emotion analysis system
uses speech or natural text generated by humans as input. The system then
learns from the online or offline feedback provided by humans and presents
1
the predictions to the human user. The human user then utilizes these output
predictions for any desired task, such as decision making.
2. Use of big data: Modern SIP systems operateonmassive and ever-expanding
data sets generated by many humans. Forinstance, most online commerce web-
sites such as Amazon.com and Netflix contain many human-generated reviews
of thousands of products across different categories. An automatic product rec-
ommendation system must learn from the big data set of all these reviews and
then attempt to suggest products to a new customer.
Both the above facets of modern SIP systems deeply influence the underlying the-
ory, algorithms, and system design. First, the presence of humans at every stage of
the SIP pipeline naturally leads to ambiguity in the representation and processing
of information. The application of designing an automatic system for detecting emo-
tions from human speech best explains this situation. A typical automatic emotion
recognition system classifies the input speech into canonical emotional classes such
as angry, happy, sad, or neutral. However, natural human emotions are a continuum
and cannot be quantized into discrete categories. The production and perception of
these human emotions also depends on several contextual factors such as the words
spoken by the speaker and the topic of the utterance. It is desirable to incorporate
all these human-centered factors while designing an automatic emotion recognition
system. Second, the use of big data in modern SIP systems also has significantly
influenced the design of theory, algorithms, and systems. Big data is characterized
by volume, velocity, variety, and veracity ofthe information being processed. Volume
andvelocitymakeprocessingtimeanimportantconsiderationforSIP.Varietyimplies
that the data often comes from multiple disparate sources. For instance, automatic
user authentication systems often use diverse information sources such as speech and
video. Finally, big data often does not have readily available ground truth informa-
tion for training automatic SIP systems. Large-scale automatic image classification
provides an example, where it is not uncommon for an appreciable fraction of the
training image data to contain only noisy object class labels at best.
Ensemblesorcollectionsofmultiplehumansandmachinesystemshaveemergedas
a popular approach to deal with this human-centered and big data nature of modern
SIP, asexemplified above. We notethatanensemble ofmultiple humans isinevitable
while working with human-centered data. The earlier example of automatic emotion
2
recognition from speech provides one such situation where the training audio dataset
is often labeled by multiple human annotators for various canonical emotion classes.
Using an ensemble of humans encourages the resulting set of emotion class labels to
bemorerepresentativeofthetrueunderlyingemotion. Ensembles ofmultiplehumans
are also important in many big data problems such as automatic online recommenda-
tion where multiple customers rate a given product for its quality. Human ensembles
alsoenablecheapandfastannotationofbigdatathroughcrowd-sourcingservicessuch
as Amazon Mechanical Turk. Finally, ensembles of multiple machine systems have
achieved state-of-the-art performance in many well-known pattern recognition tasks
such as automatic speech recognition, speaker identification, online recommendation,
text classification, and object classification from images.
Diversity or complementarity of the human or machine system ensemble is a cru-
cial ingredient of ensemble SIP (ESIP) systems. Intuitively, a trivial ensemble where
allconstituenthumansormachinesystemsprocessthedatainanear-identicalfashion
will not give an appreciable benefit over a single human or machine system. The cost
of training and using such a minimally-diverse ensemble invariably exceeds any re-
sulting performance or processing benefits. A more concrete motivation for diversity
in ESIP comes from significant performance gains observed upon using ensembles
in several SIP applications highlighted above - automatic speech recognition, on-
line recommendation, text and image classification. Multiple annotators processing
human-generated data such as audio also exhibit diverse behavior and reliability. A
systematic computationalstudy ofthis underlying diversity has often been ignored in
prior research. Thus, this dissertation presents a computational framework for diver-
sity in an ensemble of humans and machine systems used for SIP. It studies ensemble
diversity by addressing three important questions - modeling, analysis, and design.
The modeling question aims to find a realistic statistical model for SIP when
using a diverse ensemble. Such a model can provide a principled framework for
fusion of outputs from the constituent members of an ensemble and can also provide
insights about their reliability and behavior. I propose the Globally-Variant Locally-
Constant (GVLC) model as a possible candidate. The GVLC model is a single-input
multiple-output (SIMO) state-dependent Bayesian network that mimics the data-
dependent behavior of the ensemble. I present an expectation-maximization (EM)
algorithm to perform maximum a posteriori (MAP) parameter estimation for the
GVLC model in the absence of data such as the ground truth value of the signal or
3
label of interest. Application of the GVLC model to emotion recognition from speech
and several standard pattern classification tasks shows performance improvements
over other prior statistical models. The GVLC model also provides new insights
about the diversity and reliability of the individual ensemble members.
The next segment of this dissertation addresses the analysis problem that aims
to mathematically define ensemble diversity and relate it to ensemble performance.
The relationbetween ensemble diversity andthe quality ofthe fused outputhas often
been observed by several researchers working in SIP and related fields. However, a
general theoretical understanding of this link is lacking in prior literature. Such a
theoretical framework can provide insights about the observed performance benefits
of machine system fusion across several SIP tasks. More importantly, it can also
motivate principled algorithms for training a diverse ensemble of machine systems
instead of existing ad-hoc approaches. I present the Generalized Ambiguity Decompo-
sition (GAD) theorem that provides this theoretical framework for the broad class of
second-orderdifferentiablelossfunctionsusedinSIPandmachinelearning. TheGAD
theoremgivesadataandlossfunction-dependentdefinitionofensemble diversity, and
presents an upper-bound on ensemble performance in terms of this diversity. I report
several experiments on simulated and real-world pattern recognition data sets that
show the accuracy and behavior of this result for ensembles of popular classifiers and
regressors. This dissertation also analyses the impact of diversity on performance
in an ensemble of sequence classification systems such as those used in automatic
speech recognition, human motion activity detection, financial time series prediction,
and gene sequence classification. I theoretically relate ensemble diversity and fusion
performance for such ensembles, and then present detailed experiments on ensembles
of automatic speech recognition systems as a case study to illustrate the theory.
The final part of this dissertation addresses the design question by incorporating
diversity considerations during the design of automatic SIP systems. I use the GAD
theorem to motivate a discriminative training algorithm for an ensemble of diverse
maximumentropy (∂MaxEnt) models,thatarestate-of-the-artsystemsusedinseveral
speech and natural language processing tasks. I also present the Diverse Minimum
Bayes Risk(DMBR)framework fortraininganensembleofdiversesequenceclassifica-
tion systems. I finally prove that diversifying the training data through careful noise
additionspeedsupbothunsupervised pre-traininganddiscriminative trainingoffeed-
0
This appears in Section 4.6 of Chapter 4
4
forward neural networks. Both shallow and deep neural networks are the subject of a
lot of recent research due to their state-of-the-artperformance in several applications
such as automatic speech recognition and object recognition from images. I derive a
geometricsufficient conditionthatselects additivenoise toguaranteeatrainingspeed
up. The noise adds on to the training data samples. The resulting noise benefit is
more pronounced in case of limited training data since the carefully-selected noise
samples have the effect of creating additional training data samples.
5
Chapter 2
Modeling a Diverse Ensemble:
Globally-Variant Locally-Constant
(GVLC) Model
Researchers have shown that fusion of labels from multiple experts – humans or ma-
chine classifiers – improves the accuracy and generalizability of the overall pattern
recognitionsystem. Simple pluralityisapopulartechnique forperformingthisfusion,
but it gives equal importance to labels from all experts who may not be equally reli-
ableandconsistent across datasets. Estimationofexpert reliability without knowing
the reference labels is however a challenging problem. Most previous works deal with
these challenges by modeling expert reliability as constant over the entire data (fea-
ture) space. This chapter presents amodel based onthe consideration thatin dealing
with real-world data, expert reliability is variable over the complete feature space
but constant over local clusters of homogeneous instances. I hence call this model
the Globally-Variant Locally-Constant (GVLC) model. The GVLC model jointly
learns a classifier and expert reliability parameters without assuming the knowledge
of reference labels using the Expectation-Maximization algorithm. Classification ex-
periments on simulated data, data from the UCI Machine Learning Repository and
two emotional speech classification datasets show the benefits of the proposed model.
Using a metric based on the Jensen-Shannon divergence, I empirically show that the
proposed model gives greater benefit for data sets where expert reliability is highly
variable over the feature space.
6
2.1 Introduction
Conventional supervised pattern classification assumes the availability of a reference
label for each training instance based on two implicit assumptions. First, the set of
classes isassumed tobeunambiguous andcrisplydefined. Thismaynotholdinmany
real-world scenarios due to the inherent non-categorical and ambiguous nature of the
phenomena of interest in both their manifestation and human processing. A classic
example is emotion recognition fromspeech. It is well-known thatthe expression and
perception of natural human emotions is complex and characterized by heterogene-
ity [117]. For example, while the emotional content of a person’s speech may appear
predominantly angry, it may have different shades of anger. Furthermore, it may
contain acoustic characteristics of neutrality and sadness as well. Thus discretization
of the emotional description into one of a few categories such as angry or sad is only
an approximation to the underlying continuum. The second common assumption be-
hindtheavailabilityofareference labelisthatthelabelingprocess isreliable, i.e., the
correct class label has been assigned to each instance. This assumption is often unre-
alistic since even expert labelers typically do not possess complete knowledge of the
classes. Forexample, a human expert labeling emotional speech is biased by his prior
experience about the acoustic characteristics of various emotions, which may not be
inconsonancewiththespeechclipsbeingappraised(labeled). Machineclassifiersalso
generate labels for unseen data instances using the instance-to-label mapping learned
from a training corpus, which may not generalize to the test data. This results in
classification errors. Finally, in some situations, obtaining the true label can be ex-
pensive, time-consuming, or even dangerous. For example, in the medical domain,
labeling a tissue as malignant or otherwise can be done through biopsy, which is not
only an expensive procedure but invasive as well [140]. Supervised training of spoken
language systems (e.g. automatic speech recognizers) typically requires professional
transcription of large amounts of speech data, which is both time-consuming and
expensive [159].
A simple strategy for dealing with the above issues is to get each instance labeled
by an ensemble of potentially imperfect experts (a generic term for a human labeler
or machine classifier). This is followed by simple plurality fusion, wherein the class
label with most votes is deemed the reference. Consider R experts and a label set
with K classes, denoted by {1,...,K}. Let y
j
denote the label assigned by the j
th
7
expert and y
j
k
its 1-in-K encoding, i.e., y
j
k
= 1 if y
j
= k, and 0 otherwise. Formally,
simple plurality uses the following decision rule
ˆ y
PLU
= arg max
k∈{1,...,K}
R
X
j=1
y
j
k
(2.1)
to find the fused label ˆ y
PLU
. Thus, simple plurality gives equal weight to the labels
fromallexperts. However, sincethereliabilityofdifferentexpertscanbevariableand
also data-dependent, it is reasonable to emphasize a more reliable expert’s judgment
while making the overall decision. But computing expert reliability in many real-
world data labeling problems is challenging and can be exacerbated by the hidden
or ambiguous nature of the true class labels in many cases. Dawid and Skene [41],
and later Smyth et al. [158] proposed a simple but powerful approach to this problem
basedontheExpectation-Maximization (EM)algorithm. Idescribetheirapproachin
Section 2.2. In this model, the Bayes optimal maximum a-posteriori (MAP) decision
rule for combining labels from R experts becomes a weighted sum of their 1-in-K
encoding.
One of the limitations of the model in [41], [158] is that a classifier has to be
learnt separately from the estimation of the labeling parameters, making the overall
estimation sub-optimal. To overcome this difficulty, Raykar et al. [140] proposed an
extension by explicitly incorporating a classifier linking the feature vector and the
hidden reference label. They showed the accuracy of this model over a variety of
datasets to be better than a classifier learned using the labels obtained from simple
plurality fusion and the model in [41], [158].
All the above models assume expert reliability to be constant over all data in-
stances. However, in real-world scenarios, this reliability varies from one instance
to another, as illustrated by means of examples in the next section. Furthermore,
instances close to each other in the feature space tend to be labeled with similar
reliability. These two ideas form the basis for the expert ensemble labeling model
proposed in this section. A generative model captures the feature variability in this
model. I consider a feature space generated using a Gaussian mixture model (GMM).
The given feature vector generates the hidden reference label using the multinomial
logistic regression or maximum entropy (MaxEnt) model. As will become apparent
later,thismodelcanuseanyclassifierthatcanbetrainedwithsoftlabels. Themodel
8
assumes eachexpert’s reliability tobeconstantonlyover each mixture component, as
opposedtotheentire featurespace. Ihence refertothismodelasthe globally-variant
locally-constant (GVLC) model.
A preliminary version of the GVLC model appeared in Interspeech-2010 [7]. This
chapter develops the model in a Bayesian framework for the multi-class case and
analyzesthedifferentmodelsboththeoreticallyandexperimentally. Ipresentdetailed
experimentsusingavarietyofdatasetsthatincludesimulateddataandstandarddata
sets from the UCI Machine Learning Repository. Finally, I report results on tasks
of classifying four emotional categories, as well as emotional valence, activation and
intensity fromspeech. Whiletheexpertsinthecaseofemotionalspeech tasksreferto
human evaluators, they are machine classifiers in the case of the UCI databases. The
next section presents a review of prior work in this domain, followed by a description
of the GVLC model in Section 2.3. I describe details on experiments conducted
on simulated and real world databases in Section 2.4. Subsection 2.4.4 attempts to
explain the observed benefit of the proposed GVLC model. Section 2.5 concludes the
paper with some directions of future work.
2.2 Prior Work
Fig. 2.1 shows the Bayesian network for one of the first models proposed for this
problem, due to Dawid and Skene [41] and Smyth et al. [158]. Let y be a K-valued
random variable that represents the unobserved reference label for a given training
example. The authors assume that each ofthe R experts is characterized by aK×K
reliability matrixA
j
, j∈{1,...,R}. When asked to give a label corresponding to the
true label y = k, the j
th
expert samples from the K-valued conditional distribution
{A
j
(k
1
,k)}
K
k
1
=1
. A
j
(k
1
,k) is the probability that the j
th
expert confuses class k
1
for class k. Given a training corpus, i.e., N independent and identically distributed
(IID) R-tuples of labels from the R experts, the learning task is to estimate the prior
probabilitydistributionofy and{A
1
,...,A
R
}. TheauthorsusetheEMalgorithm[42]
tofind maximum likelihood (ML)estimates ofthese parameters. Once theparameter
estimation is complete, the authors infer the true label y given observed noisy labels
9
Figure 2.1: This figure shows the Bayesian network for the model presented in [41],
[158]. Shaded and unshaded nodes represent observed and unobserved random vari-
ables respectively. The latent true label y generates R noisy labels y
1
,...,y
R
using
the conditional probability matrices A
1
,...,A
R
.
{y
1
,...,y
R
} as follows:
ˆ y
MAP
= arg max
k∈{1,...,K}
logP(y =k|y
1
,...,y
R
) = arg max
k∈{1,...,K}
h
logπ
k
+
R
X
j=1
K
X
k
1
=1
y
j
k
1
logA
j
(k
1
,k)
i
(2.2)
This decision rule is a weighted simple plurality, where the label from the more
reliable expert is given greater weight through the logarithm of the elements of the
conditionalprobabilitymatricesA
1
,...,A
R
. Theemotionrecognitioncommunityhas
developedsimilarintuitively-inspiredeffortstoincorporateevaluatorreliabilityduring
label fusion [65]. Proposition 2.1 states a sufficient condition for the equivalence of
the decision rule in (2.2) and simple plurality.
Proposition 2.1. If the prior probability distribution of y is uniform (π
k
=
1
K
∀k∈
{1,...,K}) and all expert reliability matrices are equal (A
j
=A∀j∈{1,...,R}) with
the following values:
A(k,k)=
α
(α+K−1)
∀k∈{1,...,K} (2.3)
A(t,k) =
1
(α+K−1)
∀t6=k (2.4)
where α∈R
+
−{1}, then (2.2) reduces to the simple plurality rule in (2.1).
10
Proof. I write the pairwise discriminant functions for the two decision rules in
(2.1) and (2.2) as
h(k,t) =
R
X
j=1
(y
j
k
−y
j
t
) and g(k,t)= log
π
k
π
t
+
R
X
j=1
K
X
k
1
=1
y
j
k
1
log
A
j
(k
1
,k)
A
j
(k
1
,t)
(2.5)
for k6= t. The two decision rules now become the following in terms of the pairwise
discriminant functions:
ˆ y
PLU
=k ⇐⇒ h(k,t)≥ 0 ∀t6=k and ˆ y
MAP
=k ⇐⇒ g(k,t)≥ 0 ∀t6=k
(2.6)
Equality of the two pairwise discriminant functions is a sufficient condition for the
ˆ y
PLU
to be equal to ˆ y
MAP
. Next, I write g(k,t) as
g(k,t)=log
π
k
π
t
+
R
X
j=1
y
j
k
log
A
j
(k,k)
A
j
(k,t)
+
R
X
j=1
y
j
t
log
A
j
(t,k)
A
j
(t,t)
+
R
X
j=1
X
k
1
6=k,t
y
j
k
1
log
A
j
(k
1
,k)
A
j
(k
1
,t)
.
(2.7)
Comparingtheaboveequationwithh(k,t),Iobtainthefollowingsufficientconditions
for g(k,t) to be equal to h(k,t):
π
k
=π
t
; A
j
(k,t)=
1
α
A
j
(k,k), A
j
(t,k)=
1
α
A
j
(t,t) ∀j
A
j
(k
1
,k)=A
j
(k
1
,t) ∀j and k
1
6=k,t (2.8)
where α ∈ R
+
/{1} is the base of the logarithm. Since A
j
is singly-stochastic with
entriesofeachcolumnaddingto1,onecanfinditsentriesusingtheaboveconstraints.
These turn out to be the same as in (2.3) and (2.4).
Proposition 2.1 requires that for each expert, the probabilities of retaining the
true label are same for all values of y. In addition, the probability of making an
erroris constant forallchoices ofthetrue andnoisy label. Finally, the reference label
should beequally likely toassume anyone oftheK possible values. α∈ [0,1)implies
that A
j
(k,k) < A
j
(t,k) ∀t6= k and j, which means that all experts are adversarial
and less likely than chance to retain the true label. α =0 denotes totally adversarial
experts who always flip the true label to some incorrect label. For α ∈ [1,+∞),
11
Figure 2.2: This figure shows the model presented by Rakyar et al. [140]. A MaxEnt
classifier explicitly maps the feature vector x to the true hidden label y.
A
j
(k,k) > A
j
(t,k) and all experts are non-adversarial. α → +∞ denotes perfect
experts, who always retain the true label.
Learning a classifier requires one to first use the above model to infer the true
hidden label y. The inferred labels y and the given feature vector x then provide
training data for the classifier. However, while these inference and learning steps
are individually optimal, the overall process is not. Raykar et al. [140] overcome
this limitation by jointly learning the classifier and the expert reliability matrices
(Figure2.2). AscomparedtoFigure2.1,weobservethatFigure2.2explicitlyincludes
the generation of the hidden reference label y through the link between x and y. A
MaxEnt classifier governs the conditional distribution of y given x. ML parameter
estimation is again performed within the EM framework. The classifier trains on soft
instead of hard labels since it utilizes the posterior distribution of y estimated during
the E-step.
If in addition to the feature vector, noisy labels from R experts are also available,
one can infer the true label y as follows:
ˆ y
MAP
=argmax
k
P(y =k|y
1
,...,y
R
,x) =argmax
k
h
logP(y =k|x)+logP(y
1
,...,y
R
|y)
i
=argmax
k
h
w
T
k
x+b
k
+
R
X
j=1
K
X
k
1
=1
y
j
k
1
logA
j
(k
1
,k)
i
(2.9)
This decision rule is very similar to (2.2) except for the presence of an affine
12
function ofx due to the MaxEnt classifier instead of the prior probability ofy. Using
a different classifier would modify this term. The second term however would still
remain a weighted linear combination of the decisions given by the R experts.
Both theabove models assume each expert’s reliability matrix tobe constant over
the entire feature space, i.e., independent ofx. In other words, these models capture
the global reliability of experts. However, it is natural to expect this reliability to
be variable over the feature space. The primary reason for this is the fact that all
instances are not equally easy to label (and for all experts). For example, Figure 2.3
shows images of “4”, “9” and “1” from the UCI Handwritten Pen Digits database [1].
The first row shows standard styles of writing the digits, while the next row shows
non-prototypical styles. While a human or machine expert will have no difficulty
in correctly recognizing the digits in the first row, the non-prototypical styles may
be more easily misrecognized as some other digits. Figure 2.4 illustrates the variable
natureofexpertreliabilityevenfurther. IperformedK-meansclustering(withK = 4)
on the Yeast database [78] from UCI using one of the features. I then trained three
classifiers (Logistic regression, Naive Bayes and J48 decision tree) using the entire
feature set, and computed their error rates over the 4 clusters. One can observe
that the error rates are variable across the three classifiers. Moreover, there exists
variability in error rate within a given type of classifier as well. Thus, modeling
reliability as constant across the entire feature space is a very strict assumption.
Figure 2.3: These figures show some prototypical and non-prototypical shapes of
three digits (4, 9 and 1) from the Handwritten Pen Digits database [1] in the first
and second rows respectively.
Whitehill et al. [182] model this behavior by incorporating a difficulty parameter
for each instance in addition to a measure of expert reliability. The i
th
instance has
difficulty 1/β
i
∈R
+
. 1/β
i
→ +∞ implies that the instance is extremely difficult to
13
Figure 2.4: This plot shows the variation of error probability for three classifiers
(logistic regression, Naive Bayes and J48) on the Yeast database with one of the
features. I divided the data set into 4 clusters using K-means and computed the
error rates over these clusters. The feature histogram is scaled by 2.5 for illustration
purposes.
14
label, while 1/β
i
→0 denotes a simple to label instance. Reliability of the j
th
expert
is governed by parameter α
j
∈ R, where large positive values of α
j
denote a more
reliable expert. The probability of retaining the true label is assumed to be a bilinear
sigmoid function – P(y
j
i
= y
i
) =
1
1+e
−α
j
β
i
. The authors propose an EM algorithm
to learn all α and β parameters, in addition to inferring a soft estimate of the true
hidden label. However, given a new instance, one has to run the training over the
expanded database in order to learn its difficulty and infer the hidden label, which
is time consuming. Also, the number of parameters grows linearly with the number
of instances, potentially leading to overfitting. In addition, a classifier is not trained
jointly with the estimation of other parameters.
Another model which considers a different error probability for each instance is
presented in [187]. The hidden reference label is generated from the feature vector
x by a binary logistic regression model. The probability of retaining the true label
given the feature vector is η
j
(x) =
1
1+e
−w
jT
x−γ
j
for the j
th
expert. Thus, a specific
form of the expert’s reliability matrix (analogous to a binary symmetric channel) is
adopted. Also, η
j
(x) is constrained to be a logistic function of x. Each expert is
characterized by the parameters w
j
and γ
j
. Apart from the above restrictions, this
model is also not totally generative since it cannot generate the feature vector x.
The proposed model presented in the next section attempts to address the above
concerns with previous models. Not only is expert reliability globally-variant and
constant over clusters in the feature space, the model is truly generative as well, i.e.,
sampling from the model generates an instance of the feature vector and multiple
noisy expert labels. This enables the model to better explain the joint variability
of the feature vector and labels from multiple experts. This model also does not
make any restrictive assumptions about the form of the reliability matrix. There has
been interest recently for data processing in different communities regarding crowd-
sourcing services such as Amazon Mechanical Turk and Crowd Flower. Examples
rangefromspeech/naturallanguageprocessing[109],[159],computervision[43],[161]
to visualization design [71]. Most of these works require multiple experts performing
somecomplexlabelingtasksuchasparaphrasinganarticle. Inthischapter,Iwillonly
concentrate on combination of simple categorical class labels from multiple experts.
15
Figure 2.5: This figure shows the proposed globally-variant locally constant (GVLC)
model. z is the hidden variable in the GMM which generates the feature vector x.
2.3 The Globally-VariantLocally-ConstantModel
As previously mentioned, one of the strict assumptions of the models in [41], [140],
[158] is that each expert’s reliability is assumed to be identical over the entire feature
space. However, practical examples such as those presented in the previous section
illustrate that this assumption may not hold always. Expert reliability can vary from
one data instance to the next. However, imposing a different reliability matrix for
each instance will lead to a huge number of parameters to be estimated. One way
in which the number of parameters can be kept down to a manageable number is
by requiring the reliability matrices to be constant over local clusters of instances
in the feature space. This is intuitive, since one anticipates experts to have similar
reliability for instances which are close to each other in the feature space. However,
finding representative features, distance metrics to quantify feature similarity and
hence clusters in the feature space, are all challenging problems. Instances close to
each other in the feature space may not be perceptually close to an expert. In this
chapter, I assume that the given feature space organization retains the perceptual
closeness of instances.
Figure 2.5 shows the proposed model. The feature space distribution is modeled
by a GMM. In case of a Gaussian centered at each data instance, a GMM becomes
a kernel density estimator and converges to the true feature space distribution [51].
One can substitute the GMM by a mixture of discrete distributions in case the fea-
ture space is discrete. Each expert has a reliability matrix at each Gaussian in the
16
GMM, thus modeling the data-dependent reliability while keeping the total number
of parameters manageable. The generative process for (x,y
1
,...,y
R
) is:
1. A Gaussian is selected from the M-valued distribution of z i.e., P(z = m) =
π
m
. If the m
th
0
Gaussian is selected, the feature vector x is sampled from
N(x;μ
m
0
,Σ
m
0
).
2. The reference label is generated using the K-valued distribution implied by the
MaxEnt classifier, i.e., P(y =k|x)∝ exp(w
T
k
x+b
k
). Lety =k
0
be the sampled
reference label.
3. The label for the j
th
expert is generated by sampling the K-valued distribution
in the k
th
0
column of the reliability matrix A
j
m
0
, i.e., P(y
j
= k
1
|y = k
0
,z =
m
0
) =A
j
m
0
(k
1
,k
0
). y
j
1
and y
j
2
(j
1
6= j
2
) are assumed to be independent given
y =k
0
and z =m
0
.
Not only does this model address issues such as labeler variability and its data
dependence, it is also flexible. Increasing the number of Gaussians will lead to a finer
modeling of the feature distribution and expert reliability variation. In the case of
one mixture component, each expert will have one global reliability matrix, similar
to [140].
2.3.1 ML Parameter Estimation using the EM algorithm
ConsiderN IIDtraininginstances,eachconsistingoftheD-dimensionalfeaturevector
x
i
and R noisy expert labels {y
1
i
= k
1
i
,...,y
R
i
= k
R
i
} (i ∈ {1,...,N}). Each label y
j
i
can assume K possible values, denoted by{1,...,K}. Let the entire set ofparameters
to be estimated be denoted by Θ = {(π
m
,μ
m
,Σ
m
,(A
j
m
)
R
j=1
)
M
m=1
,(w
k
,b
k
)
K
k=1
}. The
observed data (D
obs
) log-likelihood is:
logP(D
obs
|Θ)=
N
X
i=1
logP(x
i
,(y
j
i
=k
j
i
)
R
j=1
|Θ)
=
N
X
i=1
log
M
X
m=1
K
X
k=1
P(x
i
,(y
j
i
=k
j
i
)
R
j=1
,y
i
=k,z
i
=m|Θ)
. (2.10)
MLestimationoftheparametersΘby directmaximization oflogP(D
obs
|Θ)ismathe-
maticallyintractable. ThusIresorttotheEMalgorithmwithz
i
andy
i
(i∈{1,...,N})
17
as the unobserved data. The complete data log-likelihood factors as:
logP(D
obs
,D
unobs
|Θ)=
N
X
i=1
logP(x
i
,(y
j
i
=k
j
i
)
R
j=1
,y
i
,z
i
|Θ)=
N
X
i=1
h
M
X
m=1
z
im
logπ
m
+logN(x
i
;μ
m
,Σ
m
)+
K
X
k=1
y
ik
logσ(x
i
;w
k
,b
k
)+
K
X
k=1
R
X
j=1
y
ik
logA
j
m
(k
j
i
,k)
i
(2.11)
whereσ(x
i
;w
k
,b
k
) =
exp(w
T
k
x
i
+b
k
)
P
K
˜
k=1
exp(w
T
˜
k
x
i
+b
˜
k
)
isthesoft-maxfunctionfortheMaxEntmodel,
and z
im
and y
ik
denote the 1-in-M and 1-in-K encodings of z
i
and y
i
respectively.
I now need to compute the posterior probability density function (PDF) of hidden
variables(z
i
andy
i
)giventheobserved variablesandthecurrent parameterestimates.
This can be written as follows:
P(z
i
=m,y
i
=k|x
i
,(y
j
i
=k
j
i
)
R
j=1
)∝P(z
i
=m,y
i
=k,x
i
,(y
j
i
=k
j
i
)
R
j=1
)
=P(z
i
=m)P(x
i
|z
i
=m)P(y
i
=k|x
i
,z
i
=m)P((y
j
i
=k
j
i
)
R
j=1
|z
i
=m,x
i
,y
i
=k)
∴ ζ
ikm
∝π
m
N(x
i
;μ
m
,Σ
m
)σ(x
i
;w
k
,b
k
)
R
Y
j=1
A
j
m
(k
j
i
,k). (2.12)
We note that ζ
ikm
=E{y
ik
z
im
}. The expectation of the complete data log-likelihood
with respect to the above posterior PDF additionally involves computation of the
following quantities:
E{z
im
} =P(z
i
=m|x
i
,(y
j
i
=k
j
i
)
R
j=1
) =
K
X
k=1
ζ
ikm
=γ
im
(2.13)
E{y
ik
} =P(y
i
=k|x
i
,(y
j
i
=k
j
i
)
R
j=1
) =
M
X
m=1
ζ
ikm
=η
ik
. (2.14)
The expectation of the complete data log-likelihood thus becomes:
E{logP(D
obs
,D
unobs
|Θ)}=
N
X
i=1
M
X
m=1
n
γ
im
logπ
m
+γ
im
logN(x
i
;μ
m
,Σ
m
)
o
+
N
X
i=1
K
X
k=1
η
ik
logσ(x
i
;w
k
,b
k
)+
N
X
i=1
M
X
m=1
K
X
k=1
R
X
j=1
ζ
ikm
logA
j
m
(k
j
i
,k).
(2.15)
18
The M-step consists of maximizing the above expectation with respect to Θ subject
to the following constraints:
M
X
m=1
π
m
=1 and
K
X
k
1
=1
A
j
m
(k
1
,k) =1 ∀j,m,k . (2.16)
We determine the re-estimation equations for the GMM parameters and reliability
matrices using the Lagrange multiplier method. The MaxEnt parameters can be
estimated by solving the following optimization problem:
(ˆ w
k
,
ˆ
b
k
) =argmax
w
k
,b
k
N
X
i=1
K
X
k=1
η
ik
logσ(x
i
;w
k
,b
k
)= arg min
w
k
,b
k
E((w
k
,b
k
)
K
k=1
) (2.17)
whereE isthenegativeoftheMaxEntlog-likelihoodfunction,alsocalledcross-entropy.
Theaboveobjective functionisthesameasforaconventional MaxEntclassifier when
η
ik
is the 1-in-K encoding of the hard label of the i
th
instance. Since the objective
function is convex, any gradient based method is guaranteed to find the globally-
optimal parameter estimates. Moreover, the gradient and Hessian of this objective
function have the following closed-form expressions [17]:
∇
w
k
,b
k
E =
N
X
i=1
(σ(x
i
;w
k
,b
k
)−η
ik
)[x
T
i
1]
T
(2.18)
∇
w
k
,b
k
∇
wt,bt
E =
N
X
i=1
σ(x
i
;w
k
,b
k
)(δ
K
(k,t)−σ(x
i
;w
t
,b
t
))[x
T
i
1]
T
[x
T
i
1] (2.19)
where δ
K
(.) is the Kronecker delta function. I summarize the final EM equations
below:
19
• Initialization:
[w
T
k
b
k
]
T
=[0,...,0]
T
∀k∈{1,...,K} (2.20)
(π
m
,μ
m
,Σ
m
)
M
m=1
= kmeans([x
1
...x
N
],M) (2.21)
A
j
m
(k
1
,k) =
P
N
i=1
δ
K
(y
j
i
=k
1
,y
i,PLU
=k|z
i
=m)
P
N
i=1
δ
K
(y
i,PLU
=k|z
i
=m)
∀j∈{1,...R}
(2.22)
m∈{1,...,M},k
1
and k∈{1,...,K}
where kmeans(X,M) is a function that performs K-means clustering [17] over
the data matrix X using M clusters, and returns the cluster weights, means
and covariance matrices computed from instances in X. y
i,PLU
is the label
obtained by fusing y
1
i
,...,y
R
i
by simple plurality. The assignment of the i
th
training instance is done to the closest cluster centroid generated by K-means.
Thus,forthepurposeofinitializingthereliabilitymatrices,Iconsiderthesimple
plurality label as a proxy for the true label.
• E-step: The E-step computes the following variables:
ζ
ikm
∝π
m
N(x
i
;μ
m
,Σ
m
)σ(x
i
;w
k
,b
k
)
R
Y
j=1
A
j
m
(k
j
i
,k)
∀i∈{1,...M},m∈{1,...,M}, andk∈{1,...,K} (2.23)
γ
im
=
K
X
k=1
ζ
ikm
, η
ik
=
M
X
m=1
ζ
ikm
∀i∈{1,...,N},m∈{1,...,M}, andk∈{1,...,K}. (2.24)
• M-step: The M-step computes estimates of the following GVLC model param-
20
eters:
π
m
=
P
N
i=1
γ
im
N
∀m∈{1,...,M} (2.25)
μ
m
=
P
N
i=1
γ
im
x
i
P
N
i=1
γ
im
∀m∈{1,...,M} (2.26)
Σ
m
=
P
N
i=1
γ
im
(x
i
−μ
m
)(x
i
−μ
m
)
T
P
N
i=1
γ
im
∀m∈{1,...,M} (2.27)
A
j
m
(k
1
,k)=
P
N
i=1
ζ
ikm
y
j
ik
1
P
N
i=1
ζ
ikm
∀j∈{1,...R},m∈{1,...,M} (2.28)
k
1
and k∈{1,...,K}
(w
k
,b
k
)
K
k=1
= train-soft-maxent([x
1
...x
N
],([η
i1
...η
iK
])
N
i=1
) (2.29)
Here, train-soft-maxent(X,L) denotes a function to train a K-class MaxEnt
classifierusingfeaturesinthedatamatrixXandsoftlabelsintheN×K matrix
L. Each row of L contains a probability distribution over the K class labels.
• Termination condition: Terminate the algorithm when the relative change
inlog-likelihoodoftheobserved data(2.10)iswithinaspecifiedthresholdǫ> 0,
i.e.,
1−
logP(D
obs
|Θ
curr
)
logP(D
obs
|Θ
prev
)
≤ǫ (2.30)
I first explain the variables computed in the E-step of the above algorithm. ζ
ikm
is the probability of the m
th
mixture component andk
th
hidden label occurring given
theobserveddataandcurrentparametervalues. Thiscanbethoughtofasajointsoft
count of the number of occurrences of z = m and y = k. Similarly, γ
im
and η
ik
are
the individual soft counts of the occurrences of m
th
mixture component and k
th
class.
It should be noted that while γ
im
has the same meaning as in EM-based training of
a GMM, its expression is different. This is because of the links between observed
expert labels y
j
and z in Figure 2.5, which are absent in the Bayesian network of a
simple GMM.
The parameter update equations in the M-step are also intuitively meaningful.
The parameters of the GMM are updated as in the case of a simple GMM, except
that the soft weights γ
im
are defined differently. A
j
m
(k
1
,k) is equal to a convex
21
combination of the k
th
1
entry of the 1-in-K encoding of the labels from the j
th
expert
over the database. Put differently, it is proportional to the sum of soft counts ζ
ikm
over those instances where the j
th
expert assigned label k
1
.
It must be noted that the EM algorithm can get stuck in local maxima of the
log-likelihood function. To combat this problem, the algorithm allows early stopping
of the EM iterations based on the model’s accuracy on a development corpus. The
next subsection presents a Bayesian version of the model and the associated MAP
EM algorithm.
2.3.2 A Bayesian Version of the Proposed Model
While the proposed model can account for the data-dependent behavior of experts,
it still involves a large number of parameters. This is in spite of the fact that the
GVLCmodelhasconstrainedthereliabilitymatricestobethesameovereachmixture
component for a given expert. As shown in Table 2.1, simple plurality is parameter-
free and thus does not involve a training stage. The models in [41], [158] and [140]
differ by the presence of a K-class classifier in the latter. The proposed GVLC model
additionallyinvolvesM−1+(D+D
2
)M GMMparametersandM reliabilitymatrices
for each expert instead of just 1. As an aside, in terms of computational complexity,
simple plurality is roughly O(1), the method by Smyth et al. is O(RK
2
), the one
by Raykar et al. is O(RK
2
+ KD), while the proposed model further scales that
complexity by O(M). Thus, training of the proposed model is roughly M-times
slower than the one by Raykar et al.
Model Number of parameters
Simple plurality 0
Smyth et al. [158] K +(K
2
−K)R
Raykar et al. [140] (D+1)K +(K
2
−K)R
GVLC model M−1+(D+D
2
)M +(D+1)K +(K
2
−K)MR
Table 2.1: This table shows the number of parameters for simple plurality, models
presented in [41], [158],[140] and the proposed GVLC model. Simple plurality and
the model in [41], [158] do not involve a K-class classifier.
ThedifferenceinnumberofparametersoftheGVLCmodelandthemodelin[140]
is M −1+2DM +(K
2
−K)R(M −1), assuming diagonal covariance matrices for
each Gaussian in the GMM. This number is quadratic in K and linear in D, M and
22
Figure 2.6: This figure shows the Bayesian version of the proposed GVLC model. All
unmentioned PDFs are same as in Figure 2.5. Each MaxEnt weight vector w
k
is
assumed to be drawn fromN(w
k
;0,σ
2
I). A
j
PLU
denotes the global reliability matrix
of the j
th
expert computed using the simple plurality labels as a proxy for the true
labels. k
th
column of the reliability matrixA
j
is assumed to be drawn independently
from a Dirichlet distribution with parameter vector as α times the k
th
column of
A
j
PLU
. This is denoted by P(A
j
)=Dir(A;αA
j
PLU
).
R. It is thus natural to expect the proposed model to severely overfit the data as all
parameters (particularly the number of classes) are increased. One approach to deal
with this overfitting is to impose priors on the parameters themselves. I consider two
sets of priors – the first one on the MaxEnt parameters (excluding the bias term),
and the second one on the expert reliability matrices. This leads us to the Bayesian
version of the GVLC model as shown in Figure 2.6.
I assume that each of the K coefficient vectors in the MaxEnt model (excluding
the biases b
k
) are generated from a zero mean Gaussian distribution with covariance
matrix σ
2
I. This effectively leads to an L
2
regularization in the MaxEnt objective
function. Let A
j
PLU
denote the global reliability matrix of the j
th
expert computed
using the simple plurality labels as a proxy for the reference ones. Column k ofA
j
is
generated from a Dirichlet distribution with a parameter vector which is α times the
k
th
column ofA
j
PLU
. One could assume the variance of the prior distribution of each
w
k
tobe different. Similarly, each entry ofthe Dirichlet parametervectors could have
been tuned independently. But this will result in too many hyperparameters to tune.
23
This modelhasjusttwoadditionalhyperparameters totune–σ andα. Estimationof
parameters of this model can be performed using maximum aposteriori (MAP) EM,
which requires the computation of the aposteriori PDF of the parameters given the
complete data:
logP(Θ|D
obs
,D
unobs
)= logP(D
obs
,D
unobs
|Θ)+logP(Θ)−logP(D
obs
,D
unobs
).
(2.31)
The last term is independent of Θ and can be ignored from the optimization. The
first term is same as in case of the ML EM and the second term is the prior imposed
on the parameters. In the case of the Bayesian network in Figure 2.6, this prior can
be written as:
logP(Θ)=
K
X
k=1
logN(w
k
;0,σ
2
I)+
R
X
j=1
logDir(A
j
;αA
j
PLU
). (2.32)
The E-step in the MAP EM algorithm computes the expectation of the conditional
PDF logP(Θ|D
obs
,D
unobs
) with respect to the posterior PDF of the hidden variables
given the observed variables and current estimates of parameters. The M-step maxi-
mizes this expectation with respect to the parameters. The final EM equations turn
out to be exactly the same as in case of the ML EM algorithm presented earlier, but
with the following changes:
1. The estimation equations for the reliability matrices become:
A
j
m
(k
1
,k)=
P
N
i=1
ζ
ikm
y
j
ik
1
+α
j
(k
1
,k)−1
P
N
i=1
ζ
ikm
+
P
K
k
1
=1
α
j
(k
1
,k)−K
∀j∈{1,...R},m∈{1,...,M}
k
1
and k∈{1,...,K} (2.33)
where α
j
(k
1
,k) is the k
th
1
entry of the k
th
column of αA
j
PLU
. We see that the
Dirichlet parameters have the effect of increasing the total soft count in the
numerator of the above equation. Thus ifα
j
(k,k)>α
j
(k
1
,k)∀k
1
6=k, i.e., the
prior count of an expert assigning the correct label is greater than the count of
assigning an incorrect label, then A
j
m
(k,k) will be given more additive bias as
compared to A
j
m
(k
1
,k).
24
2. The objective function of the soft MaxEnt classifier will now contain an L
2
penalty term,−λ
P
K
k=1
||w
k
||
2
where λ =
1
2σ
2
. Hence weight vectors with large
L
2
norms will be penalized more in the optimization.
Thehyper-parametersλ(orσ)andαcanbetunedbasedonclassificationperformance
on a development set. Once the parameters of the model have been estimated using
either the ML or MAP criterion, I perform inference of the hidden label given the
feature vector x and labels from multiple experts as shown in the next subsection.
2.3.3 Inference of the Hidden Reference Label
Given a feature vector x and associated noisy labels from R experts (y
1
,...,y
R
), the
MAP estimate of the true hidden label y is
ˆ y
MAP
= argmax
k
P(y =k|x,y
1
,...,y
R
)
= argmax
k
M
X
m=1
n
π
m
N(x;μ
m
,Σ
m
)exp(w
T
k
x+b
k
)
R
Y
j=1
K
Y
k
1
=1
A
j
m
(k
1
,k)
y
j
k
1
o
. (2.34)
If the reliability matrices are independent of the mixture component index m (i.e.,
A
j
m
(k
1
,k) = A
j
(k
1
,k) ∀m ∈ {1,...,M}), then the decision rule in (2.34) reduces
to the one in (2.9). This implies that the decision rules of the proposed GVLC
model and the one by Raykar et al. [140] are equivalent if each expert has a single
reliability matrix for the entire feature space. We can also conclude that the above
decision rule reduces to simple plurality if the sufficient conditions of Proposition 2.1
are additionally satisfied.
One could also perform MAP inference of the hidden true label given just the
feature vector x without the noisy labels, corresponding to the practical situation
where the experts have not labeled instances in the test set. It can be easily shown
that this inference is the same as using the MaxEnt classifier to classify the input
feature vector. The next section compares the various models first on simulated
data and then on real databases from the UCI repository and two speech corpora for
emotion classification. The expert ensemble contains machine classifiers in the case
of the UCI databases and human labelers for the emotional speech data sets.
25
2.4 Experiments and Results
2.4.1 Classification Experiments on Synthetic Data
I conducted classification experiments on synthetic data to understand the behavior
of the proposed model. The synthetic database was generated by forward sampling
of the Bayesian network shown in Figure 2.5. The feature space dimension was set as
2 in a binary classification scenario with 3 experts. The feature vectors were assumed
tobegeneratedfromaGMMwith4components, withequalweights assigned toeach
Gaussian. All covariance matrices were set to 0.01I, where I is the identity matrix
in R
2
. The mean vector of the m
th
Gaussian was set to the components of the m
th
fourth root of unity:
μ
m
(1)= cos
2(m−1)
π
2
and μ
m
(2)=sin
2(m−1)
π
2
. (2.35)
The logistic regression weight vectors were set to: w
1
= [1 1]
T
, w
2
= [−1 − 1]
T
and b
1
= b
2
= 0. Each reliability matrix had a constant diagonal. The diagonal
entries for the first expert’s reliability matrices were: A
1
1
(k,k) = 0.6, A
1
2
(k,k) = 0.7,
A
1
3
(k,k)= 0.8andA
1
4
(k,k) =0.9(∀k∈{1,2}), corresponding tofourequally spaced
points in the interval [0.55,0.95]. Let us represent these diagonal entries by the
4-tuple (0.6,0.7,0.8,0.9). The off-diagonal entries were picked to ensure that each
column adds to 1. The diagonal entries for the second and third expert were set to
the tuples (0.9,0.6,0.7,0.8) and (0.8,0.9,0.6,0.7) respectively, representing circular
right-shifts by indices 1 and 2 of the tuple for the first expert.
Since the two synthetic classes are linearly separable, all algorithms were able to
achieve a very high classification accuracy close to 100%. Hence, I introduced some
noise inthe datagenerationprocess. The true labelsgenerated by the MaxEnt model
were flipped with probability 0.2 before generating the expert labels. I generated
2500 instances for training, and 1250 instances each for development and testing.
The development set was used for early stopping of the EM iterations. Both α and λ
were set to 0. Figure 2.7 shows the inference accuracies of the proposed model and
theonebyRaykaretal. SimplepluralityandthemodelbySmythetal.[158]perform
equally well at 79.6%, but significantly worse than the model by Raykar et al. [140].
The inference accuracy of the proposed GVLC model is better than all the baselines
26
2 4 6 8 10 12 14
90.9
91
91.1
91.2
91.3
91.4
91.5
91.6
Number of mixture components
Inference accuracy
Proposed algorithm
Raykar et al.
Figure 2.7: Inference accuracies of the various models using the synthetic binary
classification database. The performance of simple plurality and the algorithm by
Smythetal. was79.6%. TheverticallinecorrespondstothetruenumberofGaussians
(4).
forM = 4,5,6(the correct number of Gaussians was M = 4). The accuracy becomes
erratic for larger values of M indicating over-fitting. This highlights the fact that the
choice of the number of mixture components is extremely critical. I tune this number
on a development set in further experiments.
Database No. of classes No. of instances No. of features
Magic Gamma Telescope [70] 2 19020 10
Pima Indians Diabetes [156] 2 768 8
Abalone [118] 2, 3
1
4177 7
Yeast [78] 4 1484 6
Handwritten Pen Digits [1] 10 10992 16
Table2.2: ThistableshowsasummaryofdifferentdatabasesfromtheUCIrepository
used in the experiments.
2
The Abalone database had 29 class labels. However the data distribution among these classes
is very uneven – 11 classes had less than 20 instances. Hence we converted the problem into binary
(age≤ 9 and≥10) and 3-class (age≤ 8, 9−10and≥ 11)classification. The class binning was done
in a way to ensure that all bins have nearly equal number of samples.
27
2.4.2 Classification Experiments on UCI Databases
Inextperformedclassificationexperimentson5chosendatabasesfromtheUCIrepos-
itory [58] for testing the performance of the various models in fusing labels from
multiple classifiers. The database details are summarized in Table 2.2. As can be
observed, the number of instances, features and classes varies from one database to
another. This allowstestingthemodelsondifferent conditions–binarytomulti-class
and data rich to data sparse domains. One of the biggest advantages of using these
databases is that the reference label is available, making performance evaluation easy.
All databases were split into four sets – training set for the classifiers (30%),
training set for the label fusion algorithms (30%), development set for tuning M, α,
λ and early stopping of the EM algorithms (20%), and a test set (20%). For ease in
setting a range for λ, all features were standardized using the classifier training set.
Three standardclassifiers fromWeka[67]were used asexperts –J48(implementation
of the C4.5 decision tree [136]), logistic regression and naive Bayes. The choice of
classifiers was arbitrary and others (like SVM or random forests) could have been
selected.
Two sets of experiments were conducted for each of the models – classification
using the estimated MaxEnt classifier and inference ofthe truehidden labelusing the
observed data. Inference ofthehiddenlabelisdoneusing(2.1),(2.2),(2.9)and(2.34)
for simple plurality, the models by Smyth et al. [158] and Raykar et al. [140], and
the proposed GVLC model, respectively. It must be noted that the first two models
do not involve the feature vector x, in contrast to the latter two. Hence inference
of y is performed using only the multiple noisy labels in those cases. For computing
the classification accuracy, we trained a MaxEnt model separately using the inferred
labels for these two models. This was not needed for the model by Raykar et al. [140]
and the proposed GVLC model since the classifier is already trained as part of the
Bayesian network.
The number of Gaussians in the GMM was varied from 1 to min(⌊N/50⌋,10)
(where N is the number of training instances). The upper limit prevents too many
Gaussians from being trained for a small training set. It must be noted that the
log-likelihood term in the objective function of the L
2
regularized MaxEnt classifier
scales as O(N) (where N is the number of training instances), making the penalty
term (−λ
P
K
k=1
||w
k
||
2
) negligible in magnitude. Thus, the regularization parameter
λ set equal to βN, where β was varied from 0 to 0.1 in steps of 0.005. The Dirichlet
28
parameter α was varied from 0 to 0.2 in steps of 0.02. Larger values of α resulted in
excessive smoothing and hence poorer performance.
Classifier/Model Classification Accuracy Inference Accuracy
J48 83.88 -
Logistic 79.77 -
Naive Bayes 72.99 -
Simple plurality 79.48 (β = 0) 81.97
Smyth et al. 79.49 (β = 0) 81.97
Raykar et al. 79.72 (β = 0,α =0.2) 81.71 (β = 0,α = 0.08)
GVLC model 80.04 (β = 0,α = 0.14, 81.71 (β = 0,α = 0.08,
M = 10) M = 1)
GVLC model (oracle) 80.12 (β = 0,α = 0, 81.87 (β = 0,α = 0,
M = 2) M = 3)
Table 2.3: This table shows the classification and inference accuracies for the Magic
Gamma Telescope database using various algorithms. Values in bold represent a
statistically significant improvement in performance over simple plurality at the 5%
significance level using the exact one-sided binomial test.
Classifier/Model Classification Accuracy Inference Accuracy
J48 72.56 -
Logistic 79.27 -
Naive Bayes 76.83 -
Simple plurality 76.83 (β = 0.005) 79.27
Smyth et al. 76.83 (β = 0.005) 79.27
Raykar et al. 77.44 (β = 0.015,α = 0) 78.66 (β = 0.035,α = 0.06)
GVLC model 78.05 (β = 0.005,α = 0.08, 79.27 (β = 0.1,α = 0.12,
M = 2) M = 3)
GVLC model (oracle) 78.66 (β = 0.005,α = 0.04, 80.49 (β = 0.04,α = 0.16,
M = 3) M = 3)
Table 2.4: This table shows the classification and inference accuracies for the Pima
Indians database using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
Tables 2.3-2.8 show the accuracies for the different databases on the test set. The
first three rows represent accuracies of the three classifiers. The oracle accuracy of
the GVLC model is obtained by tuning all the hyperparameters on the test set and
gives an upper bound of the model’s performance. It is expected that with lesser
mismatch between development and test set, and finer parameter tuning, one can
29
Classifier/Model Classification Accuracy Inference Accuracy
J48 77.09 -
Logistic 78.79 -
Naive Bayes 73.45 -
Simple plurality 74.91 (β = 0.005) 78.42
Smyth et al. 74.91 (β = 0.005) 78.42
Raykar et al. 76.12 (β = 0.005,α = 0) 78.42 (β = 0.02,α = 0.12)
GVLC model 76.36 (β = 0.005,α = 0, 79.15 (β = 0.005,α =0.04,
M = 4) M = 4)
GVLC model (oracle) 76.73 (β = 0.005,α = 0, 79.88 (β = 0,α = 0,
M = 3) M = 9)
Table 2.5: Thistableshows theclassificationandinference accuracies fortheAbalone
database (binary) using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
Classifier/Model Classification Accuracy Inference Accuracy
J48 61.18 -
Logistic 66.59 -
Naive Bayes 57.57 -
Simple plurality 62.38 (β = 0.005) 64.54
Smyth et al. 62.50 (β = 0.005) 64.54
Raykar et al. 62.26 (β = 0.005,α = 0.20) 64.66 (β = 0.005,α = 0.04)
GVLC model 63.94 (β = 0.005,α = 0.12, 65.02 (β = 0.025,α = 0.20,
M = 6) M = 9)
GVLC model (oracle) 64.90 (β = 0.005,α = 0.16, 65.99 (β = 0.010,α = 0.16,
M = 9) M = 4)
Table 2.6: Thistableshows theclassificationandinference accuracies fortheAbalone
database (3-class) using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
30
Classifier/Model Classification Accuracy Inference Accuracy
J48 53.29 -
Logistic 55.71 -
Naive Bayes 57.09 -
Simple plurality 56.06 (β = 0.005) 55.02
Smyth et al. 56.06 (β = 0.005) 55.71
Raykar et al. 56.06 (β = 0.005,α = 0) 55.02 (β = 0.005,α = 0)
GVLC model 57.09 (β = 0.010,α = 0.10, 57.09 (β = 0.075,α = 0.08,
M = 4) M = 3)
GVLC model (oracle) 57.99 (β = 0.010,α = 0.12, 58.48 (β = 0.005,α = 0,
M = 3) M = 4)
Table 2.7: This table shows the classification and inference accuracies for the Yeast
database using various algorithms. Values in bold represent a statistically significant
improvement in performance over simple plurality at the 5% significance level using
the exact one-sided binomial test.
Classifier/Model Classification Accuracy Inference Accuracy
J48 93.17 -
Logistic 95.03 -
Naive Bayes 85.78 -
Simple plurality 90.43 (β = 0) 95.03
Smyth et al. 90.71 (β = 0) 94.89
Raykar et al. 90.85 (β = 0,α = 0) 94.80 (β = 0,α = 0.12)
GVLC model 92.15 (β = 0,α = 0.12, 95.45 (β = 0,α = 0.12,
M = 8) M = 4)
GVLC model (oracle) 92.15 (β = 0,α = 0.12, 95.91 (β = 0,α = 0.18,
M = 8) M = 3)
Table 2.8: This table shows the classification and inference accuracies for the Hand-
written Pen Digits database using various algorithms. Values in bold represent a
statistically significant improvement in performance over simple plurality at the 5%
significance level using the exact one-sided binomial test.
31
come very close to this bound. Values in bold represent a statistically significant
improvement in performance over simple plurality at the 5% significance level using
the exact one-sided binomial test. As can be observed, the GVLC model gives the
highest classification accuracy for 6 databases and the highest inference accuracy for
4 databases (except the Magic Gamma Telescope and Pima Indians database). The
improvement is statistically significant with respect to simple plurality for 5 and 3
databases for classification and inference respectively. In contrast, the algorithm by
Raykar et al. gives a statistically significant improvement in only 2 databases for
classification and none forinference. The oracle GVLC model gives a statistically sig-
nificantimprovement for6and5databasesinclassification andinferencerespectively,
indicating that careful tuning of the hyperparameters is crucial.
2.4.3 Emotion Classification from Speech
I consider here the problem of human emotion recognition from speech. As men-
tioned earlier, even though human emotion expressions span a continuum, they are
often quantized into categories such as{angry, happy, sad, neutral}. Labeling human
speechforemotionsisadifficulttaskandmultiplehumanevaluatorsaretypicallyused.
I use two emotional speech databases for our experiments. The first database [98]
(calledtheEMA database)has3trainedactorsreading10sentences 5timeseach por-
traying the four aforementioned emotional classes. This results in 150 audio clips per
emotional class. All the clips were then labeled by 4 human evaluators who assigned
a class label to each clip. The emotion which the actor was asked to synthesize was
assumed to be the reference label. I extracted the root mean squared energy (RMSE)
and 12 Mel-Frequency Cepstral Coefficients (MFCCs) over 20 ms frames with 10 ms
shift using the OpenSMILE toolkit [52]. The component-wise mean of this feature
vectorwascomputedovereachutteranceresultingina13-dimensionalutterance-level
feature vector. The data was randomly split into three sets for training (40%), test-
ing (30%) and development (30%). Similar to the procedure adopted for the UCI
databases, we standardized the features using mean and variance computed from the
trainingset. Table 2.9shows theemotion classification andinference accuracy forthe
various models. The proposed GVLC model performs better than the three baseline
models both in classification and inference of the true emotion label. In addition, it
is the only model which achieves statistically significant improvements over simple
32
plurality.
Model Classification Accuracy Inference Accuracy
Annotator 1 93.37 -
Annotator 2 90.36 -
Annotator 3 98.19 -
Annotator 4 77.71 -
Simple plurality 81.93 (β = 0.025) 98.80
Smyth et al. 82.53 (β = 0.025) 96.99
Raykar et al. 82.53 (β = 0.025,α = 0) 98.80 (β = 0,α = 0)
GVLC model 84.94 (β = 0.015,α = 0.04, 99.40 (β =0,α = 0,
M = 1) M = 2)
GVLC model (oracle) 86.14 (β =0,α = 0, 99.40 (β =0,α = 0,
M = 4) M = 2)
Table 2.9: This table shows the emotion classification and inference accuracies for
the EMA database using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
Apartfrommodelingcategoricalrepresentations, research intheemotionalspeech
analysiscommunityhasalsofocussedonotherdimensionalrepresentations. Themost
popular of these are activation and valence [82]. Valence is a bipolar rating of the
pleasantnessofspeech. Activationontheotherhand,denotestheexcitationinspeech.
For example, the angry emotional class is expected to have negative valence and pos-
itive activation. To further test the performance of various models, we conducted
valence and activation classification experiments on the same database as above. Fol-
lowing the convention, {angry, happy} were assigned high and {sad, neutral} were
assigned low activation. For valence, {angry, sad} were labeled as negative while
{happy, neutral} were labeled as positive. Tables 2.10-2.11 show the classification
and inference accuracies for these two cases. While the GVLC model gives a statisti-
callysignificantimprovement oversimpleplurality, itperformsasgoodasthebaseline
models. This is understandable in the case of inference, since the accuracy is already
extremely high. We attempt to further explain this observation in Subsection 2.4.4.
Before that, we present results on the SEMAINE database.
The SEMAINE database [111] is a large multimodal, audio-visual data set, col-
lected as part of a research effort to build Sensitive Artificial Listener (SAL) agents.
These agents should be able to interact with a human in a sustained, emotionally-
colored conversation. All interactions involve two persons, a human user and an
33
Model Classification Accuracy Inference Accuracy
Annotator 1 95.36 -
Annotator 2 94.70 -
Annotator 3 98.01 -
Annotator 4 84.77 -
Simple plurality 82.78 (β = 0) 98.01
Smyth et al. 84.77 (β = 0) 98.68
Raykar et al. 84.11 (β = 0,α = 0) 98.68 (β = 0,α = 0.2)
GVLC model 84.77 (β = 0,α = 0.02, 98.68 (β = 0,α = 0.2,
M = 3) M = 1)
GVLC model (oracle) 86.09 (β = 0.005,α = 0.02, 98.68 (β =0,α = 0,
M = 3) M = 1)
Table 2.10: This table shows the valence classification and inference accuracies for
the EMA database using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
Model Classification Accuracy Inference Accuracy
Annotator 1 99.36 -
Annotator 2 96.79 -
Annotator 3 99.36 -
Annotator 4 83.33 -
Simple plurality 92.99 (β = 0.005) 96.18
Smyth et al. 95.54 (β = 0.005) 99.36
Raykar et al. 95.54 (β = 0.005,α = 0) 100.00 (β = 0.005,α = 0.2)
GVLC model 95.54 (β = 0.005,α =0, 99.36 (β = 0.005,α = 0,
M = 1) M = 1)
GVLC model (oracle) 99.36 (β = 0.005,α = 0.20, 100.00 (β = 0,α = 0,
M = 1) M = 2)
Table 2.11: This table shows the activation classification and inference accuracies for
the EMA database using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
34
operator (who can be either a machine or a human simulating a machine agent). The
operator has four personalities – Spike (angry), Poppy (happy), Obadiah (sad) and
Prudence (sensible/neutral). The operator’s task is to induce his/her own person-
ality into the user as the conversation goes along. There are a total of 94 sessions
in the SEMAINE database, where each session includes audio and video recordings
of the interaction. Each session is rated by multiple human evaluators for various
emotional dimensions (such as valence, activation, power and intensity), and char-
acteristics of the interaction (such as breakdown of engagement, social concealment
etc). For the purpose of experiments, I picked three emotional dimensions – valence,
activationandintensity, forcomparingvariousalgorithms. Intensity captureshowfar
the speaker is from a state of cool rationality, irrespective of the direction. Valence
and activation have the same meanings as explained earlier. Since the emotion of the
character being played by the human operator is clearly defined, I decided to use just
its audio. Furthermore, only sessions 19, 20, 21 and 22 (Obadiah, Spike, Poppy and
Prudence respectively) contained ratings from the same set of 3 human evaluators
(evaluators R1, R2 and R3). Hence, I only used these sessions for our experiments.
For each session, the ratings from an annotator are recorded as a time series and are
available every 20 ms. As a pre-processing step, I segmented the operator’s audio
into sentences using the time-aligned text transcriptions available in the database.
Next, RMS energy and 12 MFCC features were extracted over 20 ms frames with
10 ms shift from each sentence using the OpenSMILE toolkit. These 13-dimensional
feature vectors and evaluator ratings were averaged over 5 contiguous frames, since
sentence-level averaging would have resulted in too few instances to train. I ended
finally obtained a total of 4452 instances.
Since thedimensional ratings were continuous, they hadtobe quantized appropri-
ately before various models could be trained. I observed that the valence, activation
and intensity ratings are well-represented by 3, 2 and 2 clusters respectively. This
was corroborated by observing the ratings for the different operator personalities. In
terms of activation, Prudence clearly falls in the low activation category along with
Obadiah. However, its valence ratings are neither extremely positive nor extremely
negative. Thus, I quantized the valence and activation ratings into 3 and 2 levels re-
spectively usingK-meansforeachevaluatorindependently. Intensity wasrepresented
by 2 clusters since Prudence usually had a low rating, while the other 3 personali-
ties had high ratings. The reference labels for valence classification were obtained by
35
assigning label 1 to Obadiah and Spike, 2 to Prudence and 3 to Poppy. Similarly,
the reference activation labels were obtained by mapping Obadiah and Prudence to
1, and Spike and Poppy to 2. For intensity, I assigned 1 to Prudence and 2 to the
remaining personalities. The 4452 instances were split into a training (40%), test
(30%) and development set (30%). Tables 2.12-2.14 show the valence, activation
and intensity classification accuracies of various algorithms. The GVLC model gives
statistically significant improvement over simple plurality for valence inference and
activation/intensity classification.
Model Classification Accuracy Inference Accuracy
Annotator 1 96.71 -
Annotator 2 92.23 -
Annotator 3 92.31 -
Simple plurality 52.95 (β = 0.005) 96.56
Smyth et al. 52.85 (β = 0.005) 96.56
Raykar et al. 53.02 (β = 0.005,α = 0.04) 96.56 (β = 0,α = 0)
GVLC model 53.25 (β = 0.005,α = 0.08, 96.79 (β = 0,α =0.06,
M = 7) M = 9)
GVLC model (oracle) 53.32 (β = 0.005,α = 0, 96.79 (β = 0,α =0.06,
M = 7) M = 9)
Table2.12: Thistableshowsthevalenceclassificationandinference accuraciesforthe
SEMAINE database using various algorithms. Values in bold represent a statistically
significant improvement in performance over simple plurality at the 5% significance
level using the exact one-sided binomial test.
2.4.4 An Insight into GVLC Model’s Benefit
The proposed GVLC model gives statistically significant improvement over simple
pluralityfor10and7outof12testcases(forclassification andinference respectively).
This is appreciably better than the performance of the next best algorithm (one by
Raykar et al.), which achieves a statistically significant improvement in only 4 cases
forclassification and2forinference. Itisinteresting tonotethatthe benefitobtained
bytheproposeddata-dependentmodeloverthedata-independentonebyRaykaretal.
is variable across databases. An insight into this variation can be gained by recalling
the essential difference between the two models. While the model by Raykar et al.
assumes each expert’s reliability matrix to be constant over the entire feature space,
36
Model Classification Accuracy Inference Accuracy
Annotator 1 98.29 -
Annotator 2 63.05 -
Annotator 3 94.72 -
Simple plurality 73.16 (β = 0.015) 97.25
Smyth et al. 73.16 (β = 0.015) 97.25
Raykar et al. 73.31 (β = 0.015,α = 0.06) 96.88 (β = 0.005,α = 0.08)
GVLC model 73.75 (β = 0.015,α = 0.02, 97.03 (β = 0,α = 0.04,
M = 5) M = 9)
GVLC model (oracle) 73.75 (β = 0.015,α = 0, 97.25 (β = 0,α = 0,
M = 5) M = 7)
Table 2.13: This table shows the activation classification and inference accuracies
for the SEMAINE database using various algorithms. Values in bold represent a
statistically significant improvement in performance over simple plurality at the 5%
significance level using the exact one-sided binomial test.
Model Classification Accuracy Inference Accuracy
Annotator 1 97.15 -
Annotator 2 90.77 -
Annotator 3 98.87 -
Simple plurality 63.66 (β = 0.02) 97.30
Smyth et al. 63.66 (β = 0.02) 97.30
Raykar et al. 63.81 (β = 0.025,α = 0) 97.30 (β = 0,α = 0)
GVLC model 64.34 (β = 0.06,α = 0, 97.30 (β = 0,α = 0,
M = 9) M = 1)
GVLC model (oracle) 64.64 (β = 0.065,α = 0, 97.30 (β = 0,α = 0,
M = 1) M = 1)
Table 2.14: This table shows the intensity classification and inference accuracies for
the SEMAINE database using various algorithms. Values in bold represent a sta-
tistically significant improvement in performance over simple plurality at the 5%
significance level using the exact one-sided binomial test.
37
the GVLC model makes this assumption only over clusters of homogeneous instances.
Thus, it is natural to expect that the GVLC model gives greater performance benefit
for databases with greater variation of expert reliability over the feature space. To
check this intuition, I defined a measure of the variation of an expert’s reliability
given a database.
Consider a feature space consisting of M clusters of data instances derived from
some clustering algorithm (we used K-means). Assuming the availability of the true
reference labels, I can estimate the global K×K reliability matrix of the j
th
expert
– A
j
glob
, where A
j
glob
(k
1
,k) = P(y
j
= k
1
|y = k). Similarly, I can estimate a local
reliabilitymatrixforallinstancesbelongingtothem
th
cluster–A
j
loc,m
. Thereliability
variation of expert j given cluster m and reference label k is
RV
j
(k,m)=
JSD(A
j
glob
(:,k),A
j
loc,m
(:,k))
1/2
(2.36)
where JSD(p,q) denotes the Jensen-Shannon divergence [101] between probability
mass functions p and q, and X(:,k) denotes the k
th
column of X. Taking square
root makes the Jensen-Shannon divergence a metric. Now, the expected reliability
variation of expert j is
E{RV
j
}=
M
X
m=1
K
X
k=1
RV
j
(k,m)P(y =k,z =m). (2.37)
Upon averaging E{RV
j
} over all experts for a given database, I get a metric repre-
senting the average reliability variation of the experts from their global reliabilities.
I next compute the Pearson’s correlation coefficient between this average metric and
therelativeinference performanceimprovement obtainedbytheproposedmodelwith
respect to the model by Raykar et al. [140] over all 12 test cases. The oracle perfor-
mance was used since it guards against any noise introduced due to hyper-parameter
mismatch. The correlation coefficient was found to be 0.74 (significant at the 5%
level). This indicates that the benefit of the proposed model is more when expert
reliability is highly variable over the feature space. In case the reliability is nearly
constant, overfitting due to increase in the number of parameters nullifies any gain
obtained by modeling the reliability variation.
38
2.5 Conclusions and Future Work Directions
This section presented the globally-variant locally-constant (GVLC) model for cap-
turing the data-dependent behavior of an ensemble of humans and machine systems
(experts). It is based on the observation that experts (whether machine classifiers or
human evaluators) do not have equal reliability for all instances processed. Rather,
their reliability varies from one region of the feature space to another. In this work,
thisreliabilityisassumed tobeconstantoverclusters ofinstances inthefeaturespace
andcanbemodeled byaGaussianmixture model. This enables onetomodelexperts
as having a reliability matrix for each Gaussian in the mixture. The hidden reference
labelis assumed tobe generated fromthefeaturevector usingaMaxEnt model. This
hidden label is then distorted by each expert using the reliability matrix correspond-
ing to the Gaussian which generated the feature vector. All the parameters of this
model are learned in the ML sense using the EM algorithm. A Bayesian version of
thismodelisalsoproposed,wheretheMaxEntclassifiercoefficientsareL
2
-regularized
and the expert reliability matrix entries are generated from a Dirichlet distribution.
Experiments on simulated data, a variety of databases from the UCI repository, and
two emotion classification databases show improvements in both classification and
inference accuracy by using the proposed model.
There are many interesting directions for future work in this domain. First, the
GVLC model assumes that all the instances are independent. However many prac-
tical problems involve labeling of time series. For example, speaker clustering and
diarization involve labeling frames of an audio clip with speaker indices. Labeling of
human body motion capture data for special events of interest (e.g., body gestures)
is another example. It would be interesting to extend the proposed GVLC model to
handle multiple labeling of time series. One simple way to do this is to impose a first
order Markov chain structure on the temporal evolution of the hidden variable z in
Figure 2.5, resulting in a Dynamic Bayesian Network (DBN).
Second, in the case of most classifiers as experts, it is easy to generate a complete
posterior distribution over the label set for each instance. The problem now modifies
to inferring the true posterior distribution (or just the true label) using posterior
distributionsfrommultiplenoisyexperts. JinandGhahramani[81]considerarelated
problem where each training instance is associated with a subset of labels, exactly
one of which is correct. A more general version of this problem involves a single
39
posterior distribution over labels associated with each instance. In the case of human
experts, it is tough to get a posterior distribution. But a ranked list of labels is
much easier to obtain. Thus the problem can be modified to devising a scheme for
a combination of ranked lists from multiple human experts. It would be interesting
to see if a more structured way of inferring the true hidden label from multiple noisy
ranked lists gives benefits over standard voting algorithms like the Borda count and
Schulze’s method [20], [150].
40
Chapter 3
Analysis of Ensemble Diversity:
Generalized Ambiguity
Decomposition (GAD)
The previous chapter presented the globally-variant locally-constant (GVLC) model
- a Bayesian network for modeling the behavior of an ensemble of diverse human
and machine experts. The current chapter analyses this diversity from a different
perspective. Diversity or complementarity of experts in ensemble pattern recognition
andinformationprocessing systems iswidely-observed byresearchers tobecrucialfor
achieving performance improvement upon fusion. Understanding this link between
ensemble diversity and fusion performance is thus an important research question.
However, prior works have theoretically characterized ensemble diversity and have
linked itwith ensemble performance invery restricted settings. This chapter presents
a generalized ambiguity decomposition (GAD) theorem as a broad framework for an-
swering these questions. The GAD theorem applies to a generic convex ensemble of
experts for any arbitrary twice-differentiable loss function. It shows that the ensem-
ble performance approximately decomposes into a difference of the average expert
performance and the diversity of the ensemble. It thus provides a theoretical expla-
nation for the empirically-observed benefit of fusing outputs from diverse classifiers
and regressors. It also provides a loss function-dependent, ensemble-dependent, and
data-dependent definition of diversity. I present extensions of this decomposition to
common regression and classification loss functions, and report a simulation-based
analysis ofthe diversity termand the accuracy ofthe decomposition. I finally present
41
experiments on standard pattern recognition data sets which indicate the accuracy
of the decomposition for real-world classification and regression problems.
3.1 Introduction
Researchers across several fields have empirically observed that an ensemble of multi-
pleexperts(classifiersorregressors)performsbetterthanasingleexpert. Well-known
examples demonstrating this performance benefit span a large variety of applications
such as
• Automatic speech and language processing: Most teams in large-scale projects
involving automatic speech recognition and processing such as the DARPA
GALE [143], CALO [172], RATS [108] and the IARPA BABEL [39] programs
useacombinationofmultiplesystemsforachievingstate-of-the-artperformance.
The use of multiple systems is also widespread in text and natural language
processing applications, with examples ranging from parsing [141] to text cate-
gorization [152].
• Recommendation systems: Many industrial and academic teams competing for
the Netflix prize [16] used ensembles of diverse systems for the movie rating
prediction task. The $1 Million grand prize winning system from the team
BellKor’s Pragmatic Chaos was composed of multiple systems from three inde-
pendent teams (BellKor [14], Pragmatic Theory [132], and BigChaos [168]).
• Web information retrieval: Researchers have also used ensembles of diverse sys-
tems for information retrieval tasks from the web. For instance, the winning
teams [29], [131] in the Yahoo! Learning to Rank Challenge [32] used ensem-
ble methods (such as bagging, boosting, random forests, and lambda-gradient
models) for improving document ranking performance. The challenge overview
article[32]alsoemphasizestheperformancebenefitsobtainedbyallteamswhen
using ensemble methods.
• Computer vision: Ensemble methodsareoftenpopularincomputervisiontasks
as well, such as tracking [13], object detection [107], and pose estimation [66].
• Human state pattern recognition: Systems for multimodal physical activity de-
tection [100] often fuse classifiers trained on different feature sets for achieving
42
an improvement in accuracy. Several teams competing in the Interspeech chal-
lenges have also used ensembles for classification of human emotion [145], age
and gender [146], intoxication and sleepiness [147], personality, likability, and
pathology [148], and social signals [149].
The above list is only a small fraction of the large number of applications which
have used ensembles of multiple systems. Dietterich [46] offers three main reasons for
the observed benefits of an ensemble. First, an ensemble can potentially have a lower
generalization error than a single expert. Second, the parameter estimation involved
in training most state of the art expert systems such as neural networks involves
solving a non-convex optimization problem. A single expert can get stuck in local
optima whereas an ensemble of multiple experts can provide parameter estimates
closer to the global optima. Finally, the true underlying function for a problem at
hand may be too complex for a single expert and an ensemble may be better able to
approximate it.
Intuition and documented research such as the ones listed above suggests that the
experts intheensemble shouldbeoptimallydiverse. Diversity actsasahedge against
uncertainty in the evaluation data set, and the mismatch between the loss functions
used for training and evaluation. Kuncheva [92] gives a simple intuitive argument
in favor of having the right amount of diversity in an ensemble. She says that just
one expert suffices if all experts produce identical output, however, if the experts
disagree in their outputs very frequently, it indicates that they are individually poor
estimators of the target variable. Ambiguity decomposition [90] (AD) explains this
tradeoff for the special case of the squared error loss function. Let X ∈ X ⊆ R
D
and Y ∈Y ⊆R denote the D-dimensional input and 1-dimensional target (output)
random variables respectively. Let f
k
: X → R be the k
th
expert which maps the
input spaceX to the real lineR. f is a convex combination of K experts when
f(X)=
K
X
k=1
w
k
f
k
(X) where w
k
≥ 0 and
K
X
k=1
w
k
= 1. (3.1)
AD states that the squared error between the above f(X) and Y is
[Y −f(X)]
2
=
K
X
k=1
w
k
[Y −f
k
(X)]
2
−
K
X
k=1
w
k
[f
k
(X)−f(X)]
2
. (3.2)
43
The first term on the right hand side is the weighted squared error of the individual
experts with respect to Y. The second term quantifies the diversity of the ensemble
and is the squared error spread of the experts around f(X). For two ensembles with
identical weighted squared error, one with a greater diversity will have a lower overall
squared error. The bias-variance-covariance decomposition is an equivalent result by
Ueda and Nakano [173] with neural network ensembles as the focus. Neural networks
are a suitable application because an ensemble of neural networks often consists of
almost equally-accurate but diverse networks due to the non-convex training opti-
mization problem. AD is also related to the bias-variance decomposition (BVD) [63]
which says that the expected squared error between a regressor f
D
(X) trained on
datasetD and the target variable Y is
E
D
{[f
D
(X)−Y]
2
}= [Y −E
D
{f
D
(X)}]
2
+E
D
{[f
D
(X)−E
D
{f
D
(X)}]
2
}. (3.3)
The first term on the right hand side is the square of the bias, which is the difference
between the target Y and the expected prediction over the distribution of D. The
secondtermmeasuresthevarianceoftheensemble. BVDreducestoADwhenexperts
have the same functional form (e.g., linear) and when the training set D is drawn
from a convex mixture of training sets{D
k
}
K
k=1
with mixture weights{w
k
}
K
k=1
.
Many existing algorithms attempt to promote diversity while training an ensem-
ble. Examples include ensembles of decision trees [24], support vector machines [175],
conditional maximum entropy models [11], negative correlationlearning [106]forneu-
ral networks and DECORATE [115], which is a meta-algorithm based on generation
of synthetic data. AdaBoost [60] is another prominent algorithm which incremen-
tally creates a diverse ensemble of weak experts by modifying the distribution from
which training instances are sampled. However, only few studies have focused on
understanding the impact of diversity on ensemble performance for both classifiers
and regressors. AD provides this link only for least squares regression. The analysis
presented by Tumer and Ghosh [169] assumes classification as regression over class
posterior distribution.
Thischapterpresentsageneralizedambiguitydecomposition(GAD)theoremthat
isapplicabletobothclassificationandregression. Itdoesnotassumethattheclassifier
isestimatingaposteriordistributionoverthelabelsetY incaseofclassification. This
isoftenencounteredinpractice,forexampleincaseofsupportvectormachines. Some
44
prior work has been done for deriving a BVD for a single expert with different loss
functions [23], [49], [84]. The proposed GAD theorem is different. It focuses on a
convex combination of experts rather than a single expert. Even though one can
link the BVD to AD by considering a mixture of training sets as mentioned before,
this link requires that the individual experts should have the same functional form.
The proposed GAD theorem does not make such assumptions. This result applies
pointwise for any given (X,Y)∈X ×Y rather than relying on an ensemble average.
I present the GAD theorem and its proof in the next section. I derive the decom-
position for some common regression and classification loss functions in Section 3.3.
I present a simulation-based analysis in Section 3.4. I then evaluate the presented de-
compositiononmultiplestandardclassificationandregressiondatasetsinSection3.5.
Section 3.6 presents the conclusion and some directions for future work.
3.2 GeneralizedAmbiguityDecomposition(GAD)
Theorem
The concept of a loss function is central to statistical learning theory [176]. It com-
putes the mismatch between the prediction of an expert and the true target value.
Lemma 3.1 below presents useful bounds on a class of loss functions which are used
widely in supervised machine learning.
Lemma 3.1. (Taylor’s Theorem for Loss Functions [85]) Let x,Y ∈ R and
B ⊆ R be a closed and bounded set containing x. Let l : R×R → R be a loss
function which is twice-differentiable in its second argument with continuous second
derivative overB. Let
M
l,B
(Y)= sup
z∈B
l
′′
(Y,z)<∞ and (3.4)
m
l,B
(Y)= inf
z∈B
l
′′
(Y,z)>−∞. (3.5)
Then for any Y
0
∈B, we can write the following quadratic bounds on the loss function:
l(Y,Y
0
)≥l(Y,x)+l
′
(Y,x)(Y
0
−x)+
m
l,B
(Y)
2
(Y
0
−x)
2
and (3.6)
l(Y,Y
0
)≤l(Y,x)+l
′
(Y,x)(Y
0
−x)+
M
l,B
(Y)
2
(Y
0
−x)
2
. (3.7)
45
Proof. Since l(Y,Y
0
) is twice-differentiable in its second argument over R×B, by
Taylor’s theorem [85],∃ a function h
2
:R×B→R such that
l(Y,Y
0
)=l(Y,x)+l
′
(Y,x)(Y
0
−x)+h
2
(Y,Y
0
)(Y
0
−x)
2
where lim
Y
0
→x
h
2
(Y,Y
0
) = 0
(3.8)
for any given x ∈ B. h
2
(Y,Y
0
)(Y
0
−x)
2
is called remainder or residue and has the
following form due to the Mean Value Theorem [85]:
h
2
(Y,Y
0
)(Y
0
−x)
2
=
l
′′
(Y,z)
2
(Y
0
−x)
2
where z∈ (Y
0
,x). (3.9)
The second derivative of the loss function is continuous over the closed and bounded
setB. Weierstrass’ Extreme Value Theorem [85] gives:
m
l,B
(Y)≤ l
′′
(Y,z)≤M
l,B
(Y) ∀z∈B (3.10)
where m
l,B
(Y)= inf
z∈B
l
′′
(Y,z)>−∞ (3.11)
and M
l,B
(Y)= sup
z∈B
l
′′
(Y,z)<∞. (3.12)
We note that m
l,B
(Y) = 0 is an obvious choice if l is convex. Using the bounds in
(3.10) in Taylor’s theorem from (3.8) results in the desired inequalities:
l(Y,Y
0
)≥ l(Y,x)+l
′
(Y,x)(Y
0
−x)+
m
l,B
(Y)
2
(Y
0
−x)
2
(3.13)
l(Y,Y
0
)≤ l(Y,x)+l
′
(Y,x)(Y
0
−x)+
M
l,B
(Y)
2
(Y
0
−x)
2
. (3.14)
The second argument of l is always bounded in practice since it represents the
predictionoftheexpert. Hence, limitingthedomainoftwice-differentiability andcon-
tinuity ofthesecond derivative fromR×RtoR×B is areasonable assumption. The
next lemma presents ambiguity decomposition forthe squared error loss function [90].
I denote f(X) as f and f
k
(X) as f
k
from now on for notational simplicity.
Lemma 3.2. (Ambiguity Decomposition (AD) [90])
Consider an ensemble of K experts {f
k
: X → R,k = 1,2,...,K} and let f =
46
P
K
k=1
w
k
f
k
be a convex combination of these experts. Then
[Y −f]
2
=
K
X
k=1
w
k
[Y −f
k
]
2
−
K
X
k=1
w
k
[f
k
−f]
2
∀(X,Y)∈X ×R. (3.15)
Proof. I start by expanding the following term:
K
X
k=1
w
k
[Y −f
k
]
2
=
K
X
k=1
w
k
[Y −f−(f
k
−f)]
2
(3.16)
=
K
X
k=1
w
k
[Y −f]
2
+
K
X
k=1
w
k
[f
k
−f]
2
−2
K
X
k=1
w
k
[Y −f][f
k
−f] (3.17)
=[Y −f]
2
+
K
X
k=1
w
k
[f
k
−f]
2
−2[Y −f]
K
X
k=1
w
k
[f
k
−f] (3.18)
=[Y −f]
2
+
K
X
k=1
w
k
[f
k
−f]
2
−2[Y −f][
K
X
k=1
w
k
f
k
−f] (3.19)
=[Y −f]
2
+
K
X
k=1
w
k
[f
k
−f]
2
. (3.20)
Iarrive attheAmbiguity Decompositionby re-arrangingterms intheabove equation.
[Y −f]
2
=
K
X
k=1
w
k
[Y −f
k
]
2
−
K
X
k=1
w
k
[f
k
−f]
2
. (3.21)
Ambiguity decomposition describes the tradeoff between the accuracy of individ-
ual experts and the diversity of the ensemble. But it applies only to the squared
error loss function. I now state and prove the Generalized Ambiguity Decomposition
(GAD) theorem using Lemmas 3.1 and 3.2.
Theorem 3.1. (Generalized Ambiguity Decomposition (GAD) Theorem)
Consider an ensemble of K experts {f
k
: X → R,k = 1,2,...,K} and let f =
P
K
k=1
w
k
f
k
be a convex combination of these experts. Assume that all f
k
are finite.
47
Let (X,Y)∈X ×R and letB⊆R be the following closed and bounded set:
B = [b
min
,b
max
] where (3.22)
b
min
= min{Y,f
1
,...,f
K
} and (3.23)
b
max
= max{Y,f
1
,...,f
K
}. (3.24)
B is the smallest closed and bounded set which contains Y and all f
k
. Let l : R×
R → R be a loss function which is twice-differentiable in its second argument with
continuous second derivative overB. Let:
M
l,B
(Y) =sup
z∈B
l
′′
(Y,z)<∞, (3.25)
M
l,B
(f) =sup
z∈B
l
′′
(f,z)∈ (0,∞), and (3.26)
m
l,B
(Y) = inf
z∈B
l
′′
(Y,z)>−∞. (3.27)
Then the ensemble loss is upper-bounded as given below:
l(Y,f)≤
K
X
k=1
w
k
l(Y,f
k
)−
M
l,B
(Y)
M
l,B
(f)
h
K
X
k=1
w
k
l(f,f
k
)−l(f,f)
i
+
1
2
M
l,B
(Y)−m
l,B
(Y)
K
X
k=1
w
k
(Y −f
k
)
2
. (3.28)
Proof. B isaclosedandboundedsetwhich includesY andallf
k
bydefinition. Hence
we can write the following lower-bound for l(Y,f
k
) using Lemma 3.1:
l(Y,f
k
)≥l(Y,Y)+l
′
(Y,Y)(f
k
−Y)+
m
l,B
(Y)
2
(f
k
−Y)
2
. (3.29)
Taking a convex sum on both sides of the above inequality gives
K
X
k=1
w
k
l(Y,f
k
)≥
K
X
k=1
w
k
l(Y,Y)+
K
X
k=1
w
k
l
′
(Y,Y)(f
k
−Y)+
K
X
k=1
w
k
m
l,B
(Y)
2
(f
k
−Y)
2
=l(Y,Y)+l
′
(Y,Y)(f−Y)+
m
l,B
(Y)
2
K
X
k=1
w
k
(f
k
−Y)
2
. (3.30)
B also includes f because it includes all f
k
and f is their convex combination. Thus,
48
I consider the following upper-bound on l(Y,f) using Lemma 3.1:
l(Y,f)≤l(Y,Y)+l
′
(Y,Y)(f−Y)+
M
l,B
(Y)
2
(f−Y)
2
(3.31)
⇐⇒ l(Y,Y)+l
′
(Y,Y)(f−Y)≥l(Y,f)−
M
l,B
(Y)
2
(f−Y)
2
. (3.32)
Substituting this inequality in (3.30) gives
K
X
k=1
w
k
l(Y,f
k
)≥ l(Y,f)−
M
l,B
(Y)
2
(f−Y)
2
+
m
l,B
(Y)
2
K
X
k=1
w
k
(f
k
−Y)
2
. (3.33)
I then use the AD in Lemma 3.2 for (f−Y)
2
and write the above bound as:
K
X
k=1
w
k
l(Y,f
k
)≥l(Y,f)−
1
2
(M
l,B
(Y)−m
l,B
(Y))
K
X
k=1
w
k
(f
k
−Y)
2
+
M
l,B
(Y)
2
K
X
k=1
w
k
(f
k
−f)
2
. (3.34)
I finally invoke the following upper bound on l(f,f
k
) using Lemma 3.1:
l(f,f
k
)≤l(f,f)+l
′
(f,f)(f−f
k
)+
M
l,B
(f)
2
(f−f
k
)
2
(3.35)
⇐⇒
M
l,B
(f)
2
(f−f
k
)
2
≥l(f,f
k
)−l(f,f)−l
′
(f,f)(f−f
k
) (3.36)
⇐⇒
M
l,B
(f)
2
K
X
k=1
w
k
(f−f
k
)
2
≥
K
X
k=1
w
k
l(f,f
k
)−l(f,f). (3.37)
Ifinally getthedesired result by substituting the aboveinequality in(3.34)andusing
49
the fact that M
l
(f)> 0:
K
X
k=1
w
k
l(Y,f
k
)≥l(Y,f)−
1
2
(M
l,B
(Y)−m
l,B
(Y))
K
X
k=1
w
k
(f
k
−Y)
2
+
M
l,B
(Y)
M
l,B
(f)
h
K
X
k=1
w
k
l(f,f
k
)−l(f,f)
i
(3.38)
⇐⇒ l(Y,f)≤
K
X
k=1
w
k
l(Y,f
k
)−
M
l,B
(Y)
M
l,B
(f)
h
K
X
k=1
w
k
l(f,f
k
)−l(f,f)
i
+
1
2
(M
l,B
(Y)−m
l,B
(Y))
K
X
k=1
w
k
(Y −f
k
)
2
. (3.39)
The GAD Theorem is a natural extension of AD in Lemma 3.2 and reduces to it
for the case of squared error loss. One can gain more intuition about this result by
defining the following quantities:
Ensemble loss: l(Y,f) (3.40)
Weighted expert loss:
K
X
k=1
w
k
l(Y,f
k
) (3.41)
Diversity: d
l
(f
1
,...,f
K
) =
M
l,B
(Y)
M
l,B
(f)
"
K
X
k=1
w
k
l(f,f
k
)−l(f,f)
#
(3.42)
Curvature spread (CS): s
l,B
(Y) =M
l,B
(Y)−m
l,B
(Y)≥ 0 (3.43)
Ignoring the term involving curvature spread, GAD says that the ensemble loss is
upper-bounded by weighted expert loss minus the diversity of the ensemble. Thus,
the upper-bound involves a tradeoff between the performance of individual experts
(weighted experts loss) and the diversity. Diversity measures the spread of the expert
predictions about the ensemble’s predictions and is 0 when f
k
= f,∀k. Diversity is
non-negative for a convex loss function due to Jensen’s inequality [85]. Furthermore,
diversity depends on the loss function, the true target Y and the prediction of the
ensemble f atthecurrent datapoint. Thus, all datapointsare notequally important
from a diversity perspective. It is also interesting to note that the GAD theorem
50
provides a decomposition of the ensemble loss into a supervised (weighted expert
loss) and unsupervised (diversity) term. The latter term does not require labeled
data to compute. This makes the overall framework applicable to semi-supervised
settings.
The following corollary of Theorem 3.1 gives a simple upper-bound on the error
between l(Y,f) and its approximation motivated by the GAD theorem.
Corollary 3.1. (Error Bound for GAD Loss Function Approximation)
If
K
X
k=1
w
k
(Y −f
k
)
2
=β(X,Y) and (3.44)
max
k∈{1,...,K}
(Y −f
k
)
2
=δ(X,Y), (3.45)
then the error between the true loss and its GAD approximation is bounded as:
l(Y,f)−l
GAD
(Y,f)≤
1
2
s
l,B
(Y,f)β(X,Y) (3.46)
≤
1
2
s
l,B
(Y,f)δ(X,Y), (3.47)
where s
l,B
(Y,f) is the curvature spread defined previously and
l
GAD
(Y,f)=
K
X
k=1
w
k
l(Y,f
k
)−d
l
(f
1
,...,f
K
) (3.48)
is an approximation for l(Y,f) motivated by GAD.
Proof. Theorem 3.1 gives:
l(Y,f)−l
GAD
(Y,f)≤
1
2
(M
l,B
(Y)−m
l,B
(Y))
K
X
k=1
w
k
(Y −f
k
)
2
=
1
2
(M
l,B
(Y)−m
l,B
(Y))β(X,Y). (3.49)
I also note that:
K
X
k=1
w
k
(Y −f
k
)
2
≤ max
k∈{1,...,K}
(Y −f
k
)
2
=δ(X,Y). (3.50)
51
Hence we can also write the following less tight upper bound on the error:
l(Y,f)−l
GAD
(Y,f)≤
1
2
(M
l,B
(Y)−m
l,B
(Y))δ(X,Y). (3.51)
Corollary 3.1 shows that l
GAD
(Y,f) is a good approximation for l(Y,f) when the
curvature spread is small and all expert predictions are close to the true target Y.
For instances (X,Y) where multiple experts in the ensemble are far away from the
true target, l
GAD
(Y,f)has a high error. To summarize, the accuracy ofl
GAD
depends
on the data instance, loss function and the expert predictions.
The diversity term in the GAD theorem computes the loss function between each
expert f
k
and the ensemble prediction f. However, it is sometimes useful to under-
stand diversity in terms of pairwise loss functions between the expert predictions
themselves. The next corollary to the GAD theorem shows that one can indeed
re-write the diversity term in pairwise fashion for a metric loss function.
Corollary3.2. (PairwiseGADTheoremforMetricLossFunctions)Consider
a metric loss function l and also let w
k
= 1/K ∀k for simplicity. Then the GAD
theorem becomes
l(Y,f)≤
1
K
K
X
k=1
l(Y,f
k
)−
M
l,B
(Y)
M
l,B
(f)
"
1
K(K−1)
K
X
k
1
=1
K
X
k
2
=k
1
+1
l(f
k
1
,f
k
2
)
#
+
1
2
M
l,B
(Y)−m
l,B
(Y)
K
X
k=1
(Y −f
k
)
2
. (3.52)
Proof. The loss function satisfies the triangle inequality because it is given to be a
metric. One can visualize the output of each expert and the ensemble’s prediction as
points in a metric space induced by the metric loss function. Hence
l(f
k
1
,f
k
2
)≤ l(f,f
k
1
)+l(f,f
k
2
) (3.53)
for all k
1
∈{1,...,K} and k
2
∈{k
1
+1,...,K}. Adding these K(K−1) inequalities
52
gives
K
X
k
1
=1
K
X
k
2
=k
1
+1
l(f
k
1
,f
k
2
)≤ (K−1)
K
X
k=1
l(f,f
k
). (3.54)
We alsonotethatl(f,f) =0because l is ametric. Hence onegets thefollowing lower
bound on the diversity term in GAD
M
l,B
(Y)
M
l,B
(f)
"
1
K
K
X
k=1
l(f,f
k
)−l(f,f)
#
≥
M
l,B
(Y)
M
l,B
(f)
"
1
K(K−1)
K
X
k
1
=1
K
X
k
2
=k
1
+1
l(f
k
1
,f
k
2
)
#
(3.55)
Substituting this lower-bound in GAD from Theorem 3.1 gives the desired decompo-
sition with pairwise diversity.
The squared error and absolute error loss functions used for regression are metric
functions and thus permit a decomposition with a pairwise diversity term as given in
Corollary 3.2 above. We now derive the quantities required for GAD approximation
of common loss functions in the next section.
3.3 GAD for Common Loss Functions
Figure 3.1 plots some common regression and classification loss functions that I will
consider in this section. These curves have been plotted for reference label Y = 0 for
regression and−1 for classification. The regression loses are a convex approximation
to the following 0/1 loss:
l
reg, 0/1
(Y,f)=
(
0 Y =f
1 Y 6=f
. (3.56)
The classification loss functions approximate a similar 0/1 loss function:
l
class, 0/1
(Y,f)=
(
0 Yf ≥ 0
1 Yf <0
. (3.57)
The computation of M
l,B
(Y) and m
l,B
(Y) is critical to the GAD theorem. Hence
53
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
f(X)
l(Y,f(X))
Sqr error
Smooth abs error
Logistic
Exponential
Smooth Hinge
Figure 3.1: This figure shows common regression and classification loss functions for
target variables Y = 0 and−1 respectively. Smoothed versions of absolute error and
hinge loss are shown with ǫ = 0.1.
the following subsections focus on deriving these quantities for various common clas-
sification and regression loss functions.
3.3.1 Squared Error Loss
Squared error is the most common loss function used for regression and is defined as
given below:
l
sqr
(Y,Y
0
) = (Y −Y
0
)
2
where Y,Y
0
∈R. (3.58)
ItssecondderivativewithrespecttoY
0
isl
′′
sqr
(Y,Y
0
)= 2. HenceM
l,B
(Y) =m
l,B
(Y)=
2∀Y and CS is 0. Thus GAD reduces to AD in Lemma 3.2.
3.3.2 Absolute Error Loss
Absolute error loss function is more robust than squared error for outliers and is
defined as:
l
abs
(Y,Y
0
) =|Y −Y
0
| where Y,Y
0
∈R. (3.59)
54
This function is not differentiable at Y
0
= Y. I thus consider two commonly used
smooth approximations to the absolute error loss function. The first one uses the
integral of the inverse tangent function which approximates the sign function. This
leads to the following approximation:
l
abs, approx1
(Y,Y
0
)=
2(Y −Y
0
)
π
tan
−1
Y −Y
0
ǫ
!
where Y,Y
0
∈R and ǫ> 0.
(3.60)
One can get an arbitarily close approximation by setting a suitably small positive
value of ǫ. The second derivative of the loss function with respect to Y
0
is
l
′′
abs, approx1
(Y,Y
0
)=
4
ǫπ
h
1+
Y−Y
0
ǫ
2
i
2
. (3.61)
I need to compute the maximum and minimum of the above second derivative for
Y
0
∈ B. The above function is monotonically increasing for Y
0
< Y, achieves its
maxima at Y
0
= Y, and monotonically decreases for Y
0
≥ Y. Now Y ∈ B by
definition of B. Hence, the maximum of l
′′
abs, approx1
(Y,Y
0
) over B occurs at Y
0
= Y
and is given by:
M
l,B
(Y)=l
′′
abs, approx1
(Y,Y) =
4
πǫ
. (3.62)
The minimum value depends on the location ofB = [b
min
,b
max
] and is given below:
m
l,B
(Y) =
(
l
′′
abs, approx1
(Y,b
min
) ; if b
max
+b
min
< 2Y
l
′′
abs, approx1
(Y,b
max
) ; otherwise
. (3.63)
I also consider a second smooth approximation of absolute error:
l
abs, approx2
(Y,Y
0
) =
p
(Y −Y
0
)
2
+ǫ−
√
ǫ where Y,Y
0
∈R and ǫ>0. (3.64)
This approximation becomes better with smaller positive values of ǫ. The second
derivative of the above approximation with respect to Y
0
is
l
′′
abs, approx2
(Y,Y
0
) =
ǫ
[(Y −Y
0
)
2
+ǫ]
3/2
. (3.65)
55
The behavior of the above function with Y
0
is the same as l
′′
abs, approx1
(Y,Y
0
). It has
a monotonic increase for Y
0
< Y, achieves maxima at Y
0
= Y, and has a monotonic
decrease for Y
0
≥ Y. This results in the following second derivative maxima and
minima overB:
M
l,B
(Y) =l
′′
abs,approx2
(Y,Y)=
1
√
ǫ
. (3.66)
m
l,B
(Y) =
(
l
′′
abs, approx2
(Y,b
min
) ; if b
max
+b
min
< 2Y
l
′′
abs, approx2
(Y,b
max
) ; otherwise
. (3.67)
Both types of smooth absolute error loss functions give a non-zero curvature spread
when compared to the squared error loss function. This leads to a non-zero approxi-
mation error for the GAD theorem.
3.3.3 Logistic Loss
Logistic regression is a popular technique for classification. I consider the binary
classification case where the label setY ={-1,1}. The logistic loss function is
l
log
(Y,Y
0
) =log(1+exp(−YY
0
)) where Y ∈{−1,1} and Y
0
∈R . (3.68)
Y
0
is replaced by the expert’s prediction for supervised learning and is typically mod-
eled by an affine function of X. The ensemble is thus a convex combination of affine
experts. The second derivative of the above loss with respect to Y
0
is
l
′′
log
(Y,Y
0
) =
Y
2
exp(−YY
0
)
(1+exp(−YY
0
))
2
. (3.69)
l
′′
log
(Y,Y
0
) is an even function of Y
0
. It is monotonically increasing for Y
0
≤ 0, reaches
its maximum at Y
0
= 0, and is monotonically decreasing for Y
0
≥ 0. Hence m
l,B
(Y)
becomes
m
l,B
(Y)=
(
l
′′
log
(Y,b
min
) ; if b
max
<0 or if b
max
+b
min
< 0
l
′′
log
(Y,b
max
) ; otherwise
. (3.70)
56
Similarly, M
l,B
(Y) becomes
M
l,B
(Y) =
l
′′
log
(Y,b
max
) ; if b
max
< 0
l
′′
log
(Y,b
min
) ; if b
min
> 0
l
′′
log
(Y,0)=Y
2
/4 ; otherwise
. (3.71)
3.3.4 Exponential Loss
AdaBoost.M1 [60] uses the exponential loss function which is defined as
l
exp
(Y,Y
0
) =exp(−YY
0
) where Y ∈{−1,1} and Y
0
∈R. (3.72)
The second derivative of the loss function is
l
′′
exp
(Y,Y
0
)=Y
2
exp(−YY
0
). (3.73)
The above function of Y
0
is monotonically increasing when Y < 0 and monotonically
decreasing when Y ≥ 0. Hence m
l,B
(Y) becomes
m
l,B
(Y)=
(
l
′′
exp
(Y,b
min
) ; if Y <0
l
′′
exp
(Y,b
max
) ; otherwise
. (3.74)
Similarly, M
l,B
becomes
M
l,B
(Y) =
(
l
′′
exp
(Y,b
max
) ; if Y < 0
l
′′
exp
(Y,b
min
) ; otherwise
. (3.75)
3.3.5 Hinge Loss
The hinge loss is another popular loss function which is used for training support
vector machines (SVMs) [36] and is defined as
l
hinge
(Y,Y
0
)= max(0,1−YY
0
) where Y ∈{−1,1} and Y
0
∈R. (3.76)
57
The above loss function is not differentiable when YY
0
= 1. Hence I use the following
smooth approximation from Smooth SVM (SSVM) [99]:
l
hinge, smooth
(Y,Y
0
)= 1−YY
0
+ǫlog
"
1+exp
−
1−YY
0
ǫ
!#
where ǫ>0.
(3.77)
The above approximation is based on the logistic sigmoidal approximation ofthe sign
function which is often used in neural networks [88]. Picking a small positive value
of ǫ ensures low approximation error. The second derivative with respect to Y
0
is
l
′′
hinge, smooth
(Y,Y
0
) =
Y
2
exp
−
1−YY
0
ǫ
ǫ
h
1+exp
−
1−YY
0
ǫ
i
2
. (3.78)
The above function of Y
0
is symmetrical about Y
0
= 1/Y, increases for Y
0
< 1/Y,
attains its maximum for Y
0
=1/Y, and decreases for Y
0
≥ 1/Y. Hence M
l,B
(Y) is
M
l,B
(Y)=
l
′′
hinge, smooth
(Y,b
max
) ; if b
max
< 1/Y
l
′′
hinge, smooth
(Y,b
min
) ; if b
min
> 1/Y
l
′′
hinge, smooth
(Y,1/Y) or Y
2
/(4ǫ) ; otherwise
. (3.79)
Similarly, the value of m
l,B
(Y) also depends on the location of the intervalB and
is given below:
m
l,B
(Y)=
(
l
′′
hinge, smooth
(Y,b
min
) ; if b
max
< 1/Yor if b
max
+b
min
<2/Y
l
′′
hinge, smooth
(Y,b
max
) ; otherwise
. (3.80)
The expressions for M
l,B
(Y) and m
l,B
(Y) derived in this section for various loss func-
tions are used to derive the GAD approximation for the ensemble loss. Theoretical
analysisofthisapproximationisnoteasyforalllossfunctions. Hencethenextsection
presents simulation experiments for understanding the GAD theorem.
58
3.4 SimulationExperimentsontheGADTheorem
for Common Loss Functions
This section begins by understanding the tradeoff between the diversity term and
weighted expert loss in the GAD theorem. We next analyze the accuracy of the
ensemble loss approximation motivated by the GAD theorem.We finally contrast the
GAD approximation with the Taylor series approximation used in gradient boosting.
3.4.1 Behavior of Weighted Expert Loss and Diversity in
GAD
Consider the following proxy for the true loss function implied by GAD:
l
GAD
(Y,f)=
K
X
k=1
w
k
l(Y,f
k
)−d
l
(f
1
,...,f
K
). (3.81)
where d
l
(f
1
,...,f
K
) is the diversity. The first term on the right hand side of the
above equation is the weighted sum of the individual expert’s losses. We note that
this term provides a simple upper bound on l(Y,f) due to Jensen’s inequality for
convex loss functions:
l(Y,f)≤
K
X
k=1
w
k
l(Y,f
k
) =l
WGT
(Y,f). (3.82)
Tounderstand thetradeoffbetween thetwo termsontherighthandside of(3.81),
I performed Monte Carlo simulations because the analytical forms of d
l
(f
1
,...,f
K
)
for common loss functions derived in the previous section are not amenable to direct
theoreticalanalysis. TheK expertpredictionsweresampledfromanindependentand
identically distributed (IID) Gaussian random variable with mean μ
f
and variance
σ
2
f
. That is
f
k
∼N(μ
f
,σ
2
f
) for all k∈{1,...,K}. (3.83)
Aunimodaldistributionwasusedsinceitisintuitivetoexpectmostoftheexperts
to give numerically close predictions. I used a Gaussian probability density function
(PDF) for our simulations since it is the most popular unimodal PDF. The convex
59
nature ofthe ensemble ensures thatμ
f
is alsothe expected ensemble prediction. This
is because
E{f}=E
n
K
X
k=1
w
k
f
k
o
=
K
X
k=1
w
k
E{f
k
} =
K
X
k=1
w
k
μ
f
=μ
f
. (3.84)
The variance σ
2
f
governs the spread of the predictions around the mean. I varied
μ
f
around the true label Y = 1. I picked Y = 1 because the two regression loss
functions depend only on the distance of the prediction from the target. The analysis
also extends easily to Y = −1 for classification loss functions. I generated 1000
Monte Carlo samples for K = 3 and 7 experts. We set σ
2
f
= 2 for these simulations.
Figures3.2-3.6showthemedianvalues ofl(Y,f),l
GAD
(Y,f),andtheweighted expert
loss for various loss functions. These figures also plot the median diversity term
d
l
(f
1
,...,f
K
) with the expected ensemble prediction μ
f
.
Letusfirstanalysetheplotsforthetworegressionlossfunctions. Figure3.2shows
thecaseforthesquarederrorlossfunction. Bothl(Y,f)andl
GAD
(Y,f)overlapforall
values of μ
f
because the GAD theorem reduces to the ambiguity decomposition. The
diversity term also remains nearly constant because it is the maximum likelihood
estimator of the variance σ
2
f
. Diversity corrects for the bias between the weighted
expert loss (green curve) and the actual ensemble loss function (black curve). Fig-
ure 3.3 shows the corresponding figure for the smooth absolute error loss function
with ǫ = 0.5. l
GAD
(Y,f) provides a very accurate approximation of the true loss
function around the true label Y = 1 and becomes a poorer approximation as we
move away. This is because GAD assumes the experts predictions to be close to the
true label. l
WGT
(Y,f) gives a much larger approximation error in comparison around
Y =1. Also,thediversitytermisnearlyconstantbecauseitscomputationnormalizes
for the value of μ
f
by subtracting f from each f
k
.
Figures 3.4-3.6 show the plots for the three classification loss functions - logistic,
exponential, and smooth hinge (ǫ = 0.5) with true label Y = 1. l
GAD
(Y,f) provides
anaccurate approximationtol(Y,f)nearthetrue labelY = 1as was thecase forthe
regression loss functions. However the diversity term is not constant, but unimodal
with a peak around the decision boundary μ
f
= 0. This is because the experts
disagree a lot at the decision boundary which causes high diversity. Diversity reduces
as we move away from the decision boundary in both directions. Diversity in GAD
is agnostic to the true label and only quantifies the spread of the expert predictions
60
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2
4
6
8
10
K = 3 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.8
0.9
1
1.1
1.2
Expected ensemble prediction μ
f
Diversity Term
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2
4
6
8
10
K = 7 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1.5
1.55
1.6
1.65
Expected ensemble prediction μ
f
Diversity Term
Figure 3.2: The top plot in each figure shows the median actual ensemble loss, its
GAD approximation and weighted expert loss across 1000 Monte Carlo samples in
an ensemble of K = 3 and K = 7 experts for the squared error loss function as a
function of expected ensemble prediction μ
f
. I used σ
2
f
= 2. Y = 1 is the correct
label. I also show the median diversity term for the same setup in the bottom plot.
with respect to the given loss function. The weighted expert loss term captures the
accuracy of the experts with respect to the true label. Even though all experts agree
when they are predicting the incorrect class for μ
f
< 0, the overall loss rises due to
an increase in the weighted expert loss.
3.4.2 AccuracyofGAD-MotivatedApproximationofEnsem-
ble Loss
Inthissubsection,IanalyzetheerroroftheapproximateGADensemblelossl
GAD
(Y,f)
in terms of absolute deviation from the true ensemble loss l(Y,f). I also investigate
the behavior of the bound on the approximation error|l(Y,f)−l
GAD
(Y,f)| presented
inCorollary3.1. Iusedthesameexperimentalsetupforsimulationsasintheprevious
section. Figures 3.7-3.10 show the plots of the approximation error using l
GAD
(Y,f)
andtheweightedexpertlossl
WGT
(Y,f)forvariouslossfunctionsdiscussed previously.
I did not consider the squared error loss function because GAD reduces to AD and
we get 0 absolute error.
Figures 3.7-3.10show that the GAD approximation l
GAD
(Y,f) (red curve) always
provides significantly lower approximation error rate than the weighted expert loss
l
WGT
(Y,f) (green curve) when μ
f
is close to the true label Y = 1. This is because
61
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
1
1.5
2
K = 3 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.4
0.45
0.5
Expected ensemble prediction μ
f
Diversity Term
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
1
1.5
2
K = 7 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.58
0.6
0.62
0.64
Expected ensemble prediction μ
f
Diversity Term
Figure 3.3: The top plot in each figure shows the median actual ensemble loss, its
GAD approximation and weighted expert loss across 1000 Monte Carlo samples in
an ensemble of K = 3 and K =7 experts for the smooth absolute error loss function
as a function of expected ensemble prediction μ
f
. I used σ
2
f
= 2 and ǫ = 0.5. Y = 1
is the correct label. I also show the median diversity term for the same setup in the
bottom plot.
thesecond orderTaylor series expansion used intheGADtheorem’s proofisaccurate
when the expert predictions are close to the true label. I also note that the bound on
the approximation error |l(Y,f)−l
GAD
(Y,f)| (blue curve) follows the general trend
of the error but is not very tight.
3.4.3 Comparison with Loss Function Approximation Used
in Gradient Boosting
Gradient boosting [61] is a popular machine learning algorithm which sequentially
trains an ensemble of base learners. Gradient boosting also utilizes a Taylor series
expansionforitssequentialtraining. Hence,Idevotethissubsectiontounderstanding
thedifferences betweenthelossfunctionapproximationusedingradientboostingand
GAD.
Consider an ensemble of K−1 experts f
k
, and their linear combination
g =
K−1
X
k=1
v
k
f
k
(3.85)
to generate the ensemble prediction g. Gradient boosting does not require the coeffi-
62
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
1
1.5
2
K = 3 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
Expected ensemble prediction μ
f
Diversity Term
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
1
1.5
2
K = 7 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
Expected ensemble prediction μ
f
Diversity Term
Figure 3.4: The top plot in each figure shows the median actual ensemble loss, its
GAD approximation and weighted expert loss across 1000 Monte Carlo samples in
an ensemble of K = 3 and K = 7 experts for the logistic loss function as a function
of expected ensemble prediction μ
f
. I used σ
2
f
= 2. Y = 1 is the correct label. I also
show the median diversity term for the same setup in the bottom plot.
cients {v
k
}
K−1
k=1
to be convex weights. Now if one adds a new expert f
K
to g with a
weight v
K
, theloss ofthenew ensemble becomesl(Y,g+v
K
f
K
). Since gradientboost-
ing estimates v
K
andf
K
given estimates of{v
k
,f
k
}
K−1
k=1
, it assumes thatv
K
f
K
is close
to0. In other words, it assumes that the new base learner f
K
is weak and contributes
only that information which has not been learned by the current ensemble. The new
ensemble’s loss is therefore approximated by using a Taylor series expansion around
v
K
f
K
= 0. Assuming the loss function to be convex, we can write the following first
order Taylor series expansion:
l(Y,g+v
K
f
K
)≤l(Y,g)+v
K
f
K
l
′
(Y,g)=l
GB
(Y,f). (3.86)
Minimizing the above upper bound with respect to v
K
and f
K
is equivalent to min-
imizing v
K
f
K
l
′
(Y,g), or maximizing the correlation between v
K
f
K
and the negative
loss function gradient−l
′
(Y,g). This is the central idea used in training an ensemble
using gradient boosting.
The above Taylor series expansion highlights the key differences between gradient
boosting and GAD. First, the loss function upper bound used in gradient boosting
is a means to perform sequential training of an ensemble of weak experts. Each
new expert adds only incremental information to the ensemble, but is insufficiently
63
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2
4
6
8
10
12
K = 3 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
Expected ensemble prediction μ
f
Diversity Term
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
5
10
15
K = 7 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
1
2
3
Expected ensemble prediction μ
f
Diversity Term
Figure 3.5: The top plot in each figure shows the median actual ensemble loss, its
GAD approximation and weighted expert loss across 1000Monte Carlo samples in an
ensemble of K =3 and K =7 experts for the exponential loss function as a function
of expected ensemble prediction μ
f
. I used σ
2
f
= 2. Y = 1 is the correct label. I also
show the median diversity term for the same setup in the bottom plot.
trainedtopredictthetargetvariablesonitsown. Thisreducestheutilityoftheabove
approximation in (3.86) in situations when the individual experts are themselves
strong. This arises when, for example, the experts have been trained on different
feature sets, data sets, utilize different functional forms, or have not been trained
using gradient boosting. Second, the GAD approximation l
GAD
(Y,f) provides an
intuitive decomposition of the ensemble loss into the weighted expert loss l
WGT
(Y,f)
and the diversity d(f
1
,...,f
K
) which measures the spread of the expert predictions
about f. Gradient boosting does not offer such an intuitive decomposition.
Figure 3.11 shows the median approximation error for l
GAD
(Y,f) and l
GB
(Y,f)
using the exponential loss function. One observes that gradient boosting has mini-
mum errorwhen theensemble meanμ
f
isnearthedecision boundarybecausef
K
≈ 0.
However, theapproximation becomes pooras we move away fromthe decision bound-
ary. l
GAD
(Y,f) provides a good approximation around the true label Y = 1 as noted
in the previous section. Thus the two loss functions l
GAD
(Y,f) and l
GB
(Y,f) provide
complementary regions of low approximation error. I observed a similar trend for the
other loss functions as well.
64
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
1
1.5
2
2.5
K = 3 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
Expected ensemble prediction μ
f
Diversity Term
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1
2
3
K = 7 experts
Expected ensemble prediction μ
f
Loss Function
Ensemble Loss (GAD)
Weighted Expert Loss
Ensemble Loss (Actual)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
Expected ensemble prediction μ
f
Diversity Term
Figure 3.6: The top plot in each figure shows the median actual ensemble loss, its
GAD approximation and weighted expert loss across 1000 Monte Carlo samples in
an ensemble of K = 3 and K = 7 experts for the smooth hinge loss function as a
function of expected ensemble prediction μ
f
. I used σ
2
f
=2 and ǫ =0.5. Y = 1 is the
correct label. I also show the median diversity term for the same setup in the bottom
plot.
3.5 Experiments on GAD with Standard Machine
Learning Tasks
The previous sections presented empirical analysis of the GAD theorem based on
simulations. This section presents experiments on some real-world data sets which
will reveal the utility of the GAD theorem to machine learning problems of interest.
I used five data sets from the UCI Machine Learning Repository [58] as listed in
Table3.1fortheexperiments. IconductedtwosetsofexperimentsusingtheUCIdata
sets. These experiments mimic common scenarios usually encountered by researchers
while training systems with multiple classifiers or regressors.
The first class of experiments tries to understand diversity and its impact on en-
semble performance incase theensemble consists ofdifferentclassifiers andregressors
trained on the same data set. I trained 3 classifiers and 3 regressors for each data set.
I used logistic regression, linear support vector machine (SVM) from the Lib-Linear
toolkit [54], and a homoscedastic linear discriminant analysis (LDA)-based classifier
fromMatlabforclassification. The three linearregressors were trainedby minimizing
least squares, least absolute deviation, and the Huber loss function. GAD was used
to analyse the diversity of the trained classifiers and regressors for each data set. The
65
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 3 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 7 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
Figure 3.7: This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K = 3 and K = 7 experts
for smooth absolute error loss function as a function of expected ensemble prediction
μ
f
. I used σ
2
f
= 2, ǫ = 0.5 and Y =1 as the correct label.
second experiment considers the situations where the experts are trained on poten-
tially overlapping subsets of instances from a given data set. I used bagging [21] for
creating the multiple training subsets by sampling instances with replacement. The
experiments used the classifiers and regressors mentioned above one at a time.
Theevaluationmetricoftheaboveexperimentsistherelativeapproximationerror
between the true loss and its approximation:
E
x
=
1−
l
x
(Y,f(X))
l(Y,f(X))
(3.87)
wherexisoneofGAD,WGT(usingweightedlossofensemble (3.82)),andGB(using
approximation used in gradient boosting (3.86)). I assigned equal weights {w
k
} to
the experts in all experiments.
Table 3.2 shows the relative absolute error for various classification data sets and
loss functions when different experts were trained on the same data set. I observe
that the GAD approximation provides the lowest median absolute error for all cases.
This result is statistically significant at α = 0.01 level using the paired t-test. It is
often an order of magnitude better than the other two approximations. Table 3.3
shows the relative absolute error when one expert was trained on three versions of
thesame dataset created by sampling withreplacement. Iused logistic regression for
66
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 3 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 7 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
Figure 3.8: This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K = 3 and K = 7 experts
for logistic loss function as a function of expected ensemble prediction μ
f
. I used
σ
2
f
= 2 and Y = 1 as the correct label.
the classification loss functions and least squares linear regression for the regression
loss functions. Table 3.3 shows that the GAD approximation gives the lowest error
for all cases except for the Wine Quality data set with smooth absolute error loss
function.
These experiments indicatethatl
GAD
(Y,f)providesanaccurateapproximationof
the actual ensemble loss l(Y,f). This adds value to the proposed approximation for
designing supervised machine learning algorithms, in addition to providing an intu-
itive definition and explanation of the impact of diversity on ensemble performance.
Data set Target set No. of instances No. of features
Magic Gamma Telescope [70] {−1,1} 19020 10
Pima Indians Diabetes [156] {−1,1} 768 8
Abalone [118] {−1,1},Z
+
4177 7
Parkinson’s Disease [165] R
+
5875 20
Wine Quality [37] {0,...,10} 6497 11
Table 3.1: This table gives a description ofvariousUCI Machine Learning Repository
data sets used for experiments on GAD. All data sets with {−1,1} as the target
set were used for training binary classifiers. Others were used for training regressors.
Abalone was used for binary classification as well by first thresholding the target
variable at 10.
67
Data set E
GAD
E
WGT
E
GB
Logistic Loss
Magic Gamma Telescope [70] 1.3e-2 6.5e-2 6.6e-2
Pima Indians Diabetes [156] 7.0e-3 2.9e-2 2.3e-2
Abalone [118] 9.0e-3 4.4e-1 4.2e-1
Exponential Loss
Magic Gamma Telescope [70] 3.5e-2 1.3e-1 1.4e-1
Pima Indians Diabetes [156] 1.5e-2 6.8e-2 6.1e-2
Abalone [118] 2.2e-2 9.2e-2 9.2e-2
Smooth Hinge Loss (ǫ = 0.5)
Magic Gamma Telescope [70] 2.2e-2 1.6e-1 1.4e-1
Pima Indians Diabetes [156] 7.0e-3 4.9e-2 3.1e-2
Abalone [118] 1.2e-2 8.9e-2 6.8e-2
Squared Error Loss
Parkinson’s Disease [165] 0 1.7e-1 1.7
Wine Quality [37] 0 1.9e-1 1.5e-1
Abalone [118] 0 2.8e-2 1.4e-1
Smooth Absolute Error Loss (ǫ = 0.5)
Parkinson’s Disease [165] 1e-2 1.4e-1 1.3
Wine Quality [37] 9.1e-2 9e-2 9.9e-2
Abalone [118] 7e-3 1.9e-2 9.9e-2
Table 3.2: This table shows the relative absolute error E
x
for various UCI data sets
between the ensemble loss and approximation x which is one of GAD,WGT, and GB
corresponding to GAD, weighted sum of expert loses, and gradient boosting upper-
boundontotalloss. Thefirstthreelosesusedanensemble ofthreeclassifiers -logistic
regression, linear support vector machine, and Fisher’s linear discriminant analysis
classifiers. The two last two regressors used three regressors obtained by minimizing
squared error, absolute error, and Huber loss function. The GAD approximation has
significantly lower error than the other approximations for all cases except the Wine
quality data set for the smooth absolute loss.
68
Data set E
GAD
E
WGT
E
GB
Logistic Loss
Magic Gamma Telescope [70] 1.76e-4 5.38e-4 1.07e-1
Pima Indians Diabetes [156] 4.02e-3 1.86e-2 3.62e-2
Abalone [118] 1.24e-3 3.91e-3 5.54e-2
Exponential Loss
Magic Gamma Telescope [70] 4.03e-4 9.92e-4 1.77e-1
Pima Indians Diabetes [156] 4.22e-3 1.61e-2 7.19e-2
Abalone [118] 1.49e-3 4.09e-3 9.70e-2
Smooth Hinge Loss (ǫ = 0.5)
Magic Gamma Telescope [70] 8.93e-4 1.89e-3 3.31e-1
Pima Indians Diabetes [156] 1.26e-2 4.93e-2 5.93e-2
Abalone [118] 2.14e-3 6.18e-3 1.46e-1
Squared Error Loss
Parkinson’s Disease [165] 0 5.5e-3 1.6
Wine Quality [37] 0 1.5e-2 8.7e-2
Abalone [118] 0 9.3e-3 1.5e-1
Smooth Absolute Error Loss (ǫ = 0.5)
Parkinson’s Disease [165] 5.3e-4 7.4e-3 1.2
Wine Quality [37] 1.2e-2 4.4e-3 6.0e-2
Abalone [118] 1.4e-3 3.4e-3 8.2e-2
Table 3.3: This table shows the relative absolute error E
x
for various UCI data sets
between the ensemble loss and approximation x which is one of GAD,WGT, and GB
corresponding to GAD, weighted sum of expert loses, and gradient boosting upper-
bound on total loss. The first three loses used logistic regression classifiers trained on
threedatasetscreatedbysamplingwithreplacement (baggedtrainingsets). Thelast
two loses used alinear regressor obtained by minimizing squared errorand trained on
3 bagged training sets. The GAD approximation has significantly lower error than
theotherapproximationsforallcasesexcepttheWinequalitydatasetforthesmooth
absolute loss.
69
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−1
10
0
10
1
10
2
Expected ensemble prediction μ
f
Median absolute error
K = 3 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−1
10
0
10
1
10
2
Expected ensemble prediction μ
f
Median absolute error
K = 7 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
Figure 3.9: The figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K = 3 and K = 7 experts
for exponential loss function as a function of expected ensemble prediction μ
f
. I used
σ
2
f
= 2 and Y = 1 as the correct label.
3.6 Conclusion and Future Work
This chapter presented the generalized ambiguity decomposition (GAD) theorem
which explains the link between the diversity of experts in an ensemble and the
ensemble’s overall performance. The GAD theorem applies to a convex ensemble of
arbitrary experts with a second order differentiable loss function. It also provides
a data-dependent and loss function-dependent definition of diversity. I applied this
theorem to some commonly used classification and regression loss functions and pro-
vided a simulation-based analysis of diversity term and accuracy of the resulting loss
function approximation. This chapter also presented results on many UCI data sets
for two frequently encountered situations using ensembles of experts. These results
demonstrate the utility of the proposed decomposition to ensembles used in these
real-world problems.
Future work should design supervised learning and ensemble selection algorithms
utilizing the proposed GADtheorem. Such algorithmsmight extend existing work on
trainingdiverseensembles ofneuralnetworksusingnegativecorrelationlearning[106]
and conditional maximum entropy models [11]. The GAD loss function approxima-
tion is especially attractive because the diversity term does not require labeled data
for computation. This opens the possibility of developing semi-supervised learning
algorithms which use large amounts of unlabeled data.
70
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 3 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 7 experts
Ensemble loss (GAD)
Weighted expert loss
GAD error bound
Decision boundary
Correct label
Incorrect label
Figure 3.10: This figure shows the median absolute approximation error and error
bound across 1000 Monte Carlo samples in an ensemble of K = 3 and K = 7 experts
for smooth hinge loss function as a function of expected ensemble prediction μ
f
. I
used σ
2
f
= 2, ǫ = 0.5, and Y =1 as the correct label.
Another interesting research direction is understanding the impact of diversity
introduced at various stages of a conventional supervised learning algorithm on the
final ensemble performance. The individual experts can be trained on different data
sets and can use different feature sets. It would be useful to understand the most
beneficial ways in which diversity can be introduced in the ensemble. I would also
like to study the impact of diversity in sequential classifiers such as automatic speech
recognition (ASR) systems. My recent work [12] develops a GAD-like framework for
theoretically analyzing the impact of diversity on fusion performance of state-of-the-
art ASR systems. Finally, characterizing diversity in an ensembles of human experts
presents atougherchallengebecauseitisdifficulttoquantifytheunderlyinglossfunc-
tion. However, many real-world problems involving crowd-sourcing [3], [5], [6], [31],
[44], [71], [109], [159], [161] and understanding human behavior involve annotation by
multiple human experts [8]–[10], [41], [140], [179], [184], [187]. Extending the GAD
theorem to such cases will contribute significantly to these domains.
71
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−3
10
−2
10
−1
10
0
Expected ensemble prediction μ
f
Median absolute error
K = 3 experts
Ensemble loss (GAD)
Ensemble loss (Grad. boost.)
Decision boundary
Correct label
Incorrect label
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
−3
10
−2
10
−1
Expected ensemble prediction μ
f
Median absolute error
K = 7 experts
Ensemble loss (GAD)
Ensemble loss (Grad. boost.)
Decision boundary
Correct label
Incorrect label
Figure3.11: This figureshows themedianabsoluteapproximationerrorforGADand
the gradient boosting upper bound across 1000 Monte Carlo samples in an ensemble
of K = 3 and K = 7 experts for smooth hinge loss function as a function of expected
ensemble prediction μ
f
. I used σ
2
f
= 0.1 and Y =1 as the correct label.
72
Chapter 4
Analysis of Ensemble Diversity:
Automatic Speech Recognition
Diversity or complementarity of automatic speech recognition (ASR) systems is cru-
cialforachievingareductioninworderrorrate(WER)uponfusionusingtheROVER
algorithm. This chapter presents a theoretical proof explaining this often-observed
link between ASR system diversity and ROVER performance. This is in contrast to
many previous works thathave only presented empirical evidence forthis linkorhave
focused on designing diverse ASR systems using intuitive algorithmic modifications.
I prove that the WER of the ROVER output approximately decomposes into a dif-
ference of the average WER of the individual ASR systems and the average WER of
the ASR systems with respect tothe ROVER output. I refer tothe latter quantity as
the diversity of the ASR system ensemble because it measures the spread of the ASR
hypotheses about the ROVER hypothesis. This result explains the trade-off between
the WER of the individual systems and the diversity of the ensemble. This chap-
ter supports this result through ROVER experiments using multiple ASR systems
trained on standard data sets with the Kaldi toolkit. I use the proposed theorem to
explain the lower WERs obtained by ASR confidence-weighted ROVER as compared
to word frequency-based ROVER. I also quantify the reduction in ROVER WER
with increasing diversity of the N-best list. I finally present a simple discriminative
framework for jointly training multiple diverse acoustic models (AMs) based on the
proposed theorem. The proposed framework generalizes and provides a theoretical
basis for some recent intuitive modifications to well-known discriminative training
criterion for training diverse AMs.
73
4.1 Introduction
Automatic speech recognition (ASR) is a challenging task due to several factors rang-
ingfromvariabilityinspeakerandenvironmentalacousticcharacteristicstomismatch
in topics or domains of the conversation [138]. Most ASR systems adopt statistical
models such as hidden Markov models (HMMs) and N-gram language models (LMs)
to account for this variability. However, many large-scale ASR tasks such as exem-
plified by DARPA GALE [160], TRANSTAC [162], EARS [33], CALO [171], and
BABEL [39] projects have shown that the fusion of hypotheses from multiple ASR
systemsisessentialtoachievestate-of-the-artworderrorrates(WERs). TheROVER
algorithm [56] typically performs this fusion. The observed reduction in WER is pri-
marily due to complementary errors made by the different ASR systems.
Researchers hence have also focused on designing diverse ASR systems that make
complementary errors for fusion. The simplest approach is to use intuitively di-
verse data sets and diverse acoustic features such as Mel frequency cepstral coef-
ficients (MFCCs) and perceptual linear prediction (PLP) features for training the
ASR systems. Another common approach is to combine structurally diverse AMs
such as based on Gaussian mixture model (GMM) HMMs and deep neural networks
(DNNs) [39]. Other works [35], [48], [142], [151], [155], [186] have used machine learn-
ing techniques such as bagging [22], boosting [60], [61] and random forests [25] to
train diverse ASR systems. Breslin [26] provides a comprehensive review of these
techniques. A recent work by Cui, Huang, and Chien [40] presents a multi-view and
multi-objective algorithm forsemi-supervised training of HMM acoustic models. The
authors generate multiple ASR systems on different views of the training data by
using different front-ends and randomized decision trees. The diversity of these ASR
systems is a necessary condition for their mutual information-based optimization al-
gorithm because they use the ROVER output on unlabeled data as reference. Chen
and Zhao [34] train diverse AMs using cross-validation and speaker clustering. They
also explore several intuitive measures of AM diversity such as the standard devia-
tion of per-frame acoustic scores computed by the AMs. They empirically show that
the increased diversity compensates for a reduction in the quality of the AMs being
combined, which results in an improvement in WER after fusion. No prior work has
however theoretically analyzed the impact of ASR diversity on fusion performance to
the best of my knowledge despite widespread interest in training and fusing diverse
74
ASR systems.
On the other hand, system complementarity ordiversity is awell-studied problem
in machine learning and statistical signal processing. Many researchers have explored
formally the benefits of ensemble diversity in machine learning [46], [90], [92], [169],
[173]. This chapter presents a theoretical link between ASR system diversity and
ROVER performance in this paper motivated by these previous works. It specifically
uses the ambiguity decomposition proposed in [90] for squared-error regression as the
framework. The ambiguity decomposition states that the squared-error of a convex
sum of regressors decomposes into the difference between the weighted squared-error
oftheindividualregressorsandthediversity oftheensemble. Thisdiversity isdefined
as the weighted squared-error of each regressor from the ensemble’s prediction. This
resultisequivalent tothebias-variance-covariancedecompositionderived in[173]and
forms the basis of negative correlation learning for neural networks [106]. It is also
intuitively similar to the bias-variance decomposition [63]. Consider a training data
set D = {x
i
,y
i
}
N
i=1
⊆ R
D
×R for learning a regressor f : R
D
→ R by minimizing
the squared error loss function. I know that the optimal regressor which minimizes
squared error is the conditional expectation E
p(y|x)
{y|x} or E{y|x} (to simplify the
notation). I denote the learned regressor as f
∗
D
because it is a function of the training
data set D. Consider a given x. The random variable f
∗
D
(x) might be an excellent
approximation to E{y|x} only for a specific choice of training data set D. Hence,
a better measure of the approximation error is the expected squared error between
the optimal and learned regressors, where the expectation is taken over p(D), i.e. a
probability distribution over the space of training data sets.
The bias-variance decomposition [63] states for an inputx, this expected squared
error decomposes as:
E
p(D)
n
f
∗
D
(x)−E{y|x}
2
o
=
E
p(D)
{f
∗
D
(x)}−E{y|x}
2
+E
p(D)
n
f
∗
D
(x)−E
p(D)
{f
∗
D
(x)}
2
o
. (4.1)
The first term on the right-hand side is the squared-bias of the learned regressor
f
∗
D
(x) and computes the squared error between the mean prediction and the optimal
estimateE{y|x}. The second term is the variance of the random variable F
∗
D
(x) over
p(D) and measures the spread of the predictions about the mean prediction.
I now show the relation between (4.1) and the ambiguity decomposition [90]. Let
75
p(D) be the following mixture model over M data sets:
p(D)=
M
X
m=1
w
m
δ(D−D
m
). (4.2)
Hence the expectation of any function g(D) over p(D) becomes the convex sum
E
p(D)
{g(D)}=
M
X
m=1
w
m
g(D
m
). (4.3)
Substituting the above expression in (4.1) and re-arranging the terms gives the am-
biguity decomposition
M
X
m=1
w
m
{f
∗
Dm
(x)}−E{y|x}
2
=
M
X
m=1
w
m
f
∗
Dm
(x)
−E{y|x}
2
−
M
X
m=1
w
m
f
∗
Dm
(x)−
M
X
n=1
w
n
f
∗
Dn
(x)
2
. (4.4)
I conclude that the bias-variance decomposition equals the ambiguity decomposition
only when all regressors have the same functional form f and are estimating the
conditional expectationE{y|x} as the target. The ambiguity decomposition does not
impose such constraints.
I first propose a simple vector space model for ROVER WER in Section 4.2 to
establish the link with ambiguity decomposition. I use this model to approximate
the ROVER WER by a simpler expression involving the squared error. This enables
me to directly apply the ambiguity decomposition in Section 4.3 and decompose the
WER between the reference transcription and the hypothesis transcription generated
byROVER.Ishowthatmyproposeddecompositionappliesbothattheper-utterance
level andalsoonaverage overadataset. Section 4.4describes myexperimental setup
using the Kaldi ASR toolkit [135] and ASR confidence estimation using a variety
of lattice-based and prosodic features within a conditional random field (CRF) [96]
model. Italsodescribes thedifferent fusionstrategiesItested inROVERandmy test
data set. I empirically validate my proposed theorems on the test set in Section 4.5
among other experiments and analysis. I then use my theoretical results to give
a unified discriminative training framework using the minimum Bayes risk (MBR)
76
Variable Description
M Number of ASR hypotheses
being fused by ROVER
K Vocabulary size
N Number of cohort sets in a
given confusion network generated by ROVER
C
i
i-th cohort set
w
m
i
1-in-K encoding of
system m’s word hypotheses inC
i
s
m
i
1-in-K encoding of
system m’s confidence score inC
i
α word frequency weight in ROVER
h
m
i
αw
m
i
+(1−α)s
m
i
h
avg
i
1
M
P
M
m=1
h
m
i
r
i
1-in-K encoding of
the reference word inC
i
h
∗
i
1-in-K encoding of
the ROVER word hypothesisC
i
E(.) true WER
E
approx
approximate WER
p probability of a correct word
p
ML
ML estimate of p
Table 4.1: This table gives a list of key variables and their description used in Lem-
mas 4.1-4.2 and Theorems 4.1-4.3.
criterion for training diverse ASR systems in Section 4.6. The paper concludes in
Section 4.7 with some directions for future work.
4.2 A Vector Space Model for ROVER WER
This section presents a vector space model for the WER of the ROVER output of an
ASR system ensemble. I will use this model for proving the link between hypotheses
diversity and ROVER WER. Table 4.1 lists the key variables of this vector space
model and their descriptions. ROVER first aligns the multiple word sequence hy-
potheses using dynamic programming (DP). This alignment also allows for insertions
and deletions in the word sequence hypotheses and minimizes the total cost of word
insertions, deletions, and substitutions
1
. This total cost is also known as Levenshtein
1
I will assume an equal cost of 1 for insertions, deletions, and substitutions. The theoretical
analysis easily extends to unequal costs.
77
distance metric over the set of strings. The aligned output is called a word confu-
sion network (WCN). It consists of a temporal sequence of sets of competing word
hypotheses. I refer to each such set as a cohort set [185].
Consider M ASR systems decoding a single given audio file. Readers must note
that this includes the case ofanM-best list generated from any single ASR system. I
first perform the DP alignment of the M decoded 1-best sentence hypotheses. Let N
be the total number of cohort sets in the resulting confusion network. The presented
theoretical analysis applies to ROVER fusion of any type of ASR output (1-best, M-
best, confusion network, and word lattice) because ROVER always performs fusion
on the confusion network output by the DP alignment. Let K be the number of
words in the decoding vocabulary of the ASR systems. I assume that this vocabulary
also includes a special symbol to account for word insertions and deletions during the
DP alignment. Consider the i-th cohort set C
i
. I encode each word in this cohort
set using a 1-in-K or one-hot encoding scheme
2
. Let w
m
i
be the 1-in-K encoding of
the word hypothesis from ASR system m in cohort set C
i
. Each ASR system can
additionally provide a confidence score in [0,1] for every word hypothesis. I also
embed this confidence score in a K-dimensional vector s
m
i
which contains the score
in the location of the word hypothesis and zeros everywhere else.
ROVER computes the following convex combination of the word bit vector w
m
i
and the confidence score vector s
m
i
:
h
m
i
=αw
m
i
+(1−α)s
m
i
(4.5)
where α∈ [0,1] is a user-defined parameter. ROVER then averagesh
m
i
across the M
ASR systems to obtain
h
avg
i
=
1
M
M
X
m=1
h
m
i
(4.6)
=α
1
M
M
X
m=1
w
m
i
+(1−α)
1
M
M
X
m=1
s
m
i
(4.7)
=αw
avg
i
+(1−α)s
avg
i
. (4.8)
2
w
m
i
is a K-dimensional bit vector with 1 in the position of the occuring word and 0 everywhere
else.
78
Thefirsttermontheright-handsideoftheaboveequation(4.8)isavectorcontaining
the frequency ofeach vocabulary word in the cohortsetC
i
. The second term contains
the average confidence score of each word in the vocabulary. Elements of h
avg
i
lie in
[0,1] but they do not sum to 1 (1
T
h
avg
i
6= 1) because 1
T
s
avg
i
6=1.
ROVER next thresholds h
avg
i
such that its maximum element is set to 1 and all
others are set to 0. I denote the resulting 1-in-K bit vector by h
∗
i
. Let r
i
be the
1-in-K encoding of the DP-aligned reference word forC
i
. Thus the WER forC
i
is the
following 0/1 loss function:
E(r
i
,h
∗
i
) =
(
0 ; if r
i
=h
∗
i
1 ; r
i
6=h
∗
i
. (4.9)
Both r
i
and h
∗
i
are 1-in-K bit vectors which allows us to rewrite the WER using the
L
2
norm as
E(r
i
,h
∗
i
) =
1
2
kr
i
−h
∗
i
k
2
2
. (4.10)
The total number of word errors in the given audio file is just the sum of E(r
i
,h
∗
i
)
over all N cohort setsC
i
. The next lemma proves the relation between E(r
i
,h
∗
i
) and
the probability of a word error under the simplistic assumption of independent and
identically distributed (IID) Bernoulli errors.
Lemma 4.1. Define the probability of a correct word
p=P(r
T
h
∗
= 1) (4.11)
wherebothrandh
∗
arerandomvectors. AssumethatallworderrorsareIIDBernoulli
random variables with parameter 1−p. Then
p
ML
= 1−
1
N
N
X
i=1
E(r
i
,h
∗
i
) (4.12)
is the maximum likelihood (ML) estimate of p.
Proof. Expand E(r
i
,h
∗
i
) in (4.10) as
E(r
i
,h
∗
i
) =
1
2
kr
i
k
2
2
+kh
∗
i
k
2
2
−2r
T
i
h
∗
i
. (4.13)
79
Bothkr
i
k
2
2
andkh
∗
i
k
2
2
are 1 because r
i
and h
∗
i
are 1-in-K bit vectors. Hence
1−E(r
i
,h
∗
i
)=r
T
i
h
∗
i
. (4.14)
Now{r
T
i
h
∗
i
}
N
i=1
are IID samples from a Bernoulli random variable with parameter p.
Hence the ML estimate of p is
p
ML
=
1
N
N
X
i=1
r
T
i
h
∗
i
(4.15)
=1−
1
N
N
X
i=1
E(r
i
,h
∗
i
). (4.16)
Inotethath
∗
i
isanon-linear(threshold)functionofh
avg
i
which makestheanalysis
of diversity in E(r
i
,h
∗
i
) from (4.10) difficult. The ambiguity decomposition considers
the average prediction from the individual regressors. Thus I instead propose to use
the following approximation to E(r
i
,h
∗
i
):
E
approx
(r
i
,h
avg
i
) =
1
2
kr
i
−h
avg
i
k
2
2
. (4.17)
E
approx
(r
i
,h
avg
i
) is easier to analyse because it directly uses h
avg
i
from (4.8) in place
of its non-linear transformation h
∗
i
. The next lemma relates E
approx
(r
i
,h
avg
i
) to the
ML estimate p
ML
of the probability of a correct word, p. Each r
T
i
h
avg
i
takes a value
between 0and1. r
T
i
h
avg
i
=0occurswhen thetrueworddoesnotappearinthecohort
setC
i
leadingtoaworderror. r
T
i
h
avg
i
= 1occurswhenalltheM ASRsystems predict
the correct word in the cohort set. r
T
i
h
avg
i
thus equals the empirical estimate of the
total probability of the correct word inC
i
. I thus make the natural assumption that
r
T
i
h
avg
i
are IID random variables supported on [0,1] with mean p.
Lemma 4.2. Assume that r
T
i
h
avg
i
are IID random variables supported on [0,1] with
80
mean p. Then
p
ML
≤ 1−
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
) (4.18)
and p
ML
≥
1
2
1+
α
2
M
−
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
) (4.19)
where M is the number of ASR systems and α∈ [0,1] is the ROVER parameter.
Proof. I first prove the upper-bound on p
ML
. Expand E
approx
(r
i
,h
avg
i
) as
E
approx
(r
i
,h
avg
i
)=
1
2
kr
i
k
2
2
+kh
avg
i
k
2
2
−2r
T
i
h
avg
i
(4.20)
wherekr
i
k
2
2
= 1 because r
i
is a 1-in-K bit vector. Also
kh
avg
i
k
2
2
=kαw
avg
i
+(1−α)s
avg
i
k
2
2
(4.21)
≤αkw
avg
i
k
2
2
+(1−α)ks
avg
i
k
2
2
(4.22)
due to (4.8) and Jensen’s inequality for the convex squared-L
2
norm. I next use the
fact that kw
avg
i
k
2
≤ kw
avg
i
k
1
. Since kw
avg
i
k
1
= 1
T
w
avg
i
= 1 because all entries of
w
avg
i
are non-negative and sum to 1, this gives uskw
avg
i
k
2
≤ 1. Similarly, ks
avg
i
k
2
≤
ks
avg
i
k
1
=1
T
s
avg
i
≤ 1because thesum ofentries oftheaverage confidence score vector
s
avg
i
is less than or equal to 1. Hence I can write the upper-bound
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
)≤ 1−
1
N
N
X
i=1
r
T
i
h
avg
i
. (4.23)
I now use the fact that{r
T
i
h
avg
i
}
N
i=1
are IID random variables with support [0,1] and
mean p. Hence
p
ML
=
1
N
N
X
i=1
r
T
i
h
avg
i
. (4.24)
81
is the ML estimate of p. Thus (4.23) becomes
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
)≤ 1−p
ML
. (4.25)
Re-arranging this gives us the desired upper-bound on p
ML
. I use the following in-
equality for kh
avg
i
k
2
2
in place of the upper-bound in (4.22) to prove the lower-bound
on p
ML
:
kh
avg
i
k
2
2
≥α
2
kw
avg
i
k
2
2
+(1−α)
2
ks
avg
i
k
2
2
(4.26)
≥
α
2
M
(4.27)
because kw
avg
i
k
1
= 1
T
w
avg
i
= 1 ≤
√
Mkw
avg
i
k
2
and ks
avg
i
k
2
2
≥ 0. I have utilized the
fact that the effective dimension of w
avg
i
is M and not K in the upper-bound on
kw
avg
i
k
1
. Rest of the steps of the proof for the lower-bound on p
ML
remain the same
as for the upper-bound.
The assumptions in Lemma 4.2allows us torelate thesample mean ofr
T
i
h
avg
i
over
the N cohort sets to the sample mean of the IID Bernoulli random variables r
T
i
h
∗
i
from Lemma 4.1. Lemmas 4.1 and 4.2 thus give us the following bounds which relate
the averages of the true E(r
i
,h
∗
i
) and its approximation E
approx
(r
i
,h
avg
i
) over all N
cohort sets for a given audio file.
Theorem 4.1. Assume that all word errors are IID Bernoulli random variables with
parameter 1−p and r
T
i
h
avg
i
are IID random variables with mean p and support [0,1].
Then
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
)≤
1
N
N
X
i=1
E(r
i
,h
∗
i
) and (4.28)
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
)≥
1
N
N
X
i=1
E(r
i
,h
∗
i
)
−
1
2
1−
α
2
M
. (4.29)
Proof. I substitute the equality for p
ML
from Lemma 4.1 into the inequalities for p
ML
proved in Lemma 4.2 to prove the theorem.
82
I note that the lower-bound in (4.29) becomes independent of α as the number of
ASR systems M increases. In the limit of the number of systems M →∞ or when
α = 0, the approximate ROVER WER lies in the interval:
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
)∈
"
1
N
N
X
i=1
E(r
i
,h
∗
i
)−
1
2
,
1
N
N
X
i=1
E(r
i
,h
∗
i
)
#
. (4.30)
Both the true and the approximate ROVER WERs are random variables. Hence,
itisusefulofthinkoftheaboveboundsasastatisticalconfidenceinterval inwhichthe
approximate ROVER WER lies. The assumption of IID word errors in Theorem 4.1
will be violated for a few decoded utterances. But the approximate ROVER WER is
highly likely to lie in this confidence interval for a large majority of the test cases. I
illustrate this through experiments in Section 4.5.
Theorem 4.1provides alinkbetween theaverages ofthetrueandtheapproximate
ROVER WERs. This justifies my use of the simpler approximate ROVER WER in
place of the true WER for analyzing diversity using the ambiguity decomposition in
the next section. I finally use Theorem 4.1 to derive a link between diversity and the
true ROVER WER.
4.3 Ambiguity Decomposition for ROVER WER
The previous section presented a vector space model for the true ROVER WER,
its tractable approximation, and some bounds that relate the true and approximate
WER. This framework allows us to define diversity of hypotheses from the ASR
systems being combined and its impact on the ROVER WER in this section.
There areseveraldefinitions ofdiversity foranensemble ofmachineclassifiers [93],
[154]. I use the ambiguity decomposition [90] because it readily applies to a convex
combination of predictions from M regressors using the squared-error loss function.
This fits my approximate WER proposed in the previous section and also provides
an easily interpretable result. The bias-variance-covariance decomposition [173] is an
equivalent result which was derived in the context ofneural networks. The ambiguity
decomposition uses a scalar target variable but it is easy to extend to the vector
case which I use for my approximate ROVER WER. The next theorem presents the
ambiguity decomposition for ROVER WER.
83
Theorem 4.2. The approximate ROVER WER for any cohort setC
i
decomposes as
E
approx
(r
i
,h
avg
i
)=
1
M
M
X
m=1
E
approx
(r
i
,h
m
i
)−
1
M
M
X
m=1
E
approx
(h
avg
i
,h
m
i
). (4.31)
Proof. I use the definition of the approximate ROVER WER from the last section
and expand
2E
approx
(r
i
,h
m
i
) =kr
i
−h
m
i
k
2
2
(4.32)
=kr
i
−h
avg
i
+h
avg
i
−h
m
i
k
2
2
(4.33)
=kr
i
−h
avg
i
k
2
2
+kh
avg
i
−h
m
i
k
2
2
+2(r
i
−h
avg
i
)
T
(h
avg
i
−h
m
i
) (4.34)
= 2E
approx
(r
i
,h
avg
i
)+2E
approx
(h
avg
i
,h
m
i
)+2(r
i
−h
avg
i
)
T
(h
avg
i
−h
m
i
). (4.35)
Taking average of both sides over m and dividing by 2 gives
1
M
M
X
m=1
E
approx
(r
i
,h
m
i
) =E
approx
(r
i
,h
avg
i
)
+
1
M
M
X
m=1
E
approx
(h
avg
i
,h
m
i
)+(r
i
−h
avg
i
)
T
h
avg
i
−
1
M
M
X
m=1
h
m
i
. (4.36)
The final term on the right-hand side is zero because of the definition of h
avg
i
. Re-
arranging the resulting equation gives us the ambiguity decomposition for ROVER
WER.
Theorem4.2saysthatforanycohortsetC
i
,theapproximateROVERWERequals
the average approximate WER of the individual systems minus the diversity of the
ASR system ensemble defined as
Diversity:
1
M
M
X
m=1
E
approx
(h
avg
i
,h
m
i
). (4.37)
This diversity equals the average of the approximate WERs of the individual sys-
temsfromtheROVER’sprediction. Figure4.1illustratesthenatureoftheambiguity
decompositionbymeans ofanexample ina3-dimensionalEuclidean spaceforagiven
cohort set with M = 3 ASR systems. Diversity measures the average spread of the
individual ASR predictions around the average prediction h
avg
.
84
Figure 4.1: This figure illustrates the ambiguity decomposition for ROVER WER
presented in Theorem 4.2. I consider recognition of a single word out of a vocabulary
of K = 3 words and M = 3 ASR systems. Each of the ASR systems predicts a
different word in the given cohort set. The three axes constitute the Euclidean vector
spacearisingduetothe1-in-3encodingofwords. TheaverageROVERpredictionh
avg
is[1/3,1/3,1/3]andtheapproximate WERis1/3. Theorem 4.2decomposes thisinto
adifference oftheaverage WERofthe3systems computed fromthe averagesquared-
lengthofthefinely-dottedlines(2/3)andthediversityoftheensemblecomputedfrom
the average squared-length of the thick-dotted lines (1/3).
85
Because (4.31) applies point-wise, i.e. for each cohort set C
i
, it also applies on
average across the N cohort sets for the given audio file. This gives
1
N
N
X
i=1
E
approx
(r
i
,h
avg
i
) =
1
NM
N
X
i=1
M
X
m=1
E
approx
(r
i
,h
m
i
)
−
1
NM
N
X
i=1
M
X
m=1
E
approx
(h
avg
i
,h
m
i
). (4.38)
The above equation in conjunction with Theorem 4.1 enables me to derive ambi-
guity decomposition bounds for the average true ROVER WER. I present this in the
next theorem.
Theorem 4.3. The average true ROVER WER over N cohort sets {C
1
,...,C
N
}
decomposes into the following bounds:
1
N
N
X
i=1
E(r
i
,h
∗
i
)≤
1
NM
N
X
i=1
M
X
m=1
E(r
i
,h
m
i
)
−
1
NM
N
X
i=1
M
X
m=1
E(h
m
i
,h
∗
i
)−
α
2
M
−1
and (4.39)
1
N
N
X
i=1
E(r
i
,h
∗
i
)≥
1
NM
N
X
i=1
M
X
m=1
E(r
i
,h
m
i
)
−
1
NM
N
X
i=1
M
X
m=1
E(h
m
i
,h
∗
i
). (4.40)
Proof. I startwith theresult ofTheorem 4.2andsubstitute the lower-bound in(4.29)
on the average of E
approx
(r
i
,h
avg
i
). This gives
1
N
N
X
i=1
E(r
i
,h
∗
i
)≤
1
N
1
M
N
X
i=1
M
X
m=1
E
approx
(r
i
,h
m
i
)
−
1
N
1
M
N
X
i=1
M
X
m=1
E
approx
(h
avg
i
,h
m
i
)−
1
2
α
2
M
−1
. (4.41)
I then use the fact that E
approx
(r
i
,h
m
i
) =E(r
i
,h
m
i
) becauseh
m
i
is a 1-in-K bit vector.
I finally again use the lower-bound in (4.29) on the average of E
approx
(h
avg
i
,h
m
i
) to
give the desired upper-bound.
86
The proof of the lower-bound proceeds similarly as above except that I use the
upper-bound in (4.28) instead of (4.29).
Theorem 4.3 gives insight into the relation between ROVER WER and ASR en-
semble diversity in terms of the true WER. I have made two key statistical assump-
tions to prove this result - the word errors are IID Bernoulli random variables with
probability of error 1−p and r
T
i
h
avg
i
are IID random variables with mean p. These
assumptions are not realistic because word errors are not IID random variables. ASR
word errors often occur in clusters and tend to be bursty due to the context used
in the AM/LM and the sequential nature of decoding. Furthermore, Theorem 4.3
only provides bounds on the true ROVER WER in terms of ensemble diversity. How-
ever, the results of this section do explain the impact of diversity on ROVER WER
and motivate ASR system fusion experiments in the next section. I now present my
experiments with multiple ASR systems to check the applicability of the proposed
decomposition.
4.4 Experimental Setup
Ifirstdescribe thetrainingofvariousASRsystems usingtheKalditoolkitinthenext
section. This is followed by a description of my ASR confidence estimation module
in Section 4.4.2 and my test data set in Section 4.4.3.
4.4.1 ASR System Training in Kaldi
The Kaldi toolkit [135] provides state-of-the-art open-source tools for training ASR
systems. It offers many advantages over other publicly available ASR toolkits (e.g.
HTK [188] and Sphinx [97]) such as tight integration with finite state transducers
(FSTs) using the OpenFST toolkit [2], generic and extensible design, Apache 2.0 li-
cense, and recipes for various standard data sets. I used data from the Wall Street
Journal (WSJ) [130], the HUB4 English broadcast news [57], and the ICSI meet-
ing [79] data sets for my experiments. These data sets are popular with researchers
in automatic speech recognition and speech processing.
IusedthedefaultKalditrainingrecipeforWSJtotrainalltheASRsystems. This
recipeusestheCMUpronunciationdictionaryanditsphoneset. ItfirstcomputesMel
frequency cepstral coefficients (MFCCs) over 25 msec long speech frames with a 10
87
msecshift. ItthentrainsmonophonemodelsusingtheViterbi-EMalgorithm. Kaldi’s
AM training tools do not use the Baum-Welch or the exact EM algorithm because
the computationally cheaper Viterbi-EM algorithm gives similar word recognition
performance. The training recipe next aligns the training data using these models
andusestheresultingalignmentstotraintriphonemodels(mybaselinesystemM1). I
used2000leavesfordecisiontreeclusteringand10000totalGaussiansforthetriphone
models.
I then obtained the second system (system M2) by using the alignments from
the M1 models to perform Linear Discriminant Analysis (LDA) and Maximum Like-
lihood Linear Transformation (MLLT). Both these steps increase the discrimination
between the various phones and thus lead to better recognition accuracy. I then
discriminatively adapted the M2 acoustic models using the Maximum Mutual In-
formation (MMI) [133] criterion which resulted in the final system M3. The MMI
optimization maximizes the mutual information between the true state sequence and
the acoustic feature vectors. Prior work has shown that it improves WER beyond a
generatively-trained AM. Table 4.2 summarizes the three systems and their training
steps.
ASR System Training Steps
M1 Triphone Viterbi-EM training
M2 M1→ LDA→ MLLT
M3 M2→ MMI
Table 4.2: Table 4.2 summarizes the training steps for the three ASR systems used
in this chapter for each of the WSJ, HUB4, and ICSI data sets.
1-best WERs
HUB4 WSJ ICSI
MR BO JL MR BO JL MR BO JL
M1 31.8 34.2 46 39.3 47.7 55.5 40.1 44.2 54.1
M2 29 30.4 44.1 37.1 44.3 59 35.1 41.3 54.8
M3 28.2 29.6 43.6 36 43.2 58.5 33.9 39.6 53.4
Table4.3: Table4.3summarizesthetestingset1-bestWERsforvariousASRsystems.
The systems trained on the HUB4 data set provide the lowest WERs. M3 models
perform the best out of the three models trained for each data set except WSJ for
speaker JL.
I now discuss my ASR confidence estimation module which enables me to test
88
the applicability of the ambiguity decomposition for confidence-weighted ROVER in
addition to the conventional word frequency-based ROVER.
4.4.2 ASR Confidence Estimation
ROVERallowseachwordtopossess aconfidence scorebetween 0and1duringfusion.
A confidence score of 1 indicates that the word is correct and thus should be assigned
more weight in its corresponding cohort set. ASR confidence scores can also provide
useful information to the user of the ASR system and other modules which use the
ASR output, such as spoken dialog systems. Many researchers have thus focused on
designingalgorithmsforASRconfidenceestimation. [80]providesadetailedsurveyof
thesealgorithms. MostofthesetechniquesuserichrepresentationsofASRhypotheses
such as word lattices, confusion networks, and N-best lists.
The most common approach for ASR confidence estimation is to compute the
posterior probability of the word hypotheses on each arc using the forward-backward
algorithm [137] on the word lattice [83], [180]. This directly gives a word confidence
score without any additional processing. However, many researchers have found out
that training a classifier which uses additional features derived from the word lattice
and AM/LM scores is better able to predict ASR word errors. Examples of such
classifiers include maximum entropy models [181], conditional random fields [153],
boostedweaklearnerssuchasdecisionstumps[116],andneuralnetworks[178]. These
classifier-basedtechniques typicallyout-performthewordlatticeposteriorprobability
in terms of standard metrics such as normalized cross-entropy (NCE) and equal error
rate (EER) because they optimally combine additional features to minimize a loss
function on a labeled data set.
I used conditional random fields (CRFs)[96] forperforming confidence estimation
in this chapter. A CRF is a discriminative model which estimates the conditional
probability p(y|x) of a label sequence y given input data x. This conditional proba-
bility is written as the normalized log-linear function
p
Λ
(y|x)=
1
Z
Λ
(x)
exp
X
k
λ
k
f
k
(x,y)
(4.42)
where f
k
(x,y)is thek-thfeature, λ
k
is its weight andZ
Λ
(x) is the normalization con-
stantalsoknownasthepartitionfunction. Thefeaturesaredefinedonfully-connected
89
subgraphs or cliques ofaMarkov random field (MRF) composed ofelements ofx and
y. Quasi-Newton optimization algorithms such as L-BFGS [103] typically perform
parameter estimation of a CRF by maximizing the conditional likelihood of a labeled
training data set. Negative L1 and/or L2 norm of the parameter vector Λ added to
the log-likelihood objective function performs regularization and gives lower weight
to redundant features.
Elements of my extracted data vector x fall into two categories – ASR system
lattice-based cues and speech signal-based prosodic cues. I use the time boundaries
for each word in the given hypothesis sentence to extract the following cues:
1. ASR lattice-based cues: The ASR lattice provides valuable cues for confi-
dence estimation because it retains many competing word hypotheses and their
associated AM/LM scores within the time boundaries of a given word in the
1-best hypothesis. I extracted the following cues from the ASR lattice:
(a) Word hypothesis posterior probability - A high posterior probability of the
1-best word in the current time segment indicates high confidence.
(b) Entropy of word posterior PMF - I computed the posterior probability
mass function (PMF) over all words in the lattice within the given time
boundaries. A peaky posterior distribution characterized by low entropy
indicates that the ASR is confident about its word hypothesis.
(c) Frequencyofwordhypothesis -Thefrequencyofthe1-bestwordhypothesis
in the given time segment provides a proxy for its posterior probability.
(d) Entropy of word frequency distribution - This feature was motivated by
the entropy of the word posterior pdf. A low entropy again indicates high
confidence.
(e) Number of unique word hypotheses - A high number of unique word hy-
potheses indicates that the ASR system is confused between several com-
peting word hypotheses and the 1-best hypothesis is likely to be of lower
confidence.
(f) Number of unique context-dependent states - Kaldi’s decoder generates lat-
tices which contain the context-dependent state (transition ID) sequence
on each arc. A high number of unique context dependent states indicates
90
large number of state transitions and thus acoustic instability. This is
likely to happen in case of a word error.
(g) AM and LM scores -Iusedtheduration-normalizedacousticandlanguage
model log-likelihoods as additional cues.
(h) Duration of the hypothesis word - Including the word duration helps us
model any duration-dependent behavior of word errors.
(i) Word identity -Ialsoincluded wordidentity tomodeltheobservationthat
the ASR systems committed more frequent errors on certain words than
others.
2. Prosody-based cues: I also extracted mean, standard deviation, minimum,
and maximum of the loudness, voicing probability, and pitch using the Opens-
mile toolkit [52]. Prior work has shown the usefulness of prosodic features for
predicting ASR errors [64], [77], [102] and speech disfluencies [104].
Most CRF softwares including the one I used for this chapter ( CRF++
3
) allow
onlydiscreteinputs. Ithusquantizedeachoftheabovecuesinto10levelsusingtheK-
means algorithm. Using a higher number of quantization levels led to an exponential
increase in the number of CRF features and resulted in over-fitting. Let y
t
=1 if the
t-th word in a given hypotheses sequence is correct and 0 otherwise. Letx
t
represent
the corresponding vector of the above confidence cues after quantization. I computed
two broad categories of binary features for the CRF model - state (unigram) features
which depend only on the current label y
t
, and transition (bigram) features which
depend on both the current label and previous labely
t−1
. Examples ofthese unigram
and bigram features are
f
k
(x
t
,y
t
) =I(x
tl
=b and y
t
= 1) and (4.43)
f
k
(x
t
,y
t
,y
t−1
) =I(x
tl
=b and y
t
= 1 and y
t−1
=0) (4.44)
wherex
tl
isthel-thcomponent ofx
t
,bisthediscrete bin index, andI istheindicator
function. I also included features which utilized contextual information for the input
x - f
k
(x
t−1
,y
t
), f
k
(x
t+1
,y
t
), f
k
(x
t−1
,y
t
,y
t−1
), f
k
(x
t+1
,y
t
,y
t−1
). This was motivated
by the observed benefits ofusing temporal context in improving OOV detection [124].
3
http://crfpp.googlecode.com/svn/trunk/doc/index.html
91
I mitigated over-fitting the training data by including an L2 penalty in the training
objective function. Using an L1 penalty instead reduced the ASR confidence estima-
tion performance slightly. I also omitted any features which occurred less than 10
times in the training data.
The next section describes the test dataset used in this chapter andits associated
in-domain language model.
4.4.3 The 2012 US Presidential Debate Data Set
I downloaded the audio and human-generated transcriptions for the first 2012United
StatesPresidentialdebatebetweenPresidentBarackObama(BO)andGovernorMitt
Romney (MR) from the National Public Radio (NPR) website
4
. This debate was
held on October 3, 2012 at the Magness Arena of the University of Denver in Denver,
Colorado and was hosted by Mr. Jim Lehrer (JL). The audio is approximately 90
minutes long. I did not compose the test set using held-out audio from the WSJ,
HUB4, ICSI or other well-known ASR data sets because I wanted to investigate the
benefits of diversity in the ASR ensemble on a totally new domain.
I performed speaker diarization on this audio using the system from [177] which
usesvoiceactivitydetectionfollowedbycorrelation-basedsegmentationandhierarchi-
cal clustering. This diarization system generated three clusters corresponding to JL,
BO,andMR.Ididnotmanuallycorrecttheobtainedclusterlabels. Ithensegmented
the audio into clips up to 10 sec long for faster decoding within memory constraints.
The number of utterances for JL, BO, and MR were 107, 282, and 253, respectively.
I cleaned the reference debate transcripts by removing punctuation marks and map-
ping numbers to words (e.g. “$5 trillion” to “five trillion dollars”). I did not add
any out-of-vocabulary (OOV) words from the transcripts to the ASR pronunciation
dictionary to minimize any data set-specific processing.
Text data from the three standard ASR data sets used here is inappropriate for
training the language models due to target domain mismatch. I thus obtained tran-
scripts of all US Presidential and Vice-Presidential debates from 1960-2012
5
exclud-
ing the first Presidential debate in 2012. These transcripts contain approximately 0.5
million words and were used to train a 4-gram LM with back-off using SRILM [163].
Table 4.3 gives the WER of the 1-best (Viterbi) hypothesis for each speaker in the
4
http://www.npr.org/2012/10/03/162258551/transcript-first-obama-romney-presidential-debate
5
http://www.debates.org/index.php?page=debate-transcripts
92
testing set using various systems. I observe that the systems trained on the HUB4
broadcast news data set perform the best. The M3 systems give the lowest WER in
most cases. I am making the debate audio, transcripts, and LM available here
6
.
The next section presents various experimental results using this presidential de-
bate data set.
4.5 Results and Analysis
I start this section by presenting my experimental results for ASR system confidence
estimation and WER after N-best fusion using different variants of ROVER. I then
provide experimental validation for my proposed diversity-ROVER WER link.
4.5.1 ASR System Confidence Estimation
IgeneratedwordlatticeswiththeKalditoolkitforallaudiofilesintheUSpresidential
debatedatasetusing9differentASRsystems -systems M1,M2,andM3eachtrained
on the WSJ, HUB4, and ICSI data sets. I then extracted the 10-best lists from the
lattices along with time boundaries and AM/LM score for each word hypothesis. I
did not pick a larger N-best list size because the lattices for a few short audio files
contained less than 10 unique sentence hypotheses. I then computed the prosodic
and lattice-based cues for word confidence estimation as described in section 4.4.2. I
also labeled each word as correct or incorrect using the available human-generated
transcriptions. I finally trained 3 separate CRFs for confidence estimation using the
labeled 10-bestlists foreach ASRsystems trained on the 3different datasets (HUB4,
WSJ, and ICSI). Each CRF used data for all 3 ASR systems (M1, M2, and M3) for
a given training data set (e.g. HUB4). I performed 4-fold cross-validation during
training by leaving out 25% of each test speaker’s utterances for testing and used the
remaining data for training. I used the default value of 1 for the L2 penalty weight
in CRF++.
I used two popularperformance metrics forevaluating the performance ofthe soft
confidence score estimates given the true word error labels. The first one is the equal
error rate (EER) which is the false alarm rate or the miss rate at the confidence
score threshold where the false alarm and miss rates are equal. The EER can be
6
sail.usc.edu/data.php
93
graphically computed as the false alarm rate or the miss rate at the point where the
45
◦
line intersects the detection error tradeoff (DET) curve.
I used normalized cross-entropy (NCE) as the second performance metric. I com-
puted it as
NCE =1−
H
cond
H
base
(4.45)
where
H
cond
=−
T
X
t=1
y
t
logp(y = 1|x
t
)+(1−y
t
)logp(y = 0|x
t
)
(4.46)
H
base
=−T
1
log
T
1
T
−(T −T
1
)log
T −T
1
T
. (4.47)
T is the total number of word hypotheses, T
1
is the number of correct words, and y
t
is the label of the t-th word (1 for correct and 0 for incorrect). A small value of H
cond
indicates that the pdfs of the true error labels and estimated confidence scores are
close. Division by H
base
normalizes for the chance case when all the word confidence
scores equal the prior probability of the word being correct. NCE thus increases with
better confidence score estimates.
Normalized Cross-Entropy (NCE)
HUB4 WSJ ICSI
MR BO JL MR BO JL MR BO JL
M1 0.3735 0.3884 0.3776 0.3387 0.316 0.3181 0.3105 0.3398 0.276
M2 0.4682 0.5344 0.4306 0.4502 0.4791 0.4191 0.4032 0.4573 0.3928
M3 0.4628 0.5301 0.4321 0.4508 0.473 0.4194 0.3994 0.4565 0.3692
Table 4.4: Table 4.4 summarizes the testing set normalized cross-entropy (NCE) for
ASR system confidence estimation. Higher values of NCE indicate better ASR confi-
dence estimates. Perfect ASR confidence estimates give an NCE of 1.
Table 4.4 shows the NCE values and Table 4.5 shows the EER values for the
various ASR systems and test set speakers. I obtain an average NCE of 0.41 and an
averageEERof0.17overallthetestcases. Inextevaluatetheutilityoftheestimated
ASR confidence scores by using them in N-best ROVER.
94
Equal Error Rate (EER)
HUB4 WSJ ICSI
MR BO JL MR BO JL MR BO JL
M1 0.1807 0.1814 0.1786 0.2 0.1946 0.2034 0.2101 0.2006 0.21
M2 0.15 0.1334 0.1721 0.1621 0.1574 0.1747 0.1759 0.1672 0.1801
M3 0.155 0.1334 0.1676 0.1563 0.1597 0.1789 0.1764 0.1624 0.1872
Table 4.5: Table 4.5 summarizes the testing set equal error rate (EER) for ASR
system confidence estimation. Lower values of EER indicate better ASR confidence
estimates. Perfect ASR confidence estimates give an EER of 0.
4.5.2 ROVER WER
IusedtheROVERalgorithmtoperformfusionof10-bestlistsobtainedfromdifferent
ASR systems described in Section 4.4.1. I adopted three strategies for this fusion
which allowed me to test the diversity-ROVER WER link across different variants of
the fusion rule.
Word Frequency-based ROVER
The traditional word frequency-based ROVER was the first fusion scheme. It results
by giving all the weight to the average word frequency vector w
avg
i
for each cohort
setC
i
in (4.8) by setting α = 1. The first block of rows in Table 4.6 shows the WERs
obtained after 10-best fusion for different systems. I observe a reduction in WER
with respect to the 1-best WER in Table 4.3 for 19 out of the 27 cases for 10-best
fusion within each model (M1, M2, and M3) and within each training set (HUB4,
WSJ, and ICSI). I next compare the WERs for the M1+M2+M3 row with the WERs
of the individual systems and observe that the ROVER WER is lower than the best
component system’s WER in 5 out of 9 cases. I observe that 10-best ROVER out-
performs the WER of the best component system in 2 out of 9 cases when fusing
across training sets in the last column block under WSJ+HUB4+ICSI. The WERs
for fusion across models and training sets shows a similar trend. This leads me to
conclude that the word frequency-based ROVER algorithm may not always reduce
the WER beyond the WER of the best component ASR system. My experiments in
Section 4.5.3 give a possible reason for this observation by showing that this variant
of ROVER is unable to effectively utilize the inherent diversity in the N-best list.
95
Oracle Confidence-based ROVER
I next evaluated the WER of the oracle confidence-weighted ROVER algorithm to
find out the lower bound on the WER after N-best fusion. Each hypothesis word
in the N-best list is assigned a confidence score of 1 if it is correct with respect to
the human-generated transcriptions and 0 otherwise. I used this oracle confidence
score in (4.8) and tuned the trade-off parameter α using cross-validation. The second
block of rows in Table 4.6 shows the WER after oracle confidence-weighted ROVER
for various cases. I observe a significant reduction in WER with respect to the un-
weighted ROVER WER for all cases. The best performance is obtained when the
10-best list is composed of decoded sentences from all three models and ASR systems
trained on all three training data sets.
7
This is intuitive because such a 10-best list is
very diverse and each sentence hypothesis makes complementary errors. The oracle
confidence-weighted ROVER fusionrule isexpected toutilize thisdiversity much bet-
ter than simple word frequency-based ROVER. I illustrate this point through further
experiments using the ambiguity decomposition in section 4.5.3.
CRF Confidence-based ROVER
The final fusion rule uses the confidence scores generated by the CRF-based system
described in Section 4.4.2 for ROVER fusion. I again tuned the trade-off parameter
α using cross-validation. The third (last) block of rows in Table 4.6 shows the WER
after using the CRF confidence scores for ROVER fusion. I observe that the CRF
confidence weighted ROVERreduces WERbeyondtheunweighted ROVERin40out
of all 48 cases. This reduction in WER ranges from 0.1% to 1.5% (absolute). This
indicates that while using the CRF confidence scores is better than word frequency-
based ROVER, there is still a big gap in WER with respect to the lowest achievable
WER after ROVER fusion using the oracle confidence scores. This is despite the
acceptable values of EER and NCE for the ASR confidence estimation as shown in
Tables 4.5 and 4.4 respectively. I thus expect the CRF confidence-weighted ROVER
to utilize N-best diversity slightly better than the traditional variant but worse than
the oracle ROVER. The experiments in the next section confirm this assertion.
7
I have highlighted the corresponding WERs in bold font in Table 4.6.
96
4.5.3 Analysis of Diversity-ROVER WER Link
This section presents my experiments to validate the theoretical link between N-best
diversity and ROVER WER presented in Section 4.3. I use Theorems 4.1 and 4.2
to derive my proposed link between diversity and ROVER WER in Theorem 4.3.
The proof of Theorem 4.3 simply requires substituting Theorem 4.1 in Theorem 4.2.
Hence, I first empirically validate Theorems 4.1 and 4.2. I generated confusion net-
works for the combined N-best list and reference sentence hypothesis over all system
combination variations on the test set in Table 4.6.
Figures4.2(a)and4.2(b)showthePDFsoftheerrorbetweentheleft(approximate
ROVER WER) and right hand sides of the two bounds in Theorem 4.1. The PDF
of the error for the upper bound should ideally have all its density supported on the
negativerealaxis,whilethePDFforthelower-bounderrorshouldbesupportedonthe
positiverealaxis. Figures4.2(a)and4.2(b)showthatthisistrueforanoverwhelming
majorityofthetestinstances (≥ 93%). However, Theorem 4.1assumes thattheword
errors are IID, which is not the case in practice. Hence, the bounds are violated for a
rare minority of the test instances, as indicated by the area under the two PDFs on
their respective wrong sides of 0.
Figures 4.3(a) and 4.3(b) show the scatter plot between the approximate ROVER
WER in Theorem 4.2 and the difference of the average N-best WER and the N-best
diversity. Theorem 4.2says thatboththese quantities areequal. This is evident from
Figures4.3(a)and4.3(b)foralltestinstances acrossallsystem combinationscenarios
in Table 4.6.
To gain further insight into the nature of my proposed diversity-ROVER WER
link,IperformedmoreexperimentsonTheorem4.3. ThistheoremrelatestheROVER
WER to average N-best WER and N-best diversity through upper and lower bounds.
In order to compensate for the error in these bounds and provide interpretability to
the experiments, I instead consider the following version of the decomposition for a
given utterance with N cohort sets:
1
N
N
X
i=1
E(r
i
,h
∗
i
)
| {z }
ROVER WER
≈
1
N
1
M
N
X
i=1
M
X
m=1
E
approx
(r
i
,h
m
i
)
| {z }
Avg. N-best WER
−γ
1
N
1
M
N
X
i=1
M
X
m=1
E
approx
(h
∗
i
,h
m
i
)
| {z }
N-best Diversity
(4.48)
97
whereγ isaparameterthatIlearnthroughleastsquaresregression. γ alsoprovidesus
additionalinterpretation. ItequalsthedecreaseinROVERWERwithaunitincrease
in N-best diversity keeping the average N-best WER constant. It thus denotes the
degreetowhichagivenROVERfusionruleutilizesdiversityintheN-bestlist. Higher
values of γ indicate a more diversity-sensitive fusion rule.
Figures 4.4 and 4.5 show the scatter plots of the ROVER WER with the average
N-bestWER andtheN-bestdiversity, respectively, using theoracle confidence scores.
As expected, the ROVER WER reduces with decreasing average WER of the N-best
list in Figure 4.4. It is interesting to note from Figure 4.5 that the ROVER WER
shows adecreasing trendwithincreasing diversity oftheN-bestlist. BothFigures4.4
and 4.5 however show only a moderately linear correlation.
I next performed least squares linear regression on the ROVER WER using both
the average N-best WER and N-best diversity to check the proposed approximate
decomposition in (4.48). Figure 4.6 shows the resulting scatter plot. The correlation
coefficient is significantly higher than in Figures 4.4 and 4.5. I also obtain an ap-
preciably small RMSE of 2.89. These results highlight the accuracy of the proposed
approximate decomposition for the ROVER WER in (4.48). The estimated γ of 0.56
shows that the ROVER WER decreases by 0.56% (absolute) with a unit increase in
the N-best diversity keeping the average N-best WER constant.
Figures 4.4 and 4.5 also highlight the trade-off between average N-best WER and
N-best diversity predicted by the ambiguity decomposition. I consider the N-best
list produced by the M3 model trained on the HUB4 data set, and a combination of
N-best lists from the M1, M2, and M3 models trained on the same data set. Both
have identical ROVER WERs close to 30%. Figure 4.4 shows that the M3 N-best
list has an significantly lower average WER than the combined M1+M2+M3 N-best
list. This is intuitive because the sentence hypotheses from M1 and M2 models have
a higher WER than the hypotheses from the M3 models. However, Figure 4.5 shows
that M1+M2+M3 N-best list also has a much higher N-best diversity than the M1
N-best list. The approximate ambiguity decomposition in (4.48)says that this higher
diversity compensates for the higher N-best WER for the M1+M2+M3 N-best list
and the fused hypotheses have similar WER for the two cases, as apparent from
Figure 4.6.
I next present statistical significance tests over the entire test set for the three
speakers. Table 4.7 shows the median per-utterance correlation coefficient for each
98
speaker between the ROVER WER and the right-hand side of (4.48) with optimal γ
found using least squares regression. I present results for the three ROVER fusion
rules - word frequency-based (α = 1), oracle confidence-based, and CRF confidence-
based. The α for the latter two rules were found using cross-validation and are the
same as in section 4.5.2. I observe that all correlation coefficients are close to 0.9 and
significant at the 10% level using a bootstrap confidence interval. This indicates that
the proposed approximate decomposition accurately predicts the true ROVER WER.
I next compare the three ROVER fusion rules with respect to their sensitivity
to N-best diversity. The WER results in section 4.5.2 showed that ROVER with
oracle confidence scores gave the lowest WER after fusion. This was followed by the
CRFconfidence-weighted ROVER. Theword frequency-based ROVER gave theleast
improvement over the 1-bestWER. I now provide an explanation forthis observation
based on the proposed decomposition. Table 4.8 shows the median per-utterance γ
for the three ROVER fusion rules across speakers in the test set. (4.48)says thatγ is
the sensitivity of the fusion rule to diversity in the N-best list. More sensitive fusion
rules are expected to have a higher γ and thus utilize the N-best diversity better,
leading to lower WER upon ROVER fusion.
Table 4.8 shows that the oracle confidence ROVER has the highest median γ for
alltestspeakers outofthethree fusionstrategies. This is intuitive because ituses the
true word error label as a confidence score and is thus able to give low emphasis to
erroneous words in each confusion bin during fusion. This leads to better utilization
of diversity or complementarity in the N-best list. The CRF confidence-weighted
ROVER gives the next highest γ because the confidence scores generated by the
CRF model are not perfect as indicated by the EER and NCE results in section 4.5.1.
Hence some words with high confidence might actually be incorrect and may thus
degrade ROVER fusion performance. These errors in ASR confidence estimation
thus prevent ROVER from taking advantage of the inherent diversity in the N-best
list. The word frequency-based ROVER weighs each system equally during fusion
and is thus totally oblivious to word errors. Hence it gives the highest WERs out of
the three schemes.
I have established and investigated into my proposed theoretical link between
ASR diversity and ROVER WER in this section through several experiments. The
next section gives ageneral discriminative framework forjointlytraining diverse ASR
systems which utilizes the decomposition presented in this chapter. I also show that
99
some recent approaches for training diverse ASR systems are special cases of my
framework.
4.6 AUnifiedDiscriminativeApproachforJointly
Training Diverse ASR Systems
Prior work has used intuitive algorithmic modifications to train diverse ASR systems
as discussed in Section 4.1. However, some recent works have also focused on explicit
discriminative training algorithms for this purpose [27], [45], [164]. The proposed
theoretical link between diversity and ROVER WER unifies these approaches and
motivates aprincipled approach forjointly trainingdiverse ASR systems inadiscrim-
inative fashion.
Let the M ASR systems have parameters Θ ={Θ
1
,...,Θ
M
}. Consider a given
trainingdatasetofT audiofiles withobserved acoustic featurevector sequencesO =
{O
1
,...,O
T
} and reference words sequencesW = {W
1
,...,W
T
}. Let the M ASR
systems produce word hypothesis sequences{H
1
t
,...,H
M
t
} forO
t
. I consider a multi-
system version of the minimum Bayes risk (MBR) training objective function [50]
because it generalizes other popular discriminative training objective functions
8
such
as ones used in the maximum mutual information (MMI) [133], minimum word error
(MWE)[134],andminimumworderror(MPE)training. ThejointMBRoptimization
problem minimizes the expected WER of the ROVER word hypothesis sequence
H
∗
t
(α) obtained by combining the M hypotheses for O
t
. This optimization problem
is
Θ
∗
=argmin
Θ
T
X
t=1
X
H
1
t
...
X
H
M
t
E
W
t
,H
∗
t
(α);Θ
×P
H
1
t
,...,H
M
t
|O
t
;Θ
(4.49)
whereE(.)computestheWERbetweentwosentencehypothesesandP(H
1
t
,...,H
M
t
|O
t
;Θ)
is the joint probability density function (pdf) of the sentence hypotheses from the M
ASR systems. Thus E(.) equals the average of the confusion bin WERs E(.) over
each audio file which I used in the prior sections. The above optimization problem
is difficult to solve because of the expectation with respect to the joint pdf of hy-
8
Heigold et. al [72] provide an excellent overview of various discriminative training algorithms
for ASR.
100
potheses from all M ASR systems and the use of the ROVER hypothesis H
∗
t
(α). I
now show how the proposed decomposition in this chapter considerably simplifies the
MBR objective function in (4.49).
I first upper-bound the WER of the ROVER sentence hypothesis H
∗
t
(α) by the
average WER of the sentence hypotheses and the diversity using Theorem 4.3:
E
W
t
,H
∗
t
(α);Θ
≤
1
M
M
X
m=1
E
W
t
,H
m
t
;Θ
m
−
γ
M
M
X
m=1
E
H
∗
t
(α),H
m
t
;Θ
m
+constants . (4.50)
The constants in the above equation depend on the ROVER parameter α and M as
discussed in Section 4.3, and are independent of Θ. I hence ignore them. The trade-
off parameter γ is non-negative. I note that the diversity term still depends on the
ROVER hypothesisH
∗
t
(α) which makes joint training difficult because I need to sum
over all possible hypotheses sequences from all M ASR systems in (4.49). However,
E is just the Levenshtein string metric and I use the triangle inequality for any two
pairs of hypotheses sentences H
m
t
andH
n
t
:
E
H
n
t
,H
m
t
;Θ
m
,Θ
n
≤E
H
∗
t
(α),H
m
t
;Θ
m
+E
H
∗
t
(α),H
n
t
;Θ
n
∀m,n∈{1,...,M}. (4.51)
Adding the above inequalities over all possible unique pairs of hypotheses gives the
following pairwise lower-bound on the diversity term:
M
X
m=1
E
H
∗
t
(α),H
m
t
;Θ
m
≥
1
(M−1)
×
M
X
m=1
M
X
n=m+1
E
H
m
t
,H
n
t
;Θ
m
,Θ
n
. (4.52)
Hence the upper-bound on the WER of the ROVER hypothesis in (4.50) can be
relaxed to the following upper-bound that is independent of the ROVER hypothesis
101
H
∗
t
(α):
E
W
t
,H
∗
t
(α);Θ
≤
1
M
M
X
m=1
E
W
t
,H
m
t
;Θ
m
−
γ
M(M−1)
M
X
m=1
M
X
n=m+1
E
H
m
t
,H
n
t
;Θ
m
,Θ
n
. (4.53)
The above bound enables me to marginalize the joint pdf in the diversity term and
gives the following relaxation to the original joint MBR problem in (4.49):
Θ
∗
=argmin
Θ
M
X
m=1
h
T
X
t=1
X
H
m
t
E
W
t
,H
m
t
;Θ
m
P(H
m
|O
t
,Θ
m
)
i
−
γ
(M−1)
M
X
m=1
M
X
n=m+1
h
T
X
t=1
X
H
m
t
X
H
n
t
E
H
m
t
,H
n
t
;Θ
m
,Θ
n
×P
H
m
t
,H
n
t
O
t
,Θ
m
,Θ
n
i
. (4.54)
ThefirsttermintheaboveobjectivefunctionisproportionaltotheaverageBayesrisk
for all M ASR systems, and the second term is proportional to the average pairwise
Bayes risk ofthe systems. This latter diversity term is significantly easier to compute
than the joint diversity term in (4.49) because it involves a sum over all hypothesis
from pairs of ASR systems.
Setting γ = 0 in (4.54) leads to disjoint MBR training of the M ASR systems
without any explicit diversity criterion, though the systems can be made diverse
through implicit techniques such as randomized decision trees [155]. For γ > 0, the
optimization problem in (4.54) can be solved iteratively over each ASR system. Thus
the parameters for system m are estimated such that they have low Bayes risk with
respect to the reference transcriptions and high average Bayes risk or diversity from
all other (M−1) ASR systems.
The proposed joint optimization approach is different from the MBR leveraging
(MBRL)algorithmbyBreslinandGales[27]. MBRLisasequentialtrainingalgorithm
where the m-th ASR system is trained with respect to the previously trained (m−1)
systems. The confusion networks from the previous (m−1) systems are aligned with
the reference to find incorrectly predicted words, which are then assigned higher loss
E while training the m-th ASR system. The proposed objective function in (4.54) is
102
an upper-bound on the Bayes risk of the ROVER hypothesis. Jointly training the M
ASR systems via (4.54) thus directly impacts the average fusion performance.
TachiokaandWatanabe[164]proposeanMMI-based discriminative trainingcrite-
rionfortrainingdiverse ASRsystems. Them-thsystem istrainedby maximizing the
mutual information with the correct word sequence and minimizing the average mu-
tual information to the hypotheses word sequences to the (m−1) base systems. This
is a special case of my ambiguity decomposition-based formulation in (4.54) because
the MBR objective with a 0/1 error function E reduces to the MMI objective. The
proposed approach also similarly reduces to the complementary phone error (CPE)
formulation by Diehl and Woodland [45] when the ASR systems are trained sequen-
tiallyandonlythe1-besthypothesestranscriptionsfromtheprevious(m−1)systems
areused forcomputing the diversity instead oftaking anexpectation over all possible
hypotheses sequences as in (4.54).
4.7 Conclusion and Future Work
This chapter presented a theoretical basis of the link between the WER of the fused
hypothesis generated by the ROVER algorithm and the diversity of the constituent
ASR systems. I draw upon the ensemble methods literature in machine learning
for my purpose by first presenting a vector space model for ROVER fusion. This
enables me to approximately decompose the WER of the ROVER output in terms
of the average WER of the sentence hypotheses being combined and the diversity of
the N-best list. This decomposition gives a natural definition of N-best diversity -
the spread of the individual sentence hypotheses about the ROVER output in the
proposed vector space model. It also highlights the trade-off between average N-best
WERanddiversity. SentencehypotheseswithhighWERcanlowertheROVERWER
provided they are diverse and the fusion rule is able to take advantage ofthem. I also
refine my proposed approximate decomposition through upper and lower bounds on
the error rate.
I next present experimental evidence for the accuracy of the proposed decomposi-
tion using multiple ASRs trained with the Kaldi toolkit and multiple ROVER fusion
schemes. The experiments also provide insights into different ROVER WERs for dif-
ferent fusion schemes based on the decomposition. Using the true (oracle) word error
labelasconfidence scoreleadstolowest WERuponfusionbecauseitismoresensitive
103
to diversity in the N-best list as indicated by higher γ in (4.48). This is intuitively
followed by ROVER which uses CRF-generated confidence scores and no confidence
scores (only word frequencies) in order of increasing ROVER WER. I also present a
unified minimum Bayes risk (MBR)approachforsystematically trainingdiverse AMs
using the proposed decomposition. This discriminative training framework general-
izes several recent attempts at incorporating diversity in the AM training objective
function.
There are several interesting implications of this work. First and foremost, it
provides theoretical insight into the often-observed empirical benefit of fusing diverse
ASR systems using ROVER. This is especially important given theprevalence ofreal-
world systems involving multiple ASR systems. Second, I believe that this work can
motivate more principled approaches for training diverse ASR systems. Recent work
onminimum Bayes risk leveraging [27], complementary phone errortraining [45], and
diverse MMI training [164] for ASR systems are steps in the right direction
9
. I also
believe that the proposed results can be useful in non-ASR contexts, such as ROVER
fusion of transcriptions from multiple human annotators in a crowd-sourcing setting.
9
[11] presents a similar approach for training diverse maximum entropy models.
104
HUB4 WSJ ICSI WSJ + HUB4 + ICSI
MR BO JL MR BO JL MR BO JL MR BO JL
10-best Unweighted ROVER WERs
M1 31.7 33.9 45.6 39.2 47.7 55.3 39.8 44.1 53.6 31.7 34.3 47.3
M2 29 30.4 44 37 44.2 58.7 35.3 41.1 54.6 29 31.6 44.8
M3 28.2 29.7 43.4 35.8 42.8 58.2 33.9 39.4 52.8 27.9 30.4 43
M1 + M2 + M3 28 30 43.5 35.8 41.6 55 34.3 39.2 50.7 29.5 33.8 45.2
10-best Oracle Confidence-Weighted ROVER WERs (α = 0.65 after cross-validation)
M1 28.7 31.6 43.2 36.2 45.1 52.6 37 41.8 50.9 26.3 30.3 41.5
M2 25.9 28.2 41.4 34.4 41.7 56.5 32.5 38.7 51.3 23.9 27.5 40.8
M3 24.9 27. 1 40.8 33.2 40.7 56.0 30.6 36.9 50.9 22.8 25.9 39.7
M1 + M2 + M3 24.8 26.9 39.9 30.7 37.6 49.5 29.6 35.5 47.1 22.6 25.2 38.2
10-best CRF Confidence-Weighted ROVER WERs (α = 0.8 for
#
,
0.9 for
∗
, and 0.85 for all other cases, after cross-validation)
M1 31.6 33.9 45.8 39.1 47.6 55 39.7 44.1 53.2 31.6 34.1 46.9
M2 29 30.3 43.9 36.9 44 58.5 35.1 41 54.6 28.6 31.3 45.3
M3 28 29.6 43.7 35.7 42.7 58 33.7 39.2 53 27.3 29.6 42.4
M1 + M2 + M3 27.9
∗
29.8
∗
43.2
∗
35.5
∗
41.3
∗
55
∗
33.9
∗
39.1
∗
50.6
∗
28.4
#
32.3
#
45.2
#
Table 4.6: Table 4.6 summarizes the testing set WERs for various ASR systems after 10-best ROVER under various
conditions. ’+’denotesfusionofN-bestlistacrosstrainingdatasetsand/orsystems. E.g. theM1+M2+M3rowforthe
WSJ data set indicates that I fused the top-3 hypotheses from the M1, M2, and M3 models before performing ROVER.
Oracle confidence-based ROVER performs appreciably better than unweighted ROVER while the CRF confidence-based
ROVER gives a minor improvement.
105
−1 −0.5 0 0.5 1
0
0.02
0.04
0.06
0.08
PDF
Upper bound violations: 3.86%, Lower bound violations: 5.15%
Thm. 1 Upper Bound Error
Thm. 1 Lower Bound Error
−1 −0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
1
N
P
N
i=1
E
approx
(r
i
,h
avg
i
) - Upper/Lower Bound
CDF (for lower bound)
1−CDF (for upper bound)
Thm. 1 Upper Bound Error
Thm. 1 Lower Bound Error
(a) α =1.00
−1 −0.5 0 0.5 1
0
0.02
0.04
0.06
PDF
Upper bound violations: 4.20%, Lower bound violations: 6.49%
Thm. 1 Upper Bound Error
Thm. 1 Lower Bound Error
−1 −0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
1
N
P
N
i=1
E
approx
(r
i
,h
avg
i
) - Upper/Lower Bound
CDF (for lower bound)
1−CDF (for upper bound)
Thm. 1 Upper Bound Error
Thm. 1 Lower Bound Error
(b) Optimal α tuned on development set
Figure 4.2: This figure shows the probability density function (pdf) and cumulative
distribution function (cdf) for the error in the bounds in Theorem 4.1. I observe
that the bounds hold for an appreciably high fraction (≥ 93%) of the decoded test
files over all system combination variations in Table 4.6. The fraction of the few
bound violations is denoted by the height of the curves in the two bottom figures
at the point of intersection with the black 0 line. These violations occur because
Theorem 4.1 assumes word errors to be IID, which is not the case in practice.
106
0 1 2 3 4 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Avg. N−best WER − N−best Diversity
Approx. ROVER WER
Density of points
10
20
30
40
50
60
(a) α =1.00
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
1.5
2
2.5
3
3.5
Avg. N−best WER − N−best Diversity
Approx. ROVER WER
Density of points
10
20
30
40
50
60
(b) Optimal α tuned on development set
Figure4.3: ThisfigureshowsthescatterplotbetweentheapproximateROVERWER
and the difference of the average N-best WER and N-best diversity for all decoded
test files over all system combination variations in Table 4.6. Theorem 4.2 says that
the approximate ROVER WERequals the difference ofthe average N-best WER and
the N-best diversity. The above plots illustrate this because all the points lie on the
45
o
line.
107
34 36 38 40 42 44 46 48 50
28
30
32
34
36
38
40
42
44
46
WSJ, M1
WSJ, M2
WSJ, M3
WSJ, M1+M2+M3
HUB4, M1
HUB4, M2
HUB4, M3
HUB4, M1+M2+M3
ICSI, M1
ICSI, M2
ICSI, M3
ICSI, M1+M2+M3
WSJ+HUB4+ICSI, M1
WSJ+HUB4+ICSI, M2
WSJ+HUB4+ICSI, M3
WSJ+HUB4+ICSI, M1+M2+M3
Avg. N−best WER
ROVER WER
All Test Utterances (RMSE = 7.5618, Correlation Coefficient = 0.7178)
Figure 4.4: This figure shows the scatter plot between ROVER WER and average
N-best WER of different 10-best lists combined using the oracle confidence-weighted
ROVER algorithm. I have averaged the WERs across all test utterances and ignored
any impact of N-best diversity in this plot. I observe a correlation coefficient of 0.718
and root mean-squared error (RMSE) of 7.562between average N-best WER and the
ROVER WER. Hence in general, ROVER WER decreases as the ASR systems being
combined become more accurate.
Word frequency-based ROVER
α MR BO JL
1.00 0.898 0.902 0.869
(0.539,0.968) (0.524,0.971) (0.285,0.966)
Oracle Confidence ROVER
α MR BO JL
0.65 0.916 0.912 0.868
(0.607,0.980) (0.604,0.976) (0.297,0.970)
CRF Confidence ROVER
α MR BO JL
0.85 0.889 0.897 0.866
(0.501,0.966) (0.556,0.965) (0.251,0.967)
Table 4.7: Table 4.7 shows the median per-utterance correlation coefficients between
the ROVER WER and its optimal approximation in (4.48). This table also contains
the 90% bootstrap confidence intervals for the correlation coefficients. I observe that
all correlation coefficients are close to 0.9 and significant at the 10% level.
108
5 10 15 20 25 30
28
30
32
34
36
38
40
42
44
46
WSJ, M1
WSJ, M2
WSJ, M3
WSJ, M1+M2+M3
HUB4, M1
HUB4, M2
HUB4, M3
HUB4, M1+M2+M3
ICSI, M1
ICSI, M2
ICSI, M3
ICSI, M1+M2+M3
WSJ+HUB4+ICSI, M1
WSJ+HUB4+ICSI, M2
WSJ+HUB4+ICSI, M3
WSJ+HUB4+ICSI, M1+M2+M3
N−best Diversity
ROVER WER
All Test Utterances (RMSE = 23.3139, Correlation Coefficient = −0.5377)
Figure 4.5: This figure shows the scatter plot between ROVER WER and diversity
of different 10-best lists combined using the oracle confidence-weighted ROVER algo-
rithm. I have averaged the WERs across all test utterances and ignored any average
N-bestWER inthis plot. Iobserve acorrelationcoefficient of−0.538between N-best
diversity and the ROVER WER. Hence in general, ROVER WER decreases as the
ASR systems being combined become more diverse.
Word frequency-based ROVER
α MR BO JL
1.00 0.145 0.120 0.134
Oracle Confidence ROVER
α MR BO JL
0.65 0.508
∗#
0.468
∗#
0.404
∗#
CRF Confidence ROVER
α MR BO JL
0.85 0.186
∗
0.240
∗
0.208
∗
Table 4.8: Table 4.8 shows the median per-utterance γ in (4.48) estimated using
least squared regression between the ROVER WER and its optimal approximation.
∗
indicates that the γ is significantly higher than the corresponding γ for α =1 using
Wilcoxon’s signed rank test at the 10% significance level.
#
indicates that the γ is
significantly higher than the corresponding γ for α = 0.85 using CRF confidence-
weighted ROVER. I observe that the oracle confidence ROVER is most sensitive to
diversity in the N-best list due to its significantly higher γ, followed by the CRF
confidence ROVER and the word frequency-based ROVER.
109
30 32 34 36 38 40 42 44 46
28
30
32
34
36
38
40
42
44
46
WSJ, M1
WSJ, M2
WSJ, M3
WSJ, M1+M2+M3
HUB4, M1
HUB4, M2
HUB4, M3
HUB4, M1+M2+M3
ICSI, M1
ICSI, M2
ICSI, M3
ICSI, M1+M2+M3
WSJ+HUB4+ICSI, M1
WSJ+HUB4+ICSI, M2
WSJ+HUB4+ICSI, M3
WSJ+HUB4+ICSI, M1+M2+M3
Avg. N−best WER −0.49 N−best Diversity
ROVER WER
All Test Utterances (RMSE = 0.7842, Correlation Coefficient = 0.9910)
Figure 4.6: This figure shows the scatter plot between ROVER WER and an optimal
linear combination of average N-best WER and diversity of different 10-best lists
combined using the oracle confidence-weighted ROVER algorithm. I have averaged
the WERs across all test utterances and computed the coefficient γ = 0.49 using
least-squares linear regression. I observe a correlation coefficient of 0.991 and RMSE
of 0.784 between the optimal linear combination and the ROVER WER. Thus this
optimallinearcombinationpredictstheROVERWERbetterthantheaverageN-best
WER and diversity considered individually.
110
Chapter 5
Diversity in Ensemble Design:
Diverse Maximum Entropy Models
Diversityofaclassifierensemblehasbeenshowntobenefitoverallclassificationperfor-
mance. The previous chapters focused on modeling a diverse human/machine system
ensemble and analyzing the impact of diversity on ensemble performance. Most con-
ventionalmethodsoftrainingensembles offernocontrolontheextentofdiversity and
aremeta-learners. Thischapterpresents amethodforcreatinganensemble ofdiverse
maximum entropy (∂MaxEnt) models, which are popularin speech and languagepro-
cessing. I modify the objective function for conventional training of a MaxEnt model
such thatitsoutputposteriordistributionisdiverse withrespecttoareferencemodel.
Two diversity scores are explored – Kullback-Leibler (KL) divergence and posterior
cross-correlation. Experiments on the CoNLL-2003 Named Entity Recognition task
and the IEMOCAP emotion recognition database show the benefits of a ∂MaxEnt
ensemble.
5.1 Introduction
Ensembles of multiple experts have out-performed single experts in many pattern
classification tasks. Well-known examples include theNetflix Challenge [15], the2009
KDD Orange Cup [119] and the DARPA GALE program [144]. Dietterich [47] notes
three reasons which can explain this. First, an ensemble can potentially have lower
generalization error as compared to individual classifiers. Second, the training of
most state of the art classifiers (e.g. neural networks) involves solving a non-convex
111
optimization problem. Thus, while the individual classifiers can get stuck in local
optima, the ensemble has a better chance to come close to the global optima. Finally,
the true decision boundary for the problem at hand may be too complex for a single
classifier and an ensemble may better approximate it.
Two popular methods for training classifier ensembles are bagging (bootstrap
aggregating) [22] and AdaBoost (adaptive boosting) [60]. Consider a training set T
containing N pairs of feature vectors and target variables, {(x
n
,y
n
)}
N
n=1
. Bagging
proceeds by sampling T with replacement and creating M bootstrapped data sets
T
1
,...,T
M
. The m
th
classifier (or regressor) is then trained on T
m
. Given a test
feature vector x, results from the M experts are averaged to yield the estimated
target variable. Breiman uses a bias-variance decomposition to prove that in case of
regression, the mean squared error of the average regressor is less than or equal to
the average mean squared error over the individual regressors. The second method,
AdaBoost, works bysequentially trainingtheclassifiers intheensemble. Thetraining
dataforthem
th
classifier,T
m
, is created by weighted sampling fromT, where greater
probability mass is assigned to the instances which are misclassified by classifiers
1,...,m−1. Freund and Schapire have derived an upper bound on the training error
of the ensemble, which indicates that increasing the size of the ensemble in AdaBoost
reduces the training error towards zero. AdaBoost can also be viewed as minimizing
the exponential loss between the training and predicted label.
As noted in [94], the diversity of classifiers in an ensemble is crucial for its overall
performance. Ueda and Nakano [174] consider diversity in an ensemble of regressors
and derive a bias-variance-covariance decomposition for the average regressor’s mean
squared error. The mean squared error reduces as the pairwise diversity between
individual regressors (accounted for by the covariance term) increases. Tumer and
Ghosh [170] extend the analysis to classification by treating it as regression over the
class posteriors. The additional error of the ensemble over the Bayes optimal error is
shown to be dependent on the correlation coefficient between class posteriors.
A typical approach to introduce diversity is to use radically different classifiers
and/or feature sets. However, this does not offer explicit control on the extent of
diversity achieved. Bagging and boosting also suffer from this issue, and require
weak/unstable base classifiers for giving a substantial performance gain. Inspite of
the evidence linking diversity and ensemble performance, only a few works deal with
explicity creating diverse classifier ensembles. Negative Correlation Learning [105]
112
involves decorrelating errors from the individual neural networks as part of their
training. Another work is DECORATE [114], a meta-learner where the ensemble is
built incrementally with each successive classifier trained on a mix of artificial and
natural data. Artificial training instances are labeled contrary to the opinion of the
current ensemble.
This chapter focusses on training diverse maximum entropy (MaxEnt) models.
MaxEnt models are state of the art classifiers in many domains, especially speech
and language processing. They possess several desirable properties such as flexibility
in adding new features, scalable training, easy parameter estimation and minimal as-
sumptions about the posteriors. The next section discusses our approach for training
a diverse MaxEnt (∂MaxEnt) ensemble. I present experiments and analysis on the
CoNLL-2003 Named Entity Recognition task and the IEMOCAP emotion recogni-
tion database in section 5.3. Conclusions and scope for future work are presented in
section 5.4.
5.2 Training a ∂MaxEnt Ensemble
I first review the standard MaxEnt model to set up the notation. Let x ∈ X and
y∈Y denote the feature vector and class label respectively. The maximum entropy
principleaimstofindaprobabilitydistributionP(y|x)withmaximumentropysubject
to the following first order moment constraints for training dataT:
N
X
n=1
E
P
{f
i
(x
n
,y)}=
N
X
n=1
f
i
(x
n
,y
n
) ∀i∈{1,...,F} (5.1)
where f
i
is the i
th
feature - an arbitrary function of x and y. E
P
denotes the expec-
tation with respect to P(y|x
n
). This problem can be solved by Lagrange’s method,
and the resulting distribution is:
P
Λ
(y|x)=
exp(
P
F
i=1
λ
i
f
i
(x,y))
P
y∈Y
exp(
P
F
i=1
λ
i
f
i
(x,y))
(5.2)
113
where λ
i
are the Lagrange multipliers. The log-likelihood function of the MaxEnt
model over training dataT is
1
:
L(Λ)=
N
X
n=1
(
F
X
i=1
λ
i
f
i
(x
n
,y
n
)−logZ(x
n
)) (5.3)
where Z(x
n
) is the normalization sum in the denominator of (5.2). I note thatL(Λ)
is concave in λ
i
∀i. Hence a simple gradient ascent, Newton-Raphson or a quasi-
Newton method (such as L-BFGS [103]) can be used to find the maximum likelihood
parameter estimates. The gradient ofL(Λ) is given as:
∂L(Λ)
∂λ
i
=
N
X
n=1
f
i
(x
n
,y
n
)−E
P
{f
i
(x
n
,y)}
(5.4)
The task of this chapter is to train an ensemble of diverse MaxEnt models. I first
study the simpler case of training a MaxEnt model which fits the data well but is
diverse with respect to a reference model Q
Λ
′. A natural way to achieve this is to
introduce a diversity term in the log-likelihood function as follows:
L
tot
(Λ)=L(Λ)+αD(P
Λ
,Q
Λ
′) (5.5)
where α≥ 0 is the diversity weight andD(P
Λ
,Q
Λ
′) is the diversity between the two
models. As is noted in [95], there are multiple ways to capture diversity between two
classifiers. I use two intuitive diversity scores - the Kullback-Leibler (KL) divergence
between posterior distributions and negative posterior cross-correlation.
5.2.1 KL Divergence Diversity
The KL divergence from Q
Λ
′(y|x
n
) to P
Λ
(y|x
n
) is the following ensemble average:
KL
n
(Q
Λ
′||P
Λ
)=
X
y∈Y
Q
Λ
′(y|x
n
)log
Q
Λ
′(y|x
n
)
P
Λ
(y|x
n
)
(5.6)
I did not use KL
n
(P
Λ
||Q
Λ
′) due to difficulty in interpreting its gradient. Adding
this expectation over all instances in the training data, the modified log-likelihood
1
Penalizing this function by the L
1
and L
2
norms of Λ has been empirically shown to give
performance benefits.
114
becomes:
L
tot
(Λ) =L(Λ)+α
N
X
n=1
KL
n
(Q
Λ
′||P
Λ
). (5.7)
While L(Λ) is concave in Λ, KL
n
(Q
Λ
′||P
Λ
) is convex, attaining a minimum value
of 0 at Λ = Λ
′
. Thus the overall objective function is neither concave nor convex
and one can only hope to obtain locally optimal estimates of Λ. Furthermore, KL
divergence can potentially approach +∞, making the objective function unbounded.
The gradient ofL
tot
(Λ) can be written as:
∂L
tot
(Λ)
∂λ
i
=
N
X
n=1
f
i
(x
n
,y
n
)−[(1−α)E
P
{f
i
(x
n
,y)}+αE
Q
{f
i
(x
n
,y)}]
. (5.8)
This expression is the same as for a conventional MaxEnt model (5.4), except that a
linear combination of the feature expectation under P
Λ
and Q
Λ
′ is taken. Increasing
α has the effect of increasing the weight on the expectation from the reference model
(Q
Λ
′). While it seems that KL divergence should succeed in achieving diversity be-
tween the models, it can be easily shown that this may not be the case in practice.
Let the reference model Q
Λ
′ be trained on data setT using features{f
i
}
F
i=1
by max-
imizing L(Λ
′
). Upon convergence of its training, the gradient of L(Λ
′
) will be zero.
Hence:
N
X
n=1
E
Q
{f
i
(x
n
,y)}=
N
X
n=1
f
i
(x
n
,y
n
) ∀i∈{1,...,F}. (5.9)
If P
Λ
is trained to be diverse with respect to Q
Λ
′ by maximizing L
tot
(Λ) using the
same data and feature set, I can substitute the above equation in (5.8) and arrive at
the following result:
∂L
tot
(Λ)
∂λ
i
=(1−α)
N
X
n=1
f
i
(x
n
,y
n
)−E
P
{f
i
(x
n
,y)}
. (5.10)
Hence the gradients for a MaxEnt and∂MaxEnt are the same upto a scalar multi-
ple. At a local optima, the parameter estimates will satisfy the same constraint as in
the case of a conventional MaxEnt model. This problem with KL divergence can be
115
mitigated to some extent by using distinct training sets or features for Q
Λ
′ and P
Λ
.
However it necessitates the search for another diversity score. The next subsection
introduces posterior cross-correlation to this end.
5.2.2 Posterior Cross-Correlation (PCC) Diversity
Makingasimplisticassumption, considerindependent randomvariablesy
P
∼P
Λ
(y|x)
and y
Q
∼Q
Λ
′(y|x). The conditional probability of them being unequal is:
Pr{y
P
6=y
Q
|x} = 1−
X
y∈Y
P
Λ
(y|x)Q
Λ
′(y|x). (5.11)
Thus, negative cross-correlation between the two posterior distributions is a natural
diversity score. The modified log-likelihood function can be written as follows:
L
tot
(Λ)=L(Λ)−α
N
X
n=1
X
y∈Y
.P
Λ
(y|x
n
)Q
Λ
′(y|x
n
) (5.12)
This objective function is again neither convex nor concave. However, unlike KL
divergence, it has the following finite bounds:
min
y∈Y
Q
Λ
′(y|x
n
)≤
X
y∈Y
P
Λ
(y|x
n
)Q
Λ
′(y|x
n
)≤ max
y∈Y
Q
Λ
′(y|x
n
). (5.13)
The gradient can be shown to be equal to:
∂L
tot
(Λ)
∂λ
i
=
N
X
n=1
f
i
(x
n
,y
n
)−
N
X
n=1
[(1−Z
PQ
(x
n
)α)E
P
{f
i
(x
n
,y)}
+Z
PQ
(x
n
)αE
PQ
{f
i
(x
n
,y)}] (5.14)
where PQ
Λ,Λ
′(y|x
n
) is the normalized product distribution:
PQ
Λ,Λ
′(y|x
n
) =
Q
Λ
′(y|x
n
)P
Λ
(y|x
n
)
Z
PQ
(x
n
)
(5.15)
and Z
PQ
(x
n
) =
P
y∈Y
Q
Λ
′(y|x
n
)P
Λ
(y|x
n
) is the normalization constant. The above
gradient is similar to the one for KL divergence except for two modifications – expec-
tation with respect to the product distribution is used instead of Q
Λ
′ and the linear
116
combinationweights becomedependent ontheinstancex
n
. Thus, forinstances where
Z
PQ
(x
n
)ishigh(i.e. thecurrentandreferencemodelposteriorsarehighlycorrelated),
more weight is given to the expectation with respect to the product distribution. In
effect, themodeldeviatesmorefromtheMLestimateintheseinstances. Also,incase
ofidenticaltrainingsetsandfeaturesforthetwomodels, thegradientdoesnotreduce
to the standard MaxEnt model’s gradient. Till now, I have discussed a method to
train a MaxEnt model P
Λ
to be diverse with respect to another MaxEnt model Q
Λ
′.
The next subsection discusses one possible method in which an ensemble of M ≥ 2
∂MaxEnt models can be trained.
5.2.3 Sequential Training of a ∂MaxEnt Ensemble
Consider the training of an ensemble ofM MaxEnt classifiers P
Λ
1
,...,P
Λ
M
with corre-
sponding training sets T
1
,...,T
M
. A simple strategy is to train the ensemble sequen-
tially. Let MaxEnt(T) denote a function which trains a conventional MaxEnt model
on T and returns the parameters Λ. Let ∂MaxEnt(T ,Q
Λ
′,α,Λ
0
) denote a function
which trains a ∂MaxEnt model on T with respect to Q
Λ
′ using α as the diversity
weight and Λ
0
as the initial value of the parameters. The sequential training process
is as follows:
• Train model 1: Λ
1
= MaxEnt(T
1
).
• For m =2→M
– Initialize: Λ
0
m
= MaxEnt(T
m
).
– Interpolate models 1,...,m−1:
Q(y|x
n
) =
1
m−1
P
m−1
j=1
P
Λ
j
(y|x
n
)
∀y∈Y,n∈{1,...,|T
m
|}.
– Train model m: Λ
m
= ∂MaxEnt(T
m
,Q,α,Λ
0
m
)
Since the objective function is no longer concave, I train a ∂MaxEnt model in two
passes. The first pass finds the ML estimates of the parameters. The second pass
performs ∂MaxEnt training using the ML parameters as the starting point. This
ensures that L-BFGS converges at a local maxima which is not too far from the ML
estimate while ensuring diversity. α is tuned based on F1 score on a development
set. During the test phase, labels from all classifiers in the ensemble are fused by
117
simple plurality. More sophisticated ways of classifier fusion were not experimented
with since they are not the focus of this chapter.
5.3 Experiments and Results
The CoNLL-2003 Named Entity Recognition (NER) Task has four types of named
entities - persons, locations, organizations and miscellaneous [167]. The English task
consistsofnewswirestoriesfromtheReuterscorpusbetweenAugust1996andAugust
1997. IusedbinaryfeaturesfromStanford’sNERsystem which includewordidentity,
POS tags, word character N-grams etc [55]. Original training, development and
evaluation sets were used. Performance was measured in terms of the F1 score for
named entity detection [167].
Table5.1showstheF1scoresforensemblesof5conventionalMaxEntand∂MaxEnt
models usingthetwodiversity scores. Two cases areconsidered –when the5training
sets are identical and when they are created by bagging. I can observe that for iden-
tical training sets, KL divergence gives almost the same performance as 1 MaxEnt
model. The minute difference is due to deliberate smoothing of the posterior distri-
butions to prevent KL divergence from becoming indeterminate. On the other hand,
PCC-based∂MaxEnt modelsgiveanappreciableincreaseinperformance. Inthecase
of bagged training sets, KL divergence is able to achieve a statistically insignificant
performancegainover 5MaxEnt models. However, PCC-based∂MaxEnt modelsstill
performsignificantly better. I notethat in[189], gradient boosting with 100002-level
decision trees and Newton-Raphson optimization of the exponential loss was shown
to give a similar gain over a MaxEnt model.
Since PCC performs significantly better than KL divergence, I analyse it further.
Figure5.1showstheF1scoreontheevaluationsetwithanincreasingnumberofmod-
els (1 to 25). The performance of bagging saturates much earlier than the ∂MaxEnt
ensemble. Thus the relative performance improvement of the ∂MaxEnt ensemble in-
creases as the number of models is increased. The performance for α
e
indicates an
upper bound on the performance for the ∂MaxEnt ensemble. As a final analysis of
the ∂MaxEnt model with PCC diversity, Figure 5.2 shows the variation of the de-
velopment set F1 score and average log-likelihood for an ensemble of 5 models with
increasingα. TheF1scoreincreaseswithαuntilaroundα =1.45,afterwhichitstarts
decreasing again. Furthermore, its behaviour becomes more variable with increasing
118
Identical training sets Dev set Eval set
1 MaxEnt model 91.22 86.75
5 KL-∂MaxEnt (α
d
= 1.66) 91.31 86.73
(α
e
= 1.58) - 86.77
5 PCC-∂MaxEnt (α
d
=1.46) 91.70 87.05
(α
e
= 1.27) - 87.25
Bagged training sets
5 MaxEnt models 90.49 85.98
5 KL-∂MaxEnt (α
d
= 0.24) 90.62 86.15
(α
e
= 0.13) - 86.34
5 PCC-∂MaxEnt (α
d
=1.45) 91.21 86.74
(α
e
= 1.45) - 86.74
Table 5.1: This table shows the NER F1 scores for 5 MaxEnt and ∂MaxEnt models
using KL/PCC diversity. α
d
andα
e
denote the best values ofα tuned on the develop-
mentandevaluationset respectively. Valuesinboldindicateastatistically significant
improvement over the MaxEnt ensemble at the 5% level using McNemar’s test.
5 10 15 20 25
85.5
86
86.5
87
87.5
88
Number of models
Eval set F1 score
∂ MaxEnt Ensemble (α
d
)
∂ MaxEnt Ensemble (α
e
)
Bagged Ensemble
Figure5.1: F1scoreontheNERevaluationsetforPCC-∂MaxEnt andbaggedensem-
bles ofincreasing size. α
d
was tuned onthe development set andα
e
onthe evaluation
set.
119
0 0.5 1 1.5 2
0.9
0.905
0.91
α
Dev set F1 score
0 0.5 1 1.5 2
−0.5
−0.4
−0.3
−0.2
−0.1
0
α
Avg. dev set log−likelihood
Figure 5.2: This figure shows the F1 score and average log-likelihood for the NER
development set with increasing diversity weight α. Five ∂MaxEnt models were used
with PCC diversity on bagged data.
α because the optimization problem is become more non-concave. It is interesting
to note that the log-likelihood remains practically constant until α = 1, while the
F1 score increases significantly over the same range. The drop in log-likelihood from
α = 1 to 1.45 does not adversely impact the performance.
Next,IconductedemotionclassificationexperimentsontheIEMOCAPdatabase[30].
It is an acted, multimodal and multi-speaker database consisting of dyadic sessions
where actors are asked to elicit emotional expressions. Each session was labeled by
multiple human evaluators in terms of 4 categorical emotions - {angry, happy, sad,
Identical training sets Dev set Eval set
1 MaxEnt model 43.79 44.64
5 PCC-∂MaxEnt (α
d
=0.33) 48.73 48.10
(α
e
= 0.33) - 48.10
Bagged training sets
5 MaxEnt models 43.01 43.65
5 PCC-∂MaxEnt (α
d
=0.52) 46.74 46.09
(α
e
= 0.68) - 47.09
Table 5.2: This table shows weighted F1 scores for emotion classification with 5
models on the IEMOCAP database.
120
neutral}. The multiple labels were fused using simple plurality and sessions where
a tie occured were excluded. A total of 5498 sessions were used, and 385 acoustic-
prosodic features from the OpenSMILE toolkit [53] were extracted. These included
pitch, energy, Mel-filter bank coefficients and their per-session statistics. Table 5.2
shows the classification performance. The ∂MaxEnt model ensemble performs signifi-
cantly better than 1 MaxEnt model trained on the entire data and 5 MaxEnt models
trained on bagged data. With 25 models PCC-∂MaxEnt models, I get an additional
improvement of approximately 1-2%. This shows the benefit of using the ∂MaxEnt
ensemble on a more difficult classification task with continuous features.
5.4 Conclusion and Scope for Future Work
This chapterpresented amethodtocreatediverse ensembles ofMaxEnt models. Two
intuitive diversity scores were explored - KL divergence and negative posterior cross-
correlation. Experiments conducted on two classification tasks (the CoNLL-2003
Named Entity Recognition Task and the IEMOCAP emotion classification database)
show the advantages of training a ∂MaxEnt ensemble. It was demonstrated that
under reasonable assumptions, KL divergence achieves no gain in performance, while
posteriorcross-correlationperformssignificantlybetter. Therearemultipledirections
for future work. Introduction of a diversity term in the standard MaxEnt model
objective function made it non-concave – an undesirable property for optimization. I
need to explore ways to train diverse models while retaining concavity. Second, since
gradientboostingshowsasimilargainoveraMaxEntmodel(albeitwiththousandsof
models in the ensemble), the link between popular variants of boosting and ensemble
diversity needs to be explored. Finally, insight into the choice of diversity scores for
a given ensemble and database is required.
121
Chapter 6
Diversity in Design: Noisy
Backpropagation and Deep
Bidirectional Pre-training
Neural networks are state-of-the-art systems in several pattern recognition problems
such as automatic speech recognition, object classification from images, computer
vision, bioinformatics, and robotics. Supervised training of shallow and deep neural
networks oftenrequires thousands oflabeled datainstances andseveral hoursofCPU
time. This chapter explores a novel way to utilize diversity in pattern recognition. It
proves that diversifying the training data through noise addition can speed conver-
gence of the backpropagation algorithm for training a feed-forward neural network.
I first prove that the backpropagation algorithm is a special case of the gener-
alized Expectation-Maximization (EM) algorithm for iterative maximum likelihood
estimation. Arecent EMnoisebenefitresultthengives asufficient conditionfornoise
to speed up the backpropagation training. I show that such noise also improves the
accuracy of the trained neural network because it improves the data log-likelihood at
every iteration and since the log-likelihood approximates the classification accuracy.
I prove that the noise speed-up also applies to the deep bidirectional pre-training of
theneuralnetworkusingbidirectionalassociativememories(BAMs)ortheequivalent
restricted Boltzmann machines (RBMs). The optimal noise adds to input visible neu-
rons for BAMs and output target neurons for the backpropagation algorithm. The
geometry of the noise benefit region depends on the probability structure of these
neurons in both cases. Logistic sigmoidal neurons produce a “forbidden” noise region
122
that lies below a hyperplane. This forbidden noise region is a sphere if the neurons
have a Gaussian signal or activation function. This chapter demonstrates all noise
benefits using MNIST digit classification experiments.
6.1 Noise Benefits in Backpropagation
Thischapterprovesforthefirsttimethatnoisecanspeedconvergenceandimproveac-
curacyofthepopularbackpropagationgradient-descentalgorithmfortrainingfeedfor-
ward multilayer-perceptron neural networks [69], [88]. The proof casts backpropaga-
tionintermsofmaximum likelihoodestimation[18]andthenshows thattheiterative
backpropagation algorithm is a special case of the general Expectation-Maximization
(EM) algorithm. It then invokes the new noisy EM (NEM) theorem [121]–[123] that
gives a sufficient condition for speeding convergence in the EM algorithm. Then the
NEM result speeds convergence in the backpropagation algorithm. Figure 6.1 shows
thenoisebenefitforcrossentropytrainingofafeedforwardneuralnetwork. TheNEM
version displays a 18% median decrease in cross entropy per iteration compared to
noiseless backpropagation training. Figure 6.2 shows that adding blind noise instead
of NEM noise only gives a miniscule improvement of 1.7% in cross entropy over the
noiseless EM-BP algorithm.
I show that NEM-BP also gives better classification accuracy at each training
iteration than the noiseless EM-BP algorithm. This happens because NEM noise
improves the cross entropy at every iteration and because cross entropy is an ap-
proximation to the classification error rate. Figure 6.3 shows that NEM-BP gives a
15% median improvement in the per-iteration classification error rate for the training
set and a 10% improvement for the testing set at the optimal noise variance of 0.42.
Figure 6.4 shows that this noise benefit disappears upon using blind noise in place of
NEM noise.
This chapter further shows that a related NEM result holds for the pre-training
of the individual layers of neurons in the multilayer perceptron. These so-called
restricted Boltzmann machine (RBM) [73], [76], [157] layers are in fact simple bidi-
rectional associative memories (BAMs) [86]–[88] that undergo synchronous updating
of the neurons. They are BAMs because the neurons in contiguous layers use a con-
nection matrix M in the forward pass and the corresponding matrix transpose M
T
in the backward pass and because the neurons have no within-layer connections. The
123
general BAM convergence theorem [86]–[88] guarantees that all such rectangular ma-
trices M are bidirectionally stable for synchronous or asynchronous neuron updates
and for quite general neuronal activation nonlinearities because the RBM energy
function is a Lyapunov function for the BAM network. This gives almost immediate
convergence to a BAM fixed point after only a small number of synchronous back-
and-forth updates when both layers use logistic neurons. Figure 6.5 shows the noise
benefit for NEM training of a logistic-logistic BAM with 784 visible and 40 hidden
neurons. NEM training gives around 16% improvement in the per-iteration squared
reconstruction error over noiseless training. Figure 6.6shows that training with blind
noise does not give any significant difference.
The NEM Theorem gives a type of “forbidden” condition [121]–[123], [126], [127]
that ensures a noise speed up so long as the noise lies outside of a specified region
in the noise state space. Figures 6.7 and 6.8 show that the noise must lie outside
such regions to speed convergence. The neuron probability density function (pdf)
and network connection or synaptic weights control the geometry of this forbidden
region. Logistic neurons give the forbidden region as a half-space while Gaussian
neurons give it as a sphere.
Theorems6.3and6.4givethesufficientconditionsforanoisebenefitinthepopular
casesofneuralnetworkswithlogisticandGaussianoutputneurons. Theorems6.6and
6.7 give similar sufficient conditions for Bernoulli-Bernoulli and Gaussian-Bernoulli
BAMs. This is a type of “stochastic resonance” effect where a small amount of noise
improves the performance of a nonlinear system while too much noise harms the
system [28], [59], [62], [89], [125]–[129], [183].
Some prior research has found an approximate regularizing effect of adding white
noise to backpropagation [4], [19], [68], [110]. I instead add non-white noise that
satisfies a simple geometric condition that depends on both the network parameters
and the output activations.
The next section casts the backpropagation algorithm as ML estimation. Sec-
tion 6.3 presents the EM algorithm for neural network training and proves that it
reduces to the backpropagation algorithm. Section 6.4 reviews the NEM theorem.
Section 6.5 proves the noise benefit sufficient conditions for a neural network. Sec-
tion6.6presentsargumentsfortheobservedclassificationaccuracyimprovementfrom
the addition of NEM noise. Section 6.7 reviews RBMs or BAMs. Section 6.8 derives
sufficient conditions for a noise benefit in ML training of Bernoulli-Bernoulli and
124
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
−15
−10
−5
0
5
Noise variance
% reduction in cross−entropy NEM−BP noise benefit in training set cross−entropy over first 10 iterations
using a 5−layer neural network with 40 neurons in each hidden layer
Training set
Testing set
5 10 15 20 25 30 35 40 45
0.5
1
1.5
2
2.5
3
Training iterations
Training set cross−entropy
Training set cross−entropy for optimal noise variance 4.2e−1
EM−BP
NEM−BP
Figure 6.1: This figure shows the percent median reduction in per-iteration cross
entropy for the NEM-backpropagation (NEM-BP) training relative to the noiseless
EM-BP training of a 10-class classification neural network trained on 1000 images
from the MNIST data set. I observe a reduction in cross entropy of 18% for the
training and the testing set at the optimal noise standard deviation of 0.42. The
neural network used three logistic (sigmoidal) hidden layers with 40 neurons each.
The input layer used 784 logistic neurons and the output layer used 10 neurons with
Gibbs activation function. The bottom figure shows the training set cross entropy as
iterationsproceed forEM-BPandNEM-BP trainingusingthe optimalnoisevariance
of 0.42. The knee-point of the NEM-BP curve at iteration 4 achieves the same cross
entropy as does the noiseless EM-BP at iteration 15.
Gaussian-Bernoulli RBMs. Section 6.9 presents simulation results.
6.2 Backpropagation as Maximum Likelihood Es-
timation
This section shows that the backpropagation algorithm performs ML estimation of a
neural network’s parameters. It uses a 3-layer neural network for notational conve-
nience. All results in this chapter extend to“deep” networks with more hidden layers.
x denotes the neuron values at the input layer that consists of I neurons. a
h
is the
125
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
−1
0
1
2
Noise variance
% reduction in cross−entropy
Blind−BP noise benefit in training set cross−entropy over the first 10 iterations
using a 5−layer neural network with 40 neurons in each hidden layer
Training set
Testing set
5 10 15 20 25 30 35 40 45
0.5
1
1.5
Training iterations
Training set cross−entropy
Training set cross−entropy for optimal noise variance 5.4e−1
EM−BP
Blind−BP
Figure 6.2: This figure shows the percent median reduction in per-iteration cross
entropy for the EM-backpropagation training with blind noise (Blind-BP) relative
to the noiseless EM-BP training of a 10-class classification neural network trained
on 1000 images from the MNIST data set. I observe a marginal reduction in cross
entropy of 1.7% for the training and the testing set at the optimal noise standard
deviation of 0.54. The neural network used three logistic (sigmoidal) hidden layers
with 40 neurons each. The input layer used 784 logistic neurons and the output
layer used 10 neurons with Gibbs activation function. The bottom figure shows the
training set cross entropy as iterations proceed for EM-BP and Blind-BP training
using the optimal noise variance of 0.54. Both the blind noise EM-BP and the noise
less EM-BP give similar cross-entropies for all iterations.
vector of hidden neuron sigmoidal activations whose j
th
element is
a
h
j
=
1
1+exp
−
P
I
i=1
w
ji
x
i
=σ
I
X
i=1
w
ji
x
i
(6.1)
where w
ji
is the weight of the link that connects the i
th
visible and j
th
hidden neuron.
y denotes the K-valued target variable and t is its 1-in-K encoding. t
k
is the k
th
126
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
−10
−5
0
5
10
15
Noise variance
% reduction in
classification errors
NEM−BP noise benefit in training set classification error over first 10 iterations
using a 5−layer neural network with 40 neurons in each hidden layer
Training set
Testing set
5 10 15 20 25 30 35 40 45
0.2
0.3
0.4
0.5
0.6
Training iterations
Training set
classification error rate
Training set classification error rate for optional noise variance 4.2e−1
EM−BP
NEM−BP
Figure 6.3: This figure shows the percent median reduction in per-iteration classifi-
cation error rate for the NEM-backpropagation (NEM-BP) training relative to the
noiseless EM-BP training of a 10-class classification neural network trained on 1000
images from the MNIST data set. I observe a reduction in classification error rate of
15%forthe trainingand around10% forthetesting set atthe optimalnoise standard
deviation of 0.42. The neural network used three logistic (sigmoidal) hidden layers
with 40 neurons each. The input layer used 784 logistic neurons and the output layer
used10neuronswithGibbsactivationfunction. Thebottomfigureshowsthetraining
set classification error rate as iterations proceed for EM-BP and NEM-BP training
using the optimal noise variance of 0.42. The knee-point of the NEM-BP curve at
iteration 4 achieves the same classification error rate as does the noiseless EM-BP at
iteration 11.
output neuron’s value with activation
a
t
k
=
exp
P
J
j=1
u
kj
a
h
j
P
K
k
1
=1
exp
P
J
j=1
u
k
1
j
a
h
j
(6.2)
=p(y =k|x,Θ) (6.3)
where u
kj
is the weight of the link that connects the j
th
hidden andk
th
target neuron.
a
t
k
dependsoninputxandparametermatricesUandW. Backpropagationminimizes
127
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
2
4
Noise variance
% reduction in
classification errors
Blind−BP noise benefit in training set classification error over first 10 iterations
using a 5−layer neural network with 40 neurons in each hidden layer
Training set
Testing set
5 10 15 20 25 30 35 40 45
0.2
0.3
0.4
0.5
0.6
Training iterations
Training set
classification error rate
Training set classification error rate for optional noise variance 2.8e−1
EM−BP
Blind−BP
Figure 6.4: This figure shows the percent median reduction in per-iteration classifi-
cation error rate for the EM-backpropagation training with blind noise (Blind-BP)
relative to the noiseless EM-BP training of a 10-class classification neural network
trained on 1000 images from the MNIST data set. I observe a minor reduction in
classification error rate of 1% for the training and the testing set at the optimal
noise standard deviation of 0.28. The neural network used three logistic (sigmoidal)
hidden layers with 40 neurons each. The input layer used 784 logistic neurons and
the output layer used 10 neurons with Gibbs activation function. The bottom figure
shows the training set classification error rate as iterations proceed for EM-BP and
Blind-BP training using the optimal noise variance of 0.28. Both the curves show
similar classification error rates for all iterations.
the following cross entropy:
E =−
K
X
k=1
t
k
log(a
t
k
). (6.4)
The cross entropy equals the negative conditional log-likelihood L of the target y
given the inputs x because
E =−log
h
K
Y
k=1
(a
t
k
)
t
k
i
=−log
h
K
Y
k=1
p(y =k|x,Θ)
t
k
i
(6.5)
=−logp(y|x,Θ)=−L . (6.6)
128
10
0
10
1
10
2
10
3
−15
−10
−5
0
Noise variance
% reduction
in reconstruction error
NEM noise benefit in training set squared reconstruction error over first 50 iterations
using a logistic−logistic BAM with 784 visible and 40 hidden neurons
20 40 60 80 100 120 140 160 180 200
60
80
100
120
140
160
Training iterations
Squared
reconstruction error
Training set squared reconstruction error for optimal noise variance 1024
Noiseless training
NEM training
Figure 6.5: This figure shows the percent median reduction in per-iteration squared
reconstruction errorfor the training with NEM noise relative to the noiseless training
of a BAM on 1000 images from the MNIST data set. I observe a reduction of 16%
in the training set squared reconstruction error at the optimal noise variance of 1024.
The BAM used one logistic (sigmoidal) hidden layers with 40 neurons and an input
layer with 784 logistic neurons. The bottom figure shows the training set squared
reconstruction error as iterations proceed for NEM and noiseless training using the
optimal noise variance of 1024.
Backpropagation updates the network parameters Θ using gradient ascent to maxi-
mize the loglikelihood logp(y|x,Θ). The partial derivative of this log-likelihood with
respect to u
kj
is
∂L
∂u
kj
= (t
k
−a
t
k
)a
h
j
(6.7)
and with respect to w
ji
is
∂L
∂w
ji
=a
h
j
(1−a
h
j
)x
i
K
X
k=1
(t
k
−a
t
k
)u
kj
. (6.8)
So (6.7) and (6.8) give the partial derivatives that perform gradient ascent on the
log-likelihood L.
A linear signal function often replaces the Gibbs function at the output layer for
129
10
0
10
1
10
2
10
3
5
10
15
20
25
Noise variance
% reduction
in reconstruction error
Blind noise benefit in training set squared reconstruction error over first 50 training iterations
using a logistic−logistic BAM with 784 visible and 40 hidden neurons
20 40 60 80 100 120 140 160 180 200
60
80
100
120
140
160
Training iterations
Squared
reconstruction error
Training set squared reconstruction error for optimal noise variance 1
Noiseless training
Blind noise training
Figure 6.6: This figure shows the percent median reduction in per-iteration squared
reconstruction error for the training with blind noise relative to the noiseless training
ofaBAMon1000imagesfromtheMNISTdataset. Iobservenosignificantdifference
in the per-iteration squared reconstruction error for the two cases. The BAM used
one logistic (sigmoidal) hidden layers with 40 neurons and an input layer with 784
logistic neurons.
regression:
a
t
k
=
J
X
j=1
u
kj
a
h
j
. (6.9)
The target values t of the output neuron layer can assume any real values for regres-
sion. Then backpropagation minimizes the following squared-error function:
E =
1
2
K
X
k=1
(t
k
−a
t
k
)
2
. (6.10)
I assume that t is Gaussian with mean a
t
and identity covariance matrix I. Then
backpropagation maximizes the following log-likelihood function:
L = logp(t|x,Θ)=logN(t;a
t
,I) (6.11)
130
Figure 6.7: Noise benefit region for a neural network with Bernoulli (logistic) output
neurons: Noisespeedsupmaximum-likelihoodparameterestimationoftheneuralnet-
work with Bernoulli output neurons if the noise lies above a hyperplane that passes
through the origin of the noise space. The activation signal a
t
of the output layer
controls the normal to the hyperplane. The hyperplane changes as learning proceeds
because the parameters and hidden layer neuron activations change. I used indepen-
dent and identically distributed (i.i.d.) Gaussian noise with mean 0, variance 3, and
(3,1,1) as the normal to the hyperplane.
for
N(t;a
t
,I)=
1
(2π)
d/2
exp
(
−
1
2
K
X
k=1
(t
k
−a
t
k
)
2
)
. (6.12)
Thus the gradient partial derivatives of this log-likelihood function are the same as
those for the K-class classification case in (6.7) and (6.8).
131
Figure 6.8: Noise benefit region for a neural network with Gaussian output neurons:
Noisespeedsupmaximum-likelihoodparameterestimationoftheneuralnetworkwith
Gaussian output neurons if the noise lies inside a hypersphere. The activation signal
a
t
of the output layer and the target signal t control the center and radius of this
hypersphere. This hypersphere changes as learning proceeds because the parameters
and hidden-layer neuron activations change. I used i.i.d. Gaussian noise with mean
0, variance 3, and center t−a
t
=(1,1,1).
6.3 EM Algorithm for Neural Network ML Esti-
mation
Both backpropagation and the EM algorithm find the ML estimate of a neural net-
work’s parameters. The next theorem shows that backpropagation is a special case
of the generalized EM algorithm.
Theorem 6.1. Backpropagation is the GEM Algorithm
The backpropagation update equation for a differentiable likelihood function p(y|x,Θ)
at epoch n
Θ
n+1
= Θ
n
+η∇
Θ
logp(y|x,Θ)
Θ=Θ
n
(6.13)
equals the GEM update equation at epoch n
Θ
n+1
= Θ
n
+η∇
Θ
Q(Θ|Θ
n
)
Θ=Θ
n
(6.14)
132
where GEM uses the differentiable Q-function
Q(Θ|Θ
n
) =E
p(h|x,y,Θ
n
)
n
logp(y,h|x,Θ)
o
. (6.15)
Proof: I know that [18], [120]
logp(y|x,Θ)=Q(Θ|Θ
n
)+H(Θ|Θ
n
) (6.16)
if H(Θ|Θ
n
) is the following cross entropy [38]:
H(Θ|Θ
n
) =−
Z
lnp(h|x,y,Θ) dp(h|x,y,Θ
n
). (6.17)
Hence
H(Θ|Θ
n
) =logp(y|x,Θ)−Q(Θ|Θ
n
). (6.18)
Now expand the Kullback-Leibler divergence [91]:
D
KL
(Θ
n
||Θ)=
Z
ln
p(h|x,y,Θ
n
)
p(h|x,y,Θ)
!
dp(h|x,y,Θ
n
) (6.19)
=
Z
lnp(h|x,y,Θ
n
) dp(h|x,y,Θ
n
)
−
Z
lnp(h|x,y,Θ) dp(h|x,y,Θ
n
) (6.20)
=−H(Θ
n
|Θ
n
)+H(Θ|Θ
n
). (6.21)
SoH(Θ|Θ
n
)≥H(Θ
n
|Θ
n
)forallΘbecauseD
KL
(Θ
n
||Θ)≥0[91]. Thus Θ
n
minimizes
H(Θ|Θ
n
) and hence∇
Θ
H(Θ|Θ
n
)= 0 at Θ= Θ
n
. Putting this in (6.18) gives
∇
Θ
logp(y|x,Θ)
Θ=Θ
n
=∇
Θ
Q(Θ|Θ
n
)
Θ=Θ
n
. (6.22)
Hence the backpropagation and GEM update equations are identical.
The GEM algorithm uses a probabilistic description of the hidden layer neurons.
I assume that the hidden layer neurons are Bernoulli random variables. So their
133
activation is the following conditional probability:
a
h
j
=p(h
j
=1|x,Θ). (6.23)
I can now formulate an EM algorithm for ML estimation of a feedforward neural
network’s parameters. The E-step computes the Q-function in (6.15). Computing
the expectation in (6.15) requires 2
J
values of p(h|x,y,Θ
n
). This is computationally-
intensive for large values of J. So I use Monte Carlo sampling to approximate the
above Q-function. The strong law of large numbers ensures that this Monte Carlo
approximation converges almost surely to the true Q-function. Bayes theorem gives
p(h|x,y,Θ
n
) as
p(h|x,y,Θ
n
)=
p(h|x,Θ
n
)p(y|h,Θ
n
)
P
h
p(h|x,Θ
n
)p(y|h,Θ
n
)
. (6.24)
Icansamplemoreeasilyfromp(h|x,Θ
n
)thanfromp(h|x,y,Θ
n
)becausetheh
j
terms
areindependentgivenx. Ireplacep(h|x,Θ
n
)byitsMonteCarloapproximationusing
M independent and identically-distributed (i.i.d.) samples:
p(h|x,Θ
n
)≈
1
M
M
X
m=1
δ
K
(h−h
m
) (6.25)
where δ
K
is the J-dimensional Kronecker delta function. The Monte Carlo approxi-
mation of the hidden data conditional PDF becomes
p(h|x,y,Θ
n
)≈
P
M
m=1
δ
K
(h−h
m
)p(y|h,Θ
n
)
P
h
P
M
m
1
=1
δ
K
(h−h
m
1
)p(y|h,Θ
n
)
(6.26)
=
P
M
m=1
δ
K
(h−h
m
)p(y|h
m
,Θ
n
)
P
M
m
1
=1
p(y|h
m
1
,Θ
n
)
(6.27)
=
M
X
m=1
δ
K
(h−h
m
)γ
m
(6.28)
where γ
m
=
p(y|h
m
,Θ
n
)
P
M
m
1
=1
p(y|h
m
1
,Θ
n
)
(6.29)
is the “importance” of h
m
. (6.28) gives an importance-sampled approximation of
134
p(h|x,y,Θ
n
) where each sample h
m
has weight γ
m
. I can now approximate the Q-
function as
Q(Θ|Θ
n
)≈
X
h
M
X
m=1
γ
m
δ
K
(h−h
m
)logp(y,h|x,Θ) (6.30)
=
M
X
m=1
γ
m
logp(y,h
m
|x,Θ) (6.31)
=
M
X
m=1
γ
m
h
logp(h
m
|x,Θ)+logp(y|h
m
,Θ)
i
(6.32)
where
logp(h
m
|x,Θ)=
J
X
j=1
h
h
m
j
loga
h
j
+(1−h
m
j
)log(1−a
h
j
)
i
(6.33)
for sigmoidal hidden layer neurons. Gibbs activation neurons at the output layer give
logp(y|h
m
,Θ)=
K
X
k=1
t
k
loga
mt
k
(6.34)
where a
h
j
is as in (6.1) and
a
mt
k
=
exp
P
J
j=1
u
kj
a
mh
j
P
K
k
1
=1
exp
P
J
j=1
u
k
1
j
a
mh
j
. (6.35)
Gaussian output layer neurons give
logp(y|h
m
,Θ) =−
1
2
K
X
k=1
(t
k
−a
mt
k
)
2
. (6.36)
The Q-function in (6.32) equals a sum of log-likelihood functions for two 2-layer
neural networks between the visible-hidden and hidden-output layers. The M-step
maximizes this Q-function by gradient ascent. So it is equivalent to two distinct
backpropagation steps performed on these two 2-layer neural networks.
135
6.4 TheNoisyExpectation-MaximizationTheorem
The Noisy Expectation-Maximization (NEM) algorithm [122], [123] modifies the EM
scheme andachieves fasterconvergence timesonaverage. TheNEMalgorithminjects
additive noise intothe dataateach EM iteration. The noise decays withthe iteration
count to guarantee convergence to the optimal parameters of the original data model.
The additive noise must also satisfy the NEM condition below that guarantees that
the NEM parameter estimates will climb faster up the likelihood surface on average.
6.4.1 NEM Theorem
The NEM Theorem[122], [123]states ageneralsufficient conditionwhen noisespeeds
up the EM algorithm’s convergence to a local optimum. The NEM Theorem uses the
following notation. The noise random variable N has pdf p(n|x). So the noise N
can depend on the data x. h are the latent variables in the model. {Θ
(n)
} is a
sequence of EM estimates for Θ. Θ
∗
= lim
n→∞
Θ
(n)
is the converged EM estimate
for Θ. Define the noisy Q function Q
N
(Θ|Θ
(n)
) =E
h|x,Θ
k
[lnp(x+N,h|θ)]. Assume
that the differential entropy of all random variables is finite and that the additive
noise keeps the data in the likelihood function’s support. Then I can state the NEM
theorem [122], [123].
Theorem 6.2. Noisy Expectation Maximization (NEM)
The EM estimation iteration noise benefit
Q(Θ
∗
|Θ
∗
)−Q(Θ
(n)
|Θ
∗
)≥Q(Θ
∗
|Θ
∗
)−Q
N
(Θ
(n)
|Θ
∗
) (6.37)
or equivalently
Q
N
(Θ
(n)
|Θ
∗
)≥Q(Θ
(n)
|Θ
∗
) (6.38)
holds on average if the following positivity condition holds:
E
x,h,N|Θ
∗
ln
p(x+N,h|Θ
k
)
p(x,h|Θ
k
)
≥ 0. (6.39)
The NEM Theorem states that each iteration of a suitably noisy EM algorithm
gives higher likelihood estimates on average than do the regular EM’s estimates. So
136
the NEM algorithm converges faster than EM if I can identify the data model. The
faster NEM convergence occurs both because the likelihood function has an upper
bound and because the NEM algorithm takes larger average steps up the likelihood
surface.
Maximum A Posteriori (MAP) estimation for missing information problems can
useamodifiedversionoftheEMalgorithm. TheMAPversionmodifiestheQ-function
by adding a log prior term G(Θ) =lnp(Θ) [42], [112]:
Q(Θ|Θ
t
)=E
h|x,Θt
[lnp(x,h|Θ)]+G(Θ). (6.40)
The MAP version of the NEM algorithm applies a similar modification to the Q
N
-
function:
Q
N
(Θ|Θ
t
) =E
h|x,Θt
[lnp(x+N,h|Θ)]+G(Θ). (6.41)
Many latent-variable models are not identifiable [166] and thus need not have
global optima. These models include Gaussian mixture models [113], hidden Markov
models [139], and neural networks). The EM and NEM algorithms converge to local
optima in these cases. The additive noise in the NEM algorithm helps the NEM
estimates search other nearby local optima. The NEM Theorem still guarantees
that NEM estimates have higher likelihood on average than EM estimates for non-
identifiable models.
6.5 Noise Benefits in Neural Network ML Estima-
tion
Consider adding noise n to the 1-in-K encoding t of the target variable y. I first
present the noise benefit sufficient condition for Gibbs activation output neurons
used in K-class classification.
Theorem 6.3. Forbidden Hyperplane Noise Benefit Condition
The NEM positivity condition holds for ML training of feedforward neural network
with Gibbs activation output neurons if
E
t,h,n|x,Θ
∗
n
n
T
log(a
t
)
o
≥ 0. (6.42)
137
Proof: I add noise to the target 1-in-K encodingt. The likelihood ratio in the NEM
sufficient condition becomes
p(t+n,h|x,Θ)
p(t,h|x,Θ)
=
p(t+n|h,Θ)p(h|x,Θ)
p(t|h,Θ)p(h|x,Θ)
(6.43)
=
p(t+n|h,Θ)
p(t|h,Θ)
(6.44)
=
K
Y
k=1
(a
t
k
)
t
k
+n
k
(a
t
k
)
t
k
=
K
Y
k=1
(a
t
k
)
n
k
. (6.45)
So the NEM positivity condition becomes
E
t,h,n|x,Θ
∗
n
log
K
Y
k=1
(a
t
k
)
n
k
o
≥ 0. (6.46)
This condition is equivalent to
E
t,h,n|x,Θ
∗
n
K
X
k=1
n
k
log(a
t
k
)
o
≥ 0. (6.47)
I can rewrite this positivity condition as the following matrix inequality:
E
t,h,n|x,Θ
∗{n
T
log(a
t
)}≥ 0 (6.48)
where log(a
t
) is the vector of output neuron log-activations.
The above sufficient condition requires that the noise n lie above a hyperplane
with normal log(a
t
). The next theorem gives a sufficient condition for a noise benefit
in the case of Gaussian output neurons.
Theorem 6.4. Forbidden Sphere Noise Benefit Condition
The NEM positivity condition holds for ML training of a feedforward neural network
with Gaussian output neurons if
E
t,h,n|,x,Θ
∗
n
n−a
t
+t
2
−
a
t
−t
2
o
≤ 0 (6.49)
where||.|| is the L
2
vector norm.
138
Proof: I add noise n to the K output neuron values t. The log-likelihood in the
NEM sufficient condition becomes
p(t+n,h|x,Θ)
p(t,h|x,Θ)
=
p(t+n|h,Θ)p(h|x,Θ)
p(t|h,Θ)p(h|x,Θ)
(6.50)
=
N(t+n;a
t
,I)
N(t;a
t
,I)
(6.51)
= exp
1
2
h
t−a
t
2
−
t+n−a
t
2
i
. (6.52)
So the NEM sufficient condition becomes
E
t,h,n|,xΘ
∗
n
n−a
t
+t
2
−
a
t
−t
2
o
≤ 0. (6.53)
The above sufficient condition defines a forbidden noise region outside a sphere
with centert−a
t
and radius||t−a
t
||. All noise inside this sphere speeds convergence
of ML estimation in the neural network on average.
This section presented sufficient conditions for a noise benefit in training a neural
network that uses the EM algorithm. I now discuss an argument for the observed
per-iteration improvement in classification accuracy due to NEM noise addition in
Figure 6.3.
6.6 Noise Benefits in Classification Accuracy
Iconsiderbinaryclassificationusinganeuralnetwork withonelogisticoutputneuron.
The target value for this neuron is t ∈{0,1} corresponding to classes 0 and 1. Let
this output neuron have an activation a
t
∈ (0,1]. Then the accuracy is
A =t I
a
t
≥
1
2
+(1−t) I
a
t
<
1
2
(6.54)
where I is the indicator function. The log-likelihood function is
L =t log(a
t
)+(1−t) log(1−a
t
). (6.55)
The next lemma gives a useful relation between the indicator and logarithmic func-
tions.
139
Lemma 6.1. Let x∈ (0,1]. Then
I
x≥
1
2
≥ log(2x). (6.56)
Proof: First consider x∈
0,
1
2
i
. The indicator function is 0 while the logarithm is
non-positive. Hence I
x ≥
1
2
≥ log(2x). Next consider x∈
1
2
,1
i
. The indicator
function is 1. The logarithm is positive and monotonically increasing with a value of
0 at x =
1
2
and log(2) at x =1. But log(2) is less than 1. Hence I
x≥
1
2
≥ log(2x)
for x∈
1
2
,1
i
as well.
The next theorem shows that the above log-likelihood is a lower-bound on the
classification accuracy in (6.54).
Theorem 6.5. Classification Accuracy-Likelihood Inequality The accuracy A
in (6.54) is related to the log-likelihood L in (6.55) as
A≥ L+log2. (6.57)
Proof: Lemma 6.1 gives
I
a
t
≥
1
2
≥ log(2a
t
). (6.58)
Next I note that
I
a
t
<
1
2
=I
1−a
t
≥
1
2
. (6.59)
A simple transformation x→1−x in Lemma 6.1 now gives
I
a
t
<
1
2
≥ log(2(1−a
t
)). (6.60)
Thus the two inequalities in (6.58) and (6.60) give the desired result:
t I
a
t
≥
1
2
+(1−t) I
a
t
<
1
2
≥t log(a
t
)+(1−t) log(1−a
t
)+log(2). (6.61)
140
ML estimation of a neural network’s parameters thus maximizes a lower-bound
on the classification accuracy.
I now discuss noise benefit in pre-training or initialization of the parameters of a
neural network using RBMs or the equivalent BAMs.
6.7 Training BAMs or Restricted Boltzmann Ma-
chines
Restricted Boltzmann Machines [73], [157] are a special type of bidirectional asso-
ciative memory (BAM) [86]–[88]. So they enjoy rapid convergence to a bidirectional
fixed point for synchronous updating of the neurons. A BAM is a two-layer heteroas-
sociative network that uses the synaptic connection matrixW on the forward pass of
the neuronal signals from the lower layer to the higher layer but also uses the trans-
pose matrix W
T
on the backward pass from the higher layer to the lower layer. The
lower layer is visible during training of “deep” neural networks [73] while the higher
field is hidden. The general BAM Theorem ensures that any such matrix W is bidi-
rectionally stable for threshold neurons as well for most continuous neurons. Logistic
neurons satisfy the BAM Theorem because logistic signal functions are bounded and
monotone decreasing. The following results use the term RBM instead of BAM for
simplicity.
Consider an RBM with I visible neurons and J hidden neurons. Let x
i
and h
j
denote the values of the i
th
visible and j
th
hidden neuron.
Let E(x,h;Θ) be the energy function for the network. Then the joint probability
density function of x and h is the Gibbs distribution:
p(x,h;Θ)=
exp
−E(x,h;Θ)
Z(Θ)
(6.62)
where Z(Θ)=
X
x
X
h
exp(−E(x,h;Θ)). (6.63)
Integrals replace sums for continuous variables in the above partition function Z(Θ).
TheGibbsenergyfunctionE(v,h;Θ)dependsonthetypeofRBM.ABernoulli(visible)-
Bernoulli(hidden) RBMhaslogisticconditionalPDFsatthehiddenandvisible layers
141
and has the following BAM energy or Lyapunov function [86]–[88]:
E(x,h;Θ)=−
I
X
i=1
J
X
j=1
w
ij
x
i
h
j
−
I
X
i=1
b
i
x
i
−
J
X
j=1
a
j
h
j
(6.64)
where w
ij
is the weight of the connection between the i
th
visible and j
th
hidden
neuron, b
i
is the bias for the i
th
visible neuron, and a
j
is the bias for the j
th
hidden
neuron. A Gaussian(visible)-Bernoulli(hidden) RBM has Gaussian conditional PDFs
at the visible layer, logistic conditional PDFs at the hidden layer, and the energy
function [73], [74]
E(x,h;Θ)=−
I
X
i=1
J
X
j=1
w
ij
x
i
h
j
+
1
2
I
X
i=1
(x
i
−b
i
)
2
−
J
X
j=1
a
j
h
j
. (6.65)
The neural network uses an RBM or BAM as a building block. The systems finds
ML estimates of each RBM’s parameters and then stacks up the resulting RBMs on
topofeach other. Then backpropagation trainsthe neural network. The next section
reviews ML training for an RBM.
6.7.1 ML Training for RBMs using Contrastive Divergence
The maximum likelihood (ML) estimate of the parameters Θ for a RBM is
Θ
∗
= argmax
Θ
logp(x;Θ). (6.66)
Gradient ascent can iteratively solve this optimization problem. I estimate w
ij
in
the quadratic forms in (53) and (54) because the terms are the same for a Bernoulli-
Bernoulli and Gaussian-Bernoulli RBM. The gradient of logp(x;Θ) with respect to
w
ij
is
∂logp(x;Θ)
∂w
ij
=E
p(h|x,Θ)
{x
i
h
j
}−E
p(x,h|Θ)
{x
i
h
j
}. (6.67)
So the update rule for w
ij
at iteration (n+1) becomes
w
(n+1)
ij
=w
(n)
ij
+η
E
p(h|x,Θ
(n)
)
{x
i
h
j
}−E
p(x,h|Θ
(n)
)
{x
i
h
j
}
(6.68)
142
where η > 0 is the learning rate. I can easily compute p(h|x,Θ
(n)
) for the RBM
becausetherearenoconnectionsbetweenanytwohiddenortwovisibleneurons. This
gives the expectation E
p(h|x,Θ
(n)
)
{x
i
h
j
}. But I cannot so easily compute p(x,h|Θ
(n)
)
due to the partition function Z(Θ) in (6.63). Constrastive divergence (CD) [73]
approximates Z(Θ) through activations that derive from a forward and a backward
pass in the RBM.
6.7.2 ML Training for RBMs using the EM algorithm
The EM algorithm provides an iterative method for learning RBM parameters. Con-
sider an RBM with I visible neurons andJ hidden neurons. EM maximizes asimpler
lower bound on logp(x;Θ) because the log-likelihood can be intractable to compute.
This lower bound at Θ= Θ
(n)
is
Q(Θ|Θ
(n)
)=E
h|x,Θ
(n){logp(x,h;Θ)} (6.69)
=E
h|x,Θ
(n){−E(x,h;Θ))−logZ(Θ)}. (6.70)
Ageneralized EM(GEM)algorithmuses gradientdescent toiteratively maximize the
above Q-function. The gradient with respect to w
ij
is
∂Q
Θ|Θ
(n)
∂w
ij
=
∂E
h|x,Θ
(n){−E(x,h;Θ))−logZ(Θ)}
∂w
ij
(6.71)
=E
h|x,Θ
(n)
(
−
∂E(x,h;Θ))
∂w
ij
−
∂logZ(Θ)
∂w
ij
)
(6.72)
=E
h|x,Θ
(n)
(
x
i
h
j
−
1
Z(Θ)
∂Z(Θ)
∂w
ij
)
. (6.73)
143
But the partition funtion term expands as
1
Z(Θ)
∂Z(Θ)
∂w
ij
=
1
Z(Θ)
∂
n
P
x
P
h
exp(−E(x,h;Θ))
o
∂w
ij
(6.74)
=
1
Z(Θ)
X
x
X
h
n
∂exp(−E(x,h;Θ))
∂w
ij
o
(6.75)
=
1
Z(Θ)
X
x
X
h
n
exp(−E(x,h;Θ))x
i
h
j
o
(6.76)
=
X
x
X
h
n
exp(−E(x,h;Θ))
Z(Θ)
x
i
h
j
o
(6.77)
=
X
x
X
h
n
p(x,h|Θ)x
i
h
j
o
=E
x,h|Θ
{x
i
h
j
}. (6.78)
So the partial derivative of the Q-function becomes
∂Q
Θ|Θ
(n)
∂w
ij
=E
h|x,Θ
(n)
n
x
i
h
j
−E
x,h|Θ
{x
i
h
j
}
o
(6.79)
=E
h|x,Θ
(n){x
i
h
j
}−E
x,h|Θ
{x
i
h
j
}. (6.80)
This leads to the key GEM gradient ascent equation:
w
(n+1)
ij
=w
(n)
ij
+η
E
p(h|x,Θ
(n)
)
{x
i
h
j
}−
E
p(x,h|Θ
(n)
)
{x
i
h
j
}
. (6.81)
The above updateequation is thesame as thecontrastive divergence updateequation
in (6.68). The next section shows that this equivalence between CD and GEM lets
us to derive a NEM sufficient condition for RBM training.
6.8 Noise Benefits in RBM ML Estimation
Consider now addition of noisen to the input datax. A NEM noise benefit exists in
the RBM if
E
x,h,n|Θ
∗
n
log
p(x+n,h;Θ)
p(x,h;Θ)
o
≥ 0. (6.82)
144
The noisy complete data likelihood is
p(x+n,h;Θ)=
exp(−E(x+n,h;Θ))
Z(Θ)
. (6.83)
So a NEM noise benefit for an RBM occurs if
E
x,h,n|Θ
∗
n
log
exp(−E(x+n,h;Θ))
exp(−E(x,h;Θ))
o
≥ 0. (6.84)
This is equivalent to the RBM noise benefit inequality:
E
x,h,n|Θ
∗
n
−E(x+n,h;Θ)+E(x,h;Θ)
o
≥ 0. (6.85)
The theorem below states a simple hyperplane separation condition that guaran-
tees a noise benefit in the Bernoulli-Bernoulli RBM.
Theorem 6.6. Forbidden Hyperplane Noise Benefit Condition
The NEM positivity condition holds for Bernoulli-Bernoulli RBM training if
E
x,h,n|Θ
∗
n
n
T
(Wh+b)
o
≥ 0. (6.86)
Proof: The noise benefit for a Bernoulli(visible)-Bernoulli(hidden) RBM results if I
apply the energy function from (6.64) to the expectation in (6.85) to get
E
x,h,n|Θ
∗
n
I
X
i=1
J
X
j=1
w
ij
n
i
h
j
+
I
X
i=1
n
i
b
i
o
≥ 0. (6.87)
The term in brackets is equivalent to
I
X
i=1
J
X
j=1
w
ij
n
i
h
j
+
I
X
i=1
n
i
b
i
=n
T
(Wh+b). (6.88)
So the noise benefit sufficient condition becomes
E
x,h,n|Θ
∗
n
n
T
(Wh+b)
o
≥ 0. (6.89)
The above sufficient condition is similar to the hyperplane condition for neural
145
network training in Theorem 6.3. All noise above a hyperplane based on the RBM’s
parameters gives a noise benefit. The next theorem states a spherical separation
condition that guarantees a noise benefit in the Bernoulli-Bernoulli RBM.
Theorem 6.7. Forbidden Sphere Noise Benefit Condition
The NEM positivity condition holds for Gaussian-Bernoulli RBM training if
E
x,h,n|Θ
∗
n
1
2
knk
2
−n
T
(Wh+b−x)
o
≤ 0. (6.90)
Proof: Putting the energy function in (6.65) into (6.85) gives the noise benefit con-
dition for a Gaussian(visible)-Bernoulli(hidden):
E
x,h,n|Θ
∗
n
I
X
i=1
J
X
j=1
w
ij
n
i
h
j
+
I
X
i=1
n
i
b
i
−
1
2
I
X
i=1
n
2
i
(6.91)
−
I
X
i=1
n
i
x
i
o
≥ 0.
The term in brackets equals the following matrix expression:
I
X
i=1
J
X
j=1
w
ij
n
i
h
j
+
I
X
i=1
n
i
b
i
−
1
2
I
X
i=1
n
2
i
−
I
X
i=1
n
i
x
i
=n
T
(Wh+b−x)−
1
2
knk
2
. (6.92)
So the noise benefit sufficient condition becomes
E
x,h,n|Θ
∗
n
1
2
knk
2
−n
T
(Wh+b−x)
o
≤ 0. (6.93)
The above conditionbisects thenoisespace. Butthebisecting surfacefor(6.90)is
a hypersphere. This condition is also similar to the noise benefit condition for neural
network ML training in Theorem 6.4.
146
6.9 Simulation Results
ImodifiedtheMatlabcodeavailablein[75]toinjectnoiseduringEM-backpropagation
training of a neural network. I used 1000 training instances from the training set of
the MNIST digitclassification dataset. Each image inthe dataset had28×28pixels
with each pixel value lying between 0 and 1. I fed each pixel into the input neuron
of a neural network. I used a 5-layer neural network with 40 neurons in each of the
three hidden layers and 10 neurons in the output layer for classifying the 10 digits.
The output layer used the Gibbs activation function for the 10-class classification
network. I used logistic activation functions in all other layers. Simulations used 10
Monte Carlo samples for approximating the Q-function in the 10-class classification
network. Figure6.1showsthenoisebenefitforcrossentropytrainingofafeedforward
neural network. The NEM version displays a 18% median decrease in cross entropy
per iteration compared to noiseless backpropagation training. Figure 6.2 shows that
adding blind noise instead of NEM noise only gives a miniscule improvement of 1.7%
incrossentropy over thenoiseless EM-BPalgorithm. Figure6.3shows thatNEM-BP
gives a 15% median improvement in the per-iteration classification error rate for the
training set and a 10% improvement for the testing set at the optimal noise variance
of 0.42. Figure 6.4 shows that this noise benefit disappears upon using blind noise in
place of NEM noise.
6.10 Conclusions
The backpropagation algorithm is a special case of the generalized EM algorithm.
So proper noise injection speeds backpropagation convergence because it speeds EM
convergence. These sufficient conditions use the recent noisy EM (NEM) theorem.
Similar sufficient conditions hold for a noise benefit in pre-training neural networks
based on the NEM theorem. Noise-injection simulations on the MNIST digit recog-
nition data set reduced both the network cross entropy and classification error rate.
147
Bibliography
[1] F. Alimoglu and E. Alpaydin, “Methods of Combining Multiple Classifiers
Based on Different Representations for Pen-based Handwritten Digit Recogni-
tion”, in Proc. Fifth Turkish Artificial Intelligence and Artificial Neural Net-
works Symposium (TAINN 96), 1996.
[2] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A
general and efficient weighted finite-state transducer library”, Implementation
and Application of Automata, pp. 11–23, 2007.
[3] V.AmbatiandS.Vogel,“Cancrowds buildparallelcorporaformachine trans-
lation systems?”, in Proc. HLT-NAACL, 2010, pp. 62–65.
[4] G. An, “The effects of adding noise during backpropagation training on a
generalization performance”, Neural Computation, vol. 8, no. 3, pp. 643–674,
1996.
[5] K Audhkhasi, P. Georgiou, and S. Narayanan, “Accurate transcription of
broadcast news speech using multiple noisy transcribers and unsupervised re-
liability metrics”, in Acoustics, Speech and Signal Processing (ICASSP), 2011
IEEE International Conference on, IEEE, 2011, pp. 4980–4983.
[6] K. Audhkhasi, P. Georgiou, and S. Narayanan, “Reliability-weighted acous-
tic model adaptation using crowd sourced transcriptions”, Proc. Interspeech,
Florence, 2011.
[7] K. Audhkhasi and S. Narayanan, “Data-dependent evaluator modeling and
its application to emotional valence classification from speech”, in Proc. Inter-
speech, 2010.
[8] ——, “Data-dependent evaluator modeling and its application to emotional
valence classification from speech”, in Proceedings of InterSpeech, 2010.
148
[9] ——, “Emotion classification from speech using evaluator reliability-weighted
combination of ranked lists”, in Proc. ICASSP, IEEE, 2011, pp. 4956–4959.
[10] ——, “A globally-variant locally-constant model for fusion of labels from mul-
tiple diverse experts without using reference labels”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 35, no. 4, pp. 769–783, 2013.
[11] K. Audhkhasi, A. Sethy, B. Ramabhadran, and S. Narayanan, “Creating en-
semble of diverse maximum entropy models”, in Proceedings of ICASSP, 2012.
[12] K.Audhkhasi,A.Zavou,P.Georgiou,andS.Narayanan,“Theoreticalanalysis
ofdiversity inanensemble ofautomaticspeech recognition systems”, Accepted
for publication in IEEE Transactionson Audio, Speech, and Language Process-
ing, 2013.
[13] S. Avidan, “Ensemble tracking”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 29, no. 2, pp. 261–271, 2007.
[14] R. M. Bell, Y. Koren, and C. Volinsky, “The BellKor solution to the Netflix
prize”, Netflix prize documentation, 2007.
[15] J. Bennett and S. Lanning, “The Netflix Prize”, in Proceedings of KDD Cup
and Workshop, 2007, pp. 35–38.
[16] ——,“TheNetflixprize”,inProceedingsofKDDCupandWorkshop,vol.2007,
2007, p. 35.
[17] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[18] ——, Pattern recognition and machine learning, 4. Springer New York, 2006,
vol. 4.
[19] C. M. Bishop, “Training with noise is equivalent to Tikhonov regularization”,
Neural computation, vol. 7, no. 1, pp. 108–116, 1995.
[20] D. Black, The theory of committees and elections. Springer, 1986.
[21] L. Breiman, “Bagging predictors”, Machine learning, vol. 24, no. 2, pp. 123–
140, 1996.
[22] ——, “Bagging predictors”, Machine Learning, vol. 24, no. 2, pp. 123–140,
1996.
[23] ——, “Bias, variance and arcing classifiers (Technical Report 460)”, Statistics
Department, University of California at Berkeley, Berkeley, CA, 1996.
149
[24] ——, “Random forests”, Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[25] ——, “Random forests”, Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[26] C. Breslin, “Generation and combination of complementary systems for au-
tomatic speech recognition”, PhD thesis, Cambridge University Engineering
Department and Darwin College, 2008.
[27] C. Breslin and M. J. F. Gales, “Generating complementary systems for speech
recognition”, in Interspeech, 2006.
[28] A. Bulsara, R. Boss, and E. Jacobs, “Noise effects in an electronic model of a
single neuron”, Biological cybernetics, vol. 61, no. 3, pp. 211–222, 1989.
[29] C.J.C.Burges, K.M.Svore, P.N.Bennett,A.Pastusiak, andQ.Wu,“Learn-
ingtorankusinganensembleoflambda-gradientmodels.”,JournalofMachine
Learning Research (Proceedings Track), vol. 14, pp. 25–35, 2011.
[30] C. Busso et al., “IEMOCAP: Interactive Emotional Dyadic Motion Capture
Database”, Language resources and evaluation, vol. 42, no. 4, pp. 335–359,
2008.
[31] C. Callison-Burch, “Fast, cheap, and creative: Evaluating translation quality
using Amazon’s Mechanical Turk”, in Proc. EMNLP, 2009, pp. 286–295.
[32] O. Chapelle and Y. Chang, “Yahoo! learning to rank challenge overview”,
Journal of Machine Learning Research, vol. 14, pp. 1–24, 2011.
[33] S. F. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, and G.
Zweig, “Advances in speech transcription at IBM under the DARPA EARS
program”,IEEE Transactions on Audio, Speech,and Language Processing,vol.
14, no. 5, pp. 1596–1608, 2006.
[34] X. Chen and Y. Zhao, “Building acoustic model ensembles by data sampling
with enhanced trainings and features”, IEEE Transactions on Audio, Speech,
and Language Processing, vol. 21, no. 3-4, pp. 498–507, 2013.
[35] G. Cook and T. Robinson, “Boosting the performance of connectionist large
vocabulary speech recognition”, in Proc. ICSLP, IEEE, vol. 3, 1996, pp. 1305–
1308.
[36] C. Cortes and V. Vapnik, “Support-vector networks”, Machine learning, vol.
20, no. 3, pp. 273–297, 1995.
150
[37] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modeling wine
preferences by data mining from physicochemical properties”, Decision Sup-
port Systems, vol. 47, no. 4, pp. 547–553, 2009.
[38] T.M.CoverandJ.A.Thomas,Elementsofinformationtheory.Wiley-interscience,
2012.
[39] J. Cui, X. Cui, B. Ramabhadran, J. Kim, B. Kingsbury, J. Mamou, L. Mangu,
M. Picheny, T. N. Sainath, and A. Sethy, “Developing speech recognition
systems for corpus indexing under the IARPA BABEL program”, in Proc.
ICASSP, IEEE, 2013.
[40] X. Cui, J. Huang, and J.-T. Chien, “Multi-view and multi-objective semi-
supervisedlearningforhmm-basedautomaticspeechrecognition”,IEEETrans-
actions on Audio, Speech, and Language Processing, vol. 20, no. 7, pp. 1923–
1935, 2012.
[41] A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer
error-rates using the EM algorithm”, Journal of the Royal Statistical Society.
Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979.
[42] A. P. Dempster, N. M. Liard, and D. B. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm”, Journal of Royal Statistical Society:
Series B, vol. 39, pp. 1–38, 1977.
[43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
Large-Scale Hierarchical Image Database”, in CVPR, 2009.
[44] M. Denkowski, H. Al-Haj, and A. Lavie, “Turker-assisted paraphrasing for
English-Arabic machine translation”, in Proc. HLT-NAACL, 2010, pp. 66–70.
[45] F.DiehlandP. C.Woodland,“Complementary phone errortraining”,in Proc.
Interspeech, 2012.
[46] T. Dietterich, “Ensemble methods in machine learning”, Multiple classifier
systems, pp. 1–15, 2000.
[47] ——, “Ensemble methods in machine learning”, Multiple classifier systems,
pp. 1–15, 2000.
[48] C.DimitrakakisandS.Bengio,“BoostingHMMswithanapplicationtospeech
recognition”, in Proc. ICASSP, IEEE, vol. 5, 2004, pp. 618–621.
151
[49] P. Domingos, “A unified bias-variance decomposition”, in Proc. ICML, 2000.
[50] V. Doumpiotisand W.Byrne, “Latticesegmentation andminimum Bayes risk
discriminative training for large vocabulary continuous speech recognition”,
Speech Communication, vol. 48, no. 2, pp. 142–160, 2006.
[51] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley, New
York, 2001, vol. 2.
[52] F.Eyben,M.W¨ ollmer,andB.Schuller,“Opensmile:TheMunichversatileand
fast open-source audio feature extractor”, in Proceedings of the international
conference on Multimedia, ACM, 2010, pp. 1459–1462.
[53] ——, “OpenSMILE: the Munich versatile and fast open-source audio feature
extractor”, in Proc. ACM Multimedia, 2010, pp. 1459–1462.
[54] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “LIBLINEAR: A library
for large linear classification”, The Journal of Machine Learning Research,vol.
9, pp. 1871–1874, 2008.
[55] J. Finkel, T. Grenager,and C. Manning, “Incorporatingnon-localinformation
into information extraction systems by Gibbs sampling”, in Proc. ACL, 2005,
pp. 363–370.
[56] J. Fiscus, “A post processing system to yield reduced word error rates: Recog-
nizerOutputVotingErrorReduction(ROVER)”,inProc. ASRU,IEEE,1997,
pp. 347–354.
[57] J. Fiscus, J. Garofolo, M. Przybocki, W. Fisher, and D. Pallett, “English
broadcast news speech (HUB4)”, Linguistic Data Consortium, Philadelphia,
1997.
[58] A.FrankandA.Asuncion,“UCImachinelearningrepository”,http://archive.ics.uci.edu/ml,
2010.
[59] B. Franzke and B. Kosko, “Noise Can Speed Convergence in Markov Chains”,
Physical Review E, vol. 84, no. 4, p. 041112, 2011.
[60] Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line
learning and an application to boosting”, in Computational learning theory,
1995, pp. 23–37.
152
[61] J. Friedman, “Greedy function approximation: a gradient boosting machine”,
Annals of Statistics, pp. 1189–1232, 2001.
[62] L. Gammaitoni, P. H¨ anggi, P. Jung, and F. Marchesoni, “Stochastic reso-
nance”, Reviews of Modern Physics, vol. 70, no. 1, p. 223, 1998.
[63] S.Geman,E.Bienenstock,andR.Doursat,“Neuralnetworksandthebias/variance
dilemma”, Neural computation, vol. 4, no. 1, pp. 1–58, 1992.
[64] S. Goldwater, D. Jurafsky, and C. D. Manning, “Which words are hard to
recognize? Prosodic, lexical, and disfluency factors that increase speech recog-
nition error rates”, Speech Communication, vol. 52, no. 3, pp. 181–200, 2010.
[65] M. Grimm and K. Kroschel, “Evaluation of natural emotions using self as-
sessment manikins”, in IEEE Workshop on Automatic Speech Recognition and
Understanding, 2005, pp. 381–385.
[66] S. Gutta, J. R. J. Huang, P. Jonathon, and H. Wechsler, “Mixture of experts
for classification of gender, ethnic origin, and pose of human faces”, IEEE
Transactions on Neural Networks, vol. 11, no. 4, pp. 948–960, 2000.
[67] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten,
“The WEKA data mining software: an update”, ACM SIGKDD Explorations
Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[68] Y. Hayakawa, A. Marumoto, and Y. Sawada, “Effects of the chaotic noise
on the performance of a neural network model for optimization problems”,
Physical review E, vol. 51, no. 4, pp. 2693–2696, 1995.
[69] S. Haykin, Neural networks: A comprehensive foundation. Prentice Hall, 1998.
[70] D. Heck, J. Knapp, J. Capdevielle, G. Schatz, and T. Thouw, CORSIKA: A
MonteCarlocodetosimulateextensiveairshowers.FZKA6019Forschungszen-
trum Karlsruhe, 1998, vol. 6019.
[71] J. Heer andM. Bostock, “Crowdsourcing graphicalperception: using Mechani-
calTurktoassessvisualizationdesign”,inProc. Intl. Conf. on Human Factors
in Computing Systems, 2010, pp. 203–212.
[72] G. Heigold, H. Ney, R. Schluter, and S. Wiesler, “Discriminative training
for automatic speech recognition: Modeling, criteria, optimization, implemen-
tation, and performance”, IEEE Signal Processing Magazine, vol. 29, no. 6,
pp. 58–69, Nov. 2012.
153
[73] G.E.Hinton, S.Osindero, andY.-W.Teh, “Afastlearning algorithmfordeep
belief nets”, Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[74] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data
with neural networks”, Science, vol. 313, no. 5786, pp. 504–507, 2006.
[75] G.Hinton,TrainingadeepautoencoderoraclassifieronMNISTdigits,http://www.cs.toronto.e
[Online; accessed 20-Feb-2013].
[76] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V.
Vanhoucke, P. Nguyen, T. Sainath, et al., “Deep neural networks for acoustic
modeling in speech recognition”, IEEE Signal Processing Magazine, 2012.
[77] J. Hirschberg, D. Litman, and M. Swerts, “Prosodic and other cues to speech
recognition failures”, Speech Communication,vol. 43, no. 1, pp. 155–175,2004.
[78] P. Horton and K. Nakai, “A probabilistic classification system for predicting
the cellular localization sites of proteins”, in Proc. Fourth International Con-
ference on Intelligent Systems for Molecular Biology, 1996, pp. 109–115.
[79] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin,
T. Pfau, E. Shriberg, and A. Stolcke, “The ICSI meeting corpus”, in Proc.
ICASSP, IEEE, vol. 1, 2003, pp. I–364.
[80] H.Jiang,“Confidence measures forspeech recognition:Asurvey”, Speech com-
munication, vol. 45, no. 4, pp. 455–470, 2005.
[81] R. Jin and Z. Ghahramani, “Learning with multiple labels”, Proc. NIPS,
pp. 921–928, 2003, issn: 1049–5258.
[82] R. Kehrein, “The prosody of authentic emotions”, in Proc. Speech Prosody
Conference, 2002, pp. 423–426.
[83] T. Kemp and T. Schaaf, “Estimating confidence using word lattices”, in Proc.
Eurospeech, Rhodes, Greece: ESCA, vol. 2, 1997, pp. 827–830.
[84] E. Kong and T. Dietterich, “Error-correcting output coding corrects bias and
variance”, in Proceedings of International Conference on Machine Learning,
vol. 313, 1995, p. 321.
[85] G.KornandT.Korn,Mathematical handbookfor scientists and engineers:def-
initions, theorems, and formulas for reference and review. Dover Publications,
2000.
154
[86] B. Kosko, “Adaptive bidirectional associative memories”, Applied optics, vol.
26, no. 23, pp. 4947–4960, 1987.
[87] ——, “Bidirectional associative memories”, Systems, Man and Cybernetics,
IEEE Transactions on, vol. 18, no. 1, pp. 49–60, 1988.
[88] ——, Neural networks and fuzzy systems: A dynamical systems approach to
machine intelligence. Prentice-Hall International Editions, 1992.
[89] ——, Noise. Viking, 2006, isbn: 0670034959.
[90] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validation, and
active learning”, Advances in neural information processing systems, pp. 231–
238, 1995.
[91] S. Kullback and R. A. Leibler, “On information and sufficiency”, The Annals
of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951.
[92] L. I. Kuncheva, Combining pattern classifiers: methods and algorithms. Wiley-
Interscience, 2004.
[93] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensem-
blesandtheirrelationshipwiththeensemble accuracy”, Machine learning,vol.
51, no. 2, pp. 181–207, 2003.
[94] L. Kuncheva, Combining pattern classifiers: methods and algorithms. Wiley-
Interscience, 2004.
[95] L. Kuncheva and C. Whitaker, “Measures of diversity in classifier ensembles
andtheirrelationshipwith theensemble accuracy”, Machine Learning,vol.51,
no. 2, pp. 181–207, 2003.
[96] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data”, in Proceedings of
ICML-01, 2001, pp. 282–289.
[97] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the SPHINX speech
recognition system”, IEEE Transactions on Acoustics, Speech and Signal Pro-
cessing, vol. 38, no. 1, pp. 35–45, 1990.
[98] S. Lee, S. Yildirim, A. Kazemzadeh, and S. Narayanan, “An articulatory
study of emotional speech production”, in Proc. Ninth European Conference
on Speech Communication and Technology, 2005.
155
[99] Y. Lee and O. Mangasarian, “SSVM: a smooth support vector machine for
classification”, Computational optimization and Applications, vol. 20, no. 1,
pp. 5–22, 2001.
[100] M. Li, V. Rozgic, G. Thatte, S. Lee, A. Emken, M. Annavaram, U. Mitra,
D.Spruijt-Metz, and S. Narayanan, “Multimodal physical activity recognition
by fusing temporal and cepstral information”, IEEE Transactions on Neural
Systems and Rehabilitation Engineering, vol. 18, no. 4, pp. 369–380, 2010.
[101] J.Lin,“Divergence measures based ontheShannonentropy”, IEEE Trans. on
Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
[102] D. J. Litman, J. B. Hirschberg, and M. Swerts, “Predicting automatic speech
recognition performance using prosodic cues”, in Proceedings of the 1st North
American chapter of the Association for Computational Linguistics conference,
Association for Computational Linguistics, 2000, pp. 218–225.
[103] D. Liu and J. Nocedal, “On the limited memory BFGS method for large scale
optimization”, Mathematical programming, vol. 45, no. 1, pp. 503–528, 1989.
[104] Y. Liu, E. Shriberg, and A. Stolcke, “Automatic disfluency identification in
conversational speech using multiple knowledge sources”, in Proc. Eurospeech,
Geneva, Switzerland, vol. 1, 2003, pp. 957–960.
[105] Y. Liu and X. Yao, “Ensemble learning via negative correlation”, Neural Net-
works, no. 10, pp. 1399–1404, 1999.
[106] ——, “Ensemble learning via negative correlation”, Neural Networks, vol. 12,
no. 10, pp. 1399–1404, 1999.
[107] T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-SVMs for
object detection and beyond”, in Proc. ICCV, IEEE, 2011, pp. 89–96.
[108] L.Mangu,H.Soltau,H.Kuo,B.Kingsbury,andG.Saon,“Exploitingdiversity
for spoken term detection”, in Proc. ICASSP, IEEE, 2013, pp. 8282–8286.
[109] M. Marge, S. Banerjee, and A. I. Rudnicky, “Using the Amazon Mechanical
Turk for transcription of spoken language”, in Proc. ICASSP, 2010.
[110] K. Matsuoka, “Noise injection into inputs in back-propagation learning”, Sys-
tems, Man and Cybernetics, IEEE Transactions on,vol.22,no.3,pp.436–440,
1992.
156
[111] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The SE-
MAINE database: Annotated multimodal records of emotionally coloured con-
versations between a person and a limited agent”, IEEE Trans. on Affective
Computing, pp. 1–14, 2011.
[112] G. J. McLachlan and T. Krishnan, The EM algorithm and extensions. Wiley-
Interscience, 2007, vol. 382.
[113] G. McLachlan and D. Peel, Finite mixture models. Wiley-Interscience, 2000.
[114] P. Melville and R. Mooney, “Constructing diverse classifier ensembles using
artificial training examples”, in Proc. IJCAI, 2003, pp. 505–510.
[115] ——, “Constructing diverse classifier ensembles using artificial training exam-
ples”,inInternational Joint Conferenceon Artificial Intelligence,vol.18,2003,
pp. 505–512.
[116] P. J.Moreno, B. Logan,andB. Raj,“Aboostingapproach forconfidence scor-
ing”,inProceedings of the 7th European Conference on Speech Communication
and Technology, 2001.
[117] E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, and
S. Narayanan, “Interpreting ambiguous emotionalexpressions”, in Proc. ACII,
Sep. 2009.
[118] W. Nash, The Population Biology of Abalone (Haliotis Species) in Tasmania.
I. Blacklip Abalone (H Rubra) from the North Coast and the Islands of Bass
Strait. Sea Fisheries Division, Marine Research Laboratories – Taroona, Dept.
of Primary Industry and Fisheries, Tasmania, 1978.
[119] A. Niculescu-Mizil et al., “Winning the KDD Cup Orange Challenge with
ensemble selection”, in KDD Cup and Workshop in conjunction with KDD,
2009.
[120] D. Oakes, “Direct calculation of the information matrix via the EM”, Journal
of the Royal Statistical Society: Series B (Statistical Methodology), vol. 61, no.
2, pp. 479–482, 1999.
[121] O. Osoba and B. Kosko, “Noise-Enhanced Clustering and Competitive Learn-
ing Algorithms”, Neural Networks, Jan. 2013.
157
[122] O. Osoba, S. Mitaim, and B. Kosko, “Noise Benefits in the Expectation-
Maximization Algorithm: NEM theorems and Models”, in The International
Joint Conference on Neural Networks (IJCNN), IEEE, 2011, pp. 3178–3183.
[123] ——, “The Noisy Expectation-Maximization Algorithm”, in review, 2013.
[124] C. Parada, M. Dredze, D. Filimonov, and F. Jelinek, “Contextual information
improves OOV detection in speech”, in Proc. NAACL, 2010.
[125] A. Patel and B. Kosko, “Levy Noise Benefits in Neural Signal Detection”, in
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Interna-
tional Conference on, vol. 3, 2007, pp. III–1413 –III–1416.
[126] ——, “Stochastic Resonance in Continuous and Spiking Neurons with Levy
Noise”, IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 1993–
2008, 2008.
[127] ——, “Error-probability noise benefits in threshold neural signal detection”,
Neural Networks, vol. 22, no. 5, pp. 697–706, 2009.
[128] ——, “Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Esti-
mation”, IEEE Signal Processing Letters, vol.17,no.12,pp.1005–1009,2010,
issn: 1070-9908.
[129] ——, “Noise Benefits in Quantizer-Array Correlation Detection and Water-
mark Decoding”, IEEE Transactions on Signal Processing, vol. 59, no. 2,
pp. 488 –505, 2011, issn: 1053-587X.
[130] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based
CSRcorpus”,inProc. Workshopon Speechand Natural Language,Association
for Computational Linguistics, 1992, pp. 357–362.
[131] D. Y. Pavlov, A. Gorodilov, and C. A. Brunk, “BagBoo: a scalable hybrid
bagging-the-boosting model”, in Proc. CIKM, ACM, 2010, pp. 1897–1900.
[132] M. Piotte and M. Chabbert, “The Pragmatic Theory solution to the Netflix
grand prize”, Netflix prize documentation, 2009.
[133] D. Povey, “Discriminative training for large vocabulary speech recognition”,
PhD thesis, Cambridge University, 2003.
[134] ——, “Discriminative training for large vocabulary speech recognition”, PhD
thesis, Cambridge University, 2003.
158
[135] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M.
Hannemann, P. Motlicek, Y. Qian, and P. Schwarz, “The Kaldi Speech Recog-
nition Toolkit”, in Proc. ASRU, Hilton Waikoloa Village, Big Island, Hawaii,
US: IEEE, Dec. 2011.
[136] J. Quinlan, C4.5: programs for machine learning. Morgan Kaufmann, 1993.
[137] L. Rabiner and B. Juang, “An introduction to hidden Markov models”, ASSP
Magazine, IEEE, vol. 3, no. 1, pp. 4–16, 1986.
[138] L.RabinerandB.-H.Juang,Fundamentalsof speechrecognition.PrenticeHall,
1993.
[139] L. R.Rabiner, “Atutorialon hidden Markov models andselected applications
in speech recognition”, Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286,
1989.
[140] V. C. Raykar, S. Yu, L. S. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and
L. Moy, “Learning from Crowds”, Journal of Machine Learning Research, vol.
11, pp. 1297–1322, Mar. 2010.
[141] K. Sagae and J. Tsujii, “Dependency Parsing and Domain Adaptation with
LR Models and Parser Ensembles”, in Proc. EMNLP-CoNLL, vol. 2007, 2007,
pp. 1044–1050.
[142] G. Saon and H. Soltau, “Boosting systems for large vocabulary continuous
speech recognition”, Speech Communication, vol. 54, no. 2, pp. 212–218, 2012.
[143] G. Saon, H. Soltau, U. Chaudhari, S. Chu, B. Kingsbury, H. Kuo, L. Mangu,
and D. Povey, “The IBM 2008 GALE Arabic speech transcription system”, in
Proc., 2010, pp. 4378–4381.
[144] G. Saon et al., “The IBM 2008 GALE Arabic speech transcription system”, in
Proc. ICASSP, 2010, pp. 4378–4381.
[145] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 emotion
challenge”, in Proc. Interspeech, 2009, pp. 312–315.
[146] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. M¨ uller,
and S. S. Narayanan, “The INTERSPEECH 2010paralinguistic challenge”, in
Proc. Interspeech, 2010, pp. 2794–2797.
159
[147] B. Schuller, S. Steidl, A. Batliner, F. Schiel, and J. Krajewski, “The INTER-
SPEECH 2011 speaker state challenge”, in Proc. Interspeech, 2011, pp. 3201–
3204.
[148] B. Schuller, S. Steidl, A. Batliner, E. N¨ oth, A. Vinciarelli, F. Burkhardt, R.
Van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, and G. Weiss,
“TheINTERSPEECH 2012speakertraitchallenge”,inProc. Interspeech,2012.
[149] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M.
Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A.
Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 computa-
tional paralinguistics challenge: social signals, conflict, emotion, autism”, in
Proc. Interspeech, 2013.
[150] M. Schulze, “A new monotonic and clone-independent single-winner election
method”, Voting matters, vol. 17, pp. 9–19, 2003.
[151] H.Schwenk,“UsingboostingtoimproveahybridHMM/neuralnetworkspeech
recognizer”, in Proc. ICASSP, IEEE, vol. 2, 1999, pp. 1009–1012.
[152] F. Sebastiani, “Machine learning in automated text categorization”, ACM
Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
[153] M. S. Seigel and P. C. Woodland, “Combining information sources for confi-
dence estimation with CRF models”, in Proceedings of the Annual Conference
oftheInternationalSpeechCommunicationAssociation,INTERSPEECH,2011,
pp. 905–908.
[154] C.A.ShippandL.I.Kuncheva, “Relationshipsbetween combinationmethods
and measures of diversity in combining classifiers”, Information Fusion, vol. 3,
no. 2, pp. 135–148, 2002.
[155] O. Siohan, B. Ramabhadran, and B. Kingsbury, “Constructing ensembles of
ASRsystems usingrandomizeddecisiontrees”,inProc. ICASSP,IEEE,vol.1,
2005, pp. 197–200.
[156] J. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. Johannes,
“UsingtheADAPlearningalgorithmtoforecasttheonsetofdiabetesmellitus”,
in Proc. Annual Symposium on Computer Application in Medical Care, 1988,
p. 261.
160
[157] P. Smolensky, “Information processing in dynamical systems: foundations of
harmony theory”, 1986.
[158] P.Smyth,U.Fayyad,M.Burl,P.Perona,andP.Baldi,“Inferringgroundtruth
fromsubjectivelabelingofVenusimages”,inProc. NIPS,1995,pp.1085–1092.
[159] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast—but is
it good?: Evaluating non-expert annotations for natural language tasks”, in
Proc. EMNLP, 2008, pp. 254–263.
[160] H. Soltau, G. Saon, B. Kingsbury, H. K. J. Kuo, L. Mangu, D. Povey, and A.
Emami, “Advances in Arabic speech transcription at IBM under the DARPA
GALE program”, IEEE Transactions on Audio, Speech, and Language Pro-
cessing, vol. 17, no. 5, pp. 884–894, 2009.
[161] A. Sorokin andD.Forsyth, “Utility dataannotationwith Amazon Mechanical
Turk”, in Proc. CVPR, 2008, pp. 1–8.
[162] D. Stallard, F. Choi, C. L. Kao, K. Krstovski, P. Natarajan, R. Prasad, S.
Saleem,andK.Subramanian,“TheBBN2007DisplaylessEnglish/IraqiSpeech-
to-Speech Translation System”, in Proc. Interspeech, 2007.
[163] A. Stolcke, “SRILM - an extensible language modeling toolkit”, in Proc. IC-
SLP, 2002, pp. 901–904.
[164] Y. Tachioka and S. Watanabe, “Discriminative training of acoustic models for
system combination”, in Interspeech, 2013.
[165] A.Tanas,M.A.Little,P.E.McSharry, andL.O.Raming,“Accuratetelemon-
itoring of Parkinson’s disease progression by noninvasive speech tests”, IEEE
Transactions on Biomedical Engineering, vol. 57, no. 4, pp. 884–893, 2010.
[166] H. Teicher, “Identifiability of finite mixtures”, The Annals of Mathematical
Statistics, vol. 34, no. 4, pp. 1265–1269, 1963.
[167] E. Tjong Kim Sang and F. De Meulder, “Introduction to the CoNLL-2003
sharedtask:Language-independentnamedentityrecognition”,inProc.CoNLL,
2003, pp. 142–147.
[168] A. T¨ oscher, M. Jahrer, and R. M. Bell, “The BigChaos solution to the Netflix
grand prize”, Netflix prize documentation, 2009.
161
[169] K.Tumer andJ. Ghosh,“Analysis ofdecision boundaries inlinearly combined
neural classifiers”, Pattern Recognition, vol. 29, no. 2, pp. 341–348, 1996.
[170] ——, “Analysis of decision boundaries in linearly combined neural classifiers”,
Pattern Recognition, vol. 29, no. 2, pp. 341–348, 1996.
[171] G.Tur,A.Stolcke, L.Voss,J.Dowding,B.Favre,R.Fernandez,M.Frampton,
M.Frandsen,C.Frederickson,andM.Graciarena,“TheCALOmeetingspeech
recognition and understanding system”, in Proc. SLT, IEEE, 2008, pp. 69–72.
[172] G. Tur, A. Stolcke, L. Voss, S. Peters, D. Hakkani-Tur, J. Dowding, B. Favre,
R. Fern´ andez, M. Frampton, M. Frandsen, C. Frederickson, M. Graciarena,
D. Kintzing, K. Leveque, S. Mason, J. Niekrasz, M. Purver, K. Riedhammer,
E. Shriberg, J. Tien, D. Vergyri, and F. Yang, “The CALO meeting assistant
system”, IEEE Transactions on Audio, Speech, and Language Processing, vol.
18, no. 6, pp. 1601–1611, 2010.
[173] N. Ueda and R. Nakano, “Generalization error of ensemble estimators”, in
IEEE International Conference on Neural Networks, vol. 1, 1996, pp. 90–95.
[174] ——, “Generalization error of ensemble estimators”, in Proc. ICNN, vol. 1,
1996, pp. 90–95.
[175] G. Valentini and T. Dietterich, “Bias-variance analysis of support vector ma-
chines for the development of svm-based ensemble methods”, The Journal of
Machine Learning Research, vol. 5, pp. 725–775, 2004.
[176] V. Vapnik, The nature of statistical learning theory. Springer, 1999.
[177] W. Wang, P. Lu, and Y. Yan, “An improved hierarchical speaker clustering”,
Acta Acoustica, 2006.
[178] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke, “Neural-
network based measures of confidence for word recognition”, in Acoustics,
Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International
Conference on, IEEE, vol. 2, 1997, pp. 887–890.
[179] P. Welinder, S. Branson, S. Belongie, and P. Perona, “The multidimensional
wisdom of crowds”, 2010.
[180] F. Wessel, R. Schluter, K. Macherey, and H. Ney, “Confidence measures for
largevocabularycontinuousspeech recognition”,Speech and Audio Processing,
IEEE Transactions on, vol. 9, no. 3, pp. 288–298, 2001.
162
[181] C. White, J. Droppo, A. Acero, and J. Odell, “Maximum entropy confidence
estimation for speech recognition”, in Acoustics, Speech and Signal Processing,
2007. ICASSP 2007. IEEE International Conference on, IEEE, vol. 4, 2007,
pp. IV–809.
[182] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, “Whose vote
should count more: Optimal integration of labels from labelers of unknown
expertise”, in Proc. NIPS, vol. 22, 2009, pp. 2035–2043.
[183] M. Wilde and B. Kosko, “Quantum forbidden-interval theorems for stochastic
resonance”, Journal of Physical A: Mathematical Theory, vol. 42, no. 46, 2009.
[184] M. Wollmer, F. Ebyen, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and
R.Cowie,“Abandoningemotionalclasses -Towards continuousemotionrecog-
nition with modeling of long-range dependencies”, in Proc. Interspeech, 2008,
pp. 597–600.
[185] P. Xu, D. Karakos, and S. Khudanpur, “Self-supervised discriminative train-
ing ofstatistical language models”, in Automatic Speech Recognition & Under-
standing, 2009. ASRU 2009. IEEE Workshop on, IEEE, 2009, pp. 317–322.
[186] J. Xue and Y. Zhao, “Random forests of phonetic decision trees for acoustic
modeling in conversational speech recognition”, IEEE Transactions on Audio,
Speech, and Language Processing, vol. 16, no. 3, pp. 519–528, 2008.
[187] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy,
and G. J. Dy, “Modeling annotator expertise: Learning when everyone knows
a bit of something”, in Proc. AISTATS, 2010.
[188] S.Young,G.Evermann,D.Kershaw,G.Moore,J.Odell,D.Ollason,D.Povey,
V. Valtchev, and P. Woodland, “The HTK book”, Cambridge University En-
gineering Department, 2002.
[189] B.Zhang,A.Sethy,T.N.Sainath,andB.Ramabhadran,“Applicationspecific
loss minimization using gradient boosting”, in Proc. ICASSP, 2011, pp. 4880–
4883.
163
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Exploiting latent reliability information for classification tasks
PDF
Machine learning paradigms for behavioral coding
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Kernel methods for unsupervised domain adaptation
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Classification and retrieval of environmental sounds
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Learning multi-annotator subjective label embeddings
PDF
Causality and consistency in electrophysiological signals
PDF
Novel variations of sparse representation techniques with applications
Asset Metadata
Creator
Audhkhasi, Kartik
(author)
Core Title
A computational framework for diversity in ensembles of humans and machine systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/17/2014
Defense Date
04/21/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
diversity,ensemble methods,information processing,machine learning,OAI-PMH Harvest,signal processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Ortega, Antonio K. (
committee member
), Sha, Fei (
committee member
)
Creator Email
kartikaudhkhasi@gmail.com,kaudhkha@us.ibm.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-445107
Unique identifier
UC11286664
Identifier
etd-AudhkhasiK-2717.pdf (filename),usctheses-c3-445107 (legacy record id)
Legacy Identifier
etd-AudhkhasiK-2717.pdf
Dmrecord
445107
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Audhkhasi, Kartik
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
ensemble methods
information processing
machine learning
signal processing