Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient estimation and discriminative training for the Total Variability Model
(USC Thesis Other)
Efficient estimation and discriminative training for the Total Variability Model
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT ESTIMATION AND DISCRIMINATIVE TRAINING FOR THE
TOTAL VARIABILITY MODEL
by
Ruchir Travadi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2019
Copyright 2019 Ruchir Travadi
Dedication
To my family
ii
Acknowledgments
My doctoral endeavor has only been made possible with the ceaseless help and
support that I have received from a number of people around me. I am deeply
thankful to all those who have supported me throughout this journey and made it
a memorable part of my life.
First and foremost I would like to express my sincerest gratitude towards my
advisor Prof. Shrikanth Narayanan for providing me the opportunity to pursue my
doctoral studies. His constant guidance and encouragement have been extremely
crucial in motivating my progress, and the research environment that he has nur-
tured at the Signal Analysis and Interpretation Laboratory has played an impor-
tant role in the development of my thesis.
I would like to thank Prof. Antonio Ortega, Prof. Mahdi Soltanolkotabi and
Prof. Joseph Lim for serving on my qualifying exam committee, Prof. Louis
Goldstein for serving on my defense committee and Prof. Panayiotis Georgiou for
serving on both. Their insightful feedback has been vital in the improvement of
my thesis.
During my stay at USC I’ve had the opportunity to expand my knowledge in
a variety of different subjects through the courses I’ve taken and I am thankful
to all the faculty members under whom I’ve had the pleasure of studying them. I
iii
would also like to thank USC and its staff for providing the perfect environment
and services for conducting my research.
Discussions with my colleagues at the Signal Analysis and Interpretation Lab
have been extremely pivotal in the development of my research work, and I am
deeply grateful to them for their help on various research projects.
My life outside of studies and research has been made especially memorable
because of my friends and it is my pleasure to thank all of them, especially Akshay,
Aamir, Praveen and Arvind. I am also extremely thankful to Shelby for her help
and affection during the stressful period of writing my dissertation.
Finally, I would be remiss not to thank my family. They have been an eternal
source of support throughout my life and I am truly indebted to them for that.
iv
Contents
Dedication ii
Acknowledgments iii
List of Tables viii
List of Figures ix
Abstract xi
1 Introduction 1
1.1 Research directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Efficient estimation . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Statistical analysis of model assumptions . . . . . . . . . . . 6
1.1.3 Discriminative training . . . . . . . . . . . . . . . . . . . . . 8
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 System Framework and Model Background 11
2.1 Speech processing system framework . . . . . . . . . . . . . . . . . 11
2.2 Speaker and Language Identification . . . . . . . . . . . . . . . . . 12
2.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . 15
2.3.3 Joint Factor Analysis . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Total Variability Model . . . . . . . . . . . . . . . . . . . . . 18
2.4 Parameter estimation and inference for the Total Variability Model 19
2.4.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
v
3 Efficient Estimation 23
3.1 Estimation using randomized SVD . . . . . . . . . . . . . . . . . . 24
3.1.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Maximization using approximation . . . . . . . . . . . . . . 26
3.2 Advantages of randomized SVD estimation . . . . . . . . . . . . . . 29
3.2.1 Computational complexity of parameter estimation . . . . . 29
3.2.2 Computational complexity of ivector extraction . . . . . . . 30
3.2.3 Interpretability of ivectors . . . . . . . . . . . . . . . . . . . 30
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Description of databases and experimental setup . . . . . . . 33
3.3.2 Results and analysis . . . . . . . . . . . . . . . . . . . . . . 35
4 Statistical Analysis of Model Assumptions 44
4.1 Distribution of feature vectors . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Distribution free formulation . . . . . . . . . . . . . . . . . . 45
4.1.2 Asymptotic distribution of Baum-Welch statistics . . . . . . 46
4.1.3 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.4 Posterior distribution of the ivector . . . . . . . . . . . . . . 49
4.1.5 Statistical test . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.6 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Globally tied component weights and covariance matrices . . . . . . 55
4.2.1 Statistical test . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Explaining the RSVD results . . . . . . . . . . . . . . . . . 56
4.3 Low rank nature of variability in the mean supervector . . . . . . . 58
4.3.1 Statistical test . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 63
4.3.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Discriminative Training 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Deep Neural Network Embeddings for speaker verification . . . . . 70
5.3 TVM in discriminatively trained systems . . . . . . . . . . . . . . . 71
5.3.1 Distribution-free reformulation of TVM . . . . . . . . . . . . 72
5.3.2 Discriminative training for TVM . . . . . . . . . . . . . . . 73
5.3.3 Hybrid architecture: TVM as a network layer . . . . . . . . 75
5.3.4 Gradients for the Total Variability layer . . . . . . . . . . . 77
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.2 System Details . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
vi
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Conclusion and Future Work 82
6.1 Summary of proposed ideas and contributions . . . . . . . . . . . . 82
6.1.1 Efficient Estimation . . . . . . . . . . . . . . . . . . . . . . . 82
6.1.2 Statistical Analysis of Model Assumptions . . . . . . . . . . 83
6.1.3 Discriminative Training . . . . . . . . . . . . . . . . . . . . 83
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 Applicability to other problems and modalities . . . . . . . . 84
6.2.2 Acoustic model adaptation for automatic speech recognition 85
Reference List 89
vii
List of Tables
3.1 Algorithmic complexity and computation time for parameter esti-
mation and ivector extraction . . . . . . . . . . . . . . . . . . . . . 36
3.2 Equal Error Rate (%) on SRE 2008 data . . . . . . . . . . . . . . . 37
3.3 Equal Error Rate (%) on RATS LID data . . . . . . . . . . . . . . 37
3.4 Correlation coefficients between EER and negative log likelihood . . 42
4.1 Equal Error Rate (%) on SRE 2008 and RATS LID data . . . . . . 65
5.1 Distribution of training data . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Results on the SITW eval set . . . . . . . . . . . . . . . . . . . . . 80
viii
List of Figures
1.1 Factors of variability in signals . . . . . . . . . . . . . . . . . . . . . 2
1.2 Automatic information extraction from speech and its applications . 3
2.1 System framework for speech processing applications . . . . . . . . 12
2.2 Detection Error Tradeoff (DET) curve and Equal Error Rate (EER) 14
3.1 Singular Values of
f
F . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 First two ivector dimensions for WSJ eval92 data . . . . . . . . . . 32
3.3 Negative log likelihood per frame for different models on SRE 2008 39
3.4 Negative log likelihood per frame for different models on RATS LID 40
3.5 EER(%) for different models on SRE 2008 . . . . . . . . . . . . . . 40
3.6 EER(%) for different models on RATS LID . . . . . . . . . . . . . . 41
4.1 Histograms of individual dimensions of
ˆ
F
kc
for SRE 2008 data . . . 52
4.2 Scatter plots of
ˆ
F
kc
projected to 2 dimensions on SRE 2008 data . . 53
4.3 Doornik-Hansen Multivariate Normality statistics for SRE 2008 data 54
4.4 ¯ p
uc
statistics for different utterances of RATS LID data . . . . . . . 56
4.5 S
I
statistics for SRE 2008 and RATS LID data . . . . . . . . . . . 60
4.6 Estimated B
c
at the end of EM iteration 1 for SRE 2008 male data 66
5.1 Loss computation for discriminative training of TVM . . . . . . . . 74
ix
5.2 X-vector and hybrid system architectures . . . . . . . . . . . . . . . 76
6.1 Regularized acoustic model training schematic . . . . . . . . . . . . 86
6.2 Regularized acoustic model training schematic . . . . . . . . . . . . 88
x
Abstract
Signals encountered in real life are affected by multiple factors of variability. Infor-
mation about these sources gets encoded in the signal, but an appropriate repre-
sentation needs to be derived from the signal in order to extract this information
automatically. Particularly in speech analysis, the Total Variability Model has
been widely used as a mechanism to extract a low-dimensional representation of
the speech signal especially for applications such as speaker and language identifi-
cation. This model has conventionally been posed as a generative latent variable
model and the model parameters are typically estimated using several iterations
of the Expectation Maximization (EM) algorithm which can be computationally
demanding. We propose a computationally efficient randomized algorithm for
parameter estimation that relies on some approximations within the model likeli-
hoodfunction. Weshowthatthisalgorithmreducesthecomputationtimerequired
for parameter estimation by a a significant factor while also providing performance
improvement on a language identification task. We then present extensive statisti-
cal analysis aimed at verifying the validity of some of the assumptions made within
the generative Total Variability Model formulation. We show that many of these
assumptions are not valid in practice and propose a way to reformulate the model
in a distribution-free and discriminative manner. Along with discriminative train-
ing, this reformulation enables the integration of Deep Neural Networks within
xi
the model. In addition, it also allows the model to be viewed abstractly as a net-
work layer that can be incorporated in other Deep Neural Network architectures.
Through experiments on speaker verification we show that incorporating this layer
leads to a significant improvement in performance.
xii
Chapter 1
Introduction
A number of real life applications rely on the processing of recorded signal streams
in order to extract certain types of information. However, these signals are usually
affected by (and therefore encode information about) a variety of different sources
of variability. For example, as shown in Figure 1.1, a speech signal can exhibit
variation depending on identity of the speaker or their characteristics such as age
or gender, the language being spoken as well as the emotional state of the speaker.
Similarly, biological signals can also be affected by physical, cognitive as well as
emotional state of the subject.
Systems can be built for automatic extraction of such information from the
signal. For example, different systems have been built in order to automatically
extract from a speech signal information such as the sequence of words being
communicated [21], the language in which they are spoken [11], the identity of the
speaker [18] and their characteristics (such as age and gender [30]), the degree to
which the signal is corrupted by noise [22], as well as the emotional state of the
speaker [28] and a wide array of other paralinguistic information [44, 43, 45].
These systems which can extract such information automatically from speech
are useful in a number of applications (Figure 1.2). For example, virtual assistants
[5] that allow users to interact with their devices through voice commands are
growing in popularity, and rely extensively on these systems. In addition, such
systems are also useful in security and defense applications [1], in commercial
services such as call centers [57] and air traffic control [8], in a clinical setting
1
Speech Signal
Biological Signals
Figure 1.1: Factors of variability in signals
for enabling communication between doctors and patients that speak different
languages [13], in behavioral analysis [37], and also in providing assistance during
emergency situations for situational awareness and resource deployment [32].
A crucial component of these systems involves the transformation of the raw
source signal into a form of representation that is suitable for the intended appli-
cation. In particular, the Total Variability Model [10] is a very versatile model
that has found use in some capacity within almost all of the applications men-
tioned earlier. Most commonly, it is used for obtaining a low-dimensional signal
representation in speaker and language identification systems [10, 11]. However, it
has also been used for speaker adaptation in automatic speech recognition [42, 47],
for adaptation to unknown noise conditions in SNR estimation [38], and also for
paralinguistic information extraction [55, 2].
Typicallyinthesesystems, thespeechsignalisfirsttransformedintoasequence
of vectors known as feature vectors. The Total Variability Model is formulated as a
generative model where the distribution of an observed collection of feature vectors
is assumed to be a Gaussian Mixture Model (GMM), with its parameters varying
2
Speech Signal
Information Extraction
Speech
Recognition
Language
Identification
Speaker
Identification
Emotion
Recognition
HOW IS THE
WEATHER TODAY?
Applications
Voice Assistants
Speech
Translation
Security and
Defense
Call Centers
CÓMO ESTÁ EL
CLIMA HOY?
Figure 1.2: Automatic information extraction from speech and its applications
across different sequences. Historically, the roots of this model can be traced
back to speaker identification systems that used a separate GMM to estimate the
distribution of every enrolled speaker [40]. This was followed by the systems that
trained a single GMM over combined training data, also known as a Universal
Background Model (UBM), and then adapted it to different speakers, in order
to solve the problem of data sparsity [39]. Later, generative models began to be
used as a front-end for discriminative methods. Initially, the front-end consisted
of UBM adaptation statistics that were stacked into the so-called supervectors,
and classification was performed using a Support Vector Machine (SVM) [3]. This
paved the way for Joint Factor Analysis (JFA) [26], which attempted to capture
3
the variability in high-dimensional supervectors within separate low-dimensional
vectors that represented source and channel factors. That eventually motivated
TVM formulation, where a single low-dimensional vector representation is instead
used to capture all sources of variability, on which compensation methods such
as Linear Discriminant Analysis (LDA), Within Class Covariance Normalization
(WCCN)[19]orNuisanceAttributeProjection(NAP)[4]aresubsequentlyapplied.
1.1 Research directions
Our work is focused on three aspects related to the Total Variability Model: com-
putationally efficient methods for parameter estimation and inference, statistical
analysis of the generative assumptions made within the model, and a distribution-
free reformulation of the model which enables it to be integrated within Deep
Neural Network architectures that can be trained discriminatively. In this section,
we first discuss our motivation behind this research, its relation to other work
in literature as well as the contributions made and their implications. Then, we
summarize some of the proposed directions for future work on extending the ideas
emerging from this research.
1.1.1 Efficient estimation
In the conventional formulation of the Total Variability Model, the generative
distribution underlying the collection of feature vectors from an utterance is char-
acterized using a low-dimensional vector representation that is known as the ivec-
tor for the utterance. Typically, the model parameters are estimated using several
iterations of the Expectation Maximization (EM) algorithm, where the ivectors for
utterances in the training set are treated as hidden variables. However, estimation
4
using the EM algorithm can be computationally demanding, and is only guaran-
teed to converge to a local optimum of the likelihood function. These drawbacks
of the EM algorithm were the primary motivation behind our research on better
estimation methods for the model. We analyzed the model likelihood function,
identified the bottlenecks that made it difficult to maximize it directly, and pro-
posedjustifiableapproximationswhichmadeiteasiertodosoinacomputationally
efficient manner.
Related work
One way to reduce the computational complexity of the EM algorithm is to use
a more efficient procedure for ivector extraction, since for every iteration of the
EM algorithm the ivectors need to be extracted for all training utterances in order
to compute the statistics required for model updates. Therefore, many modifica-
tions have been proposed in literature that attempt to reduce the computational
complexity of ivector extraction by making some approximations. In [16] assump-
tions such as constant GMM component alignment and approximate simultane-
ous orthogonalization were used to simplify ivector extraction. The variational
Bayes algorithm was used in [25] to reduce the memory footprint of ivector extrac-
tion. Pre-normalization of Baum-Welch statistics was used for speeding up ivector
extraction in [31], and approximate matrix factorization as well as the conjugate
gradient method were used for increasing the efficiency of parameter estimation
and ivector extraction in [9].
Contributions
While these methods can be used to increase the efficiency of the estimation pro-
cedure, they still operate under the iterative framework of the EM algorithm.
5
However, the problem of estimating the parameters of the Total Variability Model
is closely tied to many other problems involving low rank matrix approximations,
and randomized algorithms provide a computationally efficient solution for such
problems[17]. Weinvestigatedthepossibilityofemployingsuchalgorithmsforesti-
mating the parameters of the Total Variability Model. Unlike the EM algorithm,
we used an estimation procedure that attempts to maximize the marginalized
likelihood function directly (by making some approximations), and uses a random-
ized Singular Value Decomposition (SVD) algorithm which leads to a significant
reduction in the computational complexity. The approximations made within the
likelihood function are suitably justified if the assumptions made within the model
are valid, and the utterances used in the training set are long enough. In addition,
we showed that the computational complexity of ivector extraction can also be
reduced by using similar approximations.
1.1.2 Statistical analysis of model assumptions
The approximations we used for simplifying the likelihood function for efficient
estimation using the randomized SVD algorithm relied on the validity of model
assumptions. Therefore, itwasnecessarytoverifythevalidityoftheseassumptions
in order to justify the approximations. However, when we performed statistical
analysis for this purpose, we found that many of the assumptions were in fact
inconsistent with the observed data. This motivated the need for reformulating
the model in a way that could explain the success of the model despite this incon-
sistency, and also possibly provide insights on ways in which it can be improved.
6
Related work
A number of modifications have been proposed in literature that modify a part of
the model or use it in a slightly different way compared to the way it was origi-
nally formulated. Many of these methods focus on post-processing of the ivector
representation, such as normalization of the ivector length [14] or modifications to
the classifiers that operate on the ivectors [24]. Some systems use different feature
representations compared to those traditionally used, such as bottleneck features
[33] or a fusion of many different feature types [46]. Modifications to the model
structure such as changing the prior on ivectors [52] or phonetically aware compu-
tation of Baum-Welch statistics using Deep Neural Networks (DNNs) trained for
speech recognition have also been proposed [27, 29].
Contributions
While these approaches change a part of the model (or the broader system in which
the model is used) in different ways, they still rely on many of the assumptions
made in the conventional formulation in order to arrive at the representation.
However, a systematic evaluation of the validity of these assumptions has been
missing from literature. We enumerated the important assumptions made in the
conventional formulation and used statistical tests to analyze their validity. From
this analysis, we found that many of them were inconsistent with the observed
data. We showed that the model can be reformulated in a way that preserves all
its important aspects (the likelihood function, and procedures for estimation and
inference) without relying on assumptions about the form of distribution under-
lying the feature vectors. We also showed that assumption about the low rank
nature of some of the model parameters needed to be modified, and incorporating
7
a residual term in the model in order to account for this observation led to an
improvement in performance.
1.1.3 Discriminative training
Since the Total Variability Model has conventionally been viewed as a generative
model, its parameters are estimated by maximizing the model likelihood function.
However, under the distribution-free formulation proposed as a result of our sta-
tistical analysis, the model can potentially be cast in a discriminative form. This
discriminative reformulation of the Total Variability Model creates opportunities
for integrating DNNs within the model as well as making it trainable discrimina-
tively on large labeled corpora.
Related work
The discriminative training approach that we propose differs from the conventional
TVM formulation in two aspects. First, it uses a Deep Neural Network (DNN) to
compute statistics for estimation. Second, the model is trained discriminatively
using backpropagation on labeled data as opposed to being trained generatively
using the EM algorithm on unlabeled data.
Individually, been these variants have been proposed in the literature before.
Discriminative training was used for TVM training in [15], but their system did
not use a DNN. DNNs have previously been used to compute sufficient statis-
tics for parameter estimation in [29, 41, 35]. But these systems were not trained
discriminatively. Instead, they use a fixed DNN that was trained separately to
predict phonetic classes for Automatic Speech Recognition (ASR) which requires
transcriptions for training.
8
Apart from these approaches that use a DNN as a component within the Total
Variability Model, a number of other methods that use Deep Neural Networks
(DNN) for the speaker verification task have been proposed in the literature, both
for text-dependent [56, 20] as well as text-independent speaker verification [6, 58].
DNN embeddings known as x-vectors were proposed in [49] where they showed a
significant performace improvement over the Total Variability Model on speaker
verification.
Contributions
We propose a technique for discriminatively training TVM that is motivated by a
distribution-free reformulation of the model. This technique enables TVM to be
trained on labeled corpora, while also allowing for integration of DNNs within the
model architecture. In addition, it also enables TVM to be used as a network layer
in an arbitrary DNN embedding architecture. In particular, we propose a hybrid
architecture that uses TVM instead of a global statistics pooling layer in the x-
vector architecture. By means of experiments on speaker verification we show that
this hybrid architecture leads to a significant performance improvement over the
x-vector baseline.
1.2 Outline
The rest of this thesis proposal is organized as follows:
• In Chapter 2, we briefly describe the system framework and the problem
formulation for speaker and language identification tasks. We review some
of the early statistical models used for these applications and describe the
9
conventional formulation as well as training and inference algorithms for the
Total Variability Model.
• In Chapter 3, we show that the task of estimating the parameters that max-
imize the model likelihood can be reduced to an SVD computation over
a matrix of statistics by making some approximations in the marginalized
model likelihood function. We present experimental results on speaker and
language identification tasks to compare the performance obtained using this
estimation method with that obtained using the EM algorithm.
• In Chapter 4, we show by means of statistical analysis that some of the model
assumptions are not consistent with the data on which it is used in prac-
tice. We present a distribution free reformulation of the model, and propose
other ways to generalize the model that lead to an improved performance on
speaker and language identification tasks.
• In Chapter 5, we propose a discriminative training approach for the Total
Variability Model that is motivated by the distribution-free reformulation
discussed in Chapter 4. We also show that the idea can be extended to allow
the model to be viewed as a trainable layer in arbitrary Deep Neural Network
architectures. Through experiments on speaker verification we show that the
inclusion of this layer results in a significant improvement in performance.
• In Chapter 6, we conclude with a summary of important findings and a
discussion on possible directions for future work.
10
Chapter 2
System Framework and Model
Background
In this chapter, we first provide a brief description of the basic framework that is
commonly used in many speech processing systems. Then we describe the prob-
lem formulation as well as the evaluation metrics used for speaker and language
identification applications. We provide a historical context for the Total Variabil-
ity Model formulation by reviewing some of the statistical models that had been
proposed for these applications prior to it in literature. Finally, we describe in
detail the parameter estimation and inference algorithms commonly used for the
Total Variability Model.
2.1 Speech processing system framework
Thespectralcontentofaspeechsignalisaffectedbyanumberoffactorssuchasthe
phonemebeingspoken,theshapeofthevocaltractandpitch(amongmanyothers).
Since the spoken phoneme changes over time, the spectral properties of the signal
also change, and it is therefore not a stationary signal. However, for the purpose
of analysis it can be considered quasi-stationary, such that the spectral properties
are roughly constant when examined over very small segments. Therefore in most
speech processing systems, the signal is first windowed into small but overlapping
segments (also known as frames) of length roughly 20 ms each, as shown in Figure
11
2.1. Then, each frame is transformed into a feature vector, which is typically some
Speech Signal Windowing Feature Extraction Model Label Prediction
Figure 2.1: System framework for speech processing applications
low-dimensionalrepresentationofitspowerspectrum. TheMelFrequencyCepstral
Coefficients (MFCCs) are the most commonly used feature vector representation
scheme. Once the sequence of feature vectors has been obtained, it is fed as input
to a model which then converts it to a predicted label.
2.2 Speaker and Language Identification
2.2.1 Problem formulation
The task of a speaker (or language) identification system is to determine the iden-
tity of the speaker (or the spoken language) corresponding to a speech utterance.
The identity of a speaker (or a spoken language) is associated with a sample utter-
ance (or a set of sample utterances) provided by the speaker (or spoken in the
language) during the enrollment phase. There are two variants of the evaluation
phase for these task:
• Verification: In this case, a claimed identity is provided along with a test
utterance. The task is to determine if the test utterance indeed corresponds
to the claimed identity. The label to predict in this case is therefore a binary
variable.
• Recognition: In this case, no claimed identity is provided. The task is to
determine which (or possibly none) of the enrolled speakers (or languages)
12
correspondstothetestutterance. Thelabeltopredictinthiscaseistherefore
a nominal variable.
2.2.2 Evaluation metrics
Detection Error Tradeoff and Equal Error Rate
These metrics are suitable for the verification task. For this task, every pair of test
utterance and claimed identity is defined as a trial. A model is used to predict a
real valued score for every trial. The claim is assumed to be true if the score is
above a particular threshold τ, and false if it is below. There are potentially two
types of errors that can arise:
• A false alarm occurs if a false claim is adjudged to be true. The False Alarm
Rate (FAR) is defined as:
FAR (%) =
Number of false alarms
Total number of trials
× 100 (2.1)
• A missed detection occurs if a true claim is adjudged to be false. The Missed
Detection Rate (MDR) is defined as:
MDR (%) =
Number of missed detections
Total number of trials
× 100 (2.2)
For a given model, there is a tradeoff between the relative rate of false alarms
and missed detections, depending on the value of the threshold τ. This can be
summarized by means of a Detection Error Tradeoff (DET) curve, which plots the
FAR against the MDR for different values of the threshold τ as shown in Figure
2.2. The FAR (or MDR) at which the two rates are equal is known as the Equal
Error Rate (EER).
13
45°
EER
EER
FAR(τ)
MDR(τ)
Figure 2.2: Detection Error Tradeoff (DET) curve and Equal Error Rate (EER)
Recognition Error
This metric is suitable for the recognition task. An error is said to occur if the
estimated label does not match the true label. Recognition Error is then defined
as the percentage of errors over the entire test set:
Recognition Error (%) =
Number of errors
Total number of test utterances
× 100 (2.3)
2.3 Statistical models
In this section, we first describe the notation used to denote different model vari-
ables. Then, we briefly review some of the statistical models that have been used
for speaker and language identification tasks in literature. This is not an exhaus-
tive list, but the objective is to provide a historical context for the Total Variability
Model which is our focus of research.
14
2.3.1 Notation
Let s
u
(n) denote a sequence of samples from a speech signal corresponding to
an utterance indexed by u. A speech corpus comprising U utterances can be
represented as{s
u
(n)}
U
u=1
. Let{w
ut
(n)}
Tu
t=1
denote the collection of samples from
T
u
frames for the signal s
u
(n), X
u
={x
ut
}
Tu
t=1
denote the corresponding feature
vector sequence, and X ={X
u
}
U
u=1
denote the entire collection of feature vectors
from all utterances in the corpus. Let D be the dimensionality of each feature
vector: x
ut
∈ R
D
. Then, the model takes X
u
as input, and converts it to a
prediction ˆ y
u
corresponding to a true label y
u
. Depending on the application, the
label representation y
u
could be a scalar, vector or a sequence.
2.3.2 Gaussian Mixture Models
Gaussian Mixture Models (GMMs) were one of the earliest statistical models used
for speaker recognition [40]. In this model, it is assumed that the probability
density functionf(x
ut
) corresponding to the feature vectorsx
ut
is a mixture ofC
Gaussian components:
f(x
ut
) =
C
X
c=1
p
uc
(2π)
D
2
|Σ
uc
|
1
2
exp
−
1
2
(x
ut
−μ
uc
)
T
Σ
−1
uc
(x
ut
−μ
uc
)
(2.4)
where {p
uc
∈R}
C
c=1
are known as component weights (which sum to 1),
n
μ
uc
∈R
D
o
C
c=1
are known as component mean vectors, and
n
Σ
uc
∈R
D×D
o
C
c=1
are
known as component covariance matrices.
It is assumed that the model parameters{p
uc
,μ
uc
, Σ
uc
}
C
c=1
remain constant
across different utterances from the same speaker, but differ across different speak-
ers. The parameters for each speaker are estimated from their enrollment speech
data using the Expectation Maximization (EM) algorithm. In the testing phase,
15
the log likelihood of feature vectors from the test utterance is evaluated against
every speaker model, and used as the model score for the corresponding speaker.
GMM-UBM system
Estimation of GMM parameters separately for each speaker created a problem of
data sparsity, since data from a single speaker was often insufficient to estimate
all the GMM parameters robustly. This led to the practice of using the collective
enrollment data from all speakers for training a GMM, which was also known as a
Universal Background Model (UBM). The speaker specific GMM parameters were
then initialized using the UBM parameters{p
0c
,μ
0c
, Σ
0c
}
C
c=1
, and then adapted to
the specific speaker using their enrollment data [39].
2.3.3 Joint Factor Analysis
Although the GMM-UBM system solved the problem of data sparsity for GMMs,
it still allocated a large number of parameters for every speaker. The Joint Fac-
tor Analysis (JFA) model was proposed to reduce the number of speaker-specific
parameters in the density function.
The core idea behind this model is to restrict the degrees of freedom for the
manner in which model parameters are allowed to differ across different speakers.
This is achieved by formulating the model as a generative latent variable model. In
otherwords, itisassumedthatthedistributionoftheobservedfeaturevectorsfrom
an utterance depends on some utterance-independent model parameters as well as
some (low-dimensional) utterance-dependent latent variables which characterize
the variability arising from different sources.
In particular, it is assumed that the component weights and covariance matri-
ces are globally tied across all utterances: p
uc
= p
c
and Σ
uc
= Σ
c
, so the only
16
difference in feature vector distribution across different utterances is due to vari-
ability in component means μ
uc
. However, these differences are not allowed to
be arbitrary since the model introduces dependence between variability in μ
uc
across different componentsc. For this purpose, a mean supervectorM
u
∈R
CD
is
defined as a vector formed by stacking the D-dimensional mean vectorsμ
uc
from
all C components:
M
u
=
h
μ
T
u1
... μ
T
uC
i
T
(2.5)
Then, it is assumed that the variability in the mean supervectorM
u
across differ-
ent utterancesu arises due to differences in speakers, recording channels and some
utterance specific factors, where the speaker and channel variabilities lie along a
low-dimensional subspace. To describe the idea formally, for every utterance u
from a speaker s the mean supervector M
u
(s) is assumed to have the following
form:
M
u
(s) =M
0
+ Ux
u
(s) + Vy(s) + Dz(s) (2.6)
whereM
0
∈R
CD
is a supervector formed by stacking the UBM component means
μ
0c
, U∈R
CD×K
chn
is known as the channel factor loading matrix (K
chn
CD),
elements of x
u
(s)∈ R
K
chn
are known as the channel factors, V∈ R
CD×K
spk
is
knownasthespeakerfactorloadingmatrix(K
spk
CD),elementsofy(s)∈R
K
spk
are known as the speaker factors, and D∈R
CD×CD
is a diagonal matrix. The prior
distributions on vectorsx
u
(s),y(s),z(s) are assumed to be standard normal:
x
u
(s),y(s),z(s)∼N (0, I) (2.7)
17
2.3.4 Total Variability Model
TheformulationoftheTotalVariabilityModelissimilartothatfortheJointFactor
Analysis model, in the sense that it also imposes restrictions on the variability for
the mean supervector M
u
. However, unlike the Joint Factor Analysis model, it
does not use separate subspaces for speaker and channel factors. Instead, a single
subspace is assumed to account for all the variability in M
u
, and a single latent
variable is used to represent the source of all variability. This was motivated by
an experimental observation that the channel factors estimated using Joint Factor
Analysis were also found to contain some speaker specific information [10].
After the parameters of the Total Variability Model have been estimated, an
estimate of the latent variable corresponding to an utterance is used as a fixed-
dimensional representation of the generative distribution underlying the utterance.
Compensationmethodsaresubsequentlyareappliedonthisrepresentationinorder
to isolate the variability from the desired source, and it is ultimately used as the
input for a classifier that is discriminatively trained.
Describing more formally, the mean supervector M
u
is assumed to have the
following form:
M
u
=M
0
+ Tw
u
(2.8)
where T∈ R
CD×K
is known as the Total Variability matrix, and w
u
∈ R
K
is
known as the ivector corresponding to the utterance u. The prior distribution for
w
u
is assumed to be standard normal:
w
u
∼N (0, I) (2.9)
18
The component meansμ
uc
can then be described as
μ
uc
=μ
0c
+ T
c
w
u
(2.10)
where{T
c
}
C
c=1
∈ R
D×K
are sub-matrices of the Total Variability matrix T such
that T =
h
T
T
1
... T
T
C
i
T
2.4 Parameter estimation and inference for the
Total Variability Model
In this section, we describe the algorithms that have been used conventionally for
parameter estimation and inference for the Total Variability Model.
2.4.1 Parameter estimation
The problem of estimating the model parameters Θ =
n
{p
c
,μ
0c
, Σ
c
}
C
c=1
, T
o
is
usually set up as a Maximum Likelihood problem:
Θ
∗
= arg max
Θ
logf(X|Θ)
where f(X|Θ) denotes the likelihood of the observed feature vector collection
X given the model parameters Θ. It is common to adopt the parameters
{p
c
,μ
0c
, Σ
c
}
C
c=1
from a UBM which is trained on all features vectors X. The Total
Variabilitymatrix TisthenusuallyestimatedusingtheExpectationMaximization
(EM) algorithm where the ivectorsw
u
are treated as hidden variables.
19
Joint Log Likelihood Function
For simplifying the estimation procedure, it is assumed that the component indices
C
u
={c
ut
}
Tu
t=1
associated
1
with the feature vectors X
u
={x
ut
}
Tu
t=1
are known
in a probabilistic sense. For this purpose, the component posterior probabilities
γ
utc
= p(c
ut
= c|x
ut
,{p
c
,μ
0c
, Σ
c
}
C
c=1
) are obtained using the UBM parameters.
Then, the expected value of the joint log likelihood of X
u
,w
u
given the component
associations C
u
and the model parameters Θ (where the expectation is over C
u
sampled according to the the UBM posteriors Γ
u
={γ
utc
}
Tu
t=1
) is given as:
E
Cu∼Γu
[log f(X
u
,w
u
|C
u
, Θ)] =E
Cu∼Γu
"
Tu
X
t=1
logN
x
ut
|μ
0cut
, Σ
cut
#
+
1
2
C
X
c=1
N
uc
F
T
u0c
Σ
−1
c
T
c
w
u
−
1
2
w
T
u
I +
C
X
c=1
N
uc
T
T
c
Σ
−1
c
T
c
!
w
u
(2.11)
whereN (x|μ, Σ) represents the likelihood of an observationx with respect to a
Gaussian distribution with meanμ and covariance matrix Σ, and{N
uc
,F
u0c
}
C
c=1
refer to statistics that are obtained from feature vectors x
ut
and the component
posterior probabilities γ
utc
as below:
N
uc
=
Tu
X
t=1
γ
utc
F
uc
=
1
N
uc
Tu
X
t=1
γ
utc
x
ut
F
u0c
=F
uc
−μ
c
(2.12)
1
Since a GMM is a convex mixture of Gaussian distributions, its generative process can be
viewed as first sampling a component c with probability p
c
and then generating x
ut
from the
selected Gaussian component. In that context, the phrase “component index associated with a
feature vector” refers to the Gaussian component index that was selected in order to generate
the feature vector.
20
Expectation Maximization
For updating the model parameters using the EM algorithm, the ivectors{w
u
}
U
u=1
aretreatedashiddenvariables. Foreveryiterationi, firsttheposteriordistribution
ofw
u
given the current model parameters Θ
(i)
is evaluated as below:
w
u
| Θ
(i)
∼N
μ
wu
, Σ
wu
(2.13)
where the mean and covariance of the posterior distribution are given as below:
μ
wu
=
I +
C
X
c=1
N
uc
T
(i)
c
T
Σ
−1
c
T
(i)
c
!
−1
C
X
c=1
N
uc
T
(i)
c
T
Σ
−1
c
F
u0c
!
Σ
wu
=
I +
C
X
c=1
N
uc
T
(i)
c
T
Σ
−1
c
T
(i)
c
!
−1
(2.14)
Then, the model parameters Θ
(i+1)
for next iteration are obtained by maximizing
the following expectation over the joint log likelihood function:
Θ
(i+1)
= arg max
Θ
U
X
u=1
E
wu| Θ
(i)
E
Cu∼Γu
[log f(X
u
,w
u
|C
u
, Θ)]
(2.15)
The value of sub-matrices T
(i+1)
c
that maximize the expected joint log likelihood
in (2.15) is given as below:
T
(i+1)
c
=
U
X
u=1
N
uc
F
u0c
μ
T
wu
!
U
X
u=1
N
uc
μ
wu
μ
T
wu
+ Σ
wu
!
−1
(2.16)
21
2.4.2 Inference
Given a sequence of feature vectors X
u
={x
ut
}
Tu
t=1
and the set of model parameters
Θ, the maximum a posteriori (MAP) estimate of the ivectorw
∗
u
is given as:
w
∗
u
=
I +
C
X
c=1
N
uc
T
T
c
Σ
−1
c
T
c
!
−1
C
X
c=1
N
uc
T
T
c
Σ
−1
c
F
u0c
!
(2.17)
The MAP estimate w
∗
u
of the ivector can then ultimately be used as a vector
representation of the utterance for the purpose of classification or regression.
22
Chapter 3
Efficient Estimation
The EM algorithm provides one possible approach to the problem of estimating the
parameters of the Total Variability Model (TVM) under the maximum likelihood
criterion. However, it does have some drawbacks: it only converges to a local
maximum of the likelihood function, and is also computationally expensive. A
number of approaches have been proposed in literature that attempt to reduce
the computational complexity of the EM algorithm by facilitating efficient ivector
extraction under some approximations [16, 25, 31, 9]. However, all these methods
still operate under the iterative framework of the EM algorithm.
In this chapter, we investigate the possibility of maximizing the marginalized
likelihood function directly, unlike the EM algorithm. We show that the likelihood
function can be simplified by making some approximations that are justified if the
model assumptions are valid and the training utterances are long enough. Then,
we show that estimation of the parameters that maximize the (simplified) likeli-
hood function reduces to a Singular Value Decomposition (SVD) computation over
a matrix of normalized statistics. Randomized algorithms provide a computation-
ally efficient solution for estimating this SVD [17] and we show that this estimation
procedure for TVM is much more computationally efficient compared to the EM
algorithm. In addition, we also show that the same approximations used for simpli-
fying the marginalized likelihood function can also be used for efficiently extracting
23
ivectors. We then compare the performance obtained using the proposed estima-
tion procedure (and ivector extraction) to that using the EM algorithm on speaker
and language identification tasks.
3.1 Estimation using randomized SVD
3.1.1 Likelihood function
Just like in the case of EM updates, we start by assuming that the component
associations C are observed in a probabilistic sense, and distributed as per UBM
posteriors Γ, where:
C ={C
u
}
U
u=1
C
u
={c
ut
}
Tu
t=1
c
ut
∈{1,...,C}
Γ ={Γ
u
}
U
u=1
Γ
u
={γ
utc
}
Tu
t=1
γ
utc
=p (c
ut
=c|x
ut
,μ
0c
, Σ
c
)
(3.1)
Hence, the problem we consider is that of maximizing the expected log likelihood
L(Θ) defined as below:
L(Θ) =E
C∼Γ
logf(X|C, Θ)
=
U
X
u=1
E
Cu∼Γu
logf(X
u
|C
u
, Θ)
(3.2)
f(X
u
|C
u
, Θ) can be obtained by marginalizing overw
u
:
f(X
u
|C
u
, Θ) =
Z
wu
f(X
u
|C
u
,w
u
, Θ)N (w
u
| 0, I)dw
u
(3.3)
whereN (w
u
| 0, I) represents the standard normal prior onw
u
. According to the
assumption in the Total Variability Model, f(X
u
|C
u
,w
u
, Θ) is given as below:
f(X
u
|C
u
,w
u
, Θ) =
C
Y
c=1
Y
t:cut=c
N (x
ut
|μ
0c
+ T
c
w
u
, Σ
c
) (3.4)
24
By separating the terms involving T from rest of the expression, it is possible to
factorize f(X
u
|C
u
,w
u
, Θ) as follows:
f(X
u
|C
u
,w
u
, Θ) =g (X
u
, Θ)h(X
u
, Θ) (3.5)
where g (X
u
, Θ) and h (X
u
, Θ) are given as below:
g(X
u
, Θ) =
C
Y
c=1
Y
t:cut=c
N (x
ut
|μ
0c
, Σ
c
)
h(X
u
, Θ) = exp
"
C
X
c=1
X
t:cut=c
w
T
u
T
T
c
Σ
−1
c
(x
ut
−μ
0c
)−
1
2
T
T
c
Σ
−1
c
T
c
w
u
# (3.6)
Substitutingthisexpressionforf(X
u
|C
u
,w
u
, Θ)in (3.3), andeventuallysubstitut-
ing the obtained expression back in (3.2), we get:L(Θ) =L
1
(Θ) +L
2
(Θ) +L
3
(Θ)
where:
L
1
(Θ) =
U
X
u=1
E
Cu∼Γu
logg(X
u
, Θ)
L
2
(Θ) =
1
2
U
X
u=1
F
T
u0
Σ
−1
N
u
T
I + T
T
Σ
−1
N
u
T
−1
T
T
Σ
−1
N
u
F
u0
L
3
(Θ) =−
1
2
U
X
u=1
log|I + T
T
Σ
−1
N
u
T|
(3.7)
whereF
u0
∈R
CD
is a supervector formed by stacking vectorsF
u0c
∈R
D
(as given
by equation (5.1)) and Σ
c
, N
u
∈ R
CD×CD
are block-diagonal matrices consisting
of C blocks of dimension D×D, with c
th
block being Σ
c
and N
uc
I respectively:
F
u0
=
h
F
T
u01
... F
T
u0C
i
T
N
u
=
N
u1
I ... 0
.
.
.
.
.
.
.
.
.
0 ... N
uC
I
Σ =
Σ
1
... 0
.
.
.
.
.
.
.
.
.
0 ... Σ
C
(3.8)
25
3.1.2 Maximization using approximation
SinceL
1
(Θ) is constant with respect to T, we have: T
∗
= arg max
T
J(T), where:
J(T) =L
2
(Θ) +L
3
(Θ) (3.9)
Approximation
Unfortunately, there is no straightforward solution for estimating T
∗
that maxi-
mizes the non-convex function J(T) in a computationally efficient manner for a
general case. However, by making some approximations, it is possible to reduce
it to a problem that can be solved efficiently. Let the Cholesky decomposition of
Σ
−1
be LL
T
. Let
f
F
u0
,
e
T
u
be defined as:
f
F
u0
= N
1
2
u L
T
F
u0
e
T
u
= N
1
2
u L
T
T (3.10)
At this point, we can approximate N
uc
≈ T
u
p
c
. If the assumptions made within
TVM are correct, then
Nuc
Tu
→ p
c
as T
u
→∞ by the Law of Large Numbers.
Therefore, this would be a reasonable approximation for large T
u
, and we get:
e
T
u
≈
q
T
u
e
T
e
T = P
1
2
L
T
T and
1
T
u
I≈
1
T
I (3.11)
where P∈ R
CD×CD
is a block diagonal matrix with c
th
block given by p
c
I, and
T is the average utterance length in the training set. By using the invariance of
matrix trace under cyclic permutations, we can combine (3.7), (3.10) and (3.11)
to get:
L
2
(Θ) =
1
2
Tr
"
U
X
u=1
f
F
u0
f
F
T
u0
!
e
T
1
T
I +
e
T
T
e
T
−1
e
T
T
#
L
3
(Θ) =−
1
2
U
X
u=1
log|I +T
u
e
T
T
e
T|
(3.12)
26
where Tr[·] refers to matrix trace.
Maximization
Let
e
F denote a matrix containing
f
F
u
as columns. Then, we have:
U
X
u=1
f
F
u
f
F
T
u
!
=
e
F
e
F
T
(3.13)
Let the Singular Value Decomposition (SVD) for
e
F,
e
T be:
e
F = U
F
D
F
V
T
F
e
T = U
T
D
T
V
T
T
(3.14)
where D
F
∈ R
CD×U
and D
T
∈ R
CD×K
are diagonal with sorted entries: d
F
1
≥
...d
F
CD
,d
T
1
≥...d
T
K
, and U
F
, U
T
, V
F
, V
T
are orthonormal. To find T
∗
, we need
to find U
∗
T
, D
∗
T
, V
∗
T
that maximize the likelihood. Define:
S
F
=
e
F
e
F
T
f
D
T
= D
T
1
T
I + D
T
T
D
T
−1
D
T
T
(3.15)
Then,L
2
(Θ) andL
3
(Θ) can be expressed as:
L
2
(Θ) =
1
2
Tr
h
S
F
U
T
f
D
T
U
T
T
i
(3.16)
L
3
(Θ) =−
1
2
U
X
u=1
log|I +T
u
D
T
T
D
T
| (3.17)
NeitherL
2
(Θ) norL
3
(Θ) depend on V
T
, so without loss of generality, we can
choose V
∗
T
= I.L
3
(Θ)doesnotdependon U
T
, sowecanobtain U
∗
T
bymaximizing
L
2
(Θ). To that end, we use the following result:
27
Theorem ([7, Theorem 4.1]). Let A,B be n× n Hermitian matrices, with
eigenvalues α
i
,β
i
respectively, both similarly ordered: α
1
≥···≥ α
n
,β
1
≥···≥
β
n
. Then:
max
Uunitary
Tr
h
AU
T
BU
i
=
n
X
i=1
α
i
β
i
SubstitutingA← S
F
,U← U
T
T
,B←
f
D
T
in the statement of the theorem, we get:
max
U
T
unitary
L
2
(Θ) =
1
2
Tr
h
D
F
D
T
F
f
D
T
i
(3.18)
The maximum value can be achieved by setting U
∗
T
= U
F
. Substituting it back
in J(T), we get the following expression in terms of d
Tk
:
J(T) =
1
2
K
X
k=1
"
Td
2
F
k
d
2
T
k
1 +Td
2
T
k
−
U
X
u=1
log(1 +T
u
d
2
T
k
)
#
(3.19)
Taking the derivative with respect to d
Tk
, then making approximations (1 +
Td
2
T
k
)
2
≈ T
2
d
4
T
k
+ 2Td
2
T
K
and 1 +T
u
d
2
T
K
≈ T
u
d
2
T
k
, and setting to zero, we get:
d
∗
T
k
≈
r
d
2
F
k
UT
−
2
T
if d
2
F
k
≥ 2U
0 otherwise
(3.20)
To summarize, we get:
e
T
∗
= U
∗
T
D
∗
T
(3.21)
where U
∗
T
= U
F
, and the entries of D
∗
T
are given by (3.20).
e
T
∗
can then be
unnormalized to get T
∗
as below:
T
∗
= L
−1
P
−
1
2
e
T
∗
(3.22)
28
Summary of randomized SVD estimation
Overall, the process of estimation using randomized SVD (RSVD) has been sum-
marized in the procedure TVM-RSVD-Estimation below.
Procedure TVM-RSVD-Estimation
Input : Feature Vectorsx
ut
, Posteriorsγ
ut
, Covariance matrices Σ
c
Output: Total Variability Matrix T
1: for u = 1 to U do
2: Collect statistics N
u
,F
u
. Equation (5.1)
3: Normalize the statisticsF
u
to get
f
F
u
. Equation (3.10)
4: Get SVD of
e
F using randomized algorithms . Section 3.2.1
5: Get the Total Variability Matrix T . Equation (3.22)
3.2 Advantages of randomized SVD estimation
3.2.1 Computational complexity of parameter estimation
Although we have shown that we can obtain the Total Variability Matrix by com-
puting the SVD of matrix
e
F, it is not immediately clear that this would be a
a better choice than EM computationally, especially because the matrix
e
F is of
dimensionCD×U, which is typically fairly large. Fortunately, there are random-
ized algorithms available that can solve this problem very efficiently. A summary
of various algorithms and the associated bounds can be found in [17]. To illustrate
the efficiency of these algorithms, we highlight a few key points:
• Althoughtheoutcomeisprobabilistic, theprobabilityoffailureisauserspec-
ified parameter, and can be rendered negligible (say 10
−15
), with a nominal
impact on the computational resources required
29
• Mostofthecomputationallyintensivestepsinthesealgorithmsareparalleliz-
able, allowing the advantage of exploiting a large number of parallel nodes,
if available
• The computations do not require loading the entire matrix to memory, and
can be modified to require only a single pass over the matrix stored on a disk
3.2.2 Computational complexity of ivector extraction
TheMAPestimateoftheivectorgivenmodelparameters Θandutterancestatistics
F
u
, N
u
is given in equation (5.4). However, by making similar approximations to
those made in section 3.1.2, we can simplify (5.4) to:
w
∗
u
=
1
√
T
u
1
T
u
I +
e
T
T
e
T
−1
e
T
T
f
F
u
(3.23)
Since
e
T
Te
T = D
T
T
D
T
is diagonal, matrix inversion is greatly simplified and enables
much faster ivector extraction.
3.2.3 Interpretability of ivectors
Consider for example the singular values of the matrix
f
F obtained for the Wall
Street Journal (WSJ) si284 corpus, shown in Figure 3.1. It is apparent that most
of the variability in the matrix is explained by the first few subspace dimensions.
30
0 50 100 150 200 250 300 350 400 450 500
0
1000
2000
3000
4000
5000
6000
k
d
F
(k)
Figure 3.1: Singular Values of
f
F
For further illustration, first two dimensions of the extracted ivectors for the
WSJ test set eval92, are shown in figure 3.2. Different speakers are shown by
markers of different color, male and female speakers are shown by circular and
triangular markers respectively. It is clear that just the first two ivector dimensions
already capture quite a lot of the speaker variability, and yield fairly congregated
clusters.
31
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
ivector dimension 1
ivector dimension 2
Figure 3.2: First two ivector dimensions for WSJ eval92 data
Moreover, observing the speaker gender reveals yet another interesting fact:
with the exception of a few utterances from one speaker, the value of the first
ivectordimensionisnegativeformaleutterancesandpositiveforfemaleutterances,
indicating that it is implicitly encoding gender information. In other words, the
model has effectively identified gender as the most important factor of acoustic
variability in the data, even though no labels are provided during training. By
similarly exploring correlations between speaker metadata and the first few ivector
dimensions, it might become possible to assign them interpretable notions, while
also understanding the relative impact of different factors on acoustic variability.
3.3 Experiments
In order to evaluate the performance of the randomized SVD algorithm as well
as the approximate ivector estimation, we performed experiments on speaker and
language identification tasks.
32
3.3.1 Description of databases and experimental setup
Speaker Identification
For the Speaker Identification (SID) task, we conducted experiments on the NIST
SRE 2008 database, for the short2-short3 evaluation condition.
• Database: A combination of SRE 04-06, Switchboard Cellular and Fisher
data was used as training data for estimating the UBM, the Total Vari-
ability Matrix and the Probabilistic Linear Discriminant Analysis (PLDA)
classifier. The training dataset included approximately 2866 hours of speech
(after excluding silence portions) from 48273 utterances, with an average
duration of 214 seconds. The data was split based on gender and all the
models were trained separately for each gender, with the male portion of the
dataset containing 20750 utterances and the female portion containing 27523
utterances. Experiments were evaluated on 39433 trials with 1270 enrollment
utterances and 2528 test utterances for male data, and on 59343 trials with
1993 enrollment utterances and 3849 test utterances female data.
• Experimental Setup: For each frame, we obtained 20-dimensional MFCC vec-
tors, concatenated with delta and delta-delta coefficients. A UBM of 2048
components, and TVM with ivector dimension of 400 were trained after
removing silence regions. The overall setup consisted of extracting the ivec-
tors as a front-end, followed by a Probabilistic Linear Discriminant Analysis
(PLDA) classifier.
Language Identification
We conducted experiments for Language Identification (LID) on the DARPA
Robust Automatic Transcription of Speech (RATS) database.
33
• Database: The RATS LID database consists of audio recordings of vary-
ing lengths that are corrupted by different noise types of varying SNRs. It
contains audio from six classes: five classes corresponding to five target lan-
guages (Arabic, Dari, Farsi, Pashto, Urdu) and one class corresponding to
10 non-target languages. Train and test data splits were chosen according
to the description in [54]. The data used for training the Total Variability
Matrix contains 800 hours of speech, from 96000 recordings (16000 from each
class) of 30s each. In addition, there are four datasets, consisting of utter-
ances with length 10s, 5s, 3s, and 1s respectively. For each of these splits,
a labeled training set of 96000 utterances is available for classifier training,
and a test set of 2000 utterances of the same length is used to evaluate the
performance.
• Experimental Setup: Similar to the SID setup, we obtained 20-dimensional
MFCCvectorsforeachframeasfeatures, whichwereconcatenatedwithdelta
and delta-delta coefficients. A UBM of 2048 components, and an ivector
model of 400 dimensions were trained on the 30s dataset. For classification,
we used a Support Vector Machine (SVM) with a fifth order polynomial
kernel. SVMs were separately trained on datasets containing utterances of
length 10s, 5s, 3s and 1s, and evaluated on the respective test set. Because
of the noisy nature of the dataset, Within Class Covariance Normalization
(WCCN) was used for compensating unwanted sources of variability.
34
3.3.2 Results and analysis
On both tasks, we compared the performance of the estimate obtained using the
randomized SVD (RSVD) algorithm
1
to that obtained using the EM algorithm.
For all the experiments, the parameters{p
c
,μ
0c
, Σ
c
}
C
c=1
were obtained from the
UBM, and EM or RSVD (or both) were only used for estimating the Total Vari-
ability matrix. For the EM algorithm, we initialized the Total Variability Matrix T
by setting L
T
T to the subspace obtained using using Principal Component Anal-
ysis (PCA) over normalized statistics L
T
F
u0
. For the RSVD estimate, we also
compared the results obtained using the MAP ivector expression in (5.4) to those
obtained using the approximate ivector expression in (5.6). All the estimation
algorithms were parallelized on 16 processors, and we analyzed the performance in
terms of the amount of time taken for estimation, the Equal Error Rate (EER),
and the log likelihood value for the obtained estimate.
Comparing computational complexity
The computational complexity for different methods of parameter estimation and
ivector extraction, as well as time taken for these operations on different datasets
has been shown in Table 3.1. The computation time shown does not include the
time required for obtaining the statistics F
u0
and N
u
. For all experiments, we
found that 5 iterations of EM were sufficient to achieve convergence. It can be seen
that the time required for estimation using RSVD is significantly smaller than the
time required for estimation using the EM algorithm. In fact, for EM algorithm,
the complexity of RSVD estimation is the same as that for initialization using PCA
(since PCA can also be estimated using the randomized SVD algorithm [17]). For
RSVD, the time taken for ivector extraction using the approximate expression in
1
Code available at https://github.com/ruchirtravadi/fastivector
35
(5.6) is also much smaller than that required for extraction for the MAP estimate
given in (5.4). In addition, the memory complexity of RSVD estimation as well as
approximateivectorextractionissmallerthanthatforEMalgorithmandtheexact
MAP estimate respectively: O(KCD) for RSVD and approximate extraction as
opposed to O(K
2
C +KCD) for EM and the MAP estimate.
Table 3.1: Algorithmic complexity and computation time for parameter estimation
and ivector extraction
Parameter Estimation
Algorithm Complexity
Computation Time
SID(M) SID(F) LID
EM (Sec 2.4) O ((K
2
C +K
3
+KCD)UN
I
)
2
79m 103m 8.3h
RSVD (Sec 3.1) O ((KCD +K
2
)U)
3
181s 243s 24m
Ivector Extraction
Algorithm Complexity Computation Time
MAP (Eq (5.4)) O(K
2
C +K
3
+KCD) 0.23s
Approx (Eq (5.6)) O(KCD) 0.04s
Comparing Equal Error Rates
The performance in terms of Equal Error Rate (EER) on the speaker and language
identification tasks has been shown in Tables 3.2 and 3.3 respectively. On both
tasks, there seems to be a degradation in performance when RSVD estimate is used
2
U denotes the number of utterances, N
I
the number of EM iterations
3
Assuming that RSVD implementation is based on the proto algorithm in [17], and that SVD
computation for an m×n matrix with m≤n has complexity O(m
2
n)
36
with MAP ivector extraction. However, there is a huge contrast between the two
tasks in terms of results obtained using the approximate ivector expression. In case
of the speaker identification task, the approximation leads to a poor performance,
with the EER values almost twice those obtained by models trained using EM.
On the other hand, in case of language identification task, it provides the best
performance, leading to a significant improvement over using the MAP estimate.
More surprisingly, the improvement is higher on shorter utterances, where the
approximations being made are the most likely to be violated. It is clear that
some further analysis is required to explain these results.
Table 3.2: Equal Error Rate (%) on SRE 2008 data
Model Database
Parameter ivector
Male Female
estimation extraction
EM MAP 3.62 4.61
RSVD
MAP 3.94 5.81
Approx 7.13 8.56
Table 3.3: Equal Error Rate (%) on RATS LID data
Model RATS LID
Parameter ivector
10s 5s 3s 1s
estimation extraction
EM MAP 8.75 11.05 15.35 24.15
RSVD
MAP 10.05 11.45 15.65 24.10
Approx 8.35 10.00 13.30 21.70
37
Further analysis
For a more detailed analysis of these results, we tried to investigate a couple of
questions:
• How do the two methods compare in terms of the log likelihood of the
obtained estimate? In other words, are worse EERs explained by a sub-
optimal likelihood of the corresponding parameter estimate, or is the likeli-
hood function itself not a good proxy for EER?
• Is overfitting affecting any of these results?
In addition, we also carried out experiments where both the estimation methods
are used together, by using the RSVD estimate as an initialization for the EM
algorithm. We compared it with two other popular initialization schemes: using a
random Gaussian matrix and using Principal Component Analysis (PCA).
The obtained results in terms of log likelihood are shown in figures 3.3 and
3.4 respectively, and the corresponding EERs are shown in the figures 3.5 and
3.6 respectively. For all the plots, the data point corresponding to iteration 0
represents the likelihood or EER corresponding to the initial estimate without
using any EM iterations. For the EER plots, the dotted line represents the results
obtained using the RSVD estimation (without EM) and the approximate ivector
expression in (5.6).
In terms of the likelihood function, the initial estimate obtained using RSVD
achieves a higher value of likelihood compared to that for random or PCA initial-
ization on all tasks. In fact, on all experiments, the initial RSVD estimate achieves
a higher likelihood value compared to that for the estimate obtained at the end of
5 iterations of EM algorithm with random initialization. However, this is not the
38
case when we compare it to the estimate obtained after 5 iterations of EM initial-
ized using PCA. This suggests that even though the RSVD algorithm attempts to
maximize the likelihood function directly, the approximations being made within
the likelihood function introduce enough degradation in the estimate to rule it
out as an effective way of achieving the global maximum. When it is used as
an initialization scheme, despite starting with a significantly higher value of likeli-
hood value in all cases, the difference in likelihood to PCA initialization disappears
after a few EM iterations. In many cases the final likelihood for EM algorithm
with PCA initialization is in fact slightly higher than that for EM algorithm with
RSVD initialization. Also, for all the experiments, the trends observed in terms
of likelihood on train and test sets are very similar, indicating that overfitting to
training data is not a serious issue for any of the estimates.
(a) Male, train (b) Male, test
(c) Female, train (d) Female, test
Figure 3.3: Negative log likelihood per frame for different models on SRE 2008
39
(a) Train (30 sec) (b) Test (10 sec)
Figure 3.4: Negative log likelihood per frame for different models on RATS LID
(a) Male (b) Female
Figure 3.5: EER(%) for different models on SRE 2008
40
(a) 10 sec (b) 5 sec
(c) 3 sec (d) 1 sec
Figure 3.6: EER(%) for different models on RATS LID
The trends in terms of the EER very closely mirror those observed in terms
of model likelihood. To analyze this quantitatively, we evaluated the correlation
coefficients between EER on the test set (which is the primary evaluation metric)
and likelihoods on the training set (which is the function being optimized) for
different parameter estimates (corresponding to different initializations and EM
iterations
4
). The values are listed in Table 3.4, and it can be seen that they are
quite high (more than 0.8 in all cases). Therefore, we can conclude that as an
objective function for optimization, the model likelihood function is a good proxy
4
The correlation is calculated over 17 values: 3 types of initialization and 6 EM iterations (0
to 5, excluding iteration 0 for random initialization because it had a very different likelihood as
well as EER compared to all other estimates)
41
for EER, and that the variation in obtained EERs is explained to a large extent
by the variation in model likelihood corresponding to the obtained estimate.
Table 3.4: Correlation coefficients between EER and negative log likelihood
SRE 2008 RATS LID
Male Female 10s 5s 3s 1s
0.93 0.96 0.91 0.82 0.91 0.93
Overall, these results provide us with a general idea about the relative perfor-
mance of different estimation schemes under different evaluation metrics. However,
they also raise a few important questions that need to be answered by means of
further investigation:
1. The only case in which RSVD algorithm provides a significant improvement
in terms of EER is on RATS LID data, when it is used in conjunction with
theapproximateivectorextractionmethod. However, thesamemethodleads
to a much higher EER when used on SRE data. How can we explain the
disparity in these results?
2. Even though the RSVD algorithm attempts to maximize the log likelihood
function directly, it fails to achieve the global maximum, indicating that the
approximations being made to simplify the likelihood function do not hold
in practice. However, the approximations are in turn based on the Law of
LargeNumbersandthevalidityofthegenerativedistributionassumedbythe
model (in particular the uniformity of weights p
c
across utterances). Which
of these assumptions lead to the suboptimality of RSVD? In other words, is
the variation in observed N
uc
within statistical bounds expected under the
42
generative process assumed by the model, or do we need to reformulate the
model itself?
3. Even more fundamentally, given that we are ultimately interested in accu-
racy of our predictions, is moving towards purely discriminative models and
estimation procedures a better solution? If so, can our understanding of
TVM provide any design principles that guide our choice of discriminative
architectures and corresponding loss functions?
43
Chapter 4
Statistical Analysis of Model
Assumptions
In order to answer the questions posed in the previous chapter, it is essential to
carry out a careful examination of the different assumptions being made within
the Total Variability Model (TVM) and investigate their validity. In doing so, we
can not only answer these questions, but also find ways to generalize the model
beyond its current scope. In this chapter, we present the results from statistical
tests carried out for analyzing the validity of the model assumptions, propose
ways to reformulate the model in a more general sense, and provide answers to the
questions posed in the previous chapter.
Model assumptions
In TVM formulation, the generative process underlying the observed feature vec-
tors is characterized by three important assumptions:
1. Distribution of feature vectors: The observed feature vectors follow a distri-
bution that has the form of a Gaussian Mixture Model (GMM).
2. Globally tied component weights and covariance matrices: The component
weights and covariance matrices for the GMM distribution do not change
across different utterances.
44
3. Lowranknatureofvariabilityinthemeansupervector: Unlikethecomponent
weights and covariance matrices, the component means do change across
utterances, and the variability in the mean supervector lies entirely along a
low-dimensional subspace.
Inthesectionstofollow,wepresentadetailedanalysisofeachoftheseassumptions.
4.1 Distribution of feature vectors
Since TVM is a generative model, it would seem that the assumption about the
form of distribution of observed feature vectors is a very fundamental aspect of
the model, and that changing the form of distribution would perhaps require a
complete redesign of the model. However, it is in fact possible to arrive at the
same model without assuming a particular form for the distribution of observed
feature vectors. This is because the fundamental aspects that characterize the
model - the likelihood function as well as estimation and inference procedures, are
not exclusive to the generative process assumed in the conventional formulation.
All these aspects of the model can be preserved while reformulating it in terms
of the distribution of observed statistics F
uc
which asymptotically (as T
u
→∞)
follow a Gaussian distribution regardless of actual form of distribution followed by
feature vectorsx
ut
.
4.1.1 Distribution free formulation
In this section, we show that instead of formulating TVM as a model for observed
feature vectors, it can be formulated as a model for observed statistics, which
asymptotically follow a Gaussian distribution. In particular, we show that:
45
1. The distribution of Baum-Welch statistics used in TVM parameter estima-
tion is asymptotically Gaussian, regardless of the feature space distribution
on (Section 4.1.2).
2. The likelihood function for the observed statistics under the asymptotic
Gaussian approximation is similar to the likelihood function of observed fea-
ture vectors under the GMM assumption in conventional TVM, where the
difference between the two likelihood expressions is a constant that does not
dependonthesubspacematrix T, andthereforedoesnotaffectitsestimation
(Section 4.1.3).
3. The expression for posterior distribution of the ivector obtained from the
distribution free formulation is also the same as that in conventional TVM
(Section 4.1.4).
4.1.2 Asymptotic distribution of Baum-Welch statistics
Let Γ : R
D
7→ [0,∞)
C
be an arbitrary function, and γ
utc
= Γ
c
(x
ut
), where Γ
c
denotes the c
th
component of Γ. For example, one possible choice for Γ could
correspond to component posterior probabilities obtained from a GMM: γ
utc
=
p(c|x
ut
)
1
. Let p
uc
, μ
uc
, Σ
uc
define the following expected values (with respect to
the distribution ofx
ut
):
p
uc
=E [γ
utc
] μ
uc
=
1
p
uc
E [γ
utc
x
ut
]
Σ
uc
=
1
p
uc
E
h
(γ
utc
x
ut
−p
uc
μ
uc
)(γ
utc
x
ut
−p
uc
μ
uc
)
T
i
(4.1)
1
For all experiments in this paper, we use UBM posteriors asγ
utc
, but the model formulation
itself is not restricted to this choice
46
If γ
utc
were to correspond to GMM component posterior probabilities p(c|x
ut
),
thenμ
uc
would denote the component GMM means.
Let statistics N
uc
,F
uc
be defined by the expression in (5.1) (but with the
posteriors γ
utc
now given by the arbitrary function Γ
c
):
N
uc
=
Tu
X
t=1
γ
utc
F
uc
=
1
N
uc
Tu
X
t=1
γ
utc
x
ut
(4.2)
We can rearrange the expression forF
uc
as follows:
F
uc
=
1
N
uc
Tu
X
t=1
γ
utc
x
ut
(4.3)
=
1
√
N
uc
s
T
u
N
uc
"
1
√
T
u
Tu
X
t=1
(γ
utc
x
ut
−p
uc
μ
uc
)
#
+
T
u
p
uc
N
uc
μ
uc
(4.4)
By the multivariate Central Limit Theorem (CLT), as T
u
− →∞
1
√
T
u
Tu
X
t=1
(γ
utc
x
ut
−p
uc
μ
uc
)
d
− →N (0,p
uc
Σ
c
) (4.5)
where
d
− → denotes convergence in distribution. Similarly,
s
T
u
N
uc
d
− →
1
√
p
uc
,
T
u
p
uc
N
uc
d
− → 1 (4.6)
By Slutsky’s theorem, it follows that:
s
T
u
N
uc
1
√
T
u
X
t
(γ
utc
x
ut
−p
uc
μ
uc
)
d
− →N (0, Σ
c
) (4.7)
47
Thus, the statisticsF
uc
are asymptotically distributed normally, regardless of the
distribution of the feature vectorsx
ut
(as long as it satisfies the conditions for the
multivariate CLT to hold), and the choice of the function Γ:
F
uc
|N
uc
∼N (μ
uc
,N
−1
uc
Σ
c
) (4.8)
4.1.3 Likelihood function
Similar to conventional TVM formulation, we assume that the covariance matrices
are tied across all utterances: Σ
uc
= Σ
c
. In addition, if we also assume that the
variability in the mean supervector is confined to a low-dimensional subspace T
(i.e., that equation (2.8) and (2.9) hold, withμ
uc
now given by the expected value
in equation (5.3) andμ
c
representing the global mean ofμ
uc
across all utterances),
then we have:
F
uc
|N
uc
,w
u
∼N (μ
c
+ T
c
w
u
,N
−1
uc
Σ
c
) w
u
∼N (0, I) (4.9)
We also assume that the component-wise statisticsF
uc
are conditionally indepen-
dent of each other given the ivector w
u
and the statistics{N
uc
}
C
c=1
. Then, the
total log likelihood of all observed statistics{F
u
}
U
u=1
given
n
{N
uc
}
C
c=1
o
U
u=1
and Θ,
marginalized over the ivectors is given as:
L(Θ) =
U
X
u=1
log
h
E
wu
h
f(F
u
|{N
uc
}
C
c=1
,w
u
, Θ)
ii
(4.10)
where f denotes the conditional likelihood of F
u
given{N
uc
}
C
c=1
,w
u
and the
parameters Θ. The expression in equation (4.10) can be simplified to:
L(Θ) =L
1
(Θ) +L
2
(Θ) +L
3
(Θ) (4.11)
48
whereL
1
(Θ),L
2
(Θ),L
3
(Θ) are given as:
L
1
(Θ) =−
1
2
U
X
u=1
"
F
T
u0
Σ
−1
N
u
F
u0
+
C
X
c=1
log|N
−1
u
Σ
c
|
#
L
2
(Θ) =
1
2
U
X
u=1
F
T
u0
Σ
−1
N
u
T
I + T
T
Σ
−1
N
u
T
−1
T
T
Σ
−1
N
u
F
u0
L
3
(Θ) =−
1
2
U
X
u=1
log|I + T
T
Σ
−1
N
u
T|
(4.12)
Here,F
u0
=F
u
−M
0
and Σ, N
u
∈R
CD×CD
are block diagonal matrices withc
th
block given by Σ
c
and N
uc
I respectively (as per (3.8)).
By comparing (4.12) to the expression of log likelihood in conventional TVM
(3.7) it can be seen that the termsL
2
(Θ)andL
3
(Θ) (and therefore the objective
function J(T)) in (4.12) are identical to the corresponding terms in (3.7). The
termL
1
(Θ) differs between the two expressions, but it does not depend on T, and
therefore does not affect the estimation of the Total Variability matrix.
4.1.4 Posterior distribution of the ivector
UnderthemodelassumptionsmadeforthedistributionfreeformulationofTVMin
Section 4.1.3, the posterior distribution of the ivector given statisticsF
u
,{N
uc
}
C
c=1
is given as:
w
u
|F
u
,{N
uc
}
C
c=1
, Θ∼N
μ
wu
, Σ
wu
(4.13)
where the mean and covariance of the posterior distribution are given as below:
μ
wu
=
I +
C
X
c=1
N
uc
T
T
c
Σ
−1
c
T
c
!
−1
C
X
c=1
N
uc
T
T
c
Σ
−1
c
F
u0c
!
Σ
wu
=
I +
C
X
c=1
N
uc
T
T
c
Σ
−1
c
T
c
!
−1
(4.14)
49
which is identical to the expression for posterior distribution of the ivector in the
conventional TVM as given by (2.13). Consequently, the MAP estimate of the
ivector under the distribution free formulation is also identical to that for the
conventional formulation, as given by (5.4).
In essence, the distribution free formulation preserves all the important aspects
of the model without placing any requirements on the distribution of observed
feature vectors (except those required for the Central Limit Theorem to hold).
4.1.5 Statistical test
Although the distribution free formulation is appealing, it is necessary to test
whether the asymptotic distribution in equation (4.8) holds true on the datasets
used in practice (other assumptions leading to the model formulation in (4.9) are
addressed separately later).
Normalization
While a statistical test for multivariate normality can be used for this purpose,
some normalization is necessary before such a test can be used. This is because
the parameters of the Gaussian distribution in (4.8) could potentially be different
for each sample F
uc
due to differences in μ
uc
and N
uc
. However, as long as we
have multiple observations ofF
uc
coming from the same source distribution (i.e.,
the same μ
uc
) it is possible to transform them to yield normalized observations
that follow a zero mean Gaussian distribution with the same covariance matrix.
50
Suppose for a component c, we have a set of U
s
(> 1) observed statistics
{F
uc
,N
uc
}
Us
u=1
coming from the same source s. Let
ˆ
F
kc
, k∈{1,...,U
s
− 1} be a
weighted linear combination ofF
uc
ˆ
F
kc
=
Us
X
u=1
c
uk
q
N
uc
F
uc
(4.15)
where the combination weights c
uk
follow the following properties:
Us
X
u=1
c
uk
q
N
uc
= 0
Us
X
u=1
c
uj
c
uk
=
1 if j =k
0 if j6=k
(4.16)
Then, if equation (4.8) holds true, it can be shown that vectors
ˆ
F
ck
follow a
Gaussian distribution with mean zero and covariance Σ
c
:
ˆ
F
k
∼N (0, Σ
c
) (4.17)
Therefore, the validity of the assumption in equation (4.8) can be tested by ver-
ifying the validity of equation (4.17), which is easy to test statistically since the
distribution parameters for
ˆ
F
ck
do not vary across different samples.
For this purpose, we collected the statistics
ˆ
F
ck
for each component c on the
SRE 2008 data, where we assumed that μ
uc
remains constant across different
utterances of the same speaker. To visualize, the histograms for the 60 individual
dimensions of
ˆ
F
k
for a randomly chosen component (after normalizing them by
square root of sample covariance) for both male and female data are shown in
Figure 4.1, where a histogram of the same number of samples from actual Gaussian
data is shown beside it for comparison. Very similar histograms were observed for
51
other components as well. From the figures, the individual dimensions of
ˆ
F
kc
appear to be close to Gaussian.
(a) Male data (b) Sample from Gaussian data
(c) Female data (d) Sample from Gaussian data
Figure 4.1: Histograms of individual dimensions of
ˆ
F
kc
for SRE 2008 data
In order to visualize the joint distribution across different dimensions of
ˆ
F
kc
,
we can project the statistics to two dimensions, since linear projections of Gaussian
vectors are also Gaussian. A scatter plot of normalized
ˆ
F
kc
projected to 2 dimen-
sions after multiplication with a random projection matrix is shown in Figure 4.2.
Again, for comparison we overlay the plot with a scatter plot of same number of
samples from 2 dimensional projections of actual 60-dimensional Gaussian data,
where the points in blue represent SRE 2008 data, whereas points in orange repre-
sent the samples from Gaussian data. The scatter plots from observed data seem
to largely resemble those of true Gaussian data.
52
(a) Male (b) Female
Figure 4.2: Scatter plots of
ˆ
F
kc
projected to 2 dimensions on SRE 2008 data
Finally, to statistically test the normality of
ˆ
F
kc
, we used the Doornik-Hansen
Omnibus multivariate Normality test [12] (from implementation provided in [53]).
The test statistic in this case follows a χ
2
(2D) distribution if the data is indeed
Gaussian. Separate tests were conducted for each component c. In order to study
the behavior of the statistic with increasingN
uc
, we performed multiple tests. For
a particular test, only samples for which the value of N
uc
was above a certain
threshold were included, and the threshold was varied across a range for different
tests. A plot of the median value of the statistic (across different components)
versus the value of the threshold for N
uc
is shown in Figure 4.3. The dotted line
in the figure represents the edge of the 95% confidence interval for the statistic.
For small N
uc
, it can be seen that the median value lies far away from the confi-
dence interval, and the statistics are not Gaussian. Even for larger values of N
uc
,
the median value is outside the confidence interval, but the asymptotic nature of
approaching Gaussianity can be observed - the statistic keeps getting closer to its
confidence interval with increasing N
uc
.
53
Figure 4.3: Doornik-Hansen Multivariate Normality statistics for SRE 2008 data
4.1.6 Implications
Ononehand, thedistributionfreeformulationmightseemlikeaminortechnicality,
since it does not change any important aspects of the model such as the likelihood
function or the MAP estimate of the ivector. However, it has many important
consequences. First, this is a more accurate representation of the way model
is used in practice, since the posteriors γ
utc
are generally evaluated from UBM
parameters rather than from the parameters of the GMM actually assumed to have
generated the data. This is admissible in case of the distribution free formulation,
since Γ is allowed to be an arbitrary function ofx
ut
. More importantly however,
the distribution free formulation changes the nature of the model; it is no longer
generative,sinceweonlymodelthepropertiesofthestatisticsF
uc
,withoutneeding
to assume anything about the generative process underlying the feature vectors
This fact, along with the generality of the formulation for an arbitrary Γ can be
exploited for a discriminative training of the model parameters, and we will revisit
this topic while discussing future work in Section 6.2.
54
4.2 Globally tied component weights and covari-
ance matrices
The second major assumption that is made in TVM is that the component weights
p
c
and the covariance matrices Σ
c
do not vary across utterances. From a model
complexity perspective, it makes sense to assume the covariance matrices to be
tied globally, since allowing them to change for each utterance would require intro-
ducing a very large number of model parameters. However, this is not the case for
model weights. In fact, unlike the conventional formulation, the distribution free
formulation described earlier does not require global tying of component weights.
It is still important to test the validity of this assumption, since it allows us to
answer the questions posed earlier regarding RSVD estimation.
4.2.1 Statistical test
Let ¯ p
uc
=
Nuc
Tu
define the sample component weights. If the weights are indeed
globally tied, then ¯ p
uc
should have mean p
c
and variance
1
Tu
p
c
(1−p
c
). We can
try to visualize if this is true for RATS LID data, since all the utterances in this
dataset have the same lengthT
u
, which implies that the confidence interval for ¯ p
uc
is the same for all utterances. The sample statistics ¯ p
uc
for different utterances
are plotted in figure 4.4. The solid line at the bottom represents the mean value,
while the dotted line above represents 10 standard deviations above the mean. It
is clear that for many utterances the statistics lie far outside their expected range.
We can formalize this by means of a statistical test. If the component weights
do not change across utterances, then the statistics N
uc
for all utterances should
55
Figure 4.4: ¯ p
uc
statistics for different utterances of RATS LID data
follow a multinomial observation with parameters (T
u
, [p
1
,...,p
C
]). We can then
construct a χ
2
test statistic for a Pearson Chi-squared test:
χ
2
=
(N
uc
−T
u
p
c
)
2
T
u
p
c
(4.18)
This statistic follows a χ
2
distribution with C degrees of freedom. We performed
this test on both SRE 2008 as well as RATS LID data. The hypothesis of globally
tied weights was rejected (at 99% confidence) on 20715 out of 20750 utterances
for SRE male data, on 27517 out of 27523 utterances for SRE female data, and on
95982 out of 96000 utterances for RATS LID data. It is clear that the component
weights are not globally tied for any of these datasets.
4.2.2 Explaining the RSVD results
Thefactthatcomponentweightsarenotgloballytieddirectlyanswersthequestion
about suboptimality of the RSVD estimation procedure. Since the weights are not
56
tied, the approximationN
uc
≈T
u
p
c
made within the likelihood function for RSVD
estimation is not valid. Therefore, the suboptimal value of the likelihood corre-
sponding to the RSVD estimate is not due to statistical variation of N
uc
around
T
u
p
c
, but due to differences across utterances in underlying expected values p
uc
,
which are different from global weights p
c
. From the perspective of conventional
TVM formulation, this would require a reformulation of the model. However, the
distribution free formulation stays valid even without global tying of component
weights.
In addition, the non-tied nature of component weights along with the distribu-
tion free formulation also explain the disparity in performance between LID and
SRE tasks for the approximate ivector extraction using equation (5.6). This is
because RSVD estimation coupled with approximate ivector extraction can in fact
be seen as modifying TVM formulation to model a different statistic. To illustrate,
let
f
F
uc
be defined as below:
f
F
uc
=
√
¯ p
uc
L
T
F
uc
(4.19)
If we assume Gaussianity of F
uc
as per equation (4.8) and also that ¯ p
uc
≈ p
uc
(which is different from p
c
), then we have:
f
F
uc
∼N
e
μ
uc
,T
−1
u
I
(4.20)
where
e
μ
uc
=
√
p
uc
L
T
μ
uc
. Now, if we construct TVM by assuming instead that
the supervector
g
M
u
(constructed similar toM
u
, but by stacking
e
μ
uc
instead) lies
along a low-dimensional subspace
e
T, then the (exact) likelihood function for this
model can be shown to be the same as the function being maximized for RSVD
estimation, given by equation (3.12) (apart from the termL
1
(Θ) which would be
57
constant in
e
T). Additionally, the MAP estimate of the ivector for this model can
also be shown to be the same as the ivector estimate from equation (5.6).
Effectively, instead of viewing RSVD estimation and ivector extraction from
equation (5.6) as approximations to TVM defined as a model over F
u
, we can
view it as exact procedures for TVM defined over a different statistic
f
F
u
. If the
component weights are globally tied, then the two models are simply related by
a matrix multiplication in equation (3.22). But if they are not (which is the case
for our datasets), then they are different models operating over a different set of
statistics. There are some clear advantages to using the model defined over
f
F
u
-
the RSVD estimation procedure provides a global maximum of the likelihood for
this model in a very efficient way, and the corresponding MAP estimate of the
ivector can also be efficiently extracted. However, the ultimate metric of interest
is the accuracy of classification, and it seems that in this respect the behavior of
the two models is task dependent. For the SID task, modelingF
u
leads to a better
accuracy, whereas for the RATS LID task, it seems that modeling
f
F
u
is a better
choice.
4.3 Low rank nature of variability in the mean
supervector
This assumption, summarized by equation (2.8), is one of the key assumptions in
TVM, both in the conventional as well as the distribution free formulation, and the
compactness of the ivector representation stems primarily from this assumption.
58
4.3.1 Statistical test
The validity of the assumption in equation (2.8) can be checked by verifying the
validity of equation (4.9). Since (4.9) has to hold true for both the conventional
and distribution free formulations, this verification would be independent of which
model formulation we choose to operate under.
A difficulty with verifying the result in equation (4.9) is that the actual values
of the parameters T, Σ and the ivectors w
u
are unknown. To circumvent this
problem, we can use an estimated value of these quantities as a proxy for the true
values, and then verify if the result holds true on a held out set. In particular, if
equation (4.9) holds true, it follows that:
E
N
uc
D
(F
u0c
− T
c
w
u
)
T
Σ
−1
c
(F
u0c
− T
c
w
u
)
= 1 (4.21)
A sample estimate S of the expectation in (4.21) can be obtained as below:
S =
1
UCD
X
u,c
N
uc
(F
u0c
− T
c
w
u
)
T
Σ
−1
c
(F
u0c
− T
c
w
u
) (4.22)
To get a better idea about how S behaves with respect to change in N
uc
, we can
evaluate S separately over utterances and components where N
uc
falls within a
specific interval I. For this, we can construct a statistic S
I
as below:
S
I
=
1
U
I
D
X
u,c:Nuc∈I
N
uc
(F
u0c
− T
c
w
u
)
T
Σ
−1
c
(F
u0c
− T
c
w
u
) (4.23)
where U
I
represents the number of cases for which N
uc
∈I. If the assumption in
(4.9) is indeed true, then S
I
would be distributed as a scaled χ
2
random variable
with mean value of 1 and variance
2
U
I
D
(which is very small).
59
We performed the aforementioned analysis on SRE 2008 as well as RATS LID
data. In our experiments, we used the EM estimate for T, the UBM covariance
matrices as Σ
c
, and MAP estimates for ivectorsw
u
. A plot ofS
I
evaluated on the
corresponding test set against the corresponding center of interval I, denoted here
by N
I
, is shown in figure 4.5 below.
Figure 4.5: S
I
statistics for SRE 2008 and RATS LID data
As it can be seen, the value ofS
I
is slightly smaller than 1 for smallN
I
, which
might be due to the use of UBM covariance matrices which are slightly inflated
compared to the actual covariance matrix of the statistics. However, for SRE data,
it shows almost a linearly increasing trend with respect toN
I
, and increases above
1 for large N
I
. For RATS LID data, it seems to saturate near 1.5, but in terms of
the standard deviation, it is still much higher than its expected value of 1. This
observation suggests that at the assumption in (4.9) did not hold for the observed
data.
The higher value of S
I
for large N
uc
can be explained by a violation of the
assumption in equation (2.8) about the low rank nature of the vectorM
u
−M
0
.
60
To illustrate this, let us assume that M
u
=M
0
+ Tw
u
+
u
where Tw
u
is the
projection ofM
u
−M
0
onto the subspace T and the vector
u
accounts for the
residual error between M
u
−M
0
and Tw
u
. In that case, it can be shown that
the expected value in equation (4.21) would be equal to 1 +
Nuc
D
T
uc
Σ
−1
c
uc
, which
is higher than 1 due to the presence of a non-zero
uc
.
4.3.2 Generalization
Based on our observations, we hypothesize that the assumption in TVM that all
of the variability in the mean supervector lies along a low dimensional subspace
does not hold in practice. However, this issue can be mitigated if we explicitly
represent the effect of
u
within the model as below:
F
uc
|N
uc
,w
u
,
uc
∼N (μ
0c
+ T
c
w
u
+
uc
,N
−1
uc
Σ
c
) (4.24)
In other words, we can hypothesize that the desired source of variability (such as
speaker or language information) is captured entirely within the low-dimensional
subspace, as represented by the term Tw
u
(which can also possibly contain noise
from undesired sources such as channel effects). However, we also include in the
model the residual variability represented by
u
, which accounts for any type of
variability in the mean supervector that is not captured within the subspace T.
We assume that the error vector
uc
in equation (4.24) is a zero mean vector
with covariance Σ
c
that is independent ofw
u
. Then, by marginalizing (4.24) over
uc
, it can be shown that
F
uc
|N
uc
,w
u
∼N (μ
0c
+ T
c
w
u
,N
−1
uc
Σ
c
+ Σ
c
) (4.25)
61
In order to avoid a substantial increase in the number of model parameters, we also
assume that the covariance matrix Σ
c
is just a scaled version of Σ
c
: Σ
c
=B
c
Σ
c
.
In that case:
F
uc
|N
uc
,w
u
∼N (μ
0c
+ T
c
w
u
, (N
−1
uc
+B
c
)Σ
c
) (4.26)
One way to interpret equation (4.26) is to viewB
c
as a regularizer which effectively
creates an upper bound over the value of N
uc
. To illustrate the point, let N
0
uc
be
defined as follows
N
0
uc
=
N
uc
1 +B
c
N
uc
(4.27)
Then, equation (4.26) can be expressed in terms of N
0
uc
as below
F
uc
|N
uc
,w
u
∼N (μ
0c
+ T
c
w
u
,N
0−1
uc
Σ
c
) (4.28)
By comparing equations (4.9) and (4.28), we can see that the role of N
uc
in the
original TVM formulation has been replaced with the role of N
0
uc
in the proposed
formulation. Also, it is apparent from equation (4.27) thatB
c
introduces an upper
bound on the value of N
0
uc
: N
0
uc
→B
−1
c
as N
uc
→∞.
Another way to interpret the model is to consider B
c
as a regularizer over the
covariance ofF
uc
as assumed by the model. As we can see from equation (4.9), the
assumption in the original TVM formulation was that the covariance in statistics
F
uc
approacheszeroasN
uc
→∞. However, intheproposedmodel, thepresenceof
B
c
introduces a covariance floor, and ensures that the assumed covariance instead
approaches B
c
Σ
c
asN
uc
→∞.
62
4.3.3 Parameter estimation
Parameter estimation for the proposed model can be carried out using the EM
algorithm. Let Θ
(i)
represent the estimated model parameters at the end of EM
iterationi. For estimating Θ
(i)
, we maximize the following objective function with
respect to Θ
(i)
:
E
wu|Fu,{Nuc}
C
c=1
,Θ
(i−1)
"
U
X
u=1
logf(F
u
|{N
uc
}
C
c=1
,w
u
, Θ
(i)
)
#
(4.29)
We maximize (4.29) sequentially, first with respect to B
c
and then with respect
to T, rather than minimizing it jointly with respect to both, which is difficult.
However, likelihood is still guaranteed to increase at every iteration. The necessary
steps for EM updates are described below.
Posterior distribution of w
u
It can be shown that the posterior distribution of the ivector given the observed
statistics and model parameters Θ is similar to that for the conventional TVM
(2.13), with N
uc
replaced by N
0
uc
:
w
u
|F
u
,{N
uc
}
C
c=1
, Θ
(i)
∼N
μ
wu
, Σ
wu
(4.30)
where the mean and covariance of the posterior distribution are given as below:
μ
wu
=
I +
C
X
c=1
N
0
uc
T
(i)
c
T
Σ
−1
c
T
(i)
c
!
−1
C
X
c=1
N
0
uc
T
(i)
c
T
Σ
−1
c
F
u0c
!
Σ
wu
=
I +
C
X
c=1
N
0
uc
T
(i)
c
T
Σ
−1
c
T
(i)
c
!
−1
(4.31)
63
Maximization with respect to B
c
For compactness, let S
u
,δ
uc
be defined as below
S
u
=μ
wu
μ
T
wu
+ Σ
wu
δ
uc
=
F
u0c
− T
c
μ
wu
T
Σ
−1
c
F
u0c
− T
c
μ
wu
+Tr
T
T
c
Σ
−1
c
T
c
S
u
(4.32)
where Tr(·) refers to matrix trace. Then, excluding all terms constant with respect
to B
c
in (4.29), we need to maximize:
U
X
u=1
N
uc
δ
uc
1 +B
c
N
uc
+D
U
X
u=1
log(1 +B
c
N
uc
) (4.33)
Unfortunately, it is difficult to find a closed form expression forB
c
which minimizes
(4.33). Therefore, we just performed a search by evaluating (4.33) over a range of
values for B
c
(0 to 1 with a step of 0.001) and picked the value that minimized it.
Maximization with respect to T
It can be shown that for a given value of B
c
, the objective function in (4.29) is
maximized with respect to T
c
when
T
c
=
U
X
u=1
N
0
uc
F
u0c
μ
T
wu
!
U
X
u=1
N
0
uc
S
u
!
−1
(4.34)
Inference
For inference, it can be shown that for the proposed model, the MAP estimate of
the ivector is also obtained by replacingN
uc
in the MAP estimate for conventional
TVM (equation (5.4)) with N
0
uc
:
w
∗
u
=
I +
C
X
c=1
N
0
uc
T
T
c
Σ
−1
c
T
c
!
−1
C
X
c=1
N
0
uc
T
T
c
Σ
−1
c
F
u0c
!
(4.35)
64
4.3.4 Results and analysis
We compared the performance of the conventional TVM with that of the proposed
modification on both SRE 2008 as well as RATS LID datasets. The results are
summarized in Table 4.1. As it can be seen, the EER for the proposed model with
Table 4.1: Equal Error Rate (%) on SRE 2008 and RATS LID data
Model SRE 2008 RATS LID
Type B
c
Male Female 10s 5s 3s 1s
Baseline 0 3.62 4.61 8.75 11.05 15.35 24.15
Proposed
EM 3.68 4.67 8.75 11.00 15.20 24.10
kp
−1
c
2.99 4.14 8.55 10.85 14.90 23.75
B
c
estimated using the EM algorithm was fairly close to the baseline model, with
very minor degradation on SRE 2008 data and very minor improvements on RATS
LID data.
However, when we examined the value ofB
c
estimated by the model, we found
an interesting correlation. In figure 4.6, we have plotted the estimated value of B
c
at the end of first EM iteration for SRE 2008 male data. The components have
been ordered by increasing UBM weight, and the solid black line represents the
inverse UBM weight (appropriately scaled) for that component. As we can see,
the estimated value ofB
c
is fairly correlated with inverse UBM weight, and indeed
the correlation coefficient was found to have a value of ρ = 0.59. However, it had
dropped to ρ = 0.19 by the end of the final iteration.
Having observed this, we tried to study the effect of using a fixed value of B
c
that was inversely proportional to UBM weights, rather than estimating it through
EM. This can be seen as trying to learn a subspace that fits the more frequently
observed components better than those that are relatively infrequent, since we
65
Figure 4.6: Estimated B
c
at the end of EM iteration 1 for SRE 2008 male data
allow a higher variance for the residual in components with a low UBM weight. We
used B
c
=kp
−1
c
, where the constant k was selected such that the minimum value
of B
c
was 0.05. As we can see from the table, this strategy of using a fixed value
of B
c
inversely proportional to the UBM weights led to a significant improvement
in EER over the baseline on SRE 2008 data, with a relative improvement of 17.4%
for male trials and 10.2% for female trials. We also observed minor improvements
for RATS LID data, ranging between 1.5-3% relative.
4.4 Conclusion
We started our analysis with the objective of addressing drawbacks of the EM
algorithm for estimating the Total Variability matrix: the high computational cost
associated with it, and the fact that it only converges to a local maximum of the
likelihoodfunction. WetriedtomitigatetheseissuesbyusingtheRSVDalgorithm,
which attempts to maximize the likelihood function directly, but requires some
approximations in order to simplify the estimation procedure. While the RSVD
66
algorithm was able to provide a much faster alternative to the EM algorithm, it
failed to achieve the global maximum of the likelihood function. Similarly, the
experiments in reducing the complexity of the MAP ivector extraction procedure
using approximations led to faster extraction, but ended with mixed results in
terms of accuracy, leading to degradation on the SID task but improvements on
the LID task.
These results were later explained by a detailed analysis of model assumptions.
By statistical tests, we were able to determine that the inaccuracy of approxima-
tions made for RSVD estimation (and thereby the suboptimality of the algorithm)
was rooted in an invalid assumption about globally tied weights in the original
model formulation. However, we showed that neither global tying of weights nor
any particular form of distribution underlying the observed feature vectors needed
to be assumed in order to arrive at the model likelihood function or the ivector
extraction procedure, by reformulating the model in terms of observed statistics
instead. This also explained the differences in the performance between the SID
andLIDtasksforwhatwehadearlierviewedasanefficientbutapproximateivector
extraction method. By changing the statistic being modeled in the distribution
free formulation, we were able to view the RSVD algorithm and faster ivector
extraction procedure as exact procedures for a model over the modified statistic,
rather than as approximations over the original statistic. Differences in the results
across different tasks can therefore be attributed to differences in suitability of
the statistic being modeled to the task on which they are used. Finally, we also
showed that the assumption about variability in the mean supervector being con-
fined entirely to a low rank subspace was found not to be true, and modifications
introduced within the model in order to account for the residual variability led to
an improved performance.
67
To summarize the improvements achieved, the RSVD estimation procedure
along with a faster method for ivector extraction led to a reduction in estimation
time by a factor of almost 20, and in ivector extraction by a factor of almost
5, while also providing between 5-13% relative improvement in terms of EER on
the RATS LID database. While this procedure did lead to worse results on the
SRE 2008 database, the reasons why it is advantageous to try it as an estimation
procedure (alongside EM) for any task are clear - it can be viewed as finding a
global maximum for a different but related model, its computational requirements
are an order of magnitude smaller compared to EM and could also be used as in
initialization scheme for EM, it facilitates much faster ivector extraction, and most
importantly, it could potentially lead to better performance. In addition, modify-
ing the model by accounting for the residual variability in the mean supervector
outside the low rank subspace also led to relative improvements between 10-17%
on SRE 2008 data.
68
Chapter 5
Discriminative Training
As discussed in previous chapters, the Total Variability Model (TVM) has con-
ventionally been formulated as a generative latent variable model. Because of it’s
generative nature, it can be trained on unlabeled data in a completely unsuper-
vised fashion. However, this generative nature can act as a limitation when a
large amount of labeled training data is available. Deep Neural Network (DNN)
embeddings known as x-vectors [48, 49] have been recently shown to outperform
i-vectors on the speaker verification task. The DNN model used in this system is
trained on a large labeled training corpus, which is further augmented by including
additional copies of clean input signals that are artificially degraded with multiple
types (and varying levels) of noise and reverberation. In that context, our focus
in this chapter is to revisit TVM in a discriminative sense.
5.1 Introduction
Fromatheoreticalperspective, ourapproachismotivatedbytheideafromChapter
4 [51, 50] where we had shown that TVM can be reformulated in a distribution-free
manner. Rather than considering TVM as a model that hypothesizes a specific
form of distribution over feature vectors, this reformulation views it as a model
defined over statistics derived from these feature vectors. Because of the alge-
braic form of these statistics, they can be shown to be (asymptotically) Gaussian
regardless of how the feature vectors are distributed.
69
From a system design perspective, this formulation is advantageous primarily
for two reasons. First, it allows the model to be viewed as a discriminative model
since it is agnostic to the generative process underlying the feature vectors. In
other words, instead of training the model to best explain how a collection of
feature vectors was generated, it can be trained to extract the statistics that are
most informative for the application at hand. Second, the algebraic form of the
statistics used in this formulation is general enough to allow for the inclusion of
a trainable parametric model such as a DNN within the system. This allows the
model to leverage the availability of large labeled training corpora during training.
Like the x-vector system, the proposed discriminative TVM system is a train-
able mapping from a sequence of input features to an embedding vector. By
comparing the similarities and differences between the system components used
within both architectures, we show that it is also possible to construct a hybrid
architecture that borrows ideas from both architectures. This hybrid architecture
uses TVM as one of the model layers that acts on top of the frame-level layers
that are used in the x-vector architecture. In our experiments on the Speakers In
The Wild (SITW) corpus, the hybrid model outperforms the x-vector as well as
discriminative i-vector systems.
5.2 Deep Neural Network Embeddings for
speaker verification
A number of different methods that use Deep Neural Networks (DNN) for the
speaker verification task have been proposed in literature. Some of these systems
use a DNN for text-dependent speaker verification [56, 20]. In the domain of
70
text-independent verification, many systems have deployed a DNN as a compo-
nent within an i-vector system, either replacing the Universal Background Model
(UBM) for obtaining the sufficient statistics [29, 41] or replacing the Mel Fre-
quency Cepstral Coefficient (MFCC) features with activations from a bottleneck
layer [35]. In both cases, the DNN used within the system is trained separately to
predict senone posterior probabilities on a transcribed dataset. Some of the other
approaches that leverage DNNs include the use use of Siamese networks trained
using contrastive loss [6] and DNNs trained with triplet loss [58].
More recently, DNN embeddings known as x-vectors were proposed in [49].
As shown Figure 5.2a, the x-vector model architecture involves a frame-level
Time Delay Neural Network (TDNN) that maps the sequence of input features
X
u
to a sequence of statistics H
u
. Because of the TDNN structure, each vec-
tor h
ut
in the sequence H
u
is a function of a window of 2n + 1 feature vectors
n
x
u(t−n)
, ...,x
u(t+n)
o
centered att. The sequence H
u
is reduced to a single vector
by obtaining globally pooled statistics such as mean and (element-wise) standard
deviation. This vector of pooled statistics is then projected through an affine map-
ping to another vectorw
u
which is used as an input to a classifier network. The
classifier network produces posterior probabilities for speaker labels as the output,
and the whole system is trained in an end-to-end fashion with a cross-entropy
loss function. During the test phase, the classifier network is discarded and the
embeddingw
u
is used as the speaker representation.
5.3 TVM in discriminatively trained systems
Although TVM has conventionally been formulated as a generative model, we had
shown in [51, 50] that it can also be reformulated as a discriminative model. In this
71
section we briefly explain this reformulation that motivates our approach, followed
by a description of the proposed model architecture.
5.3.1 Distribution-free reformulation of TVM
In the conventional formulation of TVM, algorithms used for parameter estimation
and i-vector extraction often involve the computation of Baum-Welch statistics
defined as below:
N
uc
=
Tu
X
t=1
γ
utc
F
uc
=
1
N
uc
Tu
X
t=1
γ
utc
x
ut
(5.1)
whereγ
utc
correspondstotheposteriorprobabilityp(c|x
ut
)thataparticularfeature
vectorx
ut
is associated with the Gaussian component c.
In order to formulate TVM in a distribution-free manner, the central idea used
in [51, 50] is to view Baum-Welch statistics rather than the feature vectors as the
variables on which the model likelihood function is defined. In particular, it was
shown that the statistics F
uc
are asymptotically Gaussian regardless of how the
feature vectorsx
ut
are distributed:
F
uc
∼N
μ
uc
,N
−1
uc
Σ
c
(5.2)
Crucially, this result is not restricted to the particular choice of γ
utc
as posterior
probabilities of Gaussian components. Instead, the result holds for statisticsF
uc
and N
uc
computed as given by equation (5.1) using an arbitrary positive function
72
γ
utc
= Γ
c
(x
ut
), Γ
c
:R
D
7→ [0,∞). In this formulation, the constantsμ
uc
and Σ
uc
in equation (5.2) refer to following expected values:
p
uc
=E [γ
utc
] μ
uc
=
1
p
uc
E [γ
utc
x
ut
]
Σ
uc
=E
"
(γ
utc
x
ut
−p
uc
μ
uc
)(γ
utc
x
ut
−p
uc
μ
uc
)
p
uc
T
#
(5.3)
5.3.2 Discriminative training for TVM
In the conventional TVM formulation the model parameters are estimated using
the Maximum Likelihood criterion, typically through the Expectation Maximiza-
tion (EM) algorithm. Broadly speaking, this procedure returns parameter esti-
mates that are most likely to have produced the observed collection of feature
vectors under the assumed generative process.
However, the distribution-free formulation does not make any assumptions
about the generative process. Therefore, instead of viewing the model as a tool
to explain how the feature vectors were generated, it can be viewed as a train-
able mapping from the feature vector sequence to the i-vector, which is typically
estimated from the feature sequence X
u
according to the expression below:
w
∗
u
=
I + T
T
Σ
−1
N
u
T
−1
T
T
Σ
−1
N
u
F
u0
(5.4)
where N
u
and Σ are block diagonal matrices with N
uc
and Σ
c
as the diagonals
andF
u0
=F
u
−M
0
.
The Total Variability matrix T, the (super)vector M
0
and the functions Γ
c
used to compute the statisticsN
uc
andF
uc
can be treated as trainable parameters
of this mapping. Since Γ
c
can be arbitrary positive functions, a natural choice to
parametrize them is using a DNN with a positive-valued nonlinearity at the output
73
X
u
Γ N
u
F
u
T
w
u
Classifier
ˆ y
u
y
u
L
Figure 5.1: Loss computation for discriminative training of TVM
layer. Similar to the x-vector system, the model can be trained by appending a
classifier network on top of the i-vector, and using backpropagation to minimize a
discriminative loss function like the cross-entropy between the true and predicted
labels. The overall training schematic for this system is shown in Figure 5.1. In the
figure, Γ refers to the DNN withC (positive valued) output nodes corresponding to
functions{Γ
c
}
C
c=1
, T refers to the block which extracts the i-vector as per equation
(5.4),y
u
and ˆ y
u
refer to true and predicted labels respectively, andL refers to the
loss function being minimized.
Related work: The discriminative training approach proposed here differs
from the conventional TVM formulation in two aspects. First, it replaces the UBM
with a (trainable) DNN. Second, the model is trained using backpropagation on
labeled data as opposed to the EM algorithm on unlabeled data.
74
Both these methods have been individually been proposed in literature before.
Discriminative training was used for TVM training in [15], but their system did
not use a DNN. As mentioned earlier in Section 5.2, a DNN was used to replace
the UBM in [29, 41, 35] but the system was not trained discriminatively. Instead,
they use a fixed DNN that was trained separately to predict phonetic classes for
Automatic Speech Recognition (ASR) which requires transcriptions for training.
5.3.3 Hybrid architecture: TVM as a network layer
The abstraction of viewing TVM as a trainable mapping from the feature sequence
to the i-vector brings out some of the parallels with the x-vector architecture. In
order to make an explicit comparison between the two, we focus on the pooling
mechanism that each of them employs for reducing a frame-level sequence to a
single vector.
As shown in Figure 5.2, the x-vector system uses a TDNN to obtain a sequence
of statistics H
u
and subsequently reduces it to a single vector using global mean
and standard deviation pooling. On the other hand, the pooling in case of dis-
criminative TVM training happens by means of the computation ofF
uc
as given
by equation (5.1). If we interpret the positive-valued γ
utc
= Γ
c
(x
ut
) as defining a
soft indicator function of an arbitrary region in the feature space, then the vector
F
uc
can be viewed as a local mean of the feature vectors within this region. The
supervector F
u
is then formed by concatenating multiple local means{F
uc
}
C
c=1
.
For making the comparison explicit, the global mean pooling operation in x-vector
can be interpreted as the computation of F
u
for a trivial case of C = 1 and a
constant function Γ
c
= 1. In other words, the x-vector architecture uses a sin-
gle global mean extracted from a sequence of TDNN-processed features, whereas
75
X
u
TDNN
(frame level)
H
u
Global statistics,
Affine mapping
w
u
Classifier
ˆ y
u
(a) X-vector architecture
X
u
TDNN
(frame level)
H
u
Local statistics,
i-vector extraction
w
u
Classifier
ˆ y
u
(b) Hybrid architecture
Figure 5.2: X-vector and hybrid system architectures
the discriminative TVM architecture uses multiple local means{F
uc
}
C
c=1
extracted
directly from the input features.
It is therefore natural to consider a hybrid architecture where a TDNN is used
to process the input features like in the x-vector architecture and instead of global
mean pooling, multiple local means are extracted using the vector F
u
as in the
discriminative TVM architecture. A block-diagram of this hybrid architecture is
shown Figure 5.2b. This model effectively uses TVM as a network layer that acts
on the sequence H
u
.
76
5.3.4 Gradients for the Total Variability layer
One of the major difficulties involved in training the discriminative TVM as well
as the hybrid architectures is that i-vector extraction using equation (5.4) involves
a matrix inversion. This makes it difficult to efficiently estimate the gradients for
backpropagation.
To circumvent this issue, we explore two possible solutions. One workaround
is to use a normalized statistic
f
F
u
instead ofF as in [50]:
p
uc
=
N
uc
T
u
Σ
−1
c
= L
c
L
T
c
f
F
uc
=
√
p
uc
L
T
c
F
uc
(5.5)
As discussed in [50] the MAP i-vector estimate for this normalized statistic is given
(approximately) by a projection to a normalized subspace
e
T as below:
w
u
=
e
T
T
e
T
−1
e
T
T
f
F
u0
(5.6)
Theprojectionmatrix
e
T
Te
T
−1
e
T
T
caneffectivelyberepresentedbyasingleaffine
layer, making it easy to estimate gradients.
Another possibility is to project the unnormalized supervector F
u
directly to
an embeddingw
u
using an affine transformation. In this case, the embeddingw
u
does not have an interpretation as a MAP i-vector estimate.
5.4 Experiments
5.4.1 Datasets
We evaluated our models for the speaker verification task using the Speaker In
The Wild (SITW) corpus [34]. For training, we used a combination of subsets
77
Table 5.1: Distribution of training data
Dataset Utterances Speakers Hours
NIST SRE 2004 4521 307 199
NIST SRE 2005 2733 710 115
NIST SRE 2006 18158 2220 766
NIST SRE 2008 13269 1319 891
NIST SRE 2012 16111 971 721
Mixer 6 24214 591 1664
Comprehensive Switchboard 28156 2594 1337
Voxceleb 1 15343 1190 290
Voxceleb 2 111803 5994 2257
Combined 234308 14630 8240
from NIST SRE 2004-2008, NIST SRE 2012, Mixer 6, Comprehensive Switch-
board as well as Voxceleb [36] data. Overall, the data consisted of roughly 230
thousand utterances from approximately 15 thousand speakers and close to 8000
hours of audio (after silence removal). Utterances from speakers that overlap
between Voxceleb and the SITW development and evaluation sets were removed
from training. Details about the number of speakers and utterances as well as total
speech duration from each individual database are reported in Table 5.1. Similar
to [49], we also further augmented the training dataset by including 4 artifically
degraded copies for each utterance: 1 copy each for degradation with background
noise, music, babble and reverberation.
5.4.2 System Details
For feature extraction we used 20-dimensional MFCCs concatenated with first and
second order delta coefficients. For all the models, the dimension of the embedding
78
was fixed to 512. For the generative TVM model, we used a UBM with 2048
components and the Total Variability matrix was trained using the EM algorithm.
AllothermodelsweretrainedusingaStochasticGradientDescent(SGD)optimizer
with a learning rate that was linearly reduced from 0.05 to 0.001 and a momentum
of 0.5. Similar to [49] we used a minibatch size of 64 such that each utterance in
the batch consisted of a chunk of audio with a duration randomly chosen between
2 and 4 seconds. For regularization we used dropout probability of 0.1 that was
applied after training for 1 epoch. The x-vector model architecture was identical to
that used in [49]. As described in Section 5.3.4 we experimented with normalized
as well as unnormalized statistics for discriminative training of TVM. Since we
obtained better results using unnormalized statistics, all the hybrid models were
trained using unnormalized statistics.
During evaluation, the embeddings were first projected to a lower dimension
using Linear Discriminant Analysis (LDA) and then mean as well as length-
normalizedbeforescoringbymeansofaProbabilisticLinearDiscriminantAnalysis
(PLDA) model [23]. The LDA as well as PLDA models are trained on a subset
of randomly chosen 250k utterances from the augmented training set. For every
embedding the dimension of the LDA transform was tuned on the development
set. With the exception of the generative TVM model for which the optimum
LDA dimension was 250, all the other embeddings used an LDA dimension of 150.
The mean vector for mean normalization was also obtained from the development
set.
79
Table 5.2: Results on the SITW eval set
Model EER (%) DCF (10
−2
) DCF (10
−3
)
Generative TVM 6.86 0.551 0.726
Discr TVM (n) 5.25 0.490 0.666
Discr TVM (un) 5.14 0.471 0.669
x-vector 4.57 0.411 0.613
Hybrid (1500, 32) 4.29 0.367 0.542
Hybrid (500, 100) 3.88 0.363 0.541
Hybrid (220, 220) 3.88 0.353 0.530
Hybrid (50, 1000) 3.61 0.352 0.535
5.4.3 Results
The Equal Error Rate (EER) and the Detection Cost Function (DCF) evaluated
at target probabilities of 0.01 and 0.001 for all the different models on the core
trials of the SITW evaluation set are prsented in Table 5.2.
For the discriminative TVM models, n or un is used in parantheses to indicate
whether normalized or unnormalized statistics were used for training the model.
For the hybrid models, we performed experiments by varying the dimensionality of
theTDNNoutputlayeraswellasthedimensionalityC oftheoutputlayerof Γ. We
tried 4 different configurations, keeping the dimensionality of the supervector close
to 50000. In Table 5.2 Hybrid(D,C) refers to a hybrid model with aD-dimensional
output layer for TDNN and a C-dimensional output layer for Γ.
It can be seen that discriminatively trained TVM significantly outperforms its
generative counterpart, with the model trained using unnormalized statistics per-
forming slightly better. The x-vector model provides an even better performance,
but the best performance is obtained using hybrid models that leverage system
components from both architectures. Comparing the different hybrid models, the
80
trend seems to favor using a smaller dimensionality for the TDNN output and a
higher dimensionality for the output layer of Γ. Compared to the x-vector model,
the best performing hybrid model results in a roughly 20% relative improvement
on EER, 17% relative improvement on DCF(0.01) and 13% relative improvement
on DCF(0.001).
5.5 Conclusion
WehavepresentedatechniquefordiscriminativelytrainingTVMthatismotivated
by a distribution-free reformulation of the model. This technique enables TVM
to be trained on labeled corpora, while also allowing for integration of DNNs
within the model architecture. In addition, it also enables TVM to be used as
a network layer in an arbitrary DNN embedding architecture. In particular, we
have presented a hybrid architecture that uses TVM instead of a global statistics
pooling layer in the x-vector architecture. This lead to significant performance
gains on speaker verification experiments on the SITW evaluation set.
81
Chapter 6
Conclusion and Future Work
In this chapter, we first summarize the important ideas and contributions from
our research work described in the earlier chapters. Then, we describe some of the
ways in which these ideas can be extended for future work.
6.1 Summary of proposed ideas and contribu-
tions
6.1.1 Efficient Estimation
Our initial focus was on the high computational complexity of the Expectation
Maximization (EM) algorithm used to estimate the parameters of the Total Vari-
ability Model. We proposed an algorithm to maximize the marginalized likelihood
directly, while making some approximations that would be suitably justified if the
generative assumptions made in the model are true. This approach reduced the
computational time required for parameter estimation by a factor of around 20-25,
while also reducing the time required for i-vector extraction by a factor of around
5. In addition, it also led to a performance improvement on the RATS language
identification task. However, a noticeable performance degradation was observed
when this method was used on a speaker verification task.
82
6.1.2 Statistical Analysis of Model Assumptions
The observed performance degradation on speaker verification motivated us to
analyze the underlying causes. From our analysis, we observed that the approxi-
mation we had made in the likelihood function was invalid because the generative
assumptions made in the model themselves did not hold in practice. We presented
a number of statistical tests in order to demonstrate this rigorously. However, we
also showed that the model itself can be reformulated in a manner that did not
rely on these assumptions. The proposed distribution-free reformulation enabled
the model to be viewed in a discriminative sense.
6.1.3 Discriminative Training
Motivated by the distribution-free reformulation of the Total Variability Model,
we proposed an algorithm for training the model discriminatively on large labeled
corpora. We showed that (trainable) Deep Neural Networks can be integrated to
extract statistics within the model. In addition, we showed that the model itself
can be abstracted as a layer that can be integrated as a part of other Deep Neural
Network embedding models. We presented experiments on speaker verification
where this approach resulted in a 20% relative improvement in Equal Error Rate
(EER)performanceonthespeakerverificationtaskcomparedtoaverycompetitive
DNN embedding baseline.
83
6.2 Future Work
6.2.1 Applicability to other problems and modalities
We have primarily focused on speaker and language identification tasks on all
the experiments we have presented. However, as noted in Chapter 1, the Total
VariabilityModelisveryversatileinitsutilityforspeechanalysisapplications. The
techniques presented in our research are therefore applicable in a large number of
other problems. In particular, the idea of treating the model as a discriminatively
trainable mapping has a huge potential in many domains that use the i-vector
representation as a fixed system component. This idea allows the this i-vector
representationusedwithinthesystemtobeoptimizedfortheparticularapplication
at hand.
In addition, these ideas also allow the model to be extended to other signal
modalities such as text, image, video as well as biological signals. The genera-
tive formulation of the Total Variability Model was limiting because the particu-
lar probabilistic assumptions made for speech might not be appropriate in other
modalities. However, the distribution-free reformulation makes it independent of
these assumptions. In addition, the discriminative training algorithm allows it to
be integrated abstractly as a pooling layer in any Deep Neural Network scheme
acting on an arbitrary modality. Therefore, the model can be used in neural archi-
tectures for any domain that requires the reduction from a sequence to a vector.
A number of tasks in text, video and biological signal processing require such an
operation and the use of Total Variability Model in such domains can therefore be
explored.
84
6.2.2 Acoustic model adaptation for automatic speech
recognition
Inadditiontoenablingdiscriminativetrainingforobtainingbetterrepresentations,
the results underlying the distribution-free formulation can also enable other types
of training and architectures for automatic speech recognition. We briefly explain
a couple of approaches that can be explored in the future.
Adversarial regularization for acoustic models in speech recognition
The task of an acoustic model in a speech recognition system is to estimate the
probabilityp (y
ut
|X
u
) that a frame at indext corresponds to a senoney
ut
(which is
a representation corresponding to a cluster of context-dependent phonemes). In an
acoustic model, any variability inx
ut
that arises from sources other thany
ut
(such
as changes in speaker or recording condition) is a source of noise, and adversely
affects its performance. Therefore, it is desirable to build the acoustic model over
a representation that is robust to such changes.
One way of achieving robustness is by augmenting the model objective function
L
θ
with a regularization termL
R
that accounts for the susceptibility of a model
representation to changes from noisy sources:L =L
θ
+λL
R
. To express the idea
formally, lety
L
=f(x
ut
)denoteamodelrepresentation(forexample, thelastlayer
of the acoustic model DNN). Then, we can use y
L
as feature vectors to extract
the statisticsF
uc
andN
uc
given by (5.1). Ideally, if the representationy
L
is truly
robust to text-independent variability, then the mean and covariance parameters
of these statistics should also be globally tied across all utterances. Therefore, the
total log likelihoodL
R
(F
uc
|N
uc
,μ
c
, Σ
c
) under globally tied parametersμ
c
and Σ
c
can be used as a regularization term to encourage robustness in the representation
y
L
. In addition, the generality of the distribution free formulation for an arbitrary
85
Γ can also be exploited here, by updating Γ in an adversarial manner. A schematic
for this type of (potentially adversarial) regularized acoustic model training is
shown in Figure 6.1.
X
u
Θ
hid
y
L
Θ
out Γ
ˆ y
ut
N
uc
F
uc
L
θ
L
R
μ
c
, Σ
c
y
ut
L
Figure 6.1: Regularized acoustic model training schematic
Joint speech and speaker recognition
So far, we have suggested a schematic for joint discriminative training of TVM that
canbeusedtoextracttext-independentinformationfromspeech, andaregularized
acoustic model schematic that can be used to extract the text information. How-
ever, both these modes of information are ultimately encoded in the same signal,
and both types of models often even use the same feature representation (MFCCs)
86
for extracting information. Therefore, it is natural to think of a combined model
design within which both these models are incorporated.
One possible way to achieve this is by means of a DNN with a shared repre-
sentation, which is used as an input by both the acoustic model as well as TVM,
as shown in the schematic in Figure 6.2. This model brings together all the ideas
we have proposed for TVM formulation in current as well as for future work, and
is a promising research direction.
87
X
u
Θ
shared
y
shared
Θ
AM
Θ
TVM
y
AM
y
TVM
Θ
out
, Γ
AM
Γ
TVM
, T,B
ˆ y
ut
w
u
L
AM
L
TVM
y
AM
y
TVM
Figure 6.2: Regularized acoustic model training schematic
88
Reference List
[1] Robust Automatic Transcription of Speech (RATS). https://www.darpa.
mil/program/robust-automatic-transcription-of-speech. [Online].
[2] Brandon M Booth, Rahul Gupta, Pavlos Papadopoulos, Ruchir Travadi, and
Shrikanth S Narayanan. Automatic Estimation of Perceived Sincerity from
Spoken Language. In INTERSPEECH, pages 2021–2025, 2016.
[3] William M Campbell, Douglas E Sturim, and Douglas A Reynolds. Support
Vector Machines using GMM Supervectors for Speaker Verification. IEEE
signal processing letters, 13(5):308–311, 2006.
[4] William M Campbell, Douglas E Sturim, Douglas A Reynolds, and Alex
Solomonoff. SVM based Speaker Verification using a GMM Supervector Ker-
nel and NAP Variability Compensation. In Acoustics, Speech and Signal Pro-
cessing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Confer-
ence on, volume 1, pages I–I. IEEE, 2006.
[5] Nil Goksel Canbek and Mehmet Emin Mutlu. On the track of Artificial Intel-
ligence: Learning with Intelligent Personal Assistants. Journal of Human
Sciences, 13(1):592–601, 2016.
[6] Ke Chen and Ahmad Salman. Extracting speaker-specific information with
a regularized siamese deep network. In Advances in Neural Information Pro-
cessing Systems, pages 298–306, 2011.
[7] I.D.CoopeandP.F.Renaud. TraceInequalitieswithApplicationstoOrthogo-
nalRegressionandMatrixNearnessProblems. JIPAM.JournalofInequalities
in Pure and Applied Mathematics, 10(4), 2009.
[8] José Manuel Cordero, Manuel Dorado, and José Miguel de Pablo. Automated
Speech Recognition in ATC environment. In Proceedings of the 2nd Inter-
national Conference on Application and Theory of Automation in Command
and Control Systems, pages 46–53. IRIT Press, 2012.
89
[9] Sandro Cumani and Pietro Laface. Factorized Sub-space Estimation for Fast
and Memory Effective I-vector Extraction. Audio, Speech, and Language Pro-
cessing, IEEE/ACM Transactions on, 22(1):248–259, 2014.
[10] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre
Ouellet. Front-end Factor Analysis for Speaker Verification. IEEE Transac-
tions on Audio, Speech, and Language Processing, 19(4):788–798, 2011.
[11] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas Reynolds, and Reda
Dehak. Language Recognition via i-vectors and Dimensionality Reduction.
In Twelfth Annual Conference of the International Speech Communication
Association, 2011.
[12] Jurgen A Doornik and Henrik Hansen. An Omnibus Test for Univariate
and Multivariate Normality. Oxford Bulletin of Economics and Statistics,
70(s1):927–939, 2008.
[13] Emil Ettelaie, Sudeep Gandhe, Panayiotis Georgiou, Kevin Knight, Daniel
Marcu, Shrikanth Narayanan, David Traum, and Robert Belvin. Transonics:
A Practical Speech-to-Speech Translator for English-Farsi Medical Dialogues.
In Proceedings of the ACL 2005 on Interactive poster and demonstration ses-
sions, pages 89–92. Association for Computational Linguistics, 2005.
[14] Daniel Garcia-Romero and Carol Y Espy-Wilson. Analysis of I-vector Length
Normalization in Speaker Recognition Systems. In Interspeech, volume 2011,
pages 249–252, 2011.
[15] Ondřej Glembek, Lukáš Burget, Niko Brümmer, Oldřich Plchot, and Pavel
Matějka. Discriminatively trained i-vector extractor for speaker verification.
In Twelfth Annual Conference of the International Speech Communication
Association, 2011.
[16] Ondřej Glembek, Lukáš Burget, Pavel Matějka, Martin Karafiát, and Patrick
Kenny. Simplification and Optimization of i-vector Extraction. In Acoustics,
Speech and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pages 4516–4519. IEEE, 2011.
[17] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding Structure
with Randomness: Probabilistic Algorithms for Constructing Approximate
Matrix Decompositions. SIAM review, 53(2):217–288, 2011.
[18] John HL Hansen and Taufiq Hasan. Speaker Recognition by Machines and
Humans: A tutorial review. IEEE Signal processing magazine, 32(6):74–99,
2015.
90
[19] Andrew O Hatch, Sachin S Kajarekar, and Andreas Stolcke. Within-Class
Covariance Normalization for SVM-based Speaker recognition. In INTER-
SPEECH, 2006.
[20] GeorgHeigold, IgnacioMoreno, SamyBengio, andNoamShazeer. End-to-end
text-dependent speaker verification. In Acoustics, Speech and Signal Process-
ing (ICASSP), 2016 IEEE International Conference on, pages 5115–5119.
IEEE, 2016.
[21] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-Rahman
Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
Nguyen, Tara N Sainath, et al. Deep Neural Networks for Acoustic Mod-
eling in Speech Recognition: The shared views of four research groups. IEEE
Signal Processing Magazine, 29(6):82–97, 2012.
[22] Hans-Günter Hirsch and Christoph Ehrlicher. Noise Estimation Techniques
for Robust Speech Recognition. In Acoustics, Speech, and Signal Processing,
1995. ICASSP-95., 1995 International Conference on, volume 1, pages 153–
156. IEEE, 1995.
[23] Sergey Ioffe. Probabilistic linear discriminant analysis. In European Confer-
ence on Computer Vision, pages 531–542. Springer, 2006.
[24] Patrick Kenny. Bayesian speaker verification with heavy-tailed priors. In
Odyssey, page 14, 2010.
[25] Patrick Kenny. A small footprint i-vector extractor. In Odyssey, pages 1–6,
2012.
[26] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel. Joint
Factor Analysis versus Eigenchannels in Speaker Recognition. IEEE Trans-
actions on Audio, Speech, and Language Processing, 15(4):1435–1447, 2007.
[27] Patrick Kenny, Vishwa Gupta, Themos Stafylakis, P Ouellet, and J Alam.
Deep Neural Networks for extracting Baum-Welch statistics for Speaker
Recognition. In Proc. Odyssey, pages 293–298, 2014.
[28] Chul Min Lee and Shrikanth S Narayanan. Toward Detecting Emotions in
SpokenDialogs. IEEEtransactionsonspeechandaudioprocessing, 13(2):293–
303, 2005.
[29] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren. A Novel
Scheme for Speaker Recognition Using a Phonetically-Aware Deep Neural
Network. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE
International Conference on, pages 1695–1699. IEEE, 2014.
91
[30] Ming Li, Kyu J Han, and Shrikanth Narayanan. Automatic speaker age and
gender recognition using acoustic and prosodic level information fusion. Com-
puter Speech & Language, 27(1):151–167, 2013.
[31] Ming Li and Shrikanth Narayanan. Simplified Supervised I-vector Modeling
with Application to Robust and Efficient Language Identification and Speaker
Verification. Computer Speech & Language, 28(4):940–958, 2014.
[32] Nikolaos Malandrakis, Anil Ramakrishna, Victor Martinez, Tanner Sorensen,
Dogan Can, and Shrikanth Narayanan. The ELISA Situation Frame extrac-
tion for low resource languages pipeline for LoReHLT’2016. Machine Trans-
lation, pages 1–16, 2017.
[33] Pavel Matejka, Le Zhang, Tim Ng, Harish Sri Mallidi, Ondrej Glembek, Jeff
Ma, and Bing Zhang. Neural Network Bottleneck Features for Language
Identification. Proc. IEEE Odyssey, pages 299–304, 2014.
[34] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. The
speakers in the wild (sitw) speaker recognition database. In Interspeech, pages
818–822, 2016.
[35] Mitchell McLaren, Yun Lei, and Luciana Ferrer. Advances in deep neural
network approaches to speaker recognition. In Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Conference on, pages 4814–
4818. IEEE, 2015.
[36] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-
scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
[37] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral Signal Pro-
cessing: Deriving Human Behavioral Informatics From Speech and Language.
Proceedings of the IEEE, 101(5):1203–1233, 2013.
[38] Pavlos Papadopoulos, Ruchir Travadi, and Shrikanth Narayanan. Global
SNR Estimation of Speech Signals for Unknown Noise Conditions using Noise
Adapted Non-linear Regression. Proc. Interspeech 2017, pages 3842–3846,
2017.
[39] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker Ver-
ification Using Adapted Gaussian Mixture Models. Digital signal processing,
10(1-3):19–41, 2000.
[40] Douglas A Reynolds and Richard C Rose. Robust Text-Independent Speaker
Identification Using Gaussian Mixture Speaker Models. IEEE transactions on
speech and audio processing, 3(1):72–83, 1995.
92
[41] Fred Richardson, Douglas Reynolds, and Najim Dehak. Deep Neural Network
Approaches to Speaker and Language Recognition. IEEE Signal Processing
Letters, 22(10):1671–1675, 2015.
[42] George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny. Speaker
Adaptation of Neural Network Acoustic Models Using I-Vectors. In IEEE
Workshop on Automatic Speech Recognition and Understanding, pages 55–59,
2013.
[43] Björn Schuller, Stefan Steidl, Anton Batliner, Simone Hantke, Florian Hönig,
Juan Rafael Orozco-Arroyave, Elmar Nöth, Yue Zhang, and Felix Weninger.
The INTERSPEECH 2015 Computational Paralinguistics Challenge: Native-
ness, Parkinson’s & Eating Condition. In Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[44] Björn W Schuller, Stefan Steidl, Anton Batliner, Julien Epps, Florian Eyben,
Fabien Ringeval, Erik Marchi, and Yue Zhang. The INTERSPEECH 2014
Computational Paralinguistics Challenge: Cognitive & Physical Load. In
INTERSPEECH, pages 427–431, 2014.
[45] Björn W Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee K
Burgoon, Alice Baird, Aaron C Elkins, Yue Zhang, Eduardo Coutinho, and
Keelan Evanini. The INTERSPEECH 2016 Computational Paralinguistics
Challenge: Deception, Sincerity & Native Language. In INTERSPEECH,
pages 2001–2005, 2016.
[46] Maarten van Segbroeck, Ruchir Travadi, and Shrikanth S Narayanan. UBM
Fused Total Variability Modeling for Language Identification. In Fifteenth
Annual Conference of the International Speech Communication Association,
2014.
[47] Andrew W Senior and Ignacio Lopez-Moreno. Improving DNN Speaker Inde-
pendence with I-vector Inputs. In ICASSP, pages 225–229, 2014.
[48] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and
S. Khudanpur. Deep neural network-based speaker embeddings for end-to-
end speaker verification. pages 165–170, 2016.
[49] DavidSnyder,DanielGarcia-Romero,GregorySell,DanielPovey,andSanjeev
Khudanpur. X-vectors: Robustdnnembeddingsforspeakerrecognition. pages
5329–5333, 2018.
[50] Ruchir Travadi and Shrikanth Narayanan. "Efficient Estimation and Model
GeneralizationfortheTotalVariabilityModel". ComputerSpeech&Language,
53:43 – 64, 2019.
93
[51] Ruchir Travadi and Shrikanth S Narayanan. A Distribution Free Formulation
of the Total Variability Model. In INTERSPEECH, pages 1576 – 1580, 2017.
[52] Ruchir Travadi, Maarten Van Segbroeck, and Shrikanth S Narayanan.
Modified-prior i-Vector Estimation for Language Identification of Short Dura-
tion Utterances. In Fifteenth Annual Conference of the International Speech
Communication Association, 2014.
[53] A. Trujillo-Ortiz, R. Hernandez-Walls, K. Barba-Rojo, and L. Cupul-Magana.
DorHanomunortest:Doornik-Hansen Omnibus Multivariate (Univariate) Nor-
mality Test. A MATLAB file. [WWW document]. https://www.mathworks.
com/matlabcentral/fileexchange/17530-dorhanomunortest, 2007.
[54] Maarten Van Segbroeck, Ruchir Travadi, and Shrikanth S Narayanan. Rapid
Language Identification. IEEE Transactions on Audio, Speech, and Language
Processing, 23(7):1118–1129, 2015.
[55] Maarten Van Segbroeck, Ruchir Travadi, Colin Vaz, Jangwon Kim,
Matthew P Black, Alexandros Potamianos, and Shrikanth S Narayanan. Clas-
sification of Cognitive Load from Speech using an i-vector Framework. In
INTERSPEECH, pages 751–755, 2014.
[56] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez-Moreno, and
Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-
dependent speaker verification. In ICASSP, volume 14, pages 4052–4056.
Citeseer, 2014.
[57] JG Wilpon and DB Roe. At&t telephone network applications of speech
recognition. In Proc. COST232 Workshop, Rome, Italy, 1992.
[58] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker
verification with triplet loss on short utterances. In Proc. of Interspeech, 2017.
94
Abstract (if available)
Abstract
Signals encountered in real life are affected by multiple factors of variability. Information about these sources gets encoded in the signal, but an appropriate representation needs to be derived from the signal in order to extract this information automatically. Particularly in speech analysis, the Total Variability Model has been widely used as a mechanism to extract a low-dimensional representation of the speech signal especially for applications such as speaker and language identification. This model has conventionally been posed as a generative latent variable model and the model parameters are typically estimated using several iterations of the Expectation Maximization (EM) algorithm which can be computationally demanding. We propose a computationally efficient randomized algorithm for parameter estimation that relies on some approximations within the model likelihood function. We show that this algorithm reduces the computation time required for parameter estimation by a a significant factor while also providing performance improvement on a language identification task. We then present extensive statistical analysis aimed at verifying the validity of some of the assumptions made within the generative Total Variability Model formulation. We show that many of these assumptions are not valid in practice and propose a way to reformulate the model in a distribution-free and discriminative manner. Along with discriminative training, this reformulation enables the integration of Deep Neural Networks within the model. In addition, it also allows the model to be viewed abstractly as a network layer that can be incorporated in other Deep Neural Network architectures. Through experiments on speaker verification we show that incorporating this layer leads to a significant improvement in performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Noise aware methods for robust speech processing applications
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Machine learning paradigms for behavioral coding
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Robust automatic speech recognition for children
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Emotional speech production: from data to computational models and applications
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Neural representation learning for robust and fair speaker recognition
Asset Metadata
Creator
Travadi, Ruchir
(author)
Core Title
Efficient estimation and discriminative training for the Total Variability Model
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/02/2019
Defense Date
12/13/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
i-vector,language identification,OAI-PMH Harvest,speaker recognition,speaker verification,Total Variability Model,x-vector
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Goldstein, Louis (
committee member
)
Creator Email
ruchirtravadi@gmail.com,travadi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-165021
Unique identifier
UC11660824
Identifier
etd-TravadiRuc-7339.pdf (filename),usctheses-c89-165021 (legacy record id)
Legacy Identifier
etd-TravadiRuc-7339.pdf
Dmrecord
165021
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Travadi, Ruchir
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
i-vector
language identification
speaker recognition
speaker verification
Total Variability Model
x-vector